Overview

Most reinforcement learning locomotion projects waste a huge amount of time in the early phase doing useless exploration. The agent spends forever flailing before it discovers anything that resembles walking.

This project was my attempt to fix that first part.

Instead of asking an agent to learn locomotion from scratch, I built a hybrid pipeline that starts with a human example, then transfers into reinforcement learning to generalize.

The idea is simple: 1) learn the shape of walking first (mimicry), 2) then learn how to survive real environments (fine-tuning).

The problem I was solving

When the internet goes down, the question is usually "is this a local issue or is something bigger failing?"

Locomotion RL has a similar problem early on. The agent has no clue what “movement” even means yet, so it burns time exploring actions that will never work on a real robot.

Humans don’t learn like that. Kids learn by copying. They start with examples. Then they get better through practice.

So I tried to do the same thing for a bipedal walker.

The approach

I designed a two-stage training pipeline.

Stage 1: mimic human walking I took a walking video and extracted joint angles frame by frame using Mediapipe. From each frame, I computed key joint angles and created paired training examples:

input: joint angles at time t
target: joint angles at time t+1 (converted into motor control)

That conversion mattered because the environment I used (OpenAI’s BipedalWalker) doesn’t take target angles. It takes continuous motor torques in a -1 to 1 range, so I mapped joint-angle deltas into motor commands.

Then I trained a policy network to imitate the human movement.

Stage 2: fine-tune in a real RL environment Once I had a network that could produce walking-like actions, I used it as the actor inside DDPG and fine-tuned it directly in the simulated environment.

This let the agent start from something that looks like a gait instead of random thrashing.

Why NEAT was part of it

Instead of locking myself into one static network shape, I used NEAT-style evolution to search for a lightweight topology that fit the mimicry task.

The goal was not “bigger network.” The goal was “minimal structure that works.”

I used NEAT-style mutation (adding nodes, adding connections, toggling connections, resetting weights, perturbing weights) and scored networks based on inverse training error.

The takeaway was clear. Simpler topologies learned cleaner and converged faster.

What happened (results)

There were two major outcomes.

1) The pretrained agent started stronger When I transferred the mimicry-trained actor into DDPG, it started off with a noticeably higher average reward than a blank-slate DDPG agent trained with the same limited inputs.

That’s the core win. The human example actually did what it was supposed to do.

2) It plateaued hard Even though the pretrained agent started higher and plateaued higher, it still hit a ceiling and stopped improving.

At first that looked like an algorithm failure. It wasn’t.

It was an input failure.

The real lesson: observation design is the ceiling

In both the mimicry training and the fine-tuning stage, I was only giving the agent a small slice of state. Basically just the four joint angles.

That means the policy is walking blind. No body angle, no contact understanding, no ground relationship, no richer state.

When I ran DDPG from scratch with more environmental inputs, it kept improving instead of plateauing.

So the pretrained policy helped early exploration, but the limited observation space capped long-term learning.

This project is the reason I’m strict now about interface design in simulation. The model matters, but the observation pipeline often matters more.

What I’d do next

The next version is a hybrid that keeps the imitation bootstrapping but expands the state representation in a controlled way.

The hard part is that you can’t just bolt on new inputs at the end without breaking what you trained earlier. If you do, you overwrite the skill you were trying to preserve.

So the next step is either:

evolve with the full observation space from the start, even if some channels are initially empty, or
introduce an intermediate step that expands the policy capacity without destroying the gait prior

That’s where this gets interesting.

Open →