Overview
This project was done through UTMIST, the University of Toronto Machine Intelligence Student Team.
The goal was ambitious on purpose. Build a full pipeline that goes from evolving bodies to coordinated multi-agent behavior in a 2v2 soccer environment.
Not just “make creatures walk.” Make creatures coordinate.
The core idea
Most RL creature projects stop once movement works. But movement is the easy part.
Coordination is where things break:
- agents can get reward without learning teamwork
- coordination can collapse under small distribution shifts
- policies that “work” can fail the second the environment changes
So we treated this like a full stack problem. Morphology, locomotion, strategy, execution.
Phase 1: evolve baseline creatures (bodies + sensors)
We generated creatures through a genetic algorithm inspired by Karl Sims’ classic 1994 “Virtual Creatures” work.
The goal of this phase was to evolve creatures that could move toward a waypoint reliably.
The loop looked like this:
- selection: keep the top performers
- pruning: remove creatures that cannot move or drift uselessly
- mutation: change segments, joints, sensor placement, and morphology
We used a simple reward setup based on distance to the target: \[ R_t = \frac{1}{(distance\_from\_target)^2} \]
This produced a range of morphologies with different movement styles, including aggressive “launch forward” designs and more stable “stay near target” ones.
Phase 2: learn a universal walking policy (PPO)
Once we had bodies, the next step was to build a low-level motor control policy that generalizes across different morphologies.
We trained a universal PPO locomotion policy in MuJoCo, with multiple creature designs trained simultaneously. We also used curriculum learning by gradually increasing target distances once the creatures succeeded at closer targets.
The key point. We weren’t training one creature. We were training a policy that could adapt.
This phase mattered because it meant movement became a reusable skill rather than a one-off solution.
Phase 3: strategy and coordination (soccer)
Then we switched from locomotion to team play.
We set up a 2v2 soccer environment and represented the field as a grid. Strategy learning happened at the grid level, not the motor level.
The coordination stack looked like this:
- learn two heatmaps with Transformers
- a player heatmap for positioning
- a ball heatmap for where to push the ball for advantage
- initialize heatmaps using random rollouts labeled by outcome (+1 or -1)
- fine-tune with MCTS and AlphaZero-style self-play
So this wasn’t end-to-end neural net soccer. It was hierarchical:
- a high-level planner outputs target coordinates
- a low-level walking policy turns coordinates into movement
End-to-end pipeline (what actually ran)
The pipeline was: 1) evolve creature morphologies 2) encode them into MuJoCo-compatible XML 3) drop them into a soccer environment 4) run a planner to pick target coordinates 5) execute those targets through the universal PPO motor policy
What I think is interesting here
Two things.
1) Modularity without losing realism A lot of projects either:
- do pure physics simulation with no strategy, or
- do pure strategy with toy movement
This combined both.
The motor control wasn’t fake. The planner wasn’t hard-coded. They interfaced cleanly.
2) Coordination becomes measurable Instead of guessing whether teamwork exists, we forced it into explicit representations:
- heatmaps
- grid advantage
- target-based control
That made it much easier to debug.
Outcome
We built the full stack. Evolved bodies. Learned locomotion that generalizes. Added a coordination layer. Ran 2v2 soccer simulation with realistic physics underneath.
More importantly, it produced a platform we could keep extending, because the pieces weren’t welded together as a single black box.