This project solves the Unity Reacher (20-agent) continuous control task using Deep Deterministic Policy Gradient (DDPG) based on Lillicrap et al. (2016): https://arxiv.org/pdf/1604.06778.
- State space: 33-dimensional vector per agent
- Action space: 4-dimensional continuous vector in [-1, 1]
- Success criterion: average score (over 100 consecutive episodes, averaged across 20 agents) >= 30
DDPG is an off-policy actor-critic algorithm for continuous control:
- Actor learns a deterministic policy (\mu(s|\theta^\mu))
- Critic learns action-value (Q(s,a|\theta^Q))
- Target networks provide stable temporal-difference targets
- Replay buffer breaks temporal correlation and improves sample reuse
Update equations used:
- Critic target: [ y_i = r_i + \gamma,Q'\big(s_{i+1},\mu'(s_{i+1})\big) ]
- Critic loss: [ L = \frac{1}{N}\sum_i \left(Q(s_i,a_i)-y_i\right)^2 ]
- Actor objective: [ \nabla_{\theta^\mu}J \approx \frac{1}{N}\sum_i \nabla_a Q(s,a|\theta^Q)|{a=\mu(s_i)}\nabla{\theta^\mu}\mu(s|\theta^\mu)|_{s_i} ]
Compared with a minimal baseline, the final implementation includes stabilizing changes:
- Replay warmup before gradient updates (
replay_warmup=5000) - Controlled update ratio (
learn_updates_per_step=2) - Independent exploration noise per agent (Gaussian action noise per action dimension)
- Corrected OU process implementation (zero-mean Gaussian increment)
- Larger replay buffer for multi-agent data throughput
- Soft target updates with small
tau - Critic gradient clipping
- Small critic weight decay for regularization
The final networks are defined in model.py.
- Input: state (33)
- Hidden layer 1: 256 units + ReLU
- Hidden layer 2: 128 units + ReLU
- Output: action (4) with
tanh
- State path: Linear(33 -> 256) + ReLU
- Concatenate action (4) after first state layer
- Joint layer: Linear(256+4 -> 128) + ReLU
- Output: Q-value scalar
- Hidden layers initialized with fan-in based uniform bounds
- Final layers initialized in small range
[-3e-3, 3e-3]
Final training configuration used:
| Hyperparameter | Value |
|---|---|
| Replay buffer size | 3e5 |
| Batch size | 128 |
Discount factor (gamma) |
0.99 |
Soft update (tau) |
1e-3 |
| Actor learning rate | 1e-4 |
| Critic learning rate | 1e-3 |
| Critic weight decay | 1e-5 |
| Max episodes | 500 |
| Max timesteps / episode | 1000 |
| Updates per env step | 2 |
| Replay warmup | 5000 transitions |
| Noise sigma start | 0.30 |
| Noise sigma end | 0.05 |
| Noise decay (per episode) | 0.997 |
- Start the 20-agent Reacher environment.
- Reset environment each episode in training mode.
- For each timestep:
- Get action for each agent from actor
- Add independent exploration noise
- Step environment
- Store all 20 transitions in replay buffer
- If warmup reached, run a fixed number of gradient updates
- Track:
- episode score (mean across agents)
- rolling 100-episode average
- Save actor/critic checkpoints when best rolling average improves.
A plotting cell to visualize training progress:
-
Hyperparameter sweeps
- Tune
tau, update ratio, warmup, and noise schedule using controlled experiments
- Tune
-
D4PG / PPO baseline comparison
- Compare performance and training stability on same environment
