Skip to content

Latest commit

 

History

History
131 lines (95 loc) · 3.69 KB

File metadata and controls

131 lines (95 loc) · 3.69 KB

Project Report: Continuous Control (Reacher, 20 Agents)

1) Project Overview

This project solves the Unity Reacher (20-agent) continuous control task using Deep Deterministic Policy Gradient (DDPG) based on Lillicrap et al. (2016): https://arxiv.org/pdf/1604.06778.

  • State space: 33-dimensional vector per agent
  • Action space: 4-dimensional continuous vector in [-1, 1]
  • Success criterion: average score (over 100 consecutive episodes, averaged across 20 agents) >= 30

2) Learning Algorithm

DDPG Summary

DDPG is an off-policy actor-critic algorithm for continuous control:

  • Actor learns a deterministic policy (\mu(s|\theta^\mu))
  • Critic learns action-value (Q(s,a|\theta^Q))
  • Target networks provide stable temporal-difference targets
  • Replay buffer breaks temporal correlation and improves sample reuse

Update equations used:

  • Critic target: [ y_i = r_i + \gamma,Q'\big(s_{i+1},\mu'(s_{i+1})\big) ]
  • Critic loss: [ L = \frac{1}{N}\sum_i \left(Q(s_i,a_i)-y_i\right)^2 ]
  • Actor objective: [ \nabla_{\theta^\mu}J \approx \frac{1}{N}\sum_i \nabla_a Q(s,a|\theta^Q)|{a=\mu(s_i)}\nabla{\theta^\mu}\mu(s|\theta^\mu)|_{s_i} ]

Implementation Details Used

Compared with a minimal baseline, the final implementation includes stabilizing changes:

  1. Replay warmup before gradient updates (replay_warmup=5000)
  2. Controlled update ratio (learn_updates_per_step=2)
  3. Independent exploration noise per agent (Gaussian action noise per action dimension)
  4. Corrected OU process implementation (zero-mean Gaussian increment)
  5. Larger replay buffer for multi-agent data throughput
  6. Soft target updates with small tau
  7. Critic gradient clipping
  8. Small critic weight decay for regularization

3) Network Architecture

The final networks are defined in model.py.

Actor

  • Input: state (33)
  • Hidden layer 1: 256 units + ReLU
  • Hidden layer 2: 128 units + ReLU
  • Output: action (4) with tanh

Critic

  • State path: Linear(33 -> 256) + ReLU
  • Concatenate action (4) after first state layer
  • Joint layer: Linear(256+4 -> 128) + ReLU
  • Output: Q-value scalar

Initialization

  • Hidden layers initialized with fan-in based uniform bounds
  • Final layers initialized in small range [-3e-3, 3e-3]

4) Hyperparameters

Final training configuration used:

Hyperparameter Value
Replay buffer size 3e5
Batch size 128
Discount factor (gamma) 0.99
Soft update (tau) 1e-3
Actor learning rate 1e-4
Critic learning rate 1e-3
Critic weight decay 1e-5
Max episodes 500
Max timesteps / episode 1000
Updates per env step 2
Replay warmup 5000 transitions
Noise sigma start 0.30
Noise sigma end 0.05
Noise decay (per episode) 0.997

5) Training Procedure

  1. Start the 20-agent Reacher environment.
  2. Reset environment each episode in training mode.
  3. For each timestep:
    • Get action for each agent from actor
    • Add independent exploration noise
    • Step environment
    • Store all 20 transitions in replay buffer
    • If warmup reached, run a fixed number of gradient updates
  4. Track:
    • episode score (mean across agents)
    • rolling 100-episode average
  5. Save actor/critic checkpoints when best rolling average improves.

A plotting cell to visualize training progress:

training-plot


6) Ideas for Future Improvements

  1. Hyperparameter sweeps

    • Tune tau, update ratio, warmup, and noise schedule using controlled experiments
  2. D4PG / PPO baseline comparison

    • Compare performance and training stability on same environment