PolyRL/ROADMAP.md at main · Terra-Flux/PolyRL

Roadmap (subject to change)

Goal:

[Cost Efficiency] It harvests resources with dynamic availability (e.g. spot instances in cloud) for rollout to reduce the cost of reinforcement learning.
[Adaptivity] It employs a progressive workload balance algorithm to adaptively offload rollout tasks to available remote rollout engines. The algorithm estimate the workload gap between training and rollout based on recent steps, thus it adapts to length changes and dynamic resource availability.
[Fault Tolerance] It handles failures of rollout instances with negligible overhead via token-level tracking and continuous generation.
[Utilization] It also applies server-based local rollout engine so that the rollout manager can balance workload holistically between reserved and preemptible resources, which maximizes resource utilization.

Intra-stage (rollout) balance
- Rollout workload balance, start from homogenous rollout instances.
  - Token-level tracking and workload balance.
Inter-stage (rollout vs. training) balance
- Rollout offload
  - Estimate workload gap between training and rollout in seconds.
  - Adaptively offload rollout workloads to remote rollout engines.
  - Fine-grained workload partition (request -> time).
- Dynamic rollout instances allocation
  - Add rollout instances at runtime when rollout time extend.
Pack sequences in rollout manager
- Send rollout prompts to manager in a batch.
- Rollout manager decompose into per-sample requests and send to rollout instances.
- Manage the order of rollout results and pack into micro-batches.
- Dynamic batch size with a lower bound (block if under; return all when asked).
Fault tolerance
- Handle multiple failure cases
  - Spot instance preemption
  - Failure during weight transfer
  - Failure during rollout
Weight compression
- Quantization+lossless compression to reduce the size of weight before transfer.
Off-policy support
- Unlock weight update of rollout engines.

Goal:

Rollout instances running on dynamic independent hardware.
Rollout instances stream results to training engine via async-efficient manager process.
Weight update via TCP&RDMA aggregated channel.