Skip to content

Latest commit

 

History

History
73 lines (65 loc) · 3.79 KB

File metadata and controls

73 lines (65 loc) · 3.79 KB

Roadmap (subject to change)

Release v1: Elastic and auto-balanced resource management

Goal:

  1. [Cost Efficiency] It harvests resources with dynamic availability (e.g. spot instances in cloud) for rollout to reduce the cost of reinforcement learning.
  2. [Adaptivity] It employs a progressive workload balance algorithm to adaptively offload rollout tasks to available remote rollout engines. The algorithm estimate the workload gap between training and rollout based on recent steps, thus it adapts to length changes and dynamic resource availability.
  3. [Fault Tolerance] It handles failures of rollout instances with negligible overhead via token-level tracking and continuous generation.
  4. [Utilization] It also applies server-based local rollout engine so that the rollout manager can balance workload holistically between reserved and preemptible resources, which maximizes resource utilization.
  • Intra-stage (rollout) balance
    • Rollout workload balance, start from homogenous rollout instances.
      • Token-level tracking and workload balance.
  • Inter-stage (rollout vs. training) balance
    • Rollout offload
      • Estimate workload gap between training and rollout in seconds.
      • Adaptively offload rollout workloads to remote rollout engines.
      • Fine-grained workload partition (request -> time).
    • Dynamic rollout instances allocation
      • Add rollout instances at runtime when rollout time extend.
  • Pack sequences in rollout manager
    • Send rollout prompts to manager in a batch.
    • Rollout manager decompose into per-sample requests and send to rollout instances.
    • Manage the order of rollout results and pack into micro-batches.
    • Dynamic batch size with a lower bound (block if under; return all when asked).
  • Fault tolerance
    • Handle multiple failure cases
      • Spot instance preemption
      • Failure during weight transfer
      • Failure during rollout
  • Weight compression
    • Quantization+lossless compression to reduce the size of weight before transfer.
  • Off-policy support
    • Unlock weight update of rollout engines.

Release-v0: Basic disaggregated RL system

Goal:

  1. Rollout instances running on dynamic independent hardware.
  2. Rollout instances stream results to training engine via async-efficient manager process.
  3. Weight update via TCP&RDMA aggregated channel.
  • Rollout
    • Decouple rollout and update
      • Training engine send requests to SGLang server via API.
      • Rank-zero send all generation requests and skip local generation.
      • Wrap rollout results streaming into an iterator.
    • Multiple rollout instances management
      • Rollout manager in Rust
        • Rollout instance register request via API.
        • Relay requests from training engine to rollout instances.
        • Interface of algorithm-driven request scheduling.
    • Rollout instances dynamic in-n-out
      • Active new rollout instance register during runtime.
      • Shutdown connection when rollout instance goes down.
  • Training
    • From batch process to stream process
      • Align micro-batch size along rollout-actor-critic.
      • Reward, KL, advantage in micro-batch.
      • Critic and actor for/backward in micro-batch and update in mini-batch.
      • Align runtime metric collection with micro-batch processing.
  • Weight transfer
    • TCP&RDMA aggregated interface
      • Integrate Mooncake transfer engine.
      • Weight transfer agent for each instance.
    • Model weight gather and re-shard
      • Rank-zero gathers weights from FSDP ranks and call agent to transfer.
      • Support weight resharding on TP>1 rollout instances.
    • Compression interface
      • Encode and decode weight interface to support compressed weight update.
      • Asynchronous full weight update interface.