-
Notifications
You must be signed in to change notification settings - Fork 97
Description
Describe the bug
With NeMo Gym’s local_vllm_model integration, setting VLLM_RAY_DP_PACK_STRATEGY to strict or fill does not by itself colocate all data-parallel (DP) ranks on a single node when each DP rank is backed by its own Ray placement group (one PG per rank, each requesting world_size GPUs—often 1 GPU for TP=1).
Ray’s STRICT_PACK / PACK strategy only affects packing within one placement group. Independent PGs are still scheduled separately, so DP ranks can land on different nodes even though the strategy name suggests tight packing. This matches seeing e.g. safety_judge_model_dp_rank_* on 3+ distinct node IPs while data_parallel_size: 4 and fill/strict are set.
safety_judge_model (4 unique PGs, 4 total entries)
- safety_judge_model_dp_rank_0 state=CREATED GPU=1 id=9401191b... nodes 1 (1 GPU)
- safety_judge_model_dp_rank_3 state=CREATED GPU=1 id=ba090962... nodes 2(1 GPU)
- safety_judge_model_dp_rank_2 state=CREATED GPU=1 id=bfdd245c... nodes 2(1 GPU)
- safety_judge_model_dp_rank_1 state=CREATED GPU=1 id=d04e063f... nodes 1 (1 GPU)
-> total GPU (sum over entries): 4
Upstream vLLM’s DP placement path uses bundle node hints (node:<ip>: 0.001) so multiple DP PGs can be pinned to the same node; the Gym create_dp_placement_groups patch previously created extra PGs without those hints, so behavior diverged from upstream and from user expectations of “strict/fill = one node when it fits.”
Steps/Code to reproduce bug
- Use a NeMo Gym stack with
responses_api_models.local_vllm_modelanddata_parallel_size > 1,tensor_parallel_size: 1(e.g. safety judge with DP=4). - Set e.g.
vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fill(orstrict). - Start the job (
ng_run/ training entrypoint) and observe Ray placement groups (e.g.python scripts/visualize_ray_placement_groups.pyor Ray dashboard).
Expected behavior
When a single node has enough GPUs for world_size * data_parallel_size, all DP ranks for that deployment should schedule on one node (typically the DP master / rank-0 node), similar to upstream vLLM’s behavior with node-affinity bundles—unless resources truly force multi-node placement.
Configs
- NeMo Gym YAML with
local_vllm_model, e.g.safety_judge_model(or any model) with:tensor_parallel_size: 1data_parallel_size: 4(or any> 1)data_parallel_size_local: 1(common default)vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fillorstrict
- Example training bundle:
training_configs/grpo_superv3_boisterous_dodo-20260131-r1-localvllm-medium.yaml(safety judge section).
Environment details
- OS: Linux (cluster / Slurm + Ray multi-node typical).
- Python: per NeMo Gym / training image (e.g. 3.11/3.12).
- Ray + vLLM versions: match the Gym submodule /
3rdparty/vllmpin in the repo. - (Attach
uv pip listfrom the local_vllm_model venv or training container if filing upstream.)
Additional context
- Root cause (conceptual): one PG per DP rank + no cross-PG affinity ⇒
fill/strictdon’t imply “all ranks same node.” - Mitigation (NeMo Gym): patch
local_vllm_model/app.pyso that forstrict/fill, when the DP master has enough available GPUs for the remaining ranks (world_size * (dp_size - 1)after rank 0), extra PGs include the samenode:<dp_master_ip>bundle hint as upstream vLLM; seelocal_vllm_model/README.mdsection on strict/fill and colocation. - Hardware: e.g. multi-node cluster, 8×GPU nodes; issue visible when 4×1-GPU PGs could fit on one node but spread anyway.