Skip to content

Ray fill/strict does not colocate multi-DP local_vllm_model ranks (missing node-affinity on extra PGs) #914

@ffrujeri

Description

@ffrujeri

Describe the bug

With NeMo Gym’s local_vllm_model integration, setting VLLM_RAY_DP_PACK_STRATEGY to strict or fill does not by itself colocate all data-parallel (DP) ranks on a single node when each DP rank is backed by its own Ray placement group (one PG per rank, each requesting world_size GPUs—often 1 GPU for TP=1).

Ray’s STRICT_PACK / PACK strategy only affects packing within one placement group. Independent PGs are still scheduled separately, so DP ranks can land on different nodes even though the strategy name suggests tight packing. This matches seeing e.g. safety_judge_model_dp_rank_* on 3+ distinct node IPs while data_parallel_size: 4 and fill/strict are set.

safety_judge_model  (4 unique PGs, 4 total entries)
    - safety_judge_model_dp_rank_0  state=CREATED  GPU=1  id=9401191b...  nodes 1 (1 GPU)
    - safety_judge_model_dp_rank_3  state=CREATED  GPU=1  id=ba090962...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_2  state=CREATED  GPU=1  id=bfdd245c...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_1  state=CREATED  GPU=1  id=d04e063f...  nodes 1 (1 GPU)
    -> total GPU (sum over entries): 4

Upstream vLLM’s DP placement path uses bundle node hints (node:<ip>: 0.001) so multiple DP PGs can be pinned to the same node; the Gym create_dp_placement_groups patch previously created extra PGs without those hints, so behavior diverged from upstream and from user expectations of “strict/fill = one node when it fits.”

Steps/Code to reproduce bug

  1. Use a NeMo Gym stack with responses_api_models.local_vllm_model and data_parallel_size > 1, tensor_parallel_size: 1 (e.g. safety judge with DP=4).
  2. Set e.g. vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fill (or strict).
  3. Start the job (ng_run / training entrypoint) and observe Ray placement groups (e.g. python scripts/visualize_ray_placement_groups.py or Ray dashboard).

Expected behavior

When a single node has enough GPUs for world_size * data_parallel_size, all DP ranks for that deployment should schedule on one node (typically the DP master / rank-0 node), similar to upstream vLLM’s behavior with node-affinity bundles—unless resources truly force multi-node placement.

Configs

  • NeMo Gym YAML with local_vllm_model, e.g. safety_judge_model (or any model) with:
    • tensor_parallel_size: 1
    • data_parallel_size: 4 (or any > 1)
    • data_parallel_size_local: 1 (common default)
    • vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fill or strict
  • Example training bundle: training_configs/grpo_superv3_boisterous_dodo-20260131-r1-localvllm-medium.yaml (safety judge section).

Environment details

  • OS: Linux (cluster / Slurm + Ray multi-node typical).
  • Python: per NeMo Gym / training image (e.g. 3.11/3.12).
  • Ray + vLLM versions: match the Gym submodule / 3rdparty/vllm pin in the repo.
  • (Attach uv pip list from the local_vllm_model venv or training container if filing upstream.)

Additional context

  • Root cause (conceptual): one PG per DP rank + no cross-PG affinity ⇒ fill/strict don’t imply “all ranks same node.”
  • Mitigation (NeMo Gym): patch local_vllm_model/app.py so that for strict/fill, when the DP master has enough available GPUs for the remaining ranks (world_size * (dp_size - 1) after rank 0), extra PGs include the same node:<dp_master_ip> bundle hint as upstream vLLM; see local_vllm_model/README.md section on strict/fill and colocation.
  • Hardware: e.g. multi-node cluster, 8×GPU nodes; issue visible when 4×1-GPU PGs could fit on one node but spread anyway.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions