Add DP-master node affinity for Ray strict/fill pack strategy. by ffrujeri · Pull Request #916 · NVIDIA-NeMo/Gym

ffrujeri · 2026-03-19T18:23:07Z

What does this PR do?

Pins extra Ray placement groups for local_vllm_model data-parallel ranks to the DP master (via node:<dp_master_ip> bundle hints) when VLLM_RAY_DP_PACK_STRATEGY is strict or fill and the master has enough available GPUs—matching upstream vLLM behavior so multi-DP deployments colocate on one node when capacity allows.

Issues

Closes NVIDIA-NeMo/Gym#914

Usage

strict / fill: No new user-facing API. Keep using local_vllm_model with vllm_serve_env_vars and VLLM_RAY_DP_PACK_STRATEGY: fill or strict. When the DP master node has enough GPUs for all non–rank-0 groups (world_size * (data_parallel_size - 1) after rank 0), extra PGs are scheduled with the same node-affinity hint as upstream vLLM.

# Example: safety judge (or any local_vllm_model) with multi-DP on one node when it fits
safety_judge_model:
  _target_: responses_api_models.local_vllm_model.app.LocalVLLMModel
  # ...
  vllm_serve_kwargs:
    tensor_parallel_size: 1
    pipeline_parallel_size: 1
    data_parallel_size: 4
    data_parallel_size_local: 1
    # ... other serve args
  vllm_serve_env_vars:
    VLLM_RAY_DP_PACK_STRATEGY: fill   # or strict

Verify colocation with your usual tooling (e.g. Ray dashboard or python scripts/visualize_ray_placement_groups.py).

Additional Information

Problem: Ray STRICT_PACK / PACK only packs within a single placement group. NeMo Gym’s patch creates one PG per DP rank; without cross-PG hints, ranks could spread across nodes even with fill/strict, diverging from issue #914 expectations and upstream vLLM’s node:<ip> bundle hints.

Before this PR we would see:

safety_judge_model  (4 unique PGs, 4 total entries)
    - safety_judge_model_dp_rank_0  state=CREATED  GPU=1  id=9401191b...  nodes 1 (1 GPU)
    - safety_judge_model_dp_rank_3  state=CREATED  GPU=1  id=ba090962...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_2  state=CREATED  GPU=1  id=bfdd245c...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_1  state=CREATED  GPU=1  id=d04e063f...  nodes 1 (1 GPU)
    -> total GPU (sum over entries): 4

After:

safety_judge_model  (4 unique PGs, 4 total entries)
    - safety_judge_model_dp_rank_0  state=CREATED  GPU=1  id=9401191b...  nodes 1 (1 GPU)
    - safety_judge_model_dp_rank_3  state=CREATED  GPU=1  id=ba090962...  nodes 1(1 GPU)
    - safety_judge_model_dp_rank_2  state=CREATED  GPU=1  id=bfdd245c...  nodes 1(1 GPU)
    - safety_judge_model_dp_rank_1  state=CREATED  GPU=1  id=d04e063f...  nodes 1 (1 GPU)
    -> total GPU (sum over entries): 4

Change: For strict/fill and dp_size > 1, if the DP master’s available GPU count ≥ world_size * (dp_size - 1), extra PGs use {device_str: 1.0, "node:" + dp_master_ip: 0.001} on each GPU bundle; otherwise affinity is left unset and a log line explains why pinning was skipped.
Scope: responses_api_models/local_vllm_model/app.py — _patch_create_dp_placement_groups; head/rank-0 PG and existing colocated-PG resource filtering are unchanged.
Docs: If local_vllm_model/README.md already describes strict/fill and colocation, consider a one-line note that pinning now matches upstream when the master has capacity (optional follow-up).

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

copy-pr-bot · 2026-03-19T18:23:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add DP-master node affinity for Ray strict/fill pack strategy.

1582822

Signed-off-by: Felipe Vieira Frujeri <ffrujeri@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DP-master node affinity for Ray strict/fill pack strategy.#916

Add DP-master node affinity for Ray strict/fill pack strategy.#916
ffrujeri wants to merge 1 commit intomainfrom
ffrujeri/local-vllm-dp-patch

ffrujeri commented Mar 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ffrujeri commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Issues

Usage

Additional Information

Uh oh!

copy-pr-bot bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ffrujeri commented Mar 19, 2026 •

edited

Loading