Ray fill/strict does not colocate multi-DP local_vllm_model ranks (missing node-affinity on extra PGs)

**Describe the bug**

With NeMo Gym’s `local_vllm_model` integration, setting `VLLM_RAY_DP_PACK_STRATEGY` to `strict` or `fill` does **not** by itself colocate all data-parallel (DP) ranks on a single node when each DP rank is backed by its **own** Ray placement group (one PG per rank, each requesting `world_size` GPUs—often 1 GPU for TP=1).

Ray’s `STRICT_PACK` / `PACK` strategy only affects **packing within one placement group**. Independent PGs are still scheduled separately, so DP ranks can land on **different nodes** even though the strategy name suggests tight packing. This matches seeing e.g. `safety_judge_model_dp_rank_*` on 3+ distinct node IPs while `data_parallel_size: 4` and `fill`/`strict` are set.

```
safety_judge_model  (4 unique PGs, 4 total entries)
    - safety_judge_model_dp_rank_0  state=CREATED  GPU=1  id=9401191b...  nodes 1 (1 GPU)
    - safety_judge_model_dp_rank_3  state=CREATED  GPU=1  id=ba090962...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_2  state=CREATED  GPU=1  id=bfdd245c...  nodes 2(1 GPU)
    - safety_judge_model_dp_rank_1  state=CREATED  GPU=1  id=d04e063f...  nodes 1 (1 GPU)
    -> total GPU (sum over entries): 4
```

Upstream vLLM’s DP placement path uses **bundle node hints** (`node:<ip>: 0.001`) so multiple DP PGs can be pinned to the same node; the Gym `create_dp_placement_groups` patch previously created extra PGs **without** those hints, so behavior diverged from upstream and from user expectations of “strict/fill = one node when it fits.”

**Steps/Code to reproduce bug**

1. Use a NeMo Gym stack with `responses_api_models.local_vllm_model` and `data_parallel_size > 1`, `tensor_parallel_size: 1` (e.g. safety judge with DP=4).
2. Set e.g. `vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fill` (or `strict`).
3. Start the job (`ng_run` / training entrypoint) and observe Ray placement groups (e.g. `python scripts/visualize_ray_placement_groups.py` or Ray dashboard).

**Expected behavior**

When a single node has enough GPUs for `world_size * data_parallel_size`, all DP ranks for that deployment should schedule on **one** node (typically the DP master / rank-0 node), similar to upstream vLLM’s behavior with node-affinity bundles—unless resources truly force multi-node placement.

**Configs**

- NeMo Gym YAML with `local_vllm_model`, e.g. `safety_judge_model` (or any model) with:
  - `tensor_parallel_size: 1`
  - `data_parallel_size: 4` (or any `> 1`)
  - `data_parallel_size_local: 1` (common default)
  - `vllm_serve_env_vars.VLLM_RAY_DP_PACK_STRATEGY: fill` **or** `strict`
- Example training bundle: `training_configs/grpo_superv3_boisterous_dodo-20260131-r1-localvllm-medium.yaml` (safety judge section).

**Environment details**

- OS: Linux (cluster / Slurm + Ray multi-node typical).
- Python: per NeMo Gym / training image (e.g. 3.11/3.12).
- Ray + vLLM versions: match the Gym submodule / `3rdparty/vllm` pin in the repo.
- (Attach `uv pip list` from the **local_vllm_model** venv or training container if filing upstream.)

**Additional context**

- **Root cause (conceptual):** one PG per DP rank + no cross-PG affinity ⇒ `fill`/`strict` don’t imply “all ranks same node.”
- **Mitigation (NeMo Gym):** patch `local_vllm_model/app.py` so that for `strict`/`fill`, when the DP master has enough **available** GPUs for the remaining ranks (`world_size * (dp_size - 1)` after rank 0), extra PGs include the same `node:<dp_master_ip>` bundle hint as upstream vLLM; see `local_vllm_model/README.md` section on strict/fill and colocation.
- **Hardware:** e.g. multi-node cluster, 8×GPU nodes; issue visible when 4×1-GPU PGs could fit on one node but spread anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray fill/strict does not colocate multi-DP local_vllm_model ranks (missing node-affinity on extra PGs) #914

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ray fill/strict does not colocate multi-DP local_vllm_model ranks (missing node-affinity on extra PGs) #914

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions