fix: type-aware micro batch distribution to prevent FSDP hang with VLMs by samsja · Pull Request #1918 · PrimeIntellect-ai/prime-rl

samsja · 2026-03-01T00:55:31Z

Summary

When training VLMs (e.g. Qwen3-VL) with FSDP, the vision encoder is its own FSDP unit requiring all ranks to participate in all-gather collectives at every micro_step. If some GPUs get a multimodal batch while others get text-only, FSDP hangs forever.
Reorders micro batch distribution in prepare_batch using type-aware round-robin: split into MM/text groups, pad each independently, concatenate, then distribute batches[i::W] so all GPUs process the same type at every step.
No-op for pure text-only training. No changes to the forward pass, model, or data pipeline.

Test plan

All 8 existing test_batch.py tests pass (regression)
New test: mixed MM/text batches are type-aligned across workers at every micro_step
New test: MM padding batches preserve pixel_values (vision encoder still runs)
New test: all-multimodal edge case
New test: pure text-only is unchanged

🤖 Generated with Claude Code

Note

Medium Risk
Changes micro-batch padding and distribution logic in prepare_batch, which can affect training ordering/throughput and edge-case batching behavior, especially for mixed multimodal/text rollouts. Adds coverage for multimodal padding/type alignment, reducing regression risk but still touches core training batching.

Overview
Updates prepare_batch to type-align multimodal vs text-only micro-batches across all GPUs per micro-step to prevent FSDP all-gather hangs when training VLMs.

The batcher now splits micro-batches into multimodal/text groups, pads each group with zero-loss “padding” micro-batches (preserving pixel_values/image_grid_thw), concatenates, and distributes via round-robin (micro_batches[i::W]) instead of contiguous chunking.

Adds unit tests covering mixed MM/text alignment, multimodal padding preservation, all-multimodal behavior, and that pure text-only behavior remains unchanged.

^{Written by Cursor Bugbot for commit 6733138. This will update automatically on new commits. Configure here.}

When training VLMs with FSDP, the vision encoder is wrapped as its own FSDP unit requiring all ranks to participate in all-gather collectives. If some GPUs process a multimodal micro batch (calling the vision encoder) while others process text-only at the same micro_step, FSDP hangs. Reorder micro batch distribution in prepare_batch so that at every micro_step all GPUs process the same type (multimodal or text-only): - Split micro batches into MM and text-only groups - Pad each group independently to be divisible by num_train_workers - Concatenate and distribute round-robin (GPU i gets batches[i::W]) This is a no-op for pure text-only training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918

fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918
samsja wants to merge 1 commit intomainfrom
fix/fsdp-vlm-type-aligned-distribution

samsja commented Mar 1, 2026 •

edited by cursor bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented Mar 1, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented Mar 1, 2026 •

edited by cursor bot

Loading