Skip to content

fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918

Open
samsja wants to merge 1 commit intomainfrom
fix/fsdp-vlm-type-aligned-distribution
Open

fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918
samsja wants to merge 1 commit intomainfrom
fix/fsdp-vlm-type-aligned-distribution

Conversation

@samsja
Copy link
Member

@samsja samsja commented Mar 1, 2026

Summary

  • When training VLMs (e.g. Qwen3-VL) with FSDP, the vision encoder is its own FSDP unit requiring all ranks to participate in all-gather collectives at every micro_step. If some GPUs get a multimodal batch while others get text-only, FSDP hangs forever.
  • Reorders micro batch distribution in prepare_batch using type-aware round-robin: split into MM/text groups, pad each independently, concatenate, then distribute batches[i::W] so all GPUs process the same type at every step.
  • No-op for pure text-only training. No changes to the forward pass, model, or data pipeline.

Test plan

  • All 8 existing test_batch.py tests pass (regression)
  • New test: mixed MM/text batches are type-aligned across workers at every micro_step
  • New test: MM padding batches preserve pixel_values (vision encoder still runs)
  • New test: all-multimodal edge case
  • New test: pure text-only is unchanged

🤖 Generated with Claude Code


Note

Medium Risk
Changes micro-batch padding and distribution logic in prepare_batch, which can affect training ordering/throughput and edge-case batching behavior, especially for mixed multimodal/text rollouts. Adds coverage for multimodal padding/type alignment, reducing regression risk but still touches core training batching.

Overview
Updates prepare_batch to type-align multimodal vs text-only micro-batches across all GPUs per micro-step to prevent FSDP all-gather hangs when training VLMs.

The batcher now splits micro-batches into multimodal/text groups, pads each group with zero-loss “padding” micro-batches (preserving pixel_values/image_grid_thw), concatenates, and distributes via round-robin (micro_batches[i::W]) instead of contiguous chunking.

Adds unit tests covering mixed MM/text alignment, multimodal padding preservation, all-multimodal behavior, and that pure text-only behavior remains unchanged.

Written by Cursor Bugbot for commit 6733138. This will update automatically on new commits. Configure here.

When training VLMs with FSDP, the vision encoder is wrapped as its own
FSDP unit requiring all ranks to participate in all-gather collectives.
If some GPUs process a multimodal micro batch (calling the vision encoder)
while others process text-only at the same micro_step, FSDP hangs.

Reorder micro batch distribution in prepare_batch so that at every
micro_step all GPUs process the same type (multimodal or text-only):
- Split micro batches into MM and text-only groups
- Pad each group independently to be divisible by num_train_workers
- Concatenate and distribute round-robin (GPU i gets batches[i::W])

This is a no-op for pure text-only training.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant