fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918
Open
fix: type-aware micro batch distribution to prevent FSDP hang with VLMs#1918
Conversation
When training VLMs with FSDP, the vision encoder is wrapped as its own FSDP unit requiring all ranks to participate in all-gather collectives. If some GPUs process a multimodal micro batch (calling the vision encoder) while others process text-only at the same micro_step, FSDP hangs. Reorder micro batch distribution in prepare_batch so that at every micro_step all GPUs process the same type (multimodal or text-only): - Split micro batches into MM and text-only groups - Pad each group independently to be divisible by num_train_workers - Concatenate and distribute round-robin (GPU i gets batches[i::W]) This is a no-op for pure text-only training. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
prepare_batchusing type-aware round-robin: split into MM/text groups, pad each independently, concatenate, then distributebatches[i::W]so all GPUs process the same type at every step.Test plan
test_batch.pytests pass (regression)pixel_values(vision encoder still runs)🤖 Generated with Claude Code
Note
Medium Risk
Changes micro-batch padding and distribution logic in
prepare_batch, which can affect training ordering/throughput and edge-case batching behavior, especially for mixed multimodal/text rollouts. Adds coverage for multimodal padding/type alignment, reducing regression risk but still touches core training batching.Overview
Updates
prepare_batchto type-align multimodal vs text-only micro-batches across all GPUs per micro-step to prevent FSDP all-gather hangs when training VLMs.The batcher now splits micro-batches into multimodal/text groups, pads each group with zero-loss “padding” micro-batches (preserving
pixel_values/image_grid_thw), concatenates, and distributes via round-robin (micro_batches[i::W]) instead of contiguous chunking.Adds unit tests covering mixed MM/text alignment, multimodal padding preservation, all-multimodal behavior, and that pure text-only behavior remains unchanged.
Written by Cursor Bugbot for commit 6733138. This will update automatically on new commits. Configure here.