Skip to content

Conversation

@djsaunde
Copy link
Collaborator

@djsaunde djsaunde commented Dec 2, 2025

This PR is built on top of #3566, and should be merged after it.

This PR auto-sets padding_free=True when applicable (text-only SFT training) and does some work behind the scenes to compute sequence length metadata, so we can use the varlen flash attention kernels / block-diagonal SDPA / xformers kernels.

This gives us throughput gains with sufficiently large models / batch sizes; e.g., a super small model like unsloth/qwen2.5-0.5b requires per_device_train_batch_size = 16 or higher in order to see (significant) throughput gains, while unsloth/llama-3-8b requires only per_device_train_batch_size = 4 or higher. For example, with unsloth/llama-3-8b with per_device_train_batch_size = 8, we observe much faster training (52s padding-free vs 96s not; about 1.85x speedup).

image

There are very slight loss and gradient norm differences in the padding-free vs. not settings; I think these can be chalked up to the different kernels being used (FA2 varlen vs. dense, block-diagonal vs. causal kernels for xformers / SDPA).

@djsaunde djsaunde force-pushed the padding-free-seqlen-metadata-v2 branch from d5b342f to a69f35b Compare December 9, 2025 22:03
@djsaunde
Copy link
Collaborator Author

Closing in favor of #3702.

@djsaunde djsaunde closed this Dec 10, 2025
@djsaunde djsaunde reopened this Dec 10, 2025
@danielhanchen danielhanchen merged commit 35606da into unslothai:main Dec 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants