[mxfp8 training] add TP warning by danielvegamyhre · Pull Request #2562 · pytorch/torchtitan

danielvegamyhre · 2026-03-12T22:37:45Z

Summary

We are doing a refactor of how Dtensor + MXFP8WeightWrapperTensor subclass compose, reversing the wrapping order to MXFP8WeightWrapperTensor(Dtensor(...)) ([TP] reorder MXFP8 wrapper over DTensor ao#4010) and landing Dtensor scaled_mm sharding strategy fixes ([DTensor] fix scaled_mm sharding strategy pytorch#177234).
In the meantime, while the wrapping order is Dtensor(MXFP8WeightWrapperTensor(..)), we need to warn that linears using TP will use default precision, not MXFP8.

Due to how Dtensor linear uses "composite implicit autograd" it decomposes linear ops into aten.t + aten.mm, going straight through __torch_dispatch__ instead of first going through __torch_function__. This prevents our subclass from intercepting the linear op to dispatch to _to_mxfp8_then_scaled_mm autograd func. We cannot intercept at the __torch_dispatch__ level because then autograd would not capture the backward for our autograd func we dispatch to.
In contrast, grouped_mm does not have this problem, as it has an explicitly registered backward (compose explicit autograd) so we always see aten._grouped_mm in __torch_function__ and can intercept.
thanks @pianpwk for the help with this!

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 12, 2026 22:37

pytorch-bot bot added the ciflow/8gpu label Mar 12, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 12, 2026

[mxfp8 training] add TP warning

ddac399

danielvegamyhre force-pushed the tpwarning branch from 473b650 to ddac399 Compare March 12, 2026 22:38

tianyu-l approved these changes Mar 12, 2026

View reviewed changes

pianpwk approved these changes Mar 12, 2026

View reviewed changes