I am currently playing with Qwen3Moe.
I see that running on 4 gpus, DP=1, EP=1, SP=False,
the loss CrossEntropyLoss (not the most recent Chunked one)
with TP=4 or CP=4 (with or without loss compile) are reasonably compatible.
While TP=2 AND CP=2 activate together produce a loss that's about half of the previous one.
I was wondering if there might be some issue with parallel loss aggregation due to the two parallelism levels.
Attached image (highest losses are TP=4 and CP=4, the lower one is TP=2 and CP=2).

I am currently playing with Qwen3Moe.
I see that running on 4 gpus, DP=1, EP=1, SP=False,
the loss CrossEntropyLoss (not the most recent Chunked one)
with TP=4 or CP=4 (with or without loss compile) are reasonably compatible.
While TP=2 AND CP=2 activate together produce a loss that's about half of the previous one.
I was wondering if there might be some issue with parallel loss aggregation due to the two parallelism levels.
Attached image (highest losses are TP=4 and CP=4, the lower one is TP=2 and CP=2).