Qwen3Moe: TP+CP loss discrepancy with TP and CP only.

I am currently playing with Qwen3Moe. 
I see that running on 4 gpus, DP=1, EP=1, SP=False,
the loss CrossEntropyLoss (not the most recent Chunked one) 
with TP=4 or CP=4 (with or without loss compile) are reasonably compatible. 
While TP=2 AND CP=2 activate together produce a loss that's about half of the previous one. 

I was wondering if there might be some issue with parallel loss aggregation due to the two parallelism levels. 
Attached image (highest losses are TP=4 and CP=4, the lower one is TP=2 and CP=2). 

<img width="1129" height="379" alt="Image" src="https://github.com/user-attachments/assets/3b1e4d28-615c-456b-86c4-56851bcecf3a" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3Moe: TP+CP loss discrepancy with TP and CP only. #3280

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3Moe: TP+CP loss discrepancy with TP and CP only. #3280

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions