Skip to content

Qwen3Moe: TP+CP loss discrepancy with TP and CP only. #3280

@pretidav

Description

@pretidav

I am currently playing with Qwen3Moe.
I see that running on 4 gpus, DP=1, EP=1, SP=False,
the loss CrossEntropyLoss (not the most recent Chunked one)
with TP=4 or CP=4 (with or without loss compile) are reasonably compatible.
While TP=2 AND CP=2 activate together produce a loss that's about half of the previous one.

I was wondering if there might be some issue with parallel loss aggregation due to the two parallelism levels.
Attached image (highest losses are TP=4 and CP=4, the lower one is TP=2 and CP=2).

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions