Skip to content

[Module] Modularize MoE components #2571

Open
acisseJZhong wants to merge 6 commits intogh/acisseJZhong/3/basefrom
gh/acisseJZhong/3/head
Open

[Module] Modularize MoE components #2571
acisseJZhong wants to merge 6 commits intogh/acisseJZhong/3/basefrom
gh/acisseJZhong/3/head

Conversation

@acisseJZhong
Copy link
Contributor

@acisseJZhong acisseJZhong commented Mar 13, 2026

Stack from ghstack (oldest at bottom):

Motivation

Continues the modularization effort from #2428 (Embedding), #2434 (RMSNorm), and
#2527 (Linear). This PR makes MoE components (GroupedExperts,
TokenChoiceTopKRouter, TokenReorderer) inherit from Module and use the
Config/build() pattern, enabling per-component configurability and consistent
weight initialization.

Summary

  • GroupedExperts inherits from Module with a nested Config dataclass. dim,
    hidden_dim, and num_experts use field(init=False) so they are set at build() time
    by the parent MoE.
  • TokenChoiceTopKRouter inherits from Module with a nested Config dataclass.
    Replaces the flat gate_bias: bool field with gate: Linear.Config, enabling full
    gate configuration (e.g., Linear.Config(bias=True)). dim and num_experts use
    field(init=False).
  • TokenReorderer inherits from Module with a no-op init_weights (stateless
    module). num_experts and top_k use field(init=False).
  • MoE.Config replaces flat router/expert fields (top_k, score_func, route_norm,
    route_scale, gate_bias, use_grouped_mm, num_expert_groups, num_limited_groups,
    _debug_force_load_balance) with nested experts: GroupedExperts.Config and router:
    TokenChoiceTopKRouter.Config.
  • Updates all model configs (DeepSeek V3, Qwen3, GPT-OSS, Llama4) to use nested
    router/expert configs.
  • Adds 34 unit tests covering Config/build(), init_weights, forward passes, and
    nested config propagation for all modularized components.

cc @fegin #2540 may needs to rebase on this.

[ghstack-poisoned]
acisseJZhong added a commit that referenced this pull request Mar 13, 2026
ghstack-source-id: 471089b
Pull Request resolved: #2571
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 13, 2026
@acisseJZhong acisseJZhong marked this pull request as draft March 13, 2026 22:30
@acisseJZhong acisseJZhong changed the title first attempt [Module] Modularize MoE components Mar 13, 2026
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: 471089b
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: ff35069
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: d58ba64
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: d58ba64
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: d58ba64
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: 8a34c7a
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: 6c1296e
Pull Request resolved: #2571
acisseJZhong added a commit that referenced this pull request Mar 14, 2026
ghstack-source-id: 00a0a5d
Pull Request resolved: #2571
@acisseJZhong acisseJZhong marked this pull request as ready for review March 14, 2026 19:38
score_func: Literal["softmax", "sigmoid"] = "sigmoid"
route_norm: bool = False
route_scale: float = 1.0
gate: Linear.Config = field(default_factory=Linear.Config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better if it doesn't come with a default, so that model config constructor have to specify.

_debug_force_load_balance: bool = False,
):
@dataclass(kw_only=True, slots=True)
class Config(Module.Config):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I wonder if we should put the following to router as well

  • tokens_per_expert and expert_bias buffer to router module
  • load_balance_coeff to router config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/8gpu CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants