Fix StepMutator.mutate() not seeing late-attached platform decorators by chirag-bajaj · Pull Request #3026 · Netflix/metaflow

chirag-bajaj · 2026-03-13T10:16:38Z

PR Type

Summary

_process_late_attached_decorator now re-runs StepMutator.mutate() after late-attaching platform decorators, so mutators can see and modify decorators like @kubernetes that are attached after _init_step_decorators has already executed.

Issue

Fixes #3025

Reproduction

Runtime: Argo Workflows

test_flow.py

from metaflow import FlowSpec, StepMutator, step
from metaflow.user_decorators.mutable_step import MutableStep

class kubernetes_override(StepMutator):
    """Override @kubernetes attributes via StepMutator."""

    def init(self, *args, **kwargs):
        self.deco_kwargs = kwargs

    def mutate(self, mutable_step):
        for deco_name, _, _, attributes in mutable_step.decorator_specs:
            if deco_name == "kubernetes":
                attributes.update(self.deco_kwargs)
                mutable_step.add_decorator(
                    deco_type=deco_name,
                    deco_kwargs=attributes,
                    duplicates=mutable_step.OVERRIDE,
                )

class MyFlow(FlowSpec):
    @kubernetes_override(cpu=2, memory=8192)
    @step
    def start(self):
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()

Commands to run:

python test_flow.py argo-workflows create --only-json | grep -A5 '"resources"'

Where evidence shows up: Generated Argo workflow JSON — resource requests on steps

Before (error / log snippet)

# StepMutator sets cpu=2, memory=8192 but output shows defaults:
"resources": {
    "requests": {
        "cpu": "1",
        "memory": "4096M",
    }
}

After (evidence that fix works)

# StepMutator overrides are now correctly applied:
"resources": {
    "requests": {
        "cpu": "2",
        "memory": "8192M",
    }
}

Root Cause

PR #2719 moved _init_step_decorators() from Argo's make_flow() into cli.py start(). This means mutators now run before Argo attaches @kubernetes to steps. The new _process_late_attached_decorator() function calls external_init() and step_init() on the late-attached decorators but never re-runs StepMutator.mutate(), so mutators that iterate decorator_specs looking for "kubernetes" find nothing.

2.19.13 order (working):

_attach_decorators adds @kubernetes
_init_step_decorators runs mutate() — mutators see @kubernetes ✅

2.19.14 order (broken):

_init_step_decorators runs mutate() — no @kubernetes yet ❌
_attach_decorators adds @kubernetes
_process_late_attached_decorator — does not re-run mutate()

Why This Fix Is Correct

The fix adds a mutator re-run loop inside _process_late_attached_decorator, between external_init() and the step_init() loop. This restores the invariant that mutate() always sees the full set of decorators on a step. After mutators run, cls._init_graph() is called to rebuild the graph so node.decorators reflects any changes made by add_decorator(OVERRIDE).

Failure Modes Considered

Steps without mutators: Steps that have no StepMutator in config_decorators are unaffected — the inner loop simply doesn't execute. Verified by test_process_late_attached_preserves_defaults_without_mutator.
Mutators without target decorators: If @kubernetes is never late-attached, the mutator's iteration over decorator_specs finds nothing and is a no-op. Verified by test_mutator_is_noop_without_late_attachment.
Graph consistency: cls._init_graph() is called after mutator re-run to ensure node.decorators matches step.decorators. Without this, the Argo compiler reads stale decorator references from graph nodes.
Double-init: external_init() is only called once (in the existing code, before the mutator re-run). The mutator re-run only calls mutate(), not init() again.

Tests

Unit tests added/updated
Reproduction script provided (required for Core Runtime)
CI passes
If tests are impractical: explain why below and provide manual evidence above

Three regression tests added in test/unit/test_late_attached_mutator.py:

test_process_late_attached_reruns_mutators — verifies mutator overrides apply after late attachment (fails without fix, passes with fix)
test_process_late_attached_preserves_defaults_without_mutator — verifies steps without mutators retain defaults
test_mutator_is_noop_without_late_attachment — verifies mutator is a no-op when target decorator is never attached

Non-Goals

No changes to the StepMutator API or MutableStep interface.
No changes to how _init_step_decorators works — only _process_late_attached_decorator is modified.
No changes to existing tests.

AI Tool Usage

No AI tools were used in this contribution
AI tools were used (describe below)

AI tooling was used to verify the solution against the test suite and to help write the PR description. All code changes were reviewed, understood, and tested manually.

In 2.19.14, _init_step_decorators (which runs StepMutator.mutate()) was removed from the Argo CLI's make_flow() and replaced with _process_late_attached_decorator. This changed the execution order: mutators now run before the platform attaches @kubernetes, so StepMutators that iterate decorator_specs looking for "kubernetes" find nothing. This adds a re-run of step mutators inside _process_late_attached_decorator, after late-attached decorators have been added but before step_init is called. The graph is also rebuilt so that node.decorators reflects any changes made by the mutators (e.g. add_decorator with OVERRIDE).

Verify that _process_late_attached_decorator re-runs step mutators so they can see and modify late-attached platform decorators like @kubernetes. Tests cover: - Mutator overrides apply after late attachment - Steps without mutators retain defaults - Mutator is a no-op when target decorator is never attached

greptile-apps · 2026-03-13T10:19:26Z

Greptile Summary

This PR fixes a regression introduced in PR #2719 where StepMutator.mutate() was called before platform decorators like @kubernetes were late-attached to steps, causing mutator overrides (e.g. custom CPU/memory) to silently have no effect on Argo Workflows deployments. The fix adds a targeted mutator re-run loop inside _process_late_attached_decorator, guarded correctly by late_attached_step_names and mutators_ran flags to avoid redundant graph rebuilds and non-idempotent mutator calls.

Core fix (metaflow/decorators.py): After external_init() is called on late-attached decorators, StepMutator.mutate() is re-run only for affected steps; cls._init_graph() is then called (only when mutators actually ran) and external_init() is called on any new decorator instances created by add_decorator(OVERRIDE).
Graph refresh (graph = flow._graph on line 945): Correctly updates the local graph variable so the subsequent step_init calls receive the post-mutation graph — matching the pattern already used in _init_step_decorators.
Missing FlowMutator re-run: The fix addresses StepMutator but not FlowMutator. A FlowMutator that iterates step decorators looking for late-attached platform decorators (e.g. @kubernetes) would still see a stale view, leaving the same ordering bug for flow-level mutators.
Tests (test/unit/test_late_attached_mutator.py): Three regression tests correctly use pytest.importorskip for graceful skip and exercise _process_late_attached_decorator end-to-end, including the no-attachment early-exit path and the defaults-preserved path for steps without a mutator.

Confidence Score: 3/5

The fix correctly resolves the StepMutator ordering regression, but the analogous FlowMutator.mutate() re-run is absent, leaving a symmetric gap for flow-level mutators that inspect late-attached step decorators.
The core logic for StepMutator is sound — the late_attached_step_names guard prevents re-running mutators on unaffected steps, the mutators_ran flag avoids unnecessary graph rebuilds, and graph = flow._graph correctly refreshes the local reference after _init_graph(). Tests are well-structured and use pytest.importorskip appropriately. The score is held at 3 rather than 4 because FlowMutator.mutate() is not re-run after late attachment, which is a symmetric omission to the bug being fixed and could silently affect flow-level mutators that iterate over step-level platform decorators.
Pay close attention to metaflow/decorators.py lines 918–939 — the FlowMutator re-run should be added before the StepMutator loop to fully close the ordering gap.

Important Files Changed

Filename	Overview
metaflow/decorators.py	Adds mutator re-run logic inside `_process_late_attached_decorator` to fix the ordering regression from PR #2719. The `StepMutator` re-run is correct and well-guarded, but the analogous `FlowMutator.mutate()` re-run is missing, leaving `FlowMutator` implementations that inspect step-level platform decorators in the same broken state.
test/unit/test_late_attached_mutator.py	New regression tests that correctly exercise `_process_late_attached_decorator` with `pytest.importorskip` for graceful skip when the Kubernetes plugin is unavailable. Tests cover the happy path (mutator sees late-attached decorator), defaults-preserved path, and no-op path. Coverage is reasonable given the constraints, though `FlowMutator` scenarios are not tested.

Sequence Diagram

sequenceDiagram
    participant CLI as cli.py start()
    participant ISD as _init_step_decorators()
    participant AD as _attach_decorators()
    participant PLAD as _process_late_attached_decorator()
    participant SM as StepMutator.mutate()
    participant IG as cls._init_graph()

    CLI->>ISD: run flow/step mutators
    note over ISD: @kubernetes NOT yet attached
    ISD->>SM: mutate() — sees no @kubernetes ❌ (pre-fix)
    ISD->>IG: rebuild graph

    CLI->>AD: late-attach @kubernetes to steps

    CLI->>PLAD: process late-attached decorators
    PLAD->>PLAD: external_init() on @kubernetes<br/>collect late_attached_step_names

    note over PLAD: NEW (this PR)
    PLAD->>SM: re-run mutate() for affected steps ✅
    PLAD->>IG: rebuild graph (only if mutators_ran)
    PLAD->>PLAD: external_init() on any new<br/>decorator instances (_ran_init=False)

    PLAD->>PLAD: step_init() on all @kubernetes<br/>using refreshed graph

Comments Outside Diff (1)

metaflow/decorators.py, line 918-939 (link)

FlowMutator.mutate() not re-run after late attachment

The fix re-runs StepMutator.mutate() for affected steps but omits the analogous re-run of FlowMutator.mutate(). In _init_step_decorators, FlowMutator.mutate() is called first (via MutableFlow) and receives the full step decorator state via mutable_flow.steps. Any FlowMutator that iterates over step decorators — for example one that looks for @kubernetes to set global resource constraints — will still see a stale view where @kubernetes is absent, exactly the same bug as the one this PR is fixing for StepMutator.

For consistency with the ordering in _init_step_decorators (flow mutators first, then step mutators), a re-run of FlowMutator.mutate() should be inserted before the StepMutator loop:
```
if late_attached_step_names:
    # Re-run FlowMutators first (same order as _init_step_decorators)
    for deco in cls._flow_state[FlowStateItems.FLOW_MUTATORS]:
        if isinstance(deco, FlowMutator):
            inserted_by_value = [deco.decorator_name] + (deco.inserted_by or [])
            mutable_flow = MutableFlow(
                cls,
                pre_mutate=False,
                statically_defined=deco.statically_defined,
                inserted_by=inserted_by_value,
            )
            deco.mutate(mutable_flow)

    # Re-run StepMutators ...
    for step in cls._steps:
        ...
```
The same idempotency caveat that applies to StepMutator applies here; users should ensure FlowMutator.mutate() is idempotent if it could be called more than once in a single deployment path.

_{Last reviewed commit: ac1b6c9}

metaflow/decorators.py

test/unit/test_late_attached_mutator.py

- Refresh local graph variable after _init_graph() so step_init receives the rebuilt graph instead of the stale one passed as a parameter (mirrors _init_step_decorators line 874). - Restrict mutator re-run to only steps that received a late-attached decorator, avoiding issues with non-idempotent mutators. - Update test_process_late_attached_preserves_defaults_without_mutator to call _process_late_attached_decorator so the new code path is actually exercised.

metaflow/decorators.py

test/unit/test_late_attached_mutator.py

- Only call _init_graph() and refresh graph when late-attached decorators were actually found, preserving the caller-supplied graph parameter when no mutations occur. - Use pytest.importorskip for KubernetesDecorator import so tests skip gracefully if the plugin is unavailable.

test/unit/test_late_attached_mutator.py

- Remove dead __iter__ and _steps_list from _MockFlow — after __class__ reassignment, FlowSpec.__iter__ is used via type(obj). - Update test_mutator_is_noop_without_late_attachment to call _process_late_attached_decorator, exercising the empty late_attached_step_names guard path.

test/unit/test_late_attached_mutator.py

metaflow/decorators.py

- Only rebuild graph when a mutator actually executed, avoiding unnecessary _init_graph() when late-attached decorators exist but no step has a StepMutator. - Add assertion that start step's @kubernetes was overridden in test_process_late_attached_preserves_defaults_without_mutator to confirm the mutator loop ran successfully.

test/unit/test_late_attached_mutator.py

Add assertion on end_step in test_process_late_attached_reruns_mutators to confirm the step without a StepMutator keeps default cpu=1, memory=4096 after the mutator re-run loop.

When add_decorator(OVERRIDE) creates a new decorator instance, it skips external_init(). Ensure replacement decorators are initialized before step_init() is called on them.

chirag-bajaj added 2 commits March 13, 2026 11:11