Skip to content

Fix StepMutator.mutate() not seeing late-attached platform decorators#3026

Open
chirag-bajaj wants to merge 8 commits intoNetflix:masterfrom
chirag-bajaj:fix/step-mutator-platform-decorator-ordering-v2
Open

Fix StepMutator.mutate() not seeing late-attached platform decorators#3026
chirag-bajaj wants to merge 8 commits intoNetflix:masterfrom
chirag-bajaj:fix/step-mutator-platform-decorator-ordering-v2

Conversation

@chirag-bajaj
Copy link

@chirag-bajaj chirag-bajaj commented Mar 13, 2026

PR Type

  • Bug fix
  • New feature
  • Core Runtime change (higher bar -- see CONTRIBUTING.md)
  • Docs / tooling
  • Refactoring

Summary

_process_late_attached_decorator now re-runs StepMutator.mutate() after late-attaching platform decorators, so mutators can see and modify decorators like @kubernetes that are attached after _init_step_decorators has already executed.

Issue

Fixes #3025

Reproduction

Runtime: Argo Workflows

test_flow.py
from metaflow import FlowSpec, StepMutator, step
from metaflow.user_decorators.mutable_step import MutableStep

class kubernetes_override(StepMutator):
    """Override @kubernetes attributes via StepMutator."""

    def init(self, *args, **kwargs):
        self.deco_kwargs = kwargs

    def mutate(self, mutable_step):
        for deco_name, _, _, attributes in mutable_step.decorator_specs:
            if deco_name == "kubernetes":
                attributes.update(self.deco_kwargs)
                mutable_step.add_decorator(
                    deco_type=deco_name,
                    deco_kwargs=attributes,
                    duplicates=mutable_step.OVERRIDE,
                )

class MyFlow(FlowSpec):
    @kubernetes_override(cpu=2, memory=8192)
    @step
    def start(self):
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()

Commands to run:

python test_flow.py argo-workflows create --only-json | grep -A5 '"resources"'

Where evidence shows up: Generated Argo workflow JSON — resource requests on steps

Before (error / log snippet)
# StepMutator sets cpu=2, memory=8192 but output shows defaults:
"resources": {
    "requests": {
        "cpu": "1",
        "memory": "4096M",
    }
}
After (evidence that fix works)
# StepMutator overrides are now correctly applied:
"resources": {
    "requests": {
        "cpu": "2",
        "memory": "8192M",
    }
}

Root Cause

PR #2719 moved _init_step_decorators() from Argo's make_flow() into cli.py start(). This means mutators now run before Argo attaches @kubernetes to steps. The new _process_late_attached_decorator() function calls external_init() and step_init() on the late-attached decorators but never re-runs StepMutator.mutate(), so mutators that iterate decorator_specs looking for "kubernetes" find nothing.

2.19.13 order (working):

  1. _attach_decorators adds @kubernetes
  2. _init_step_decorators runs mutate() — mutators see @kubernetes

2.19.14 order (broken):

  1. _init_step_decorators runs mutate() — no @kubernetes yet ❌
  2. _attach_decorators adds @kubernetes
  3. _process_late_attached_decorator — does not re-run mutate()

Why This Fix Is Correct

The fix adds a mutator re-run loop inside _process_late_attached_decorator, between external_init() and the step_init() loop. This restores the invariant that mutate() always sees the full set of decorators on a step. After mutators run, cls._init_graph() is called to rebuild the graph so node.decorators reflects any changes made by add_decorator(OVERRIDE).

Failure Modes Considered

  1. Steps without mutators: Steps that have no StepMutator in config_decorators are unaffected — the inner loop simply doesn't execute. Verified by test_process_late_attached_preserves_defaults_without_mutator.
  2. Mutators without target decorators: If @kubernetes is never late-attached, the mutator's iteration over decorator_specs finds nothing and is a no-op. Verified by test_mutator_is_noop_without_late_attachment.
  3. Graph consistency: cls._init_graph() is called after mutator re-run to ensure node.decorators matches step.decorators. Without this, the Argo compiler reads stale decorator references from graph nodes.
  4. Double-init: external_init() is only called once (in the existing code, before the mutator re-run). The mutator re-run only calls mutate(), not init() again.

Tests

  • Unit tests added/updated
  • Reproduction script provided (required for Core Runtime)
  • CI passes
  • If tests are impractical: explain why below and provide manual evidence above

Three regression tests added in test/unit/test_late_attached_mutator.py:

  • test_process_late_attached_reruns_mutators — verifies mutator overrides apply after late attachment (fails without fix, passes with fix)
  • test_process_late_attached_preserves_defaults_without_mutator — verifies steps without mutators retain defaults
  • test_mutator_is_noop_without_late_attachment — verifies mutator is a no-op when target decorator is never attached

Non-Goals

  • No changes to the StepMutator API or MutableStep interface.
  • No changes to how _init_step_decorators works — only _process_late_attached_decorator is modified.
  • No changes to existing tests.

AI Tool Usage

  • No AI tools were used in this contribution
  • AI tools were used (describe below)

AI tooling was used to verify the solution against the test suite and to help write the PR description. All code changes were reviewed, understood, and tested manually.

In 2.19.14, _init_step_decorators (which runs StepMutator.mutate())
was removed from the Argo CLI's make_flow() and replaced with
_process_late_attached_decorator. This changed the execution order:
mutators now run before the platform attaches @kubernetes, so
StepMutators that iterate decorator_specs looking for "kubernetes"
find nothing.

This adds a re-run of step mutators inside
_process_late_attached_decorator, after late-attached decorators
have been added but before step_init is called. The graph is also
rebuilt so that node.decorators reflects any changes made by the
mutators (e.g. add_decorator with OVERRIDE).
Verify that _process_late_attached_decorator re-runs step mutators
so they can see and modify late-attached platform decorators like
@kubernetes. Tests cover:

- Mutator overrides apply after late attachment
- Steps without mutators retain defaults
- Mutator is a no-op when target decorator is never attached
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 13, 2026

Greptile Summary

This PR fixes a regression introduced in PR #2719 where StepMutator.mutate() was called before platform decorators like @kubernetes were late-attached to steps, causing mutator overrides (e.g. custom CPU/memory) to silently have no effect on Argo Workflows deployments. The fix adds a targeted mutator re-run loop inside _process_late_attached_decorator, guarded correctly by late_attached_step_names and mutators_ran flags to avoid redundant graph rebuilds and non-idempotent mutator calls.

  • Core fix (metaflow/decorators.py): After external_init() is called on late-attached decorators, StepMutator.mutate() is re-run only for affected steps; cls._init_graph() is then called (only when mutators actually ran) and external_init() is called on any new decorator instances created by add_decorator(OVERRIDE).
  • Graph refresh (graph = flow._graph on line 945): Correctly updates the local graph variable so the subsequent step_init calls receive the post-mutation graph — matching the pattern already used in _init_step_decorators.
  • Missing FlowMutator re-run: The fix addresses StepMutator but not FlowMutator. A FlowMutator that iterates step decorators looking for late-attached platform decorators (e.g. @kubernetes) would still see a stale view, leaving the same ordering bug for flow-level mutators.
  • Tests (test/unit/test_late_attached_mutator.py): Three regression tests correctly use pytest.importorskip for graceful skip and exercise _process_late_attached_decorator end-to-end, including the no-attachment early-exit path and the defaults-preserved path for steps without a mutator.

Confidence Score: 3/5

  • The fix correctly resolves the StepMutator ordering regression, but the analogous FlowMutator.mutate() re-run is absent, leaving a symmetric gap for flow-level mutators that inspect late-attached step decorators.
  • The core logic for StepMutator is sound — the late_attached_step_names guard prevents re-running mutators on unaffected steps, the mutators_ran flag avoids unnecessary graph rebuilds, and graph = flow._graph correctly refreshes the local reference after _init_graph(). Tests are well-structured and use pytest.importorskip appropriately. The score is held at 3 rather than 4 because FlowMutator.mutate() is not re-run after late attachment, which is a symmetric omission to the bug being fixed and could silently affect flow-level mutators that iterate over step-level platform decorators.
  • Pay close attention to metaflow/decorators.py lines 918–939 — the FlowMutator re-run should be added before the StepMutator loop to fully close the ordering gap.

Important Files Changed

Filename Overview
metaflow/decorators.py Adds mutator re-run logic inside _process_late_attached_decorator to fix the ordering regression from PR #2719. The StepMutator re-run is correct and well-guarded, but the analogous FlowMutator.mutate() re-run is missing, leaving FlowMutator implementations that inspect step-level platform decorators in the same broken state.
test/unit/test_late_attached_mutator.py New regression tests that correctly exercise _process_late_attached_decorator with pytest.importorskip for graceful skip when the Kubernetes plugin is unavailable. Tests cover the happy path (mutator sees late-attached decorator), defaults-preserved path, and no-op path. Coverage is reasonable given the constraints, though FlowMutator scenarios are not tested.

Sequence Diagram

sequenceDiagram
    participant CLI as cli.py start()
    participant ISD as _init_step_decorators()
    participant AD as _attach_decorators()
    participant PLAD as _process_late_attached_decorator()
    participant SM as StepMutator.mutate()
    participant IG as cls._init_graph()

    CLI->>ISD: run flow/step mutators
    note over ISD: @kubernetes NOT yet attached
    ISD->>SM: mutate() — sees no @kubernetes ❌ (pre-fix)
    ISD->>IG: rebuild graph

    CLI->>AD: late-attach @kubernetes to steps

    CLI->>PLAD: process late-attached decorators
    PLAD->>PLAD: external_init() on @kubernetes<br/>collect late_attached_step_names

    note over PLAD: NEW (this PR)
    PLAD->>SM: re-run mutate() for affected steps ✅
    PLAD->>IG: rebuild graph (only if mutators_ran)
    PLAD->>PLAD: external_init() on any new<br/>decorator instances (_ran_init=False)

    PLAD->>PLAD: step_init() on all @kubernetes<br/>using refreshed graph
Loading

Comments Outside Diff (1)

  1. metaflow/decorators.py, line 918-939 (link)

    FlowMutator.mutate() not re-run after late attachment

    The fix re-runs StepMutator.mutate() for affected steps but omits the analogous re-run of FlowMutator.mutate(). In _init_step_decorators, FlowMutator.mutate() is called first (via MutableFlow) and receives the full step decorator state via mutable_flow.steps. Any FlowMutator that iterates over step decorators — for example one that looks for @kubernetes to set global resource constraints — will still see a stale view where @kubernetes is absent, exactly the same bug as the one this PR is fixing for StepMutator.

    For consistency with the ordering in _init_step_decorators (flow mutators first, then step mutators), a re-run of FlowMutator.mutate() should be inserted before the StepMutator loop:

    if late_attached_step_names:
        # Re-run FlowMutators first (same order as _init_step_decorators)
        for deco in cls._flow_state[FlowStateItems.FLOW_MUTATORS]:
            if isinstance(deco, FlowMutator):
                inserted_by_value = [deco.decorator_name] + (deco.inserted_by or [])
                mutable_flow = MutableFlow(
                    cls,
                    pre_mutate=False,
                    statically_defined=deco.statically_defined,
                    inserted_by=inserted_by_value,
                )
                deco.mutate(mutable_flow)
    
        # Re-run StepMutators ...
        for step in cls._steps:
            ...

    The same idempotency caveat that applies to StepMutator applies here; users should ensure FlowMutator.mutate() is idempotent if it could be called more than once in a single deployment path.

Last reviewed commit: ac1b6c9

- Refresh local graph variable after _init_graph() so step_init
  receives the rebuilt graph instead of the stale one passed as
  a parameter (mirrors _init_step_decorators line 874).
- Restrict mutator re-run to only steps that received a late-attached
  decorator, avoiding issues with non-idempotent mutators.
- Update test_process_late_attached_preserves_defaults_without_mutator
  to call _process_late_attached_decorator so the new code path is
  actually exercised.
- Only call _init_graph() and refresh graph when late-attached
  decorators were actually found, preserving the caller-supplied
  graph parameter when no mutations occur.
- Use pytest.importorskip for KubernetesDecorator import so tests
  skip gracefully if the plugin is unavailable.
- Remove dead __iter__ and _steps_list from _MockFlow — after
  __class__ reassignment, FlowSpec.__iter__ is used via type(obj).
- Update test_mutator_is_noop_without_late_attachment to call
  _process_late_attached_decorator, exercising the empty
  late_attached_step_names guard path.
- Only rebuild graph when a mutator actually executed, avoiding
  unnecessary _init_graph() when late-attached decorators exist
  but no step has a StepMutator.
- Add assertion that start step's @kubernetes was overridden in
  test_process_late_attached_preserves_defaults_without_mutator
  to confirm the mutator loop ran successfully.
Add assertion on end_step in test_process_late_attached_reruns_mutators
to confirm the step without a StepMutator keeps default cpu=1,
memory=4096 after the mutator re-run loop.
When add_decorator(OVERRIDE) creates a new decorator instance, it
skips external_init(). Ensure replacement decorators are initialized
before step_init() is called on them.
@chirag-bajaj chirag-bajaj force-pushed the fix/step-mutator-platform-decorator-ordering-v2 branch from 67854ba to ac1b6c9 Compare March 13, 2026 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: StepMutator.mutate() never sees late-attached platform decorators (regression in 2.19.14)

1 participant