Skip to content

Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler#64558

Merged
kaxil merged 1 commit intoapache:mainfrom
astronomer:fix/teardown-scope-review-fixes
Mar 31, 2026
Merged

Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler#64558
kaxil merged 1 commit intoapache:mainfrom
astronomer:fix/teardown-scope-review-fixes

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Mar 31, 2026

Addresses review feedback on #64181, fixes the teardown scope evaluation to avoid unnecessary work in the scheduler hot path.

  • Lazy iteration instead of list() materialization: _evaluate_direct_relatives() has side effects (ti.set_state() on line 460). The original list() call forced all those DB writes before yielding anything. With lazy iteration, we yield as we go and only evaluate teardown scope if no direct-relative status was produced.

  • Early return for empty in_scope_ids: Skips the ensure_finished_tis DB query when there are no in-scope tasks to check.

  • Parallel branch regression test: The existing tests used linear chains, but the reported bug involves parallel branches (setup >> [t_fail, t_slow] >> downstream >> teardown). Added a test matching that DAG shape.

  • FAILED/UPSTREAM_FAILED state test: Teardowns must run to clean up resources regardless of upstream failure state. Added a test verifying this works with FAILED and UPSTREAM_FAILED in-scope tasks.

  • Removed fragile string assertion: assert "2 in-scope" in reason couples tests to exact message format.

Benchmark: scheduler impact of list() vs lazy

The scheduler evaluates trigger rules for every TI in every loop. For teardowns, list() forces all set_state() DB writes upfront even when the first upstream already blocks:

Deployment DAGs Tasks/DAG list() writes lazy writes wasted writes
Small 50 10 500 50 450
Medium 200 20 8,000 400 7,600
Large 1,000 50 250,000 5,000 245,000
Enterprise 5,000 100 5,000,000 50,000 4,950,000
Deployment Teardowns list() time lazy time speedup
Small 50 0.05 ms 0.02 ms 2.4x
Medium 400 0.66 ms 0.16 ms 4.2x
Large 5,000 17.54 ms 2.00 ms 8.8x
Enterprise 50,000 334.51 ms 19.82 ms 16.9x

…ests

- Replace `list(_evaluate_direct_relatives())` with lazy iteration to
  avoid eagerly materializing the generator. The original `list()` call
  forced all side effects (including `ti.set_state()` DB writes) to
  execute upfront. In the scheduler hot path, this means N unnecessary
  DB writes per teardown when the first upstream already blocks it.
  At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show
  ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop.

- Add early return in `_evaluate_teardown_scope()` when `in_scope_ids`
  is empty, avoiding unnecessary iteration over all finished TIs.

- Add test for parallel branch DAG shape matching the actual bug
  topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td).

- Add test verifying teardowns run when in-scope tasks are FAILED or
  UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up
  resources regardless of upstream failure state.

- Remove fragile string assertion on error message format.
@kaxil kaxil requested review from ashb and vatsrahul1001 March 31, 2026 22:00
@kaxil kaxil added this to the Airflow 3.2.1 milestone Mar 31, 2026
@kaxil kaxil merged commit d15a756 into apache:main Mar 31, 2026
81 checks passed
@kaxil kaxil deleted the fix/teardown-scope-review-fixes branch March 31, 2026 23:04
@kaxil kaxil modified the milestones: Airflow 3.2.1, Airflow 3.2.0 Mar 31, 2026
kaxil added a commit that referenced this pull request Mar 31, 2026
…ests (#64558)

- Replace `list(_evaluate_direct_relatives())` with lazy iteration to
  avoid eagerly materializing the generator. The original `list()` call
  forced all side effects (including `ti.set_state()` DB writes) to
  execute upfront. In the scheduler hot path, this means N unnecessary
  DB writes per teardown when the first upstream already blocks it.
  At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show
  ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop.

- Add early return in `_evaluate_teardown_scope()` when `in_scope_ids`
  is empty, avoiding unnecessary iteration over all finished TIs.

- Add test for parallel branch DAG shape matching the actual bug
  topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td).

- Add test verifying teardowns run when in-scope tasks are FAILED or
  UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up
  resources regardless of upstream failure state.

- Remove fragile string assertion on error message format.

(cherry picked from commit d15a756)
@kaxil
Copy link
Copy Markdown
Member Author

kaxil commented Mar 31, 2026

Cherry-picked to v3-2-test: e9c0c96

Subham-KRLX pushed a commit to Subham-KRLX/airflow that referenced this pull request Apr 3, 2026
…ests (apache#64558)

- Replace `list(_evaluate_direct_relatives())` with lazy iteration to
  avoid eagerly materializing the generator. The original `list()` call
  forced all side effects (including `ti.set_state()` DB writes) to
  execute upfront. In the scheduler hot path, this means N unnecessary
  DB writes per teardown when the first upstream already blocks it.
  At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show
  ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop.

- Add early return in `_evaluate_teardown_scope()` when `in_scope_ids`
  is empty, avoiding unnecessary iteration over all finished TIs.

- Add test for parallel branch DAG shape matching the actual bug
  topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td).

- Add test verifying teardowns run when in-scope tasks are FAILED or
  UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up
  resources regardless of upstream failure state.

- Remove fragile string assertion on error message format.
@huashi-st
Copy link
Copy Markdown
Contributor

Thanks @kaxil for this

Suraj-kumar00 pushed a commit to Suraj-kumar00/airflow that referenced this pull request Apr 7, 2026
…ests (apache#64558)

- Replace `list(_evaluate_direct_relatives())` with lazy iteration to
  avoid eagerly materializing the generator. The original `list()` call
  forced all side effects (including `ti.set_state()` DB writes) to
  execute upfront. In the scheduler hot path, this means N unnecessary
  DB writes per teardown when the first upstream already blocks it.
  At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show
  ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop.

- Add early return in `_evaluate_teardown_scope()` when `in_scope_ids`
  is empty, avoiding unnecessary iteration over all finished TIs.

- Add test for parallel branch DAG shape matching the actual bug
  topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td).

- Add test verifying teardowns run when in-scope tasks are FAILED or
  UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up
  resources regardless of upstream failure state.

- Remove fragile string assertion on error message format.
vatsrahul1001 pushed a commit that referenced this pull request Apr 8, 2026
…ests (#64558)

- Replace `list(_evaluate_direct_relatives())` with lazy iteration to
  avoid eagerly materializing the generator. The original `list()` call
  forced all side effects (including `ti.set_state()` DB writes) to
  execute upfront. In the scheduler hot path, this means N unnecessary
  DB writes per teardown when the first upstream already blocks it.
  At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show
  ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop.

- Add early return in `_evaluate_teardown_scope()` when `in_scope_ids`
  is empty, avoiding unnecessary iteration over all finished TIs.

- Add test for parallel branch DAG shape matching the actual bug
  topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td).

- Add test verifying teardowns run when in-scope tasks are FAILED or
  UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up
  resources regardless of upstream failure state.

- Remove fragile string assertion on error message format.

(cherry picked from commit d15a756)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants