Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler by kaxil · Pull Request #64558 · apache/airflow

kaxil · 2026-03-31T22:00:37Z

Addresses review feedback on #64181, fixes the teardown scope evaluation to avoid unnecessary work in the scheduler hot path.

Lazy iteration instead of list() materialization: _evaluate_direct_relatives() has side effects (ti.set_state() on line 460). The original list() call forced all those DB writes before yielding anything. With lazy iteration, we yield as we go and only evaluate teardown scope if no direct-relative status was produced.
Early return for empty in_scope_ids: Skips the ensure_finished_tis DB query when there are no in-scope tasks to check.
Parallel branch regression test: The existing tests used linear chains, but the reported bug involves parallel branches (setup >> [t_fail, t_slow] >> downstream >> teardown). Added a test matching that DAG shape.
FAILED/UPSTREAM_FAILED state test: Teardowns must run to clean up resources regardless of upstream failure state. Added a test verifying this works with FAILED and UPSTREAM_FAILED in-scope tasks.
Removed fragile string assertion: assert "2 in-scope" in reason couples tests to exact message format.

Benchmark: scheduler impact of `list()` vs lazy

The scheduler evaluates trigger rules for every TI in every loop. For teardowns, list() forces all set_state() DB writes upfront even when the first upstream already blocks:

Deployment	DAGs	Tasks/DAG	`list()` writes	lazy writes	wasted writes
Small	50	10	500	50	450
Medium	200	20	8,000	400	7,600
Large	1,000	50	250,000	5,000	245,000
Enterprise	5,000	100	5,000,000	50,000	4,950,000

Deployment	Teardowns	`list()` time	lazy time	speedup
Small	50	0.05 ms	0.02 ms	2.4x
Medium	400	0.66 ms	0.16 ms	4.2x
Large	5,000	17.54 ms	2.00 ms	8.8x
Enterprise	50,000	334.51 ms	19.82 ms	16.9x

…ests - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.

…ests (#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format. (cherry picked from commit d15a756)

kaxil · 2026-03-31T23:05:43Z

Cherry-picked to v3-2-test: e9c0c96

…ests (apache#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.

huashi-st · 2026-04-06T01:38:10Z

Thanks @kaxil for this

…ests (apache#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.

…ests (#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format. (cherry picked from commit d15a756)

kaxil requested review from ashb and vatsrahul1001 March 31, 2026 22:00

kaxil added this to the Airflow 3.2.1 milestone Mar 31, 2026

ashb approved these changes Mar 31, 2026

View reviewed changes

kaxil merged commit d15a756 into apache:main Mar 31, 2026
81 checks passed

kaxil deleted the fix/teardown-scope-review-fixes branch March 31, 2026 23:04

kaxil modified the milestones: Airflow 3.2.1, Airflow 3.2.0 Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler#64558

Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler#64558
kaxil merged 1 commit intoapache:mainfrom
astronomer:fix/teardown-scope-review-fixes

kaxil commented Mar 31, 2026

Uh oh!

Uh oh!

kaxil commented Mar 31, 2026

Uh oh!

huashi-st commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaxil commented Mar 31, 2026

Benchmark: scheduler impact of list() vs lazy

Uh oh!

Uh oh!

kaxil commented Mar 31, 2026

Uh oh!

huashi-st commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Benchmark: scheduler impact of `list()` vs lazy