Fix teardown scope: lazy iteration avoids wasted DB writes in scheduler#64558
Merged
kaxil merged 1 commit intoapache:mainfrom Mar 31, 2026
Merged
Conversation
…ests - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.
ashb
approved these changes
Mar 31, 2026
kaxil
added a commit
that referenced
this pull request
Mar 31, 2026
…ests (#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format. (cherry picked from commit d15a756)
Member
Author
|
Cherry-picked to v3-2-test: e9c0c96 |
Subham-KRLX
pushed a commit
to Subham-KRLX/airflow
that referenced
this pull request
Apr 3, 2026
…ests (apache#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.
Contributor
|
Thanks @kaxil for this |
Suraj-kumar00
pushed a commit
to Suraj-kumar00/airflow
that referenced
this pull request
Apr 7, 2026
…ests (apache#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from apache#29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format.
vatsrahul1001
pushed a commit
that referenced
this pull request
Apr 8, 2026
…ests (#64558) - Replace `list(_evaluate_direct_relatives())` with lazy iteration to avoid eagerly materializing the generator. The original `list()` call forced all side effects (including `ti.set_state()` DB writes) to execute upfront. In the scheduler hot path, this means N unnecessary DB writes per teardown when the first upstream already blocks it. At enterprise scale (5000 DAGs, 100 tasks/DAG), benchmarks show ~4.95M wasted DB writes and ~17x slower evaluation per scheduling loop. - Add early return in `_evaluate_teardown_scope()` when `in_scope_ids` is empty, avoiding unnecessary iteration over all finished TIs. - Add test for parallel branch DAG shape matching the actual bug topology from #29332 (setup >> [t_fail, t_slow] >> downstream >> td). - Add test verifying teardowns run when in-scope tasks are FAILED or UPSTREAM_FAILED (not just SUCCESS), since teardowns must clean up resources regardless of upstream failure state. - Remove fragile string assertion on error message format. (cherry picked from commit d15a756)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses review feedback on #64181, fixes the teardown scope evaluation to avoid unnecessary work in the scheduler hot path.
Lazy iteration instead of
list()materialization:_evaluate_direct_relatives()has side effects (ti.set_state()on line 460). The originallist()call forced all those DB writes before yielding anything. With lazy iteration, we yield as we go and only evaluate teardown scope if no direct-relative status was produced.Early return for empty
in_scope_ids: Skips theensure_finished_tisDB query when there are no in-scope tasks to check.Parallel branch regression test: The existing tests used linear chains, but the reported bug involves parallel branches (
setup >> [t_fail, t_slow] >> downstream >> teardown). Added a test matching that DAG shape.FAILED/UPSTREAM_FAILED state test: Teardowns must run to clean up resources regardless of upstream failure state. Added a test verifying this works with FAILED and UPSTREAM_FAILED in-scope tasks.
Removed fragile string assertion:
assert "2 in-scope" in reasoncouples tests to exact message format.Benchmark: scheduler impact of
list()vs lazyThe scheduler evaluates trigger rules for every TI in every loop. For teardowns,
list()forces allset_state()DB writes upfront even when the first upstream already blocks:list()writeslist()time