Skip to content

Fix execution timeout enforcement in task supervisor (#57174)#59657

Open
qwe-kev wants to merge 7 commits intoapache:mainfrom
qwe-kev:fix/execution-timeout-supervisor-enforcement
Open

Fix execution timeout enforcement in task supervisor (#57174)#59657
qwe-kev wants to merge 7 commits intoapache:mainfrom
qwe-kev:fix/execution-timeout-supervisor-enforcement

Conversation

@qwe-kev
Copy link
Copy Markdown

@qwe-kev qwe-kev commented Dec 19, 2025

closes: #53337
related: #57174

This PR implements proper execution timeout handling for Airflow 3.0 by moving timeout enforcement from the task process to the supervisor process.

Previously, execution_timeout was handled inside the task process using a timeout decorator. This approach failed when:

Changes

  • Added TaskExecutionTimeout message for worker-to-supervisor communication
  • Supervisor monitors execution time and enforces timeout with SIGTERM/SIGKILL
  • Removed in-process timeout decorator from task execution
  • Timeout measurement starts after DAG parsing (excludes startup overhead)

Implementation

  1. Worker sends timeout_seconds to supervisor after DAG parsing
  2. Supervisor tracks elapsed time using monotonic clock
  3. On timeout: sends SIGTERM, then SIGKILL after 5-second grace period

This ensures reliable timeout enforcement at the supervisor level, preventing runaway tasks even when the task process encounters errors.

@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Dec 19, 2025

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Dec 20, 2025

This looks really good - can you please add test cases covering the timeout handling in supervisor ?

@qwe-kev
Copy link
Copy Markdown
Author

qwe-kev commented Dec 20, 2025

Added a comprehensive test suite for timeout handling as requested. All tests are passing.

Test Coverage:

  • 9 unit tests for timeout logic (SIGTERM/SIGKILL escalation, grace periods, etc.)
  • 4 integration tests with real subprocesses
  • All 13 tests passing locally
  • No existing tests broken

@qwe-kev
Copy link
Copy Markdown
Author

qwe-kev commented Jan 6, 2026

Hi, I wanted to follow up on this contribution. Is there anything I can do to help move this forward or any additional information needed?

@potiuk
Copy link
Copy Markdown
Member

potiuk commented Jan 6, 2026

Hi, I wanted to follow up on this contribution. Is there anything I can do to help move this forward or any additional information needed?

Turning that PR green might be a good start.

@qwe-kev
Copy link
Copy Markdown
Author

qwe-kev commented Mar 9, 2026

Hi, I wanted to follow up on this contribution.

@amoghrajesh
Copy link
Copy Markdown
Contributor

@qwe-kev can you help fixing the CI and resolving merge conflicts on this one? We could take a look once that's done.

@potiuk potiuk marked this pull request as draft April 2, 2026 13:54
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 2, 2026

@qwe-kev This PR has been converted to draft because it does not yet meet our Pull Request quality criteria.

Issues found:

  • Merge conflicts: This PR has merge conflicts with the main branch. Your branch is 747 commits behind main. Please rebase your branch (git fetch origin && git rebase origin/main), resolve the conflicts, and push again. See contributing quick start.
  • Pre-commit / static checks: Failing: CI image checks / Static checks. Run prek run --from-ref main locally to find and fix issues. See Pre-commit / static checks docs.
  • Build docs: Failing: CI image checks / Build documentation (--docs-only). Run breeze build-docs locally to reproduce. See Build docs docs.
  • Provider tests: Failing: Postgres tests: providers / DB-prov:Postgres:14:3.10:-amazon,celer...standard, MySQL tests: providers / DB-prov:MySQL:8.0:3.10:-amazon,celer...standard, Sqlite tests: providers / DB-prov:Sqlite:3.10:-amazon,celer...standard, Special tests / Min SQLAlchemy test: providers / DB-prov:MinSQLAlchemy-Postgres:14:3.10:-amazon,celer...standard, Special tests / Latest SQLAlchemy test: providers / DB-prov:LatestSQLAlchemy-Postgres:14:3.10:-amazon,celer...standard (+2 more). Run provider tests with breeze run pytest <provider-test-path> -xvs. See Provider tests docs.
  • Other failing CI checks: Failing: CI image checks / Build documentation (--spellcheck-only), Postgres tests: core / DB-core:Postgres:14:3.10:API...Serialization, MySQL tests: core / DB-core:MySQL:8.0:3.10:API...Serialization, Sqlite tests: core / DB-core:Sqlite:3.10:API...Serialization, Non-DB tests: core / Non-DB-core::3.10:API...Serialization (+5 more). Run prek run --from-ref main locally to reproduce. See static checks docs.
  • ⚠️ Unresolved review comments: This PR has 1 unresolved review thread from maintainers: @uranusjr (MEMBER): 1 unresolved thread. Please review and resolve all inline review comments before requesting another review. You can resolve a conversation by clicking 'Resolve conversation' on each thread after addressing the feedback. See pull request guidelines.

Note: Your branch is 747 commits behind main. Some check failures may be caused by changes in the base branch rather than by your PR. Please rebase your branch and push again to get up-to-date CI results.

What to do next:

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates.
  • Maintainers will then proceed with a normal review.

Converting a PR to draft is not a rejection — it is an invitation to bring the PR up to the project's standards so that maintainer review time is spent productively. There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates. If you have questions, feel free to ask on the Airflow Slack.

qwe-kev added 5 commits April 3, 2026 09:15
closes: apache#53337
related: apache#57174

This PR implements proper execution timeout handling for Airflow 3.0 by
moving timeout enforcement from the task process to the supervisor process.

Previously, execution_timeout was handled inside the task process using a
timeout decorator. This approach failed when:
- Task process encountered SIGSEGV or other signals (apache#57174)
- Native code ran in tight loops without handling Python signals
- Process was killed before timeout could be enforced

Changes:
- Added TaskExecutionTimeout message for worker-to-supervisor communication
- Supervisor monitors execution time and enforces timeout with SIGTERM/SIGKILL
- Removed in-process timeout decorator from task execution
- Timeout measurement starts after DAG parsing (excludes startup overhead)

Implementation:
1. Worker sends timeout_seconds to supervisor after DAG parsing
2. Supervisor tracks elapsed time using monotonic clock
3. On timeout: sends SIGTERM, then SIGKILL after 5-second grace period

This ensures reliable timeout enforcement at the supervisor level,
preventing runaway tasks even when the task process encounters errors.
- Add 9 unit tests for timeout handling logic
  - Test SIGTERM/SIGKILL escalation behavior
  - Test grace period enforcement
  - Test monotonic clock usage
  - Test message serialization

- Add 4 integration tests with real subprocesses
  - Test actual timeout enforcement
  - Test SIGKILL escalation when task ignores SIGTERM
  - Test tasks completing before/without timeout

- Add client_with_ti_start fixture for mocking API client
- Tests account for MIN_HEARTBEAT_INTERVAL in timing assertions

All 13 tests passing. No existing tests broken.
defined on BaseOperator. Addresses review feedback
- Changed falsy checks to 'is None' checks in supervisor.py to handle
  edge case where timeout values could be 0.0
- Added validation in task_runner.py to only send positive timeouts
- Prevents 0.0 (falsy) from being incorrectly treated as None

Addresses reviewer feedback
- Add test coverage for TaskExecutionTimeout message in test_supervisor.py
- Remove deprecated test_run_task_timeout and test_execution_timeout from test_task_runner.py
  (timeout now handled by supervisor, covered by TestExecutionTimeoutIntegration tests)
- Remove unused imports (AirflowTaskTimeout, _execute_task, time)
- Set expected_body=None for TaskExecutionTimeout as it's a one-way message

All tests now pass (689 passed).
@qwe-kev qwe-kev force-pushed the fix/execution-timeout-supervisor-enforcement branch from 6081148 to bee59e7 Compare April 3, 2026 14:54
@qwe-kev qwe-kev marked this pull request as ready for review April 4, 2026 10:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle task timeouts (execution_timeout) at supervisor

4 participants