Skip to content

Fix deferred task resume failure when worker is older than server#64598

Merged
vatsrahul1001 merged 2 commits intoapache:mainfrom
astronomer:fix/trigger-next-kwargs-backward-compat
Apr 2, 2026
Merged

Fix deferred task resume failure when worker is older than server#64598
vatsrahul1001 merged 2 commits intoapache:mainfrom
astronomer:fix/trigger-next-kwargs-backward-compat

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Apr 1, 2026

Summary

  • Workers on task-sdk < 1.2 crash with KeyError: __var when resuming deferred tasks whose trigger has fired on a 3.2 server
  • Add a Cadwyn response converter on TIRunContext that wraps next_kwargs in BaseSerialization format for old API versions

Context

After #59711 switched trigger kwargs to SDK serde, handle_event_submit re-serializes next_kwargs as plain dicts. Old workers only know BaseSerialization.deserialize(), which requires {"__type": "dict", "__var": {...}} wrapping on all dicts -- so they crash with KeyError: __var.

This was flagged during review of #59711 -- the compat shim was added for the deserialize path but not the serialize path.

Approach

Instead of changing trigger.py to use BaseSerialization.serialize() (which regresses the DB to old format), the fix adds a response converter in the Execution API versioning layer (ModifyDeferredTaskKwargsToJsonValue). This:

  • Keeps the DB in SDK serde format (the direction we're moving)
  • Translates at the API boundary for old workers -- consistent with how all other backward-compat changes work via Cadwyn
  • Short-circuits when data already has __type/__var keys (rolling upgrade with old data in DB)
  • Gets removed automatically when old API versions are dropped

@vatsrahul1001 vatsrahul1001 added this to the Airflow 3.2.0 milestone Apr 1, 2026
@vatsrahul1001 vatsrahul1001 added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 1, 2026
@kaxil kaxil force-pushed the fix/trigger-next-kwargs-backward-compat branch from d5379c9 to 57b9114 Compare April 1, 2026 18:06
Workers on task-sdk < 1.2 crash with `KeyError: __var` when resuming
deferred tasks whose trigger has fired on a 3.2 server. The server's
handle_event_submit re-serializes next_kwargs with SDK serde (plain
dicts), but old workers expect BaseSerialization format (__type/__var
wrapping). submit_failure and scheduler timeout paths also write plain
dicts that old workers cannot parse.

Add a Cadwyn response converter on TIRunContext that deserializes SDK
serde then re-serializes with BaseSerialization for old API versions.
This catches all next_kwargs writers at the single API read point.
Short-circuits when data is already in BaseSerialization format.
@kaxil kaxil force-pushed the fix/trigger-next-kwargs-backward-compat branch from 57b9114 to 858818b Compare April 1, 2026 23:51
@jason810496 jason810496 self-requested a review April 2, 2026 04:18
@vatsrahul1001
Copy link
Copy Markdown
Contributor

e2e test failing tests/airflow_e2e_tests/basic_tests/test_basic_dag_operations.py::TestBasicDagFunctionality::test_xcom_value FAILED [ 2%]

@vatsrahul1001
Copy link
Copy Markdown
Contributor

E2E failure looks unrelated. Failing for others' PR as well.

@vatsrahul1001 vatsrahul1001 merged commit 891c7fb into apache:main Apr 2, 2026
549 of 553 checks passed
@vatsrahul1001 vatsrahul1001 deleted the fix/trigger-next-kwargs-backward-compat branch April 2, 2026 08:09
github-actions bot pushed a commit that referenced this pull request Apr 2, 2026
…n server (#64598)

Fix deferred task resume failure when worker is older than server (#64598)
(cherry picked from commit 891c7fb)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 2, 2026

Backport successfully created: v3-2-test

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Apr 2, 2026
…n server (apache#64598)

Fix deferred task resume failure when worker is older than server (apache#64598)
(cherry picked from commit 891c7fb)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
vatsrahul1001 pushed a commit that referenced this pull request Apr 2, 2026
…n server (#64598) (#64619)

Fix deferred task resume failure when worker is older than server (#64598)
(cherry picked from commit 891c7fb)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
vatsrahul1001 pushed a commit that referenced this pull request Apr 8, 2026
…n server (#64598) (#64619)

Fix deferred task resume failure when worker is older than server (#64598)
(cherry picked from commit 891c7fb)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk area:Triggerer backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants