Decouple Serialization and Deserialization Code for tasks#54569
Merged
kaxil merged 1 commit intoapache:mainfrom Aug 29, 2025
Merged
Decouple Serialization and Deserialization Code for tasks#54569kaxil merged 1 commit intoapache:mainfrom
kaxil merged 1 commit intoapache:mainfrom
Conversation
kaxil
commented
Aug 20, 2025
kaxil
commented
Aug 20, 2025
8baf823 to
0334138
Compare
8248e4a to
c126079
Compare
89a6807 to
c0be635
Compare
kaxil
commented
Aug 27, 2025
6351823 to
3674170
Compare
7850b64 to
93c2642
Compare
93c2642 to
3bfc590
Compare
jedcunningham
approved these changes
Aug 29, 2025
airflow-core/docs/administration-and-deployment/dag-serialization.rst
Outdated
Show resolved
Hide resolved
Remove Task SDK dependencies from airflow-core deserialization by establishing a schema-based contract between client and server components. This change enables independent deployment and upgrades while laying the foundation for multi-language SDK support. Key Decoupling Achievements: - Replace dynamic get_serialized_fields() calls with hardcoded class methods - Add schema-driven default resolution with get_operator_defaults_from_schema() - Remove OPERATOR_DEFAULTS import dependency from airflow-core - Implement SerializedBaseOperator class attributes for all operator defaults - Update _is_excluded() logic to use schema defaults for efficient serialization Serialization Optimizations: - Unified partial_kwargs optimization supporting both encoded/non-encoded formats - Intelligent default exclusion reducing storage redundancy - MappedOperator.operator_class memory optimization (~90-95% reduction) - Comprehensive client_defaults system with hierarchical resolution Compatibility & Performance: - Significant size reduction for typical DAGs with mapped operators - Minimal overhead for client_defaults section (excellent efficiency) - All existing serialized DAGs continue to work unchanged Technical Implementation: - Add generate_client_defaults() with LRU caching for optimal performance - Implement _deserialize_partial_kwargs() supporting dual formats - Centralized field deserialization eliminating code duplication - Consolidated preprocessing logic in _preprocess_encoded_operator() - Callback field preprocessing for backward compatibility Testing & Validation: - Added TestMappedOperatorSerializationAndClientDefaults with 9 comprehensive tests - Parameterized tests for multiple serialization formats - End-to-end validation of serialization/deserialization workflows - Backward compatibility validation for callback field migration This decoupling enables independent deployment/upgrades and provides the foundation for multi-language SDK ecosystem alongside the Task Execution API. Part of apache#45428
3bfc590 to
eb97006
Compare
mangal-vairalkar
pushed a commit
to mangal-vairalkar/airflow
that referenced
this pull request
Aug 30, 2025
Remove Task SDK dependencies from airflow-core deserialization by establishing a schema-based contract between client and server components. This change enables independent deployment and upgrades while laying the foundation for multi-language SDK support. Key Decoupling Achievements: - Replace dynamic get_serialized_fields() calls with hardcoded class methods - Add schema-driven default resolution with get_operator_defaults_from_schema() - Remove OPERATOR_DEFAULTS import dependency from airflow-core - Implement SerializedBaseOperator class attributes for all operator defaults - Update _is_excluded() logic to use schema defaults for efficient serialization Serialization Optimizations: - Unified partial_kwargs optimization supporting both encoded/non-encoded formats - Intelligent default exclusion reducing storage redundancy - MappedOperator.operator_class memory optimization (~90-95% reduction) - Comprehensive client_defaults system with hierarchical resolution Compatibility & Performance: - Significant size reduction for typical DAGs with mapped operators - Minimal overhead for client_defaults section (excellent efficiency) - All existing serialized DAGs continue to work unchanged Technical Implementation: - Add generate_client_defaults() with LRU caching for optimal performance - Implement _deserialize_partial_kwargs() supporting dual formats - Centralized field deserialization eliminating code duplication - Consolidated preprocessing logic in _preprocess_encoded_operator() - Callback field preprocessing for backward compatibility Testing & Validation: - Added TestMappedOperatorSerializationAndClientDefaults with 9 comprehensive tests - Parameterized tests for multiple serialization formats - End-to-end validation of serialization/deserialization workflows - Backward compatibility validation for callback field migration This decoupling enables independent deployment/upgrades and provides the foundation for multi-language SDK ecosystem alongside the Task Execution API. Part of apache#45428
nothingmin
pushed a commit
to nothingmin/airflow
that referenced
this pull request
Sep 2, 2025
Remove Task SDK dependencies from airflow-core deserialization by establishing a schema-based contract between client and server components. This change enables independent deployment and upgrades while laying the foundation for multi-language SDK support. Key Decoupling Achievements: - Replace dynamic get_serialized_fields() calls with hardcoded class methods - Add schema-driven default resolution with get_operator_defaults_from_schema() - Remove OPERATOR_DEFAULTS import dependency from airflow-core - Implement SerializedBaseOperator class attributes for all operator defaults - Update _is_excluded() logic to use schema defaults for efficient serialization Serialization Optimizations: - Unified partial_kwargs optimization supporting both encoded/non-encoded formats - Intelligent default exclusion reducing storage redundancy - MappedOperator.operator_class memory optimization (~90-95% reduction) - Comprehensive client_defaults system with hierarchical resolution Compatibility & Performance: - Significant size reduction for typical DAGs with mapped operators - Minimal overhead for client_defaults section (excellent efficiency) - All existing serialized DAGs continue to work unchanged Technical Implementation: - Add generate_client_defaults() with LRU caching for optimal performance - Implement _deserialize_partial_kwargs() supporting dual formats - Centralized field deserialization eliminating code duplication - Consolidated preprocessing logic in _preprocess_encoded_operator() - Callback field preprocessing for backward compatibility Testing & Validation: - Added TestMappedOperatorSerializationAndClientDefaults with 9 comprehensive tests - Parameterized tests for multiple serialization formats - End-to-end validation of serialization/deserialization workflows - Backward compatibility validation for callback field migration This decoupling enables independent deployment/upgrades and provides the foundation for multi-language SDK ecosystem alongside the Task Execution API. Part of apache#45428
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Sep 18, 2025
This change reduces serialized DAG size by automatically excluding fields that match their schema default values, similar to how operator serialization works. Fields like `catchup=False`, `max_active_runs=16`, and `fail_fast=False` are no longer stored when they have default values. Follow-up of apache#54569
kaxil
added a commit
to astronomer/airflow
that referenced
this pull request
Sep 18, 2025
This change reduces serialized DAG size by automatically excluding fields that match their schema default values, similar to how operator serialization works. Fields like `catchup=False`, `max_active_runs=16`, and `fail_fast=False` are no longer stored when they have default values. Follow-up of apache#54569
kaxil
added a commit
that referenced
this pull request
Sep 18, 2025
This change reduces serialized DAG size by automatically excluding fields that match their schema default values, similar to how operator serialization works. Fields like `catchup=False`, `max_active_runs=16`, and `fail_fast=False` are no longer stored when they have default values. Follow-up of #54569
kaxil
added a commit
that referenced
this pull request
Sep 18, 2025
This change reduces serialized DAG size by automatically excluding fields that match their schema default values, similar to how operator serialization works. Fields like `catchup=False`, `max_active_runs=16`, and `fail_fast=False` are no longer stored when they have default values. Follow-up of #54569 (cherry picked from commit a582464)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🎯 Problem Statement
The Task SDK separation in Airflow 3.1 requires decoupling serialization and deserialization code to eliminate server-side dependencies on client SDK implementations:
airflow-coredeserialization currently depends on Task SDK'sBaseOperatorfor default values and field lists🚀 Solution Overview
This PR decouples (to a great extent) serialization and deserialization code by removing Task SDK dependencies from
airflow-core:get_serialized_fields()calls with hardcoded class methodsOPERATOR_DEFAULTSand other Task SDK imports from server-side codeschema.jsonandclient_defaultsinstead of Task SDK classes for default resolution📊 Benchmark
As part of this change, I optimised how the defaults are stored and when a field is stored and removed anything that matches defaults, nulls and the bigger impact change to remove storing entire callback functions as strings and instead store a boolean to indicate if a callback was set or not.
The bigger the DAG (more tasks + especially with callbacks), the more savings.
Using actual pre-optimization code:
🔥 Callback Optimization Analysis (100 tasks with 3 callbacks each):
🎯 Key Optimization Impact:
🏗️ Architecture Changes
Task Default Resolution
Implements hierarchical defaults during deserialization:
schema.json) - lowest priorityclient_defaults.tasks- SDK-specific overridespartial_kwargs- MappedOperator valuesSerialization Exclusion
Fields matching
client_defaultsare automatically excluded from task serialization, reducing redundancy while maintaining full information.fyi: Following the Task Execution API pattern, I aim to add versioned schema contract at Airflow website directly or version docs soon'ish:
Thinking about a URL like:
https://airflow.apache.org/schemas/dag-serialization/v2.json🚦 Migration Path
For Users
Appendix (for my own tracking)
TODOs (some might be done in a future PR):
schema.jsonon_*_callbackon tasks to usehas_on_*_callbackunmapmethod from scheduler-side #54816client_defaultsgeneration in serialization (Task SDK side)Future Work:
schema.jsonin the calver OpenAPI spec for Execution API and/or in airflow versioned docsui_color&ui_fgcolorOther points
ExtendedJSON- TypeDecorator used in serialization of the following:DagRun.context_carrierTaskInstance.next_kwargsBenchmark script: