Skip to content

Add session-level query tags to Databricks SQL operators (fixes #66839)#66895

Open
SkastVnT wants to merge 1 commit into
apache:mainfrom
SkastVnT:feat/databricks-sql-query-tags
Open

Add session-level query tags to Databricks SQL operators (fixes #66839)#66895
SkastVnT wants to merge 1 commit into
apache:mainfrom
SkastVnT:feat/databricks-sql-query-tags

Conversation

@SkastVnT
Copy link
Copy Markdown

Summary

Closes #66839.

When Databricks SQL operators execute queries, those queries are invisible in system.query.history because no session-level tags are attached. This PR injects Airflow context metadata into the Databricks QUERY_TAGS session parameter so every query can be traced back to the DAG/task that triggered it.

Approach

Tags are serialised into the key:value,key:value format required by Databricks and injected via session_configuration={"QUERY_TAGS": "..."} inside DatabricksSqlHook.get_conn(). This is the correct mechanism — the connector-level query_tags= parameter does not propagate to system.query.history.

Key design decisions

Decision Rationale
Dict API at the operator level Callers pass query_tags={"key": "value"} — readable, type-safe, mergeable
String serialisation at the hook level A single place (_format_query_tags) owns the escaping/truncation rules
Value escaping \, ,, : are escaped; values truncated at 128 chars
Merge with existing QUERY_TAGS Preserves any tags already set in session_configuration
include_airflow_query_tags=False Opt-out for operators that don't want automatic Airflow context tags

Changes

providers/databricks/src/airflow/providers/databricks/hooks/databricks_sql.py

  • Added _format_query_tag_value(value) — escape special chars, truncate at 128 chars.
  • Added _format_query_tags(tags) — convert dict[str, str | None]"key:value,...".
  • get_conn() now builds a session_config dict, merges any incoming QUERY_TAGS string with the formatted operator tags, and passes it as session_configuration to sql.connect().

providers/databricks/src/airflow/providers/databricks/operators/databricks_sql.py

  • Added module-level _get_airflow_query_tags(context) that returns a dict of Airflow context keys (airflow_dag_id, airflow_task_id, airflow_run_id, airflow_try_number, airflow_map_index).
  • Both DatabricksSqlOperator and DatabricksCopyIntoOperator accept query_tags: dict[str, str | None] and include_airflow_query_tags: bool = True.
  • On execute(), each operator merges its own query_tags with Airflow context tags (if enabled) and assigns the result to hook.query_tags.

Tests

  • test_databricks_sql.py (hooks): replaced old query_tags= connector-kwarg tests with new session_configuration["QUERY_TAGS"] assertions; added TestFormatQueryTags class covering value escaping, truncation, None-omission, and round-trip correctness.
  • test_databricks_sql.py (operators): existing tests unchanged — they already test the dict-level API.

Verification

Tested against a real Databricks workspace: all 4 tasks in the test DAG completed successfully, and QUERY_TAGS appeared in system.query.history as a MAP<STRING,STRING> with the expected Airflow keys.

Checklist

  • My PR title and commit messages follow the commit message guidelines.
  • My changes are covered by unit tests.
  • I've run ruff format and ruff check --fix — all checks pass.
  • I've updated the provider changelog / newsfragment (will add if requested).

@boring-cyborg boring-cyborg Bot added area:dev-tools area:production-image Production image improvements and fixes area:providers backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch kind:documentation provider:databricks labels May 14, 2026
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 14, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@SkastVnT SkastVnT force-pushed the feat/databricks-sql-query-tags branch from ae3f2a6 to 333417a Compare May 14, 2026 03:56
@SkastVnT
Copy link
Copy Markdown
Author

SkastVnT commented May 14, 2026

I added the following validation screenshots for the Databricks query tags test DAG:

Screenshot 1 — no_airflow_tags task

This validates the negative-control case: DatabricksSqlOperator still runs successfully when Airflow query tag injection is disabled with include_airflow_query_tags=False.

1

Screenshot 2 — insert_more_rows task

This validates a tagged write operation: the task completed successfully and inserted additional rows into the Databricks demo table through DatabricksSqlOperator.

2

Screenshot 3 — setup_table task

This validates the setup step: the task successfully created the Delta table and inserted the initial demo rows.

4

Screenshot 4 — Databricks demo table rows

This confirms that the Databricks table was created and populated correctly. The final result contains 10 rows inserted by the Airflow tasks.

5

Screenshot 5 — Databricks table existence check

This confirms that airflow_query_tags_demo exists in the Databricks default schema.

7

Screenshot 6 — system.query.history availability check

This confirms that system.query.history is accessible in the Databricks workspace, so query history can be used to validate query tags.

8

Screenshot 7 — tagged_select task

This validates a tagged read operation: the task completed successfully with custom query tags such as pipeline and step, together with Airflow context query tags.

3

These screenshots cover the successful Airflow task execution, Databricks table creation, inserted demo rows, and access to system.query.history. I will add the query-history tag validation screenshot separately once the updated run returns rows containing the expected Airflow query tags.

@SkastVnT SkastVnT force-pushed the feat/databricks-sql-query-tags branch 2 times, most recently from 3b0cc58 to cd0ae67 Compare May 14, 2026 04:26
@SkastVnT SkastVnT marked this pull request as ready for review May 14, 2026 04:35
@SkastVnT
Copy link
Copy Markdown
Author

Additional validation — query tags are visible in Databricks system.query.history

I re-ran the Databricks query tags test DAG with the updated implementation and verified that the Airflow context tags are now written to Databricks query history.

The query below returns rows from system.query.history where query_tags contains the expected Airflow DAG metadata:

  • airflow_dag_id
  • airflow_task_id
  • airflow_run_id
  • airflow_try_number
  • custom tags such as pipeline and step

The result shows tagged queries from setup_table, tagged_select, and insert_more_rows, confirming that session-level query tags are propagated correctly to Databricks query history.

Databricks system.query.history query tags validation

The Airflow DAG run was triggered successfully, and the tagged Databricks SQL tasks executed successfully end-to-end.

Airflow Databricks query tags test DAG run

@SkastVnT SkastVnT force-pushed the feat/databricks-sql-query-tags branch from 78b9d53 to a6557d8 Compare May 14, 2026 04:39
@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented May 14, 2026

cc @alexott @mwojtyczka @moomindani can you help with review?

@eladkal eladkal removed area:dev-tools area:production-image Production image improvements and fixes backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch labels May 14, 2026
@SkastVnT SkastVnT force-pushed the feat/databricks-sql-query-tags branch from a6557d8 to d8a3970 Compare May 14, 2026 05:42
@SkastVnT
Copy link
Copy Markdown
Author

Fixed the CI failure.

The issue was caused by the new query_tags argument being added before caller in DatabricksSqlHook.__init__, while the Databricks SQL sensors were still passing self.caller positionally. This made self.caller get interpreted as query_tags.

I updated the sensors to pass caller=self.caller as a keyword argument, amended the commit, and force-pushed the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve Databricks operators with query tags

2 participants