Add HookToolset and SQLToolset for agentic LLM workflows#62785
Merged
kaxil merged 5 commits intoapache:mainfrom Mar 3, 2026
Merged
Add HookToolset and SQLToolset for agentic LLM workflows#62785kaxil merged 5 commits intoapache:mainfrom
HookToolset and SQLToolset for agentic LLM workflows#62785kaxil merged 5 commits intoapache:mainfrom
Conversation
HookToolset: Generic adapter that exposes any Airflow Hook's methods as pydantic-ai tools via introspection. Requires explicit allowed_methods list (no auto-discovery). Builds JSON Schema from method signatures and enriches tool descriptions from docstrings. SQLToolset: Curated 4-tool database toolset (list_tables, get_schema, query, check_query) wrapping DbApiHook. Read-only by default with SQL validation, allowed_tables metadata filtering, and max_rows truncation. Both implement pydantic-ai's AbstractToolset interface with sequential=True on all tool definitions to prevent concurrent sync I/O.
The list comprehension in the else branch produces list[list[Any]] while the if branch produces list[dict[str, Any]]. Add an explicit type annotation to satisfy mypy.
Sphinx autoapi generates RST from pydantic-ai's AbstractToolset base class docstrings. These words appear in the auto-generated docs and need to be in the global wordlist.
Docs for HookToolset (generic hook→tools adapter) and SQLToolset (curated 4-tool DB toolset). Includes defense layers table, allowed_tables limitation, HookToolset guidelines, recommended configurations, and production checklist.
gopidesupavan
approved these changes
Mar 3, 2026
Member
gopidesupavan
left a comment
There was a problem hiding this comment.
woohoo Great work kaxil.
Remove toolsets.rst from how-to-guide list in provider.yaml — the validation script only recognizes operators/, sensors/, and transfer/ doc paths. The toolsets docs remain accessible via the index.rst toctree. Add "hardcode" to the spelling wordlist.
Member
|
May be we should add |
HookToolset and SQLToolset for agentic LLM workflows
dominikhei
pushed a commit
to dominikhei/airflow
that referenced
this pull request
Mar 11, 2026
…62785) HookToolset: Generic adapter that exposes any Airflow Hook's methods as pydantic-ai tools via introspection. Requires explicit allowed_methods list (no auto-discovery). Builds JSON Schema from method signatures and enriches tool descriptions from docstrings. SQLToolset: Curated 4-tool database toolset (list_tables, get_schema, query, check_query) wrapping DbApiHook. Read-only by default with SQL validation, allowed_tables metadata filtering, and max_rows truncation. Both implement pydantic-ai's AbstractToolset interface with sequential=True on all tool definitions to prevent concurrent sync I/O. * Fix mypy error: annotate result variable in SQLToolset._query The list comprehension in the else branch produces list[list[Any]] while the if branch produces list[dict[str, Any]]. Add an explicit type annotation to satisfy mypy. * Add toolset/agentic/ctx to spelling wordlist Sphinx autoapi generates RST from pydantic-ai's AbstractToolset base class docstrings. These words appear in the auto-generated docs and need to be in the global wordlist. Docs for HookToolset (generic hook→tools adapter) and SQLToolset (curated 4-tool DB toolset). Includes defense layers table, allowed_tables limitation, HookToolset guidelines, recommended configurations, and production checklist.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Airflow Hooks as AI Agent Tools
Airflow's 350+ provider hooks already form the largest authenticated tool registry in the data ecosystem. Each hook has typed methods, rich docstrings, and managed credentials stored in Airflow's secret backend. This PR adds a thin adapter layer that exposes them as pydantic-ai tools, turning Airflow into an AI agent toolkit. This is a continuation of mine & Pavan's talk at the Airflow Summit 2025 [ YouTube ] .
For context: MCP servers typically expose ~30 tools each and require separate authentication setup. Airflow hooks cover thousands of methods across AWS, GCP, Azure, databases, HTTP APIs, Slack, and more — all pre-authenticated through Airflow Connections.
What's in this PR
HookToolset — generic adapter that turns any Airflow Hook into a set of pydantic-ai tools via introspection:
The introspection engine builds JSON Schema from method signatures (
inspect.signature+get_type_hints) and enriches tool descriptions from docstrings (Sphinx:param:and GoogleArgs:styles). This works with any hook — S3Hook, GCSHook, SlackHook, DbApiHook, etc.SQLToolset — curated 4-tool database toolset inspired by LangChain's
SQLDatabaseToolkit(one of their most-used agent features):list_tablesallowed_tables)get_schemaquerymax_rowstruncation)check_queryDocumentation — full
toolsets.rstwith usage, parameters, security section (defense layers table,allowed_tableslimitation, HookToolset guidelines, recommended configurations, production checklist).Safety
allowed_methodslist. No auto-discovery — DAG authors must opt in to each method. Methods validated withhasattr+callableat instantiation time.allow_writes=Falseby default, validates every query throughvalidate_sql()and rejects INSERT/UPDATE/DELETE/DROP.allowed_tablesfilters metadata visibility (not query-level — documented clearly).max_rowstruncates results.sequential=Trueon all tool definitions to prevent concurrent sync I/O on shared hook state.Design decisions
Why custom introspection instead of pydantic-ai's
_function_schema? Hook methods are bound methods withself, decorators like@provide_bucket_name, and complex signatures. Our lightweight approach avoids coupling to pydantic-ai internals.Why
allowed_tablesis metadata-only? Parsing SQL for table references (CTEs, subqueries, aliases, vendor-specific syntax) is complex and error-prone. Providing a false sense of security is worse than being honest about the limitation. Real access control belongs at the DB permission level.Why not auto-discover hook methods? Auto-discovery would expose every public method including
run(),get_connection(), etc. — giving an LLM broad unintended access. Explicit listing forces DAG authors to think about the blast radiusDag: