Skip to content

AIP-99: Add LLMFileAnalysisOperator and @task.llm_file_analysis#64077

Merged
gopidesupavan merged 7 commits intoapache:mainfrom
gopidesupavan:file-analysis
Mar 26, 2026
Merged

AIP-99: Add LLMFileAnalysisOperator and @task.llm_file_analysis#64077
gopidesupavan merged 7 commits intoapache:mainfrom
gopidesupavan:file-analysis

Conversation

@gopidesupavan
Copy link
Copy Markdown
Member

@gopidesupavan gopidesupavan commented Mar 22, 2026

Add LLMFileAnalysisOperator and @task.llm_file_analysis to the common-ai provider.

This adds a read-only file analysis operator that resolves files through ObjectStoragePath, normalizes supported text formats into prompt context, and optionally attaches PNG/JPG/PDF inputs for multimodal models. It supports single files and directory/prefix inputs, enforces file and prompt limits before calling the model, and includes structured-output support through output_type.

The change also adds helper utilities, docs, example DAGs, optional Avro/Parquet extras, and unit tests for text, structured, multimodal, limit, and dependency-gated paths.

closes: #ISSUE


Was generative AI tooling used to co-author this PR?
  • Yes — Codex (GPT-5)

Generated-by: Codex (GPT-5) following the guidelines



  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@gopidesupavan gopidesupavan requested a review from kaxil as a code owner March 22, 2026 21:13
@gopidesupavan gopidesupavan moved this from Backlog to In review in AIP-99 Common Data Access Pattern + AI Mar 22, 2026
@gopidesupavan gopidesupavan changed the title Add LLMFileAnalysisOperator and @task.llm_file_analysis AIP-99: Add LLMFileAnalysisOperator and @task.llm_file_analysis Mar 22, 2026
@gopidesupavan
Copy link
Copy Markdown
Member Author

@codex review

@gopidesupavan gopidesupavan marked this pull request as draft March 23, 2026 11:20
@gopidesupavan gopidesupavan marked this pull request as ready for review March 24, 2026 05:13
@gopidesupavan gopidesupavan reopened this Mar 24, 2026
@gopidesupavan gopidesupavan force-pushed the file-analysis branch 4 times, most recently from 74a77e9 to fce35be Compare March 24, 2026 20:27
@kaxil kaxil requested a review from Copilot March 24, 2026 23:01
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “read-only file analysis” capability to the common-ai provider by introducing LLMFileAnalysisOperator and the @task.llm_file_analysis TaskFlow decorator. The feature resolves local/object-storage paths via ObjectStoragePath, normalizes supported text formats into prompt context, and supports multimodal attachments (PNG/JPG/PDF) plus optional Avro/Parquet readers.

Changes:

  • Introduces LLMFileAnalysisOperator and @task.llm_file_analysis, including HITL approval support and structured output (output_type).
  • Adds file discovery + format rendering utilities with limits (max files/bytes/text) and optional Avro/Parquet support.
  • Adds docs, example DAGs, dependency extras, and unit tests covering core behaviors and dependency-gated paths.

Reviewed changes

Copilot reviewed 15 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds avro/parquet extras and locked deps for fastavro/pyarrow.
providers/common/ai/pyproject.toml Declares avro and parquet optional extras with version markers.
providers/common/ai/provider.yaml Registers new operator docs/module and the new task decorator.
providers/common/ai/src/airflow/providers/common/ai/get_provider_info.py Exposes the new operator and decorator via provider metadata.
providers/common/ai/src/airflow/providers/common/ai/exceptions.py Adds file-analysis-specific exception types.
providers/common/ai/src/airflow/providers/common/ai/utils/file_analysis.py Implements file resolution, format detection, sampling/truncation, and prompt construction (incl. multimodal attachments).
providers/common/ai/src/airflow/providers/common/ai/operators/llm_file_analysis.py New operator that builds file-analysis request content and performs a single LLM call (with optional approval + structured output).
providers/common/ai/src/airflow/providers/common/ai/decorators/llm_file_analysis.py Adds TaskFlow decorator wrapper around the new operator.
providers/common/ai/src/airflow/providers/common/ai/example_dags/example_llm_file_analysis.py Example DAGs for basic, prefix, multimodal, structured, and decorator usage.
providers/common/ai/docs/operators/llm_file_analysis.rst New user-facing docs page for the operator and decorator (usage + parameters + formats).
providers/common/ai/docs/operators/index.rst Updates operator overview table and adds a short description of the new operator.
docs/spelling_wordlist.txt Adds terms used in the new docs (“codec”, “PDFs”).
providers/common/ai/tests/unit/common/ai/utils/test_file_analysis.py Unit tests for file discovery, limits, gzip handling, multimodal behavior, and Avro/Parquet gating.
providers/common/ai/tests/unit/common/ai/operators/test_llm_file_analysis.py Unit tests for operator execution, structured output serialization, and approval deferral/complete flows.
providers/common/ai/tests/unit/common/ai/decorators/test_llm_file_analysis.py Unit tests for decorator execution and prompt validation/templating behavior.
providers/common/ai/tests/unit/common/ai/assets/__init__.py Marks test assets directory as a package.

@gopidesupavan
Copy link
Copy Markdown
Member Author

gopidesupavan commented Mar 25, 2026

image Example to tryout

Create a small CSV in dev folder
/opt/airflow/dev/sample_orders.csv.

order_id,customer,amount,status
1,Alice,125.50,paid
2,Bob,0.00,failed
3,Carol,79.99,paid
4,Dan,250.00,refunded
from datetime import datetime

from airflow.providers.common.ai.operators.llm_file_analysis import LLMFileAnalysisOperator
from airflow.providers.common.compat.sdk import dag, task

@dag(
    dag_id="example_llm_file_analysis_customer_orders",
    schedule=None,
    start_date=datetime(2024, 1, 1),
    catchup=False,
)
def example_llm_file_analysis_customer_orders():
    t1 = LLMFileAnalysisOperator(
        task_id="analyze_orders_csv",
        llm_conn_id="pydanticai_default",  # replace with any configured LLM connection
        file_path="/opt/airflow/dev/sample_orders.csv",
        prompt=(
            "Analyze this CSV file. "
            "List the columns, count the rows, identify any failed or refunded orders, "
            "and compute the total amount for rows where status=paid. "
            "Mention if the file content was sampled or truncated."
        ),
        sample_rows=10,
        model_id="google-gla:gemini-2.5-pro"
    )

    @task
    def print_result(result):
        print(result)

    t1 >> print_result(t1.output)


example_llm_file_analysis_customer_orders()


@gopidesupavan
Copy link
Copy Markdown
Member Author

with file image input:

image
from datetime import datetime

from airflow.providers.common.ai.operators.llm_file_analysis import LLMFileAnalysisOperator
from airflow.providers.common.compat.sdk import dag, task

@dag(
    dag_id="example_llm_file_analysis_images",
    schedule=None,
    start_date=datetime(2024, 1, 1),
    catchup=False,
)
def example_llm_file_analysis_images():
    t1 = LLMFileAnalysisOperator(
        task_id="analyze_orders_csv",
        llm_conn_id="pydanticai_default",  # replace with any configured LLM connection
        file_path="/opt/airflow/task-sdk/docs/img/airflow-3-task-sdk.png",
        prompt=(
            "Analyze this file. "
            "What are the various components existing in the file architecture diagram? "
        ),
        sample_rows=10,
        model_id="google-gla:gemini-2.5-pro",
        multi_modal=True,
    )

    @task
    def print_result(result):
        print(result)

    t1 >> print_result(t1.output)


example_llm_file_analysis_images()

@gopidesupavan gopidesupavan merged commit 3bec5d6 into apache:main Mar 26, 2026
141 of 145 checks passed
@gopidesupavan gopidesupavan deleted the file-analysis branch March 26, 2026 17:21
nailo2c pushed a commit to nailo2c/airflow that referenced this pull request Mar 30, 2026
…he#64077)

* Add LLMFileAnalysisOperator and @task.llm_file_analysis to the common-ai provider

# Conflicts:
#	uv.lock

* Fix mypy issues

* Update utils

* Update return model

* Fix spells

* fix up read

* document prefix lookup operation
Suraj-kumar00 pushed a commit to Suraj-kumar00/airflow that referenced this pull request Apr 7, 2026
…he#64077)

* Add LLMFileAnalysisOperator and @task.llm_file_analysis to the common-ai provider

# Conflicts:
#	uv.lock

* Fix mypy issues

* Update utils

* Update return model

* Fix spells

* fix up read

* document prefix lookup operation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants