Skip to content

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122

Draft
asolimando wants to merge 2 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer
Draft

Add ExpressionAnalyzer for pluggable expression-level statistics estimation#21122
asolimando wants to merge 2 commits intoapache:mainfrom
asolimando:asolimando/ndv-expression-analyzer

Conversation

@asolimando
Copy link
Member

Which issue does this PR close?

Part of #21120 (framework + projection/filter integration)

Rationale for this change

DataFusion currently loses expression-level statistics when computing plan metadata. Projected expressions that aren't bare columns or literals get unknown statistics, and filter selectivity falls back to a hardcoded 20% when interval analysis cannot handle the predicate. There is also no extension point for users to provide statistics for their own UDFs.

This PR introduces ExpressionAnalyzer, a pluggable chain-of-responsibility framework that addresses these gaps. It follows the same extensibility pattern used elsewhere in DataFusion (ExprPlanner, OptimizerRule).

Addresses reviewer feedback from #19957: chain delegation, SessionState integration, own folder.

What changes are included in this PR?

  • ExpressionAnalyzer trait with registry parameter for chain delegation
  • ExpressionAnalyzerRegistry to chain analyzers (first Computed wins)
  • DefaultExpressionAnalyzer: Selinger-style estimation for columns, literals, binary expressions (AND/OR/NOT/comparisons), arithmetic
  • ExpressionAnalyzerRegistry stored in SessionState, injected into ProjectionExec and FilterExec by the planner
  • ProjectionExprs uses registry to estimate NDV, min/max, and null fraction through projected expressions
  • FilterExec uses registry selectivity as fallback when check_support returns false
  • Config option optimizer.enable_expression_analyzer (default false) to opt in; zero behavior change when disabled
  • Limitation: projections/filters created by optimizer rules after planning do not receive the registry and fall back to upstream behavior. Full coverage requires an operator-level statistics registry (orthogonal, will be tracked separately).

Are these changes tested?

  • 15 unit tests for ExpressionAnalyzer (NDV, selectivity, min/max, null fraction, custom analyzers, chain delegation)
  • 31 projection tests (including new test_project_statistics_with_expression_analyzer)
  • 26 filter tests
  • 7 session state tests

Are there any user-facing changes?

New public API (purely additive, non-breaking):

  • ExpressionAnalyzer trait and ExpressionAnalyzerRegistry in datafusion-physical-expr
  • SessionState::expression_analyzer_registry() getter
  • SessionStateBuilder::with_expression_analyzer_registry() setter
  • ProjectionExprs::with_expression_analyzer_registry() setter
  • FilterExecBuilder::with_expression_analyzer_registry() setter
  • ProjectionExec::with_expression_analyzer_registry() setter
  • Config option datafusion.optimizer.enable_expression_analyzer

No breaking changes. Default behavior is unchanged (config defaults to false).


Disclaimer: I used AI to assist in the code generation, I have manually reviewed the output and it matches my intention and understanding.

@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate physical-plan Changes to the physical-plan crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) labels Mar 23, 2026
Introduce ExpressionAnalyzer, a chain-of-responsibility framework for
expression-level statistics estimation (NDV, selectivity, min/max).

Framework:
- ExpressionAnalyzer trait with registry parameter for chain delegation
- ExpressionAnalyzerRegistry to chain analyzers (first Computed wins)
- DefaultExpressionAnalyzer: Selinger-style estimation for columns,
  literals, binary expressions, NOT, boolean predicates

Integration:
- ExpressionAnalyzerRegistry stored in SessionState, initialized once
- ProjectionExprs stores optional registry (non-breaking, no signature
  changes to project_statistics)
- ProjectionExec sets registry via Projector, injected by planner
- FilterExec uses registry for selectivity when interval analysis
  cannot handle the predicate
- Custom nodes get builtin analyzer as fallback when registry is absent
- Regenerate configs.md for new enable_expression_analyzer option
- Add enable_expression_analyzer to information_schema.slt expected output
- Fix unresolved doc links to SessionState and DefaultExpressionAnalyzer
  (cross-crate references use backticks instead of doc links)
- Simplify config description
@asolimando asolimando force-pushed the asolimando/ndv-expression-analyzer branch from 322b97f to f101c51 Compare March 25, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant