Skip to content

Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120

@asolimando

Description

@asolimando

Is your feature request related to a problem or challenge?

DataFusion currently loses expression-level statistics when computing plan metadata:

  • Projections: any expression that isn't a bare column or literal gets NDV = Absent, even for simple cases like col + 1 where NDV is trivially derivable from the input
  • Filters: when interval analysis cannot handle a predicate (check_support returns false), selectivity falls back to a hardcoded 20% regardless of available column statistics
  • Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer

Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.

Related: this was previously raised in #992 (closed as non-actionable at the time).

Describe the solution you'd like

A pluggable chain-of-responsibility framework for expression-level statistics, covering:

  1. Selectivity (predicate filtering fraction)
  2. NDV (number of distinct values)
  3. Min/max bounds
  4. Null fraction

The framework should:

  • Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic
  • Include built-in analyzers for common function families (string, math, date_part/date_trunc)
  • Allow users to register custom analyzers via SessionState for UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware)
  • Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.)
  • Be non-breaking and purely additive

Describe alternatives you've considered

Planned work

Framework

  • ExpressionAnalyzer trait, chain-of-responsibility registry, SessionState integration
  • Default analyzer with Selinger-style heuristics (columns, literals, binary expressions, NOT)

Built-in analyzers for common functions

  • String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...)
  • Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...)
  • Date/time functions (date_part, date_trunc)

Operator integration

  • Projection: propagate statistics through projected expressions
  • Filter: use analyzer selectivity when interval analysis is not applicable
  • Joins: expression-aware cardinality estimation for join key expressions
  • Aggregates: NDV-based output row estimation for GROUP BY expressions

Additional context

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions