Pluggable expression-level statistics estimation (ExpressionAnalyzer)

### Is your feature request related to a problem or challenge?

DataFusion currently loses expression-level statistics when computing plan metadata:

- Projections: any expression that isn't a bare column or literal gets `NDV = Absent`, even for simple cases like `col + 1` where NDV is trivially derivable from the input
- Filters: when interval analysis cannot handle a predicate (`check_support` returns false), selectivity falls back to a hardcoded 20% regardless of available column statistics
- Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer

Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.

Related: this was previously raised in #992 (closed as non-actionable at the time).

### Describe the solution you'd like

A pluggable chain-of-responsibility framework for expression-level statistics, covering:

1. Selectivity (predicate filtering fraction)
2. NDV (number of distinct values)
3. Min/max bounds
4. Null fraction

The framework should:

- Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic
- Include built-in analyzers for common function families (string, math, date_part/date_trunc)
- Allow users to register custom analyzers via `SessionState` for UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware)
- Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.)
- Be non-breaking and purely additive

### Describe alternatives you've considered

- Extending `PhysicalExpr::evaluate_statistics()` (#14699): this provides per-expression statistics but doesn't support chain delegation or user-registered overrides, and would require changes to the `PhysicalExpr` trait
- Hardcoding heuristics in each operator (the status quo): does not scale as more expressions and operators need statistics, and provides no extension point for users
- Distribution-based API (#14896, #14699): more powerful but significantly more complex to implement and adopt; ExpressionAnalyzer can serve as the foundation, with distribution-based estimation plugged in as a custom analyzer

### Planned work

Framework
- [ ] ExpressionAnalyzer trait, chain-of-responsibility registry, SessionState integration
- [ ] Default analyzer with Selinger-style heuristics (columns, literals, binary expressions, NOT)

Built-in analyzers for common functions
- [ ] String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...)
- [ ] Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...)
- [ ] Date/time functions (date_part, date_trunc)

Operator integration
- [ ] Projection: propagate statistics through projected expressions
- [ ] Filter: use analyzer selectivity when interval analysis is not applicable
- [ ] Joins: expression-aware cardinality estimation for join key expressions
- [ ] Aggregates: NDV-based output row estimation for GROUP BY expressions

### Additional context

- Related: #992 (similar request, closed as non-actionable), #8227 (statistics improvements epic), #14699 (expression statistics API), #14896 (expression statistics tracking)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Planned work

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pluggable expression-level statistics estimation (ExpressionAnalyzer) #21120

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Planned work

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions