-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Description
Is your feature request related to a problem or challenge?
DataFusion currently loses expression-level statistics when computing plan metadata:
- Projections: any expression that isn't a bare column or literal gets
NDV = Absent, even for simple cases likecol + 1where NDV is trivially derivable from the input - Filters: when interval analysis cannot handle a predicate (
check_supportreturns false), selectivity falls back to a hardcoded 20% regardless of available column statistics - Custom UDFs: there is no way for users to provide statistics metadata for their functions, making all UDFs opaque to the optimizer
Without expression-level statistics, the optimizer lacks the information it needs for join ordering, cardinality estimation, and cost-based decisions involving computed columns or UDFs. Projects embedding DataFusion currently have no extension point to provide this information for their own functions.
Related: this was previously raised in #992 (closed as non-actionable at the time).
Describe the solution you'd like
A pluggable chain-of-responsibility framework for expression-level statistics, covering:
- Selectivity (predicate filtering fraction)
- NDV (number of distinct values)
- Min/max bounds
- Null fraction
The framework should:
- Ship with a default Selinger-style analyzer handling columns, literals, binary expressions (AND/OR/NOT/comparisons), and arithmetic
- Include built-in analyzers for common function families (string, math, date_part/date_trunc)
- Allow users to register custom analyzers via
SessionStatefor UDF-specific or domain-specific estimation (e.g., histogram-based, geometry-aware) - Integrate into physical operators that need expression-level statistics (projections, filters, joins, aggregates, etc.)
- Be non-breaking and purely additive
Describe alternatives you've considered
- Extending
PhysicalExpr::evaluate_statistics()(StatisticsV2: initial statistics framework redesign #14699): this provides per-expression statistics but doesn't support chain delegation or user-registered overrides, and would require changes to thePhysicalExprtrait - Hardcoding heuristics in each operator (the status quo): does not scale as more expressions and operators need statistics, and provides no extension point for users
- Distribution-based API (Statistics: Migrate to
DistributionfromPrecision#14896, StatisticsV2: initial statistics framework redesign #14699): more powerful but significantly more complex to implement and adopt; ExpressionAnalyzer can serve as the foundation, with distribution-based estimation plugged in as a custom analyzer
Planned work
Framework
- ExpressionAnalyzer trait, chain-of-responsibility registry, SessionState integration
- Default analyzer with Selinger-style heuristics (columns, literals, binary expressions, NOT)
Built-in analyzers for common functions
- String functions (UPPER, LOWER, TRIM, SUBSTRING, REPLACE, ...)
- Math functions (FLOOR, CEIL, ROUND, ABS, EXP, LN, ...)
- Date/time functions (date_part, date_trunc)
Operator integration
- Projection: propagate statistics through projected expressions
- Filter: use analyzer selectivity when interval analysis is not applicable
- Joins: expression-aware cardinality estimation for join key expressions
- Aggregates: NDV-based output row estimation for GROUP BY expressions
Additional context
- Related: Expressions should also evaluate on statistics #992 (similar request, closed as non-actionable), Epic: Statistics improvements #8227 (statistics improvements epic), StatisticsV2: initial statistics framework redesign #14699 (expression statistics API), Statistics: Migrate to
DistributionfromPrecision#14896 (expression statistics tracking)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels