Skip to content

feat(extensions): support int arguments with std_dev and variance functions#1012

Open
nielspardon wants to merge 11 commits intosubstrait-io:mainfrom
nielspardon:par-stddev-int
Open

feat(extensions): support int arguments with std_dev and variance functions#1012
nielspardon wants to merge 11 commits intosubstrait-io:mainfrom
nielspardon:par-stddev-int

Conversation

@nielspardon
Copy link
Copy Markdown
Member

@nielspardon nielspardon commented Mar 17, 2026

This PR depends on #1010 and #1011 since it uses the new enum argument syntax in the test cases and adds additional generated test cases using integer arguments to the generated test cases in #1011.

The main changes for this PR are in commit 714d363.

This PR adds new function signatures for std_dev and variance which accept integer arguments. Most SQL systems support any numeric arguments not just floating point arguments as the current Substrait function signatures suggest.

Integer arguments are used in TPC-DS and are required so we can correctly convert TPC-DS queries with isthmus in substrait-java. See: substrait-io/substrait-java#68


This change is Reviewable

Copy link
Copy Markdown
Contributor

@mbwhite mbwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - good to get the complete implementation - and following the pattern of avg

@nielspardon nielspardon force-pushed the par-stddev-int branch 2 times, most recently from 76ff67d to 82379c0 Compare March 20, 2026 20:08
@nielspardon

This comment was marked as outdated.

Copy link
Copy Markdown
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for supporting integral types.

@nielspardon nielspardon added the PMC Ready PRs ready for review by PMCs label Mar 23, 2026
…and variance

BREAKING CHANGE: changes the function signature of existing functions std_dev and variance

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
@nielspardon
Copy link
Copy Markdown
Member Author

rebased on latest #1011 to remove the deprecation

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 23, 2026

Do we actually need to support calls likestd_dev(<i8>), or should we pushing producers to generate std_dev(<i8>::fp64) to force them to capture their type promotion logic.

following the pattern of avg

This is slightly different from the avg case IMO. The avg functions output the same numeric type as the input. So avg:i8 outputs i8, avg:i16 outputs i16, etc.

For the std_dev and variance functions with numeric inputs, the output is always fp64. There is no difference in output types that we need to capture.

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
…ctions

Signed-off-by: Niels Pardon <par@zurich.ibm.com>
Signed-off-by: Niels Pardon <par@zurich.ibm.com>
@nielspardon
Copy link
Copy Markdown
Member Author

For the std_dev and variance functions with numeric inputs, the output is always fp64. There is no difference in output types that we need to capture.

Currently, the fp32 argument versions also returns an fp32 result:

- args:
- name: x
value: fp32
options:
rounding:
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
distribution:
values: [ SAMPLE, POPULATION]
nullability: DECLARED_OUTPUT
return: fp32?

- args:
- name: x
value: fp32
options:
rounding:
values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR ]
distribution:
values: [ SAMPLE, POPULATION]
nullability: DECLARED_OUTPUT
return: fp32?

@nielspardon
Copy link
Copy Markdown
Member Author

Do we actually need to support calls likestd_dev(<i8>), or should we pushing producers to generate std_dev(<i8>::fp64) to force them to capture their type promotion logic.

I think not offering function signatures with all numeric types makes it only less convenient for Substrait adopters since they need to add and parse extra cast expressions which they otherwise would not need. Sure, we could force them to do so but then I also feel that the Substrait specification is under-specified on the expected casting behavior like e.g. when casting from fp32 to i64 does it round up or down?

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 24, 2026

Currently, the fp32 argument versions also returns an fp32 result:

Should have been clearer. The integer args versions all return fp64.

I also feel that the Substrait specification is under-specified on the expected casting behavior like e.g. when casting from fp32 to i64 does it round up or down?

I agree actually, but that sounds like something worth pinning down as well.

I think not offering function signatures with all numeric types makes it only less convenient for Substrait adopters since they need to add and parse extra cast expressions which they otherwise would not need.

It makes the producers life more difficult, but lessens the burden on the consumer because the surface area of functions to implement/support is smaller. I haven't figured out a good criteria to express when we should push for a plan producer to cast, but I don't think think we need to provide 1-1 functions for everything in SQL. This actually might be a case where it's worth it, but forcing producers to cast is also a mechanism to get them to capture their type coercion behaviors explicitly.

@nielspardon
Copy link
Copy Markdown
Member Author

I also feel that the Substrait specification is under-specified on the expected casting behavior like e.g. when casting from fp32 to i64 does it round up or down?

I agree actually, but that sounds like something worth pinning down as well.

created an issue for the casting docs improvement: #1023

@nielspardon
Copy link
Copy Markdown
Member Author

It makes the producers life more difficult, but lessens the burden on the consumer because the surface area of functions to implement/support is smaller. I haven't figured out a good criteria to express when we should push for a plan producer to cast, but I don't think think we need to provide 1-1 functions for everything in SQL. This actually might be a case where it's worth it, but forcing producers to cast is also a mechanism to get them to capture their type coercion behaviors explicitly.

If you wanted to enforce such a behavior you would have to analyze all of the existing function signatures and deprecate/remove the ones that are "too convenient", not forcing producers to expose their type coercion behavior.

Also this will require a major rewrite of the isthmus aggregate function mapping (at least for std_dev, variance but also for scalar, windowing functions prospectively) since the way it is implemented today does not allow to explicitly coerce types using casts since in the function conversion logic we have only access to field indices of fields used in aggregate functions but not their types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants