Skip to content

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953

Open
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types
Open

feat(extensions): add unsigned integer extension types (u8, u16, u32, u64)#953
kadinrabo wants to merge 24 commits intosubstrait-io:mainfrom
kadinrabo:feat/unsigned-extension-types

Conversation

@kadinrabo
Copy link
Copy Markdown
Contributor

@kadinrabo kadinrabo commented Jan 29, 2026

Description

Adds unsigned integer types (u8, u16, u32, u64) as first-class extension types with arithmetic function support and test coverage.

  • Self-contained unsigned_integers.yaml with type definitions (string structure encoding) and arithmetic function overloads (add, subtract, multiply, divide, modulus, sum, min, max)
  • functions_arithmetic.yaml is untouched
  • Test cases in tests/cases/arithmetic_unsigned/, following the arithmetic_decimal convention
  • Generic udtArg grammar rule for parsing UDT literals in test cases
  • Test framework updated to scan all extension YAML files for function definitions

Closes #944 and follows up community agreement from Substrait Meeting Notes on 28 Jan 2026 that type variations are not appropriate for unsigned integers due to differing semantics.


This change is Reviewable

@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from e4584ca to daccb86 Compare January 30, 2026 15:36
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from daccb86 to d19c87a Compare January 30, 2026 15:38
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 6d28b59 to 4af53f8 Compare January 30, 2026 20:24
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 4af53f8 to 7d6f7f9 Compare January 30, 2026 20:53
@kadinrabo kadinrabo closed this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as ready for review January 30, 2026 21:06
@kadinrabo kadinrabo reopened this Jan 30, 2026
@kadinrabo kadinrabo marked this pull request as draft January 30, 2026 21:07
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch 3 times, most recently from 8650358 to 10baf81 Compare February 4, 2026 15:53
@kadinrabo kadinrabo marked this pull request as ready for review February 4, 2026 15:59
@kadinrabo kadinrabo force-pushed the feat/unsigned-extension-types branch from 1a8fd73 to 856e63d Compare February 27, 2026 04:05
Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this 🙂

Left a few more comments

-
name: "divide"
description: >
Divide x by y. Partial values are truncated (rounded towards 0).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if y is zero?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the divide description from

-
name: "divide"
description: >
Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
If the option is LIMIT then the result is +/-infinity in all cases.
If either x or y are NaN then behavior will be governed by `on_domain_error`.
If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
impls:
here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can go with something like this, thoughts?

Divide x by y. Partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0.                                                                                                                                                       
If either x or y are out of range, behavior will be governed by `on_domain_error`.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to have null cases added to these tests? Might be overkill but can't hurt.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep parity with signed tests here, which don't have null cases for add/subtract/multiply either. Open to adding them if you feel strongly, but might be better as a follow up

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not doing this for now.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall to me.

The one thing I would like to see is that for the test files, let's map over not just the basic tests but also the overflow, null handling, etc tests that are applicable to usigned integers. I think the floating exception tests are the only ones that don't make apply.

-
name: "divide"
description: >
Divide x by y. Partial values are truncated (rounded towards 0).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include the divide description from

-
name: "divide"
description: >
Divide x by y. In the case of integer division, partial values are truncated (i.e. rounded towards 0).
The `on_division_by_zero` option governs behavior in cases where y is 0. If the option is IEEE then
the IEEE754 standard is followed: all values except +/-infinity return NaN and +/-infinity are unchanged.
If the option is LIMIT then the result is +/-infinity in all cases.
If either x or y are NaN then behavior will be governed by `on_domain_error`.
If x and y are both +/-infinity, behavior will be governed by `on_domain_error`.
impls:
here.

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left one comment on the definition of divide, and I think we should pull modulus out of this entirely because the existing definition could use a once over, but otherwise looks good to me.

The core set of functions you've added, along with the tests files, set a good example for how to add both future functions for unsigned integers, and also for adding new types.

division_type:
values: [ TRUNCATE, FLOOR ]
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can... modulus even overflow?

value: u!u8
options:
division_type:
values: [ TRUNCATE, FLOOR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is division_type an option we need here. The signed modulus operator defines 2 tests for them:

# division_type: Examples demonstrating truncate and floor division types
modulus(8::i8, -3::i8) [division_type:TRUNCATE] = 2::i8
modulus(8::i8, -3::i8) [division_type:FLOOR] = -1::i8
. I'm not sure this is applicable for unsigned ints.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with not doing this for now.

value: u!u8
options:
overflow:
values: [ SILENT, SATURATE, ERROR ]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With signed integers, the overflow cases involve a sign change.

# overflow: Examples demonstrating overflow behavior
divide(-9223372036854775808::i64, -1::i64) [overflow:ERROR] = <!ERROR>
divide(-128::i8, -1::i8) [overflow:SATURATE] = 127::i8

With unsigned integers, this case isn't applicable, and in general overflow shouldn't be possible so we can drop this option fully for all of the impls.

@vbarua vbarua added the PMC Ready PRs ready for review by PMCs label Mar 6, 2026
Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me, thanks for working on this Kadin.

Looking at this has definitely made me notice some stuff in the existing arithmetic extension that could use improvements as well 🧹

Comment on lines +435 to +436
# Type is "u!" + identifier, e.g., "u!u8"
type_str = "u!" + ctx.Identifier().getText().lower()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this handle YAML dependency reference or that is out of scope in the test?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By yaml dependency reference, are you talking about this feature for extensions?

dependencies:
# For reusing type classes and type variations from other extension files.
# The keys are namespace identifiers that you can then use as dot-separated
# prefix for type class and type variation names in functions and the base
# type class for variations. The values must be extension URNs, following
# the same format and conventions as those used in the proto plans.
type: object
patternProperties:
"^[a-zA-Z_\\$][a-zA-Z0-9_\\$]*$":
type: string

I don't think the test cases support dependency references, yet

substrait/tests/README.md

Lines 92 to 101 in 00bc3c2

### Spec
```
doc := <version>
<include>
(<dependency>)*
((<test_group>)?(<test_case>)+\n)+
version := ### SUBSTRAIT_SCALAR_TEST: <test_library_version>
include := ### SUBSTRAIT_INCLUDE: <uri>
dependency := ### SUBSTRAIT_DEPENDENCY: <uri>

Copy link
Copy Markdown
Contributor

@yongchul yongchul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+0 using string for representation (because we go at length in defining representation in core types (e.g., decimal)) but overall looks good to me.

@vbarua
Copy link
Copy Markdown
Member

vbarua commented Mar 25, 2026

@yongchul what does

+0 using string for representation ...

from you mean here? Is that an approval with no vote?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of unsigned numeric types

4 participants