Skip to content

feat(extensions): support deprecation info in extensions#1014

Merged
yongchul merged 12 commits intosubstrait-io:mainfrom
yongchul:extension_deprecation
Mar 20, 2026
Merged

feat(extensions): support deprecation info in extensions#1014
yongchul merged 12 commits intosubstrait-io:mainfrom
yongchul:extension_deprecation

Conversation

@yongchul
Copy link
Copy Markdown
Contributor

@yongchul yongchul commented Mar 19, 2026

There are a few recent PRs making breaking changes to extensions but we can do this more gracefully.

This PR proposes adding deprecation information to all kinds of extensions: type, type variations, scalar, aggregate, and window functions so that we can present deprecation information to the consumers. This also makes possible tooling leverage this information to produce warnings.

AI Disclaimer

  • Claude Opus 4.6
  • Example YAML was generated by AI
  • Documentation was edited by AI

@yongchul yongchul added the PMC Ready PRs ready for review by PMCs label Mar 19, 2026
Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. Also wondering if there are any markdown doc updates we want to do as well? Thanks!

Copy link
Copy Markdown
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm overall in support, just the one comment about simplifying things

Comment on lines +173 to +174
* `semanticVersion` — a [semantic version](https://semver.org) string for extensions that maintain their own versioning independent of Substrait releases.
* `genericVersion` — a freeform versioning scheme with `name` (the versioning scheme name) and `version` (the version string) fields. The interpretation is up to the extension authors.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I say let's keep things simple for now and just have only substraitVersion and genericVersion, though I don't feel super strong about this. It just feels to me like adding semanticVersion is just adding a special case of genericVersion.

If others would prefer having both then that is fine with me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also in favor of keeping things simple. Couldn't we just add an optional version field for each YAML file and if it's not set it is implicitly the substrait version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon if we put file-level default, then we need to make reason required if I understood correctly how this YAML thing works? I think it is okay to just have it per entry level and being explicit as we don't have too many deprecated stuff...

Also, @benbellick and I discussed offline briefly about introducing per YAML file version and deprecation can be with respect to the YAML file version. We may explore this idea with future URN changes. So please let me know what do you think.

@benbellick I just put semanticVersion there because it is such a popular scheme. But in the end, we still need to know which version you need to follow... without anchoring point, there is no meaning to compare verisons. Will remove the semanticVersion.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without anchoring point, there is no meaning to compare verisons.

Yes, that was also my main thought with introducing a version per YAML file since then you can deprecate relative to the YAML file version. My understanding is that for the default YAML files that come with the Substrait spec we use the same version as the Substrait spec by convention. 3rd party extension YAMLs would need to explicitly define a version per file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we assume for a second that this relatively short extension file was a 3rd party one:

%YAML 1.2
---
urn: extension:org.example:functions_aggregate_approx
version: 1.2.3
aggregate_functions:
  - name: "approx_count_distinct"
    description: >-
      Calculates the approximate number of rows that contain distinct values of the expression argument using
      HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which
      returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT
      processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact
      result.
    deprecated:
      since: 1.2.0
      reason: deprecated in favor of exact count distinct
    impls:
      - args:
          - name: x
            value: any
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: binary
        return: i64

then it is clear that since is relative to the top-level version number

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for default Substrait extensions you would just not populate the version field since it would be the same as the Substrait version of the Plan message:

%YAML 1.2
---
urn: extension:io.substrait:functions_aggregate_approx
aggregate_functions:
  - name: "approx_count_distinct"
    description: >-
      Calculates the approximate number of rows that contain distinct values of the expression argument using
      HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which
      returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT
      processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact
      result.
    deprecated:
      since: 1.2.0
      reason: deprecated in favor of exact count distinct
    impls:
      - args:
          - name: x
            value: any
        nullability: DECLARED_OUTPUT
        decomposable: MANY
        intermediate: binary
        return: i64

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think introducing versioning for YAML files opens up a whole can of worms. We may want to tackle that eventually, but it is worth its own dedicated PR and discussion.

At the current moment, there is nothing stopping extension authors outside of the core extension set from managing their own internal version schema via one of the metadata endpoints.

So what are you actually proposing here? adding the multitude of versioning fields as you initially proposed and then clean up later? I'm a bit confused.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I'm okay with dropping my proposal. You can technically put any extra deprecation info you want into the existing metadata fields.

I was just trying to allow a way for unofficial extensions to mark themselves as deprecated in an "official" way without (so their deprecation has nothing to do with substrait versioning). My suggestion was to just add an extension point where people can put whatever string they want.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw. another simple solution that avoids the whole versioning debate would be to just add a boolean deprecated field as I did here not being aware that YongChul started this PR: 44c2c5b

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nielspardon thanks for the example. Now I got what you mentioned and I think I like the idea. In fact, adding boolean field was indeed my first approach but then I realized that it is not particularly useful without which release it has been deprecated for official documentation. So here we are. 😄

The main points for this is following.

  • versioning scheme
  • granularity of versioning

For Substrait, it appears that we all agree that we use Substrait release [granularity] version [semantic version].

For third party, we could leave the version field free form (i.e., just string and not enforcing anything) for maximal flexibility but we lose systematic validation and perhaps tracking as well.
Or, we can enforce the version must be following semantic versioning scheme.
Or, we can hand wave that if it look like semantic.

That's where my attempt to genericSemanticVersion to handle things in systematic way...

Here is probably the middle ground, just to unblock the official extensions.

deprecated:
  since: "X.Y.Z" # this can be extended with oneOf to support other convention but by default it is semantic version string.

and that's it. If third party happens to want to use it, it can as long as they use semantic version string (and of course use the metadata field as well).

We can continue debate whether we want file level version field or not in other PR or issue.

How does that sound?

yongchul and others added 2 commits March 19, 2026 13:49
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
type: boolean
semanticVersionString:
type: string
pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😢 semver recommended regex is quite complex

We could just include it and put a comment to that link?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pondered this, and the expression is too complex and we don't need that complexity (capturing tags and release) as a spec IMO. So I'll stick to the simpler regex for now. We can always increase the complexity later. 😄

…tionally underspecify third party extensions for future development.
since:
description: >-
Version since when the item is deprecated.
$ref: "#/$defs/coreSemanticVersionString"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benbellick @nielspardon thanks for reviewing and discussion!

PTAL this simplified version -- almost going back to the initial proposal.

  • since is now only semantic version string.
  • If we need to add more explicit versioning scheme, we can make the since as oneof in a backward compatible way to incorporate other scheme.
  • Documented explicitly what this means for core substrait extensions and intentionally underspecified third-party extensions.
  • In a follow up PR or issue, we can further discuss standardizing the deprecation in third party extensions. We can actually punt this until someone in pain steps up. ;-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is sufficient to unblock the breaking changes in core extensions in more graceful way.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, makes sense to me and also thanks for being mindful of unblocking other PRs quickly depending on this. hopefully there are enough other PMCs members around so we can get the minimum approvals required in a reasonable time

Copy link
Copy Markdown
Member

@nielspardon nielspardon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once small comment in the example, but otherwise this looks good to me.

@yongchul yongchul merged commit 25d87ef into substrait-io:main Mar 20, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PMC Ready PRs ready for review by PMCs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants