feat(extensions): support deprecation info in extensions#1014
feat(extensions): support deprecation info in extensions#1014yongchul merged 12 commits intosubstrait-io:mainfrom
Conversation
benbellick
left a comment
There was a problem hiding this comment.
Left a few comments. Also wondering if there are any markdown doc updates we want to do as well? Thanks!
benbellick
left a comment
There was a problem hiding this comment.
I'm overall in support, just the one comment about simplifying things
site/docs/extensions/index.md
Outdated
| * `semanticVersion` — a [semantic version](https://semver.org) string for extensions that maintain their own versioning independent of Substrait releases. | ||
| * `genericVersion` — a freeform versioning scheme with `name` (the versioning scheme name) and `version` (the version string) fields. The interpretation is up to the extension authors. |
There was a problem hiding this comment.
I say let's keep things simple for now and just have only substraitVersion and genericVersion, though I don't feel super strong about this. It just feels to me like adding semanticVersion is just adding a special case of genericVersion.
If others would prefer having both then that is fine with me.
There was a problem hiding this comment.
I'm also in favor of keeping things simple. Couldn't we just add an optional version field for each YAML file and if it's not set it is implicitly the substrait version?
There was a problem hiding this comment.
@nielspardon if we put file-level default, then we need to make reason required if I understood correctly how this YAML thing works? I think it is okay to just have it per entry level and being explicit as we don't have too many deprecated stuff...
Also, @benbellick and I discussed offline briefly about introducing per YAML file version and deprecation can be with respect to the YAML file version. We may explore this idea with future URN changes. So please let me know what do you think.
@benbellick I just put semanticVersion there because it is such a popular scheme. But in the end, we still need to know which version you need to follow... without anchoring point, there is no meaning to compare verisons. Will remove the semanticVersion.
There was a problem hiding this comment.
without anchoring point, there is no meaning to compare verisons.
Yes, that was also my main thought with introducing a version per YAML file since then you can deprecate relative to the YAML file version. My understanding is that for the default YAML files that come with the Substrait spec we use the same version as the Substrait spec by convention. 3rd party extension YAMLs would need to explicitly define a version per file.
There was a problem hiding this comment.
if we assume for a second that this relatively short extension file was a 3rd party one:
%YAML 1.2
---
urn: extension:org.example:functions_aggregate_approx
version: 1.2.3
aggregate_functions:
- name: "approx_count_distinct"
description: >-
Calculates the approximate number of rows that contain distinct values of the expression argument using
HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which
returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT
processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact
result.
deprecated:
since: 1.2.0
reason: deprecated in favor of exact count distinct
impls:
- args:
- name: x
value: any
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: binary
return: i64
then it is clear that since is relative to the top-level version number
There was a problem hiding this comment.
for default Substrait extensions you would just not populate the version field since it would be the same as the Substrait version of the Plan message:
%YAML 1.2
---
urn: extension:io.substrait:functions_aggregate_approx
aggregate_functions:
- name: "approx_count_distinct"
description: >-
Calculates the approximate number of rows that contain distinct values of the expression argument using
HyperLogLog. This function provides an alternative to the COUNT (DISTINCT expression) function, which
returns the exact number of rows that contain distinct values of an expression. APPROX_COUNT_DISTINCT
processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact
result.
deprecated:
since: 1.2.0
reason: deprecated in favor of exact count distinct
impls:
- args:
- name: x
value: any
nullability: DECLARED_OUTPUT
decomposable: MANY
intermediate: binary
return: i64
There was a problem hiding this comment.
I think introducing versioning for YAML files opens up a whole can of worms. We may want to tackle that eventually, but it is worth its own dedicated PR and discussion.
At the current moment, there is nothing stopping extension authors outside of the core extension set from managing their own internal version schema via one of the
metadataendpoints.
So what are you actually proposing here? adding the multitude of versioning fields as you initially proposed and then clean up later? I'm a bit confused.
There was a problem hiding this comment.
Honestly, I'm okay with dropping my proposal. You can technically put any extra deprecation info you want into the existing metadata fields.
I was just trying to allow a way for unofficial extensions to mark themselves as deprecated in an "official" way without (so their deprecation has nothing to do with substrait versioning). My suggestion was to just add an extension point where people can put whatever string they want.
There was a problem hiding this comment.
btw. another simple solution that avoids the whole versioning debate would be to just add a boolean deprecated field as I did here not being aware that YongChul started this PR: 44c2c5b
There was a problem hiding this comment.
@nielspardon thanks for the example. Now I got what you mentioned and I think I like the idea. In fact, adding boolean field was indeed my first approach but then I realized that it is not particularly useful without which release it has been deprecated for official documentation. So here we are. 😄
The main points for this is following.
- versioning scheme
- granularity of versioning
For Substrait, it appears that we all agree that we use Substrait release [granularity] version [semantic version].
For third party, we could leave the version field free form (i.e., just string and not enforcing anything) for maximal flexibility but we lose systematic validation and perhaps tracking as well.
Or, we can enforce the version must be following semantic versioning scheme.
Or, we can hand wave that if it look like semantic.
That's where my attempt to genericSemanticVersion to handle things in systematic way...
Here is probably the middle ground, just to unblock the official extensions.
deprecated:
since: "X.Y.Z" # this can be extended with oneOf to support other convention but by default it is semantic version string.and that's it. If third party happens to want to use it, it can as long as they use semantic version string (and of course use the metadata field as well).
We can continue debate whether we want file level version field or not in other PR or issue.
How does that sound?
Co-authored-by: Ben Bellick <36523439+benbellick@users.noreply.github.com>
text/simple_extensions_schema.yaml
Outdated
| type: boolean | ||
| semanticVersionString: | ||
| type: string | ||
| pattern: "^[0-9]+\\.[0-9]+\\.[0-9]+$" |
There was a problem hiding this comment.
😢 semver recommended regex is quite complex
We could just include it and put a comment to that link?
There was a problem hiding this comment.
Pondered this, and the expression is too complex and we don't need that complexity (capturing tags and release) as a spec IMO. So I'll stick to the simpler regex for now. We can always increase the complexity later. 😄
…tionally underspecify third party extensions for future development.
| since: | ||
| description: >- | ||
| Version since when the item is deprecated. | ||
| $ref: "#/$defs/coreSemanticVersionString" |
There was a problem hiding this comment.
@benbellick @nielspardon thanks for reviewing and discussion!
PTAL this simplified version -- almost going back to the initial proposal.
sinceis now only semantic version string.- If we need to add more explicit versioning scheme, we can make the
sinceas oneof in a backward compatible way to incorporate other scheme. - Documented explicitly what this means for core substrait extensions and intentionally underspecified third-party extensions.
- In a follow up PR or issue, we can further discuss standardizing the deprecation in third party extensions. We can actually punt this until someone in pain steps up. ;-)
There was a problem hiding this comment.
I believe this is sufficient to unblock the breaking changes in core extensions in more graceful way.
There was a problem hiding this comment.
thanks, makes sense to me and also thanks for being mindful of unblocking other PRs quickly depending on this. hopefully there are enough other PMCs members around so we can get the minimum approvals required in a reasonable time
vbarua
left a comment
There was a problem hiding this comment.
Once small comment in the example, but otherwise this looks good to me.
There are a few recent PRs making breaking changes to extensions but we can do this more gracefully.
This PR proposes adding deprecation information to all kinds of extensions:
type,type variations,scalar,aggregate, andwindowfunctions so that we can present deprecation information to the consumers. This also makes possible tooling leverage this information to produce warnings.AI Disclaimer