Conversation
60c70d7 to
2b2591c
Compare
yongchul
left a comment
There was a problem hiding this comment.
Just heads up. I'm not a native speaker so take lots of grain of salts with my nit comments about sentence and grammar. :)
vbarua
left a comment
There was a problem hiding this comment.
Left a suggestion. The meaning/parsing of Extension URNs changes slightly if we prefix urn: in front of them, versus replacing extension: with urn:. Let me know what you think.
site/docs/extensions/index.md
Outdated
| * Table Functions | ||
|
|
||
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. | ||
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix. |
There was a problem hiding this comment.
Minor adjustment for clarity.
| To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix. | |
| To extend these items, users can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. These identifiers are URN-like but not technically URNs (they are prefixed with `extension:` instead of `urn:`), and will be referred to as `extension URNs` for clarity. | |
| Extension URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) URNs when replacing `extension:` with `urn:`. |
| Simple extensions within a plan are split into three components: an extension URN, an extension declaration and a number of references. | ||
|
|
||
| * **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. | ||
| * **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. The URN with the `urn:` prefix added must conform to [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html). |
There was a problem hiding this comment.
The URN with the
urn:prefix added must conform to RFC 8141.
It's a bit weird to say this. The way we've structured them now maps to
| <NID> | <NSS>
extension:<owner>:<id>
I guess they do conform to the RFC if we prefix urn, but the interpretation would be different technically:
| <NID> | <NSS>
urn:extension:<owner>:<id>
What do you think about:
The Extension URN with the
extension:replaced withurn:must conform to RFC 8141
There was a problem hiding this comment.
Hmm... I see the point you are making. However, we don't actually enforce that the structure of the <owner> part is reverse DNS. Why don't we just loosen the restriction on the urn entirely and say:
The
urnis required to be a valid URN whenurn:is prepended to the string. The format must conform tourn:extension:<Identifier>. The recommended format for the identifier is<Reverse-DNS-Name>:<any-valid-name>. This is consistent with the default substrait extensions and prevents name collisions.
To me, this feels more consistent with the urn spec. Maybe its just how my brain works, but saying "urn: added to the front makes it a valid URN" makes more sense to me than saying "urn: replacing extension: makes it a valid URN".
|
@jacques-n I have altered the regex to be |
d0792a9 to
a9ed3f5
Compare
The required regex is: ^extension:[^:]+:[^:]+$
a9ed3f5 to
8247607
Compare
site/docs/extensions/index.md
Outdated
| - `OWNER` represents the organization or entity providing the extension and should follow [reverse domain name convention](https://en.wikipedia.org/wiki/Reverse_domain_name_notation) (e.g., `io.substrait`, `com.example`, `org.apache.arrow`) to prevent name collisions | ||
| - `ID` is the specific identifier for the extension (e.g., `functions_arithmetic`, `custom_types`) | ||
|
|
||
| These URNs must match the regex `^extension:[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+$`. |
There was a problem hiding this comment.
Now that I see it again, why do we allow upper case if we were to allow only narrow set of characters?
There was a problem hiding this comment.
Do you recommend an even more restrictive urn? How about:
^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$
i.e same thing without capital letters.
There was a problem hiding this comment.
I updated it to be without capital letters, let me know what you think!
`^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$`
mbrobbel
left a comment
There was a problem hiding this comment.
Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?
@mbrobbel The problem is that defining in terms of RFC 8141 is inherently clunky because of the decision to not include
The problem with the first approach above is then The second approach is technically fine, but a bit clunky IMO. The third approach seems simpler all in all as you can check the string against a regex directly. It also gives us more flexibility in the future to add things like versioning to the end when we are ready. Also, the inspiration for this approach to using URN-like things came from java's maven, which is also not a general purpose URN. I am open to relying on the RFC 8141 spec, but it doesn't seem to me that that is necessarily the simplest solution. In hindsight I wish that we had included |
|
I would also prefer to stay closer to the RFC.
Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes |
@nielspardon I do think that that is a good approach. What if we then said that valid URNs are |
sure, we can do the change as a 1.0 item. Just saying if we want to give this another go we probably should consider that NID should be something we could register with IANA if we wanted to. |
|
Sounds good. We will still need to introduce some sort of regex to validate the |
yongchul
left a comment
There was a problem hiding this comment.
So, what do we do with this PR?
I'm +1 to urn:substrait: optional forever, then allow furll URN urn:substrait:extension:<rest> with IANA registration on 1.0. Then this PR can move forward with a couple of follow up tasks. WDYT? @benbellick @nielspardon
|
@yongchul I would be happy with that decision. Thanks for keeping tabs on this one. I'll update the PR to reflect that today. So, would weconsider the following two urns equivalent for now?
What recommendation should we give for which of these go in the YAML file? |
I would use this for the second one:
I would use the second one. |
Ah my bad, that was a typo! Agreed :) |
…<id> - Update documentation to define canonical URN format - Add backwards compatibility note for extension: prefix - Update all extension YAML files to use canonical form - Update schema to enforce canonical format
| properties: | ||
| urn: | ||
| type: string | ||
| pattern: "^urn:substrait:extension:[a-z0-9_.-]+:[a-z0-9_.-]+$" |
There was a problem hiding this comment.
This is technically a breaking change, as it would force extensions to update their URNs to the new format. However, we can make it so that all of the implementing libraries (go, python, rs, java) do a proper migration, where they can accept either old and new, and emit old.
Later we emit new, and then finally we can consider dropping the old if we would like.
There was a problem hiding this comment.
I am probably being paranoid and overdoing here but if we do this, I suggest we register substrait with IANA ASAP. We don't have to make this a break change right now (simply make the regex treat urn:substrait: optional) but I don't have a strong opinion on this as I do not know how many things will actually break because of this change...
There was a problem hiding this comment.
I have no idea what the process looks like or how long it would take to register with IANA, but sounds good to me :)
There was a problem hiding this comment.
As for making it optional, it will break peoples YAML files, but that would be an easy fix. Then I would imagine that in all of the libraries, we make it so that they can accept both:
urn:substrait:extension:io.substrait:functions_list, andextension:io.substrait:functions_list
And then they always emit
extension:io.substrait:functions_list
Then one day we can switch them all to emitting
urn:substrait:extension:io.substrait:functions_list
And then we can consider dropping support for the old URN.
As discovered in this discussion with @mbrobbel, there is a need to clarify a URN ambiguity. Current
urnimplementation across java, python, and go assumes that there are exactly two colons, i.e. they all are using the regex^extension:[^:]+:[^:]+$.Instead, we clarify that urns as defined here are exactly as in rfc 8141 but with theurn:prefix cut off.We update the documentation to enforce regex
^extension:[^:]+:[^:]+$.This change is