docs: clarify valid URNs by benbellick · Pull Request #881 · substrait-io/substrait

benbellick · 2025-11-03T15:29:32Z

As discovered in this discussion with @mbrobbel, there is a need to clarify a URN ambiguity. Current urn implementation across java, python, and go assumes that there are exactly two colons, i.e. they all are using the regex ^extension:[^:]+:[^:]+$.

~~Instead, we clarify that urns as defined here are exactly as in rfc 8141 but with the urn: prefix cut off.~~

We update the documentation to enforce regex ^extension:[^:]+:[^:]+$.

This change is

yongchul

Just heads up. I'm not a native speaker so take lots of grain of salts with my nit comments about sentence and grammar. :)

site/docs/extensions/index.md

site/docs/serialization/binary_serialization.md

proto/substrait/extensions/extensions.proto

vbarua

Left a suggestion. The meaning/parsing of Extension URNs changes slightly if we prefix urn: in front of them, versus replacing extension: with urn:. Let me know what you think.

vbarua · 2025-11-18T18:29:03Z

site/docs/extensions/index.md

 * Table Functions

-To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity.
+To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix.


Minor adjustment for clarity.

Suggested change

To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix.

To extend these items, users can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. These identifiers are URN-like but not technically URNs (they are prefixed with `extension:` instead of `urn:`), and will be referred to as `extension URNs` for clarity.

Extension URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) URNs when replacing `extension:` with `urn:`.

vbarua · 2025-11-18T18:32:27Z

site/docs/serialization/binary_serialization.md

 Simple extensions within a plan are split into three components: an extension URN, an extension declaration and a number of references.

-* **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations.
+* **Extension URN**: A unique identifier for the extension following the format `extension:<OWNER>:<ID>` that identifies a YAML document specifying one or more specific extensions. Declares an anchor that can be used in extension declarations. The URN with the `urn:` prefix added must conform to [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html).


The URN with the urn: prefix added must conform to RFC 8141.

It's a bit weird to say this. The way we've structured them now maps to

| <NID> | <NSS> extension:<owner>:<id>

I guess they do conform to the RFC if we prefix urn, but the interpretation would be different technically:

| <NID> | <NSS> urn:extension:<owner>:<id>

What do you think about:

The Extension URN with the extension: replaced with urn: must conform to RFC 8141

Hmm... I see the point you are making. However, we don't actually enforce that the structure of the <owner> part is reverse DNS. Why don't we just loosen the restriction on the urn entirely and say:

The urn is required to be a valid URN when urn: is prepended to the string. The format must conform to urn:extension:<Identifier>. The recommended format for the identifier is <Reverse-DNS-Name>:<any-valid-name>. This is consistent with the default substrait extensions and prevents name collisions.

To me, this feels more consistent with the urn spec. Maybe its just how my brain works, but saying "urn: added to the front makes it a valid URN" makes more sense to me than saying "urn: replacing extension: makes it a valid URN".

benbellick · 2025-11-18T19:22:16Z

After a discussion with @vbarua, it makes sense to just enforce that the "URN" is something compliant with the regex ^extension:[^:]+:[^:]+$. We can say it is URN-like, but we may as well be overly restrictive.

~~This regex is exactly what is implemented in the substrait libs for java, go, and python.~~

@jacques-n I have altered the regex to be ^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$.

playground

The required regex is: ^extension:[^:]+:[^:]+$

yongchul · 2025-11-21T04:51:16Z

site/docs/extensions/index.md

 - `OWNER` represents the organization or entity providing the extension and should follow [reverse domain name convention](https://en.wikipedia.org/wiki/Reverse_domain_name_notation) (e.g., `io.substrait`, `com.example`, `org.apache.arrow`) to prevent name collisions
 - `ID` is the specific identifier for the extension (e.g., `functions_arithmetic`, `custom_types`)

+These URNs must match the regex `^extension:[a-zA-Z0-9_.-]+:[a-zA-Z0-9_.-]+$`.


Now that I see it again, why do we allow upper case if we were to allow only narrow set of characters?

Do you recommend an even more restrictive urn? How about:
^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$

i.e same thing without capital letters.

I updated it to be without capital letters, let me know what you think!

proto/substrait/extensions/extensions.proto

`^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$`

mbrobbel

Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?

benbellick · 2025-11-23T16:26:56Z

Looking at this again I'm wondering if it wouldn't be easier to just use URNs as defined in RFC 8141?

@mbrobbel The problem is that defining in terms of RFC 8141 is inherently clunky because of the decision to not include urn: in the string. This leaves us with three choices:

clarify that an extension urn called extension:<rest> is only valid if urn:extension:<rest> is RFC 8141 vaild,
clarify that an extension urn called extension:<rest> is only valid if urn:<rest> is RFC 8141 valid, or
give our own format which we describe as URN-like and then formally give a regex.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

The second approach is technically fine, but a bit clunky IMO.

The third approach seems simpler all in all as you can check the string against a regex directly. It also gives us more flexibility in the future to add things like versioning to the end when we are ready.

Also, the inspiration for this approach to using URN-like things came from java's maven, which is also not a general purpose URN.

I am open to relying on the RFC 8141 spec, but it doesn't seem to me that that is necessarily the simplest solution. In hindsight I wish that we had included urn: at the beginning 😅. Another possible solution is to migrate to using urn: at the start and taking option 1 (with an extra regex for the <NSS>) but that means an extra migration. In which case, we might as well go with 3 for now and tackle 1 later soas not to have two urn-related migrations happening at once.

nielspardon · 2025-12-01T14:45:48Z

I would also prefer to stay closer to the RFC.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes extension not a good choice for a unique namespace identifier and something like substrait would be a better choice so you could have something like:

urn:substrait:extension:<rest>

benbellick · 2025-12-01T16:35:12Z

I would also prefer to stay closer to the RFC.

The problem with the first approach above is then extension becomes the NID, and so we have to put restrictions on what the NSS can be anyways.

Wikipedia says that the NID should be registered with IANA according to the RFC which probably makes extension not a good choice for a unique namespace identifier and something like substrait would be a better choice so you could have something like:
urn:substrait:extension:<rest>

@nielspardon I do think that that is a good approach. What if we then said that valid URNs are urn:substrait:extension:<rest> where the urn:substrait portion is optional? That way we don't have to do a migration yet, but we could later transition to the fully explicit URN. I would prefer not to do any migration at the moment, considering we are in the middle of the uri -> urn migration.

nielspardon · 2025-12-01T16:38:26Z

That way we don't have to do a migration yet, but we could later transition to the fully explicit URN. I would prefer not to do any migration at the moment, considering we are in the middle of the uri -> urn migration.

sure, we can do the change as a 1.0 item. Just saying if we want to give this another go we probably should consider that NID should be something we could register with IANA if we wanted to.

benbellick · 2025-12-01T16:45:49Z

Sounds good. We will still need to introduce some sort of regex to validate the <NSS> component. This would be required to register with IANA anyways. We also may want to withhold from doing any registration until 1.0, when we have a stable idea of what these should look like (e.g. we will want to include version in the string eventually).

yongchul

So, what do we do with this PR?

I'm +1 to urn:substrait: optional forever, then allow furll URN urn:substrait:extension:<rest> with IANA registration on 1.0. Then this PR can move forward with a couple of follow up tasks. WDYT? @benbellick @nielspardon

benbellick · 2026-03-18T16:53:08Z

@yongchul I would be happy with that decision. Thanks for keeping tabs on this one. I'll update the PR to reflect that today.

So, would weconsider the following two urns equivalent for now?

extension:io.substrait:functions_list
urn:substrait:extension:io.substrait_functions_list

What recommendation should we give for which of these go in the YAML file?

nielspardon · 2026-03-18T19:01:25Z

extension:io.substrait:functions_list

urn:substrait:extension:io.substrait_functions_list

I would use this for the second one:

urn:substrait:extension:io.substrait:functions_list

What recommendation should we give for which of these go in the YAML file?

I would use the second one.

benbellick · 2026-03-18T19:03:03Z

extension:io.substrait:functions_list

urn:substrait:extension:io.substrait_functions_list

I would use this for the second one:

urn:substrait:extension:io.substrait:functions_list

What recommendation should we give for which of these go in the YAML file?

I would use the second one.

Ah my bad, that was a typo! Agreed :)

…rn-structure

…<id> - Update documentation to define canonical URN format - Add backwards compatibility note for extension: prefix - Update all extension YAML files to use canonical form - Update schema to enforce canonical format

benbellick · 2026-03-18T21:51:00Z

text/simple_extensions_schema.yaml

 properties:
  urn:
    type: string
+    pattern: "^urn:substrait:extension:[a-z0-9_.-]+:[a-z0-9_.-]+$"


This is technically a breaking change, as it would force extensions to update their URNs to the new format. However, we can make it so that all of the implementing libraries (go, python, rs, java) do a proper migration, where they can accept either old and new, and emit old.

Later we emit new, and then finally we can consider dropping the old if we would like.

I am probably being paranoid and overdoing here but if we do this, I suggest we register substrait with IANA ASAP. We don't have to make this a break change right now (simply make the regex treat urn:substrait: optional) but I don't have a strong opinion on this as I do not know how many things will actually break because of this change...

I have no idea what the process looks like or how long it would take to register with IANA, but sounds good to me :)

As for making it optional, it will break peoples YAML files, but that would be an easy fix. Then I would imagine that in all of the libraries, we make it so that they can accept both:

urn:substrait:extension:io.substrait:functions_list, and

extension:io.substrait:functions_list

And then they always emit

extension:io.substrait:functions_list

Then one day we can switch them all to emitting

urn:substrait:extension:io.substrait:functions_list

And then we can consider dropping support for the old URN.

yongchul

I don't have an issue with this but it would be great if we can register substrait as official namespace in IANA registry. Created #1016 to track the registration.

benbellick marked this pull request as ready for review November 3, 2025 15:30

benbellick requested review from EpsilonPrime, cpcloud, jacques-n, vbarua and westonpace as code owners November 3, 2025 15:30

benbellick mentioned this pull request Nov 3, 2025

chore: add some tests validating urn logic substrait-io/substrait-rs#419

Draft

benbellick marked this pull request as draft November 3, 2025 15:56

docs: clarify valid URNs

2b2591c

benbellick force-pushed the ben.bellick/clarify-urn-structure branch from 60c70d7 to 2b2591c Compare November 3, 2025 16:30

benbellick requested a review from mbrobbel November 3, 2025 16:30

benbellick marked this pull request as ready for review November 3, 2025 17:50

yongchul reviewed Nov 3, 2025

View reviewed changes

site/docs/extensions/index.md Outdated Show resolved Hide resolved

site/docs/serialization/binary_serialization.md Outdated Show resolved Hide resolved

proto/substrait/extensions/extensions.proto Outdated Show resolved Hide resolved

benbellick changed the title ~~docs: clarify valid URNs ':' usage~~ docs: clarify valid URNs Nov 3, 2025

docs: improve documentation surrounding urns

99e8a7f

benbellick requested a review from yongchul November 3, 2025 21:14

yongchul approved these changes Nov 3, 2025

View reviewed changes

Merge branch 'main' into ben.bellick/clarify-urn-structure

1f7aa85

yongchul approved these changes Nov 7, 2025

View reviewed changes

vbarua reviewed Nov 18, 2025

View reviewed changes

benbellick force-pushed the ben.bellick/clarify-urn-structure branch from d0792a9 to a9ed3f5 Compare November 18, 2025 19:52

docs: clarify URN required format with regex

8247607

The required regex is: ^extension:[^:]+:[^:]+$

benbellick force-pushed the ben.bellick/clarify-urn-structure branch from a9ed3f5 to 8247607 Compare November 18, 2025 19:54

benbellick requested review from vbarua and yongchul November 18, 2025 19:54

benbellick added 2 commits November 20, 2025 15:24

Merge branch 'main' into ben.bellick/clarify-urn-structure

fef19a6

tweak: tighten up urn regex and add to extension schema

6ff4172

yongchul reviewed Nov 21, 2025

View reviewed changes

proto/substrait/extensions/extensions.proto Outdated Show resolved Hide resolved

tweak: use more restricive URN regex w/o capital letters

03a9d7f

`^extension:[a-z0-9_.-]+:[a-z0-9_.-]+$`

benbellick requested a review from yongchul November 21, 2025 20:36

mbrobbel reviewed Nov 23, 2025

View reviewed changes

benbellick mentioned this pull request Mar 10, 2026

feat(core)!: bump substrait to v0.85.0, drop URI support substrait-io/substrait-java#740

Merged

yongchul reviewed Mar 18, 2026

View reviewed changes

benbellick added 2 commits March 18, 2026 17:35

Merge remote-tracking branch 'origin/main' into ben.bellick/clarify-u…

de7373b

…rn-structure

benbellick requested a review from yongchul March 18, 2026 21:49

benbellick commented Mar 18, 2026

View reviewed changes

yongchul mentioned this pull request Mar 20, 2026

Register substrait in URN namespace #1016

Open

yongchul approved these changes Mar 20, 2026

View reviewed changes

-To extend these items, developers can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. While these identifiers are URN-like but not technically URNs (they lack the `urn:` prefix), they will be referred to as `extension URNs` for clarity. These URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) format without the `urn:` prefix.
+To extend these items, users can create one or more YAML files that describe the properties of each of these extensions. Each YAML file must include a required `urn` field that uniquely identifies the extension. These identifiers are URN-like but not technically URNs (they are prefixed with `extension:` instead of `urn:`), and will be referred to as `extension URNs` for clarity.
+Extension URNs must be valid [RFC 8141](https://www.rfc-editor.org/rfc/rfc8141.html) URNs when replacing `extension:` with `urn:`.

Conversation

benbellick commented Nov 3, 2025 • edited by jacques-n Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongchul left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbarua left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbellick commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mbrobbel left a comment

Choose a reason for hiding this comment

Uh oh!

benbellick commented Nov 23, 2025

Uh oh!

nielspardon commented Dec 1, 2025

Uh oh!

benbellick commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielspardon commented Dec 1, 2025

Uh oh!

benbellick commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yongchul left a comment

Choose a reason for hiding this comment

Uh oh!

benbellick commented Mar 18, 2026

Uh oh!

nielspardon commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benbellick commented Mar 18, 2026

Uh oh!

benbellick Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yongchul left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

benbellick commented Nov 3, 2025 •

edited by jacques-n

Loading

benbellick commented Nov 18, 2025 •

edited

Loading

benbellick commented Dec 1, 2025 •

edited

Loading

benbellick commented Dec 1, 2025 •

edited

Loading

nielspardon commented Mar 18, 2026 •

edited

Loading

benbellick Mar 18, 2026 •

edited

Loading