[Proposal]: Add UUID conversion to and from 16 byte fixed sequences#100
Open
urmastalimaa wants to merge 2 commits intobeam-community:mainfrom
Open
[Proposal]: Add UUID conversion to and from 16 byte fixed sequences#100urmastalimaa wants to merge 2 commits intobeam-community:mainfrom
urmastalimaa wants to merge 2 commits intobeam-community:mainfrom
Conversation
UUIDs are often passed around in application code in their canonical,
hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000".
Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in
their binary form fits into a 16 byte sized "fixed", saving 21 bytes per
encoding.
This change allows application code to keep passing around canonical hex
UUIDs while converting to the compact encoding, requiring only
`uuid_format: :canonical_string` to be given in decode options.
The [Java reference implementation][java-implementation] also supports
encoding UUIDs as both strings and 16 byte fixed sequences.
* Encoding is augmented such that a 16 byte fixed schema with
`%{"logicalType" => "uuid"}`, converts a hex-string UUID to the 16
byte binary representation.
* Decoding is augmented such that given `uuid_format: :canonical_string`
in decode options, the binary representation is converted to the
canonical hex-string representation.
The encoding change is nearly backwards-compatible, previously when
given an incorrectly size "fixed" with `{"logicalType": "uuid"}`, an
error was raised, while now conversion is attempted.
The decoding change is fully backwards-compatible, as `uuid_format`
defaults to `:binary`.
For UUID codec, the `uniq` library was added (no transitive
dependencies).
[java-implementation]: https://github.com/apache/avro/blob/230414abbb68e63e68f3b55bfc0cbca94f2737f6/lang/java/avro/src/main/java/org/apache/avro/LogicalTypes.java#L291-L309
e2bfb37 to
472d025
Compare
2 tasks
urmastalimaa
commented
Feb 12, 2025
| when is_binary(data) do | ||
| <<fixed::binary-size(size), rest::binary>> = data | ||
|
|
||
| case Keyword.get(opts, :uuid_format, :binary) do |
Contributor
Author
There was a problem hiding this comment.
Without any opts-based configuration, the change would be backwards incompatible.
I'll gladly accept input on whether configuration is necessary at all and if so, the key and value names.
5deff6f to
e43f62b
Compare
setup-beam does not allow ubuntu-24
e43f62b to
c5abbde
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
UUIDs are often passed around in application code in their canonical, hex as string representation e.g. "550e8400-e29b-41d4-a716-446655440000". Encoding UUIDs as Avro "string"s takes 37 bytes, while encoding UUIDs in their binary form fits into a 16 byte sized "fixed", saving 21 bytes per encoding.
This change allows application code to keep passing around canonical hex UUIDs while converting to the compact encoding, requiring only
uuid_format: :canonical_stringto be given in decode options.The Java reference implementation also supports encoding UUIDs as both strings and 16 byte fixed sequences.
Encoding is augmented such that a 16 byte fixed schema with
%{"logicalType" => "uuid"}, converts a hex-string UUID to the 16 byte binary representation.Decoding is augmented such that given
uuid_format: :canonical_stringin decode options, the binary representation is converted to the canonical hex-string representation.The encoding change is nearly backwards-compatible, previously when given an incorrectly size "fixed" with
{"logicalType": "uuid"}, an error was raised, while now conversion is attempted.The decoding change is fully backwards-compatible, as
uuid_formatdefaults to:binary.For UUID codec, the
uniqlibrary was added (no transitive dependencies).