Skip to content

Support of run end encoded#88

Merged
Alex-PLACET merged 7 commits intosparrow-org:mainfrom
Alex-PLACET:support_of_run_end_encoded
Jan 29, 2026
Merged

Support of run end encoded#88
Alex-PLACET merged 7 commits intosparrow-org:mainfrom
Alex-PLACET:support_of_run_end_encoded

Conversation

@Alex-PLACET
Copy link
Copy Markdown
Collaborator

No description provided.

@Alex-PLACET Alex-PLACET self-assigned this Jan 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 27, 2026

Codecov Report

❌ Patch coverage is 89.33333% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
.../sparrow_ipc/deserialize_run_end_encoded_array.hpp 83.72% 7 Missing ⚠️
src/array_deserializer.cpp 96.29% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for deserializing run-end encoded (REE) arrays in the Apache Arrow IPC format. Run-end encoded arrays are a compression technique that stores run end positions and their corresponding values, which is efficient for data with many consecutive repeated values.

Changes:

  • Introduced systematic node_index parameter tracking throughout the deserialization pipeline to properly track FieldNode consumption
  • Added new deserialize_run_end_encoded_array.hpp implementing REE array deserialization logic
  • Extended test coverage to include run-end encoded array test files

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
include/sparrow_ipc/deserialize_run_end_encoded_array.hpp New header implementing run-end encoded array deserialization with proper child array handling
src/array_deserializer.hpp Added node_index parameter to all deserializer method signatures and incremented it in primitive array deserializers
src/array_deserializer.cpp Implemented deserialize_run_end_encoded method and updated all existing deserializers to propagate node_index
src/deserialize.cpp Initialized node_index tracking in the main deserialization loop
tests/test_de_serialization_with_files.cpp Added test file path for run-end encoded array integration tests

@Alex-PLACET Alex-PLACET linked an issue Jan 27, 2026 that may be closed by this pull request
Comment thread src/array_deserializer.cpp Outdated
size_t& buffer_index,
size_t&,
size_t& node_index,
size_t& variadic_counts_idx,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

variadic_counts_idx was omitted here so that it gives a hint to the reader that it's not used (and also avoids warnings of unused variables).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to have a name to avoid any issue in the order of the parameters. I may add this information in the documentation.

Copy link
Copy Markdown
Contributor

@JohanMabille JohanMabille Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the variable is not used, the name should be commented to avoid a warning.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment thread src/array_deserializer.cpp Outdated
size_t& buffer_index,
size_t&,
size_t& node_index,
size_t& variadic_counts_idx,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark for variadic_counts_idx.

Comment thread src/array_deserializer.cpp Outdated
size_t& buffer_index,
size_t&,
size_t& node_index,
size_t& variadic_counts_idx,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same (ok I will stop commenting on each line with this as there are many of them, you got the point).

Comment thread src/array_deserializer.cpp Outdated
size_t& variadic_counts_idx,
const org::apache::arrow::flatbuf::Field& field)
{
static array_deserializer instance;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know these array_deserializer.{hpp, cpp} files are getting big (especially the header file - because of recursivity) but I'm not sure it's a good idea to move the instance around. I was meaning to think about this while doing some more refactoring but didn't get into it yet, so maybe we should think of something else.
Maybe @JohanMabille has an opinion already.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't create an object anymore, I call the static function in the type deserializer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't create an object anymore, I call the static function in the type deserializer

Yeah but this requires including array_deserializer.hpp in deserialize_run_end_encoded_array.hpp (which is missing) as the array_deserializer::deserialize function is concretely called there and needs the full definition (otherwise leads to incomplete types) which leads to a circular dependency... As deserialize_run_end_encoded_array is not templated (rather add a .cpp file as well and include the header there?).

A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.

We have a map with the different types and corresponding functions which are defined in their dedicated files for non nested types, but we will have circular dependencies with nested templated functions - where we call array_deserializer::deserialize recursively - if we put them in different headers (or I didn't get what you mean).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that you have a header with the main serializer with something like:

class deserializer
{
public:

    void register_deserializer(id, std::function<R(Args...)>);
    array deserialize(...);
};

Then each specific serializer will include ithis header, register its own deserialization function, and can call deserializer::deserialize recursively

size_t& variadic_counts_idx,
const org::apache::arrow::flatbuf::Field& field)
{
++node_index; // Consume one FieldNode for this fixed-size binary array
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the list view array needed this as well, I'm actually incrementing this in array_deserializer::deserialize so that it's done only once before calling the specialized deserialization functions and called it next_field_node_index, but not sure what's best really... The way it's done here is more verbose but maybe more explicit/clear? What do you think? @Alex-PLACET @JohanMabille?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be the reponsability of each array deserializer to do it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, especially if not all arrays need it.

@@ -0,0 +1,182 @@
#pragma once
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to keep this new file, we should add it to the CMakeLists.txt.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

n_children,
array_children,
nullptr,
std::move(buffers)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we directly construct an empty buffers here? (it would be more explicit).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

schema_children[0] = new ArrowSchema(std::move(run_ends_arrow_schema));
schema_children[1] = new ArrowSchema(std::move(values_arrow_schema));

const std::string_view format = "+r";
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we fix this in sparrow (or open an issue) to be able to use data_type_to_format instead of hardcoding it?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a todo and push a fix on sparrow side

@Alex-PLACET Alex-PLACET requested a review from Hind-M January 28, 2026 10:18
@Alex-PLACET Alex-PLACET merged commit c59b8a9 into sparrow-org:main Jan 29, 2026
28 of 30 checks passed
@Alex-PLACET Alex-PLACET deleted the support_of_run_end_encoded branch January 29, 2026 13:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support of run end encoded array

4 participants