Support of run end encoded by Alex-PLACET · Pull Request #88 · sparrow-org/sparrow-ipc

Alex-PLACET · 2026-01-27T12:48:18Z

No description provided.

codecov · 2026-01-27T13:02:36Z

Codecov Report

❌ Patch coverage is 89.33333% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
.../sparrow_ipc/deserialize_run_end_encoded_array.hpp	83.72%	7 Missing ⚠️
src/array_deserializer.cpp	96.29%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copilot

Pull request overview

This PR adds support for deserializing run-end encoded (REE) arrays in the Apache Arrow IPC format. Run-end encoded arrays are a compression technique that stores run end positions and their corresponding values, which is efficient for data with many consecutive repeated values.

Changes:

Introduced systematic node_index parameter tracking throughout the deserialization pipeline to properly track FieldNode consumption
Added new deserialize_run_end_encoded_array.hpp implementing REE array deserialization logic
Extended test coverage to include run-end encoded array test files

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`include/sparrow_ipc/deserialize_run_end_encoded_array.hpp`	New header implementing run-end encoded array deserialization with proper child array handling
`src/array_deserializer.hpp`	Added `node_index` parameter to all deserializer method signatures and incremented it in primitive array deserializers
`src/array_deserializer.cpp`	Implemented `deserialize_run_end_encoded` method and updated all existing deserializers to propagate `node_index`
`src/deserialize.cpp`	Initialized `node_index` tracking in the main deserialization loop
`tests/test_de_serialization_with_files.cpp`	Added test file path for run-end encoded array integration tests

Hind-M · 2026-01-27T15:37:18Z

                                                 size_t& buffer_index,
-                                                 size_t&,
+                                                 size_t& node_index,
+                                                 size_t& variadic_counts_idx,


variadic_counts_idx was omitted here so that it gives a hint to the reader that it's not used (and also avoids warnings of unused variables).

I prefer to have a name to avoid any issue in the order of the parameters. I may add this information in the documentation.

If the variable is not used, the name should be commented to avoid a warning.

Hind-M · 2026-01-27T15:37:49Z

                                                          size_t& buffer_index,
-                                                          size_t&,
+                                                          size_t& node_index,
+                                                          size_t& variadic_counts_idx,


Same remark for variadic_counts_idx.

Hind-M · 2026-01-27T15:38:32Z

                                                       size_t& buffer_index,
-                                                       size_t&,
+                                                       size_t& node_index,
+                                                       size_t& variadic_counts_idx,


Same (ok I will stop commenting on each line with this as there are many of them, you got the point).

Hind-M · 2026-01-27T15:47:35Z

+        size_t& variadic_counts_idx,
+        const org::apache::arrow::flatbuf::Field& field)
+    {
+        static array_deserializer instance;


I know these array_deserializer.{hpp, cpp} files are getting big (especially the header file - because of recursivity) but I'm not sure it's a good idea to move the instance around. I was meaning to think about this while doing some more refactoring but didn't get into it yet, so maybe we should think of something else.
Maybe @JohanMabille has an opinion already.

I don't create an object anymore, I call the static function in the type deserializer

A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.

I don't create an object anymore, I call the static function in the type deserializer

Yeah but this requires including array_deserializer.hpp in deserialize_run_end_encoded_array.hpp (which is missing) as the array_deserializer::deserialize function is concretely called there and needs the full definition (otherwise leads to incomplete types) which leads to a circular dependency... As deserialize_run_end_encoded_array is not templated (rather add a .cpp file as well and include the header there?).

A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.

We have a map with the different types and corresponding functions which are defined in their dedicated files for non nested types, but we will have circular dependencies with nested templated functions - where we call array_deserializer::deserialize recursively - if we put them in different headers (or I didn't get what you mean).

The idea is that you have a header with the main serializer with something like:

class deserializer { public: void register_deserializer(id, std::function<R(Args...)>); array deserialize(...); };

Then each specific serializer will include ithis header, register its own deserialization function, and can call deserializer::deserialize recursively

Hind-M · 2026-01-27T15:56:12Z

+                                                 size_t& variadic_counts_idx,
                                                 const org::apache::arrow::flatbuf::Field& field)
    {
+        ++node_index;  // Consume one FieldNode for this fixed-size binary array


As the list view array needed this as well, I'm actually incrementing this in array_deserializer::deserialize so that it's done only once before calling the specialized deserialization functions and called it next_field_node_index, but not sure what's best really... The way it's done here is more verbose but maybe more explicit/clear? What do you think? @Alex-PLACET @JohanMabille?

I think it should be the reponsability of each array deserializer to do it.

I agree, especially if not all arrays need it.

Hind-M · 2026-01-27T15:57:35Z

@@ -0,0 +1,182 @@
+#pragma once


If we decide to keep this new file, we should add it to the CMakeLists.txt.

Hind-M · 2026-01-27T16:07:38Z

+            n_children,
+            array_children,
+            nullptr,
+            std::move(buffers)


Should we directly construct an empty buffers here? (it would be more explicit).

Hind-M · 2026-01-27T16:10:58Z

+        schema_children[0] = new ArrowSchema(std::move(run_ends_arrow_schema));
+        schema_children[1] = new ArrowSchema(std::move(values_arrow_schema));
+
+        const std::string_view format = "+r";


Should we fix this in sparrow (or open an issue) to be able to use data_type_to_format instead of hardcoding it?

I will add a todo and push a fix on sparrow side

Alex-PLACET added 3 commits January 27, 2026 11:02

Add the support of run_end_encoded

89dca8a

wip

242851e

remove comments

45495fb

Alex-PLACET self-assigned this Jan 27, 2026

remove useless doc

cbb607d

Alex-PLACET requested review from Hind-M and Copilot January 27, 2026 14:20

Copilot started reviewing on behalf of Alex-PLACET January 27, 2026 14:20 View session

Copilot AI reviewed Jan 27, 2026

View reviewed changes

Alex-PLACET linked an issue Jan 27, 2026 that may be closed by this pull request

Support of run end encoded array #47

Closed

Hind-M reviewed Jan 27, 2026

View reviewed changes

Alex-PLACET added 2 commits January 28, 2026 09:17

wip

ac5d24d

wip

7b08ff2

Alex-PLACET requested a review from Hind-M January 28, 2026 10:18

comment unused parameters

e9a956f

Hind-M approved these changes Jan 29, 2026

View reviewed changes

Alex-PLACET merged commit c59b8a9 into sparrow-org:main Jan 29, 2026
28 of 30 checks passed

Alex-PLACET deleted the support_of_run_end_encoded branch January 29, 2026 13:18

Conversation

Alex-PLACET commented Jan 27, 2026

Uh oh!

codecov Bot commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohanMabille Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Jan 27, 2026 •

edited

Loading

JohanMabille Jan 28, 2026 •

edited

Loading