Support of run end encoded#88
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Pull request overview
This PR adds support for deserializing run-end encoded (REE) arrays in the Apache Arrow IPC format. Run-end encoded arrays are a compression technique that stores run end positions and their corresponding values, which is efficient for data with many consecutive repeated values.
Changes:
- Introduced systematic
node_indexparameter tracking throughout the deserialization pipeline to properly track FieldNode consumption - Added new
deserialize_run_end_encoded_array.hppimplementing REE array deserialization logic - Extended test coverage to include run-end encoded array test files
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
include/sparrow_ipc/deserialize_run_end_encoded_array.hpp |
New header implementing run-end encoded array deserialization with proper child array handling |
src/array_deserializer.hpp |
Added node_index parameter to all deserializer method signatures and incremented it in primitive array deserializers |
src/array_deserializer.cpp |
Implemented deserialize_run_end_encoded method and updated all existing deserializers to propagate node_index |
src/deserialize.cpp |
Initialized node_index tracking in the main deserialization loop |
tests/test_de_serialization_with_files.cpp |
Added test file path for run-end encoded array integration tests |
| size_t& buffer_index, | ||
| size_t&, | ||
| size_t& node_index, | ||
| size_t& variadic_counts_idx, |
There was a problem hiding this comment.
variadic_counts_idx was omitted here so that it gives a hint to the reader that it's not used (and also avoids warnings of unused variables).
There was a problem hiding this comment.
I prefer to have a name to avoid any issue in the order of the parameters. I may add this information in the documentation.
There was a problem hiding this comment.
If the variable is not used, the name should be commented to avoid a warning.
| size_t& buffer_index, | ||
| size_t&, | ||
| size_t& node_index, | ||
| size_t& variadic_counts_idx, |
There was a problem hiding this comment.
Same remark for variadic_counts_idx.
| size_t& buffer_index, | ||
| size_t&, | ||
| size_t& node_index, | ||
| size_t& variadic_counts_idx, |
There was a problem hiding this comment.
Same (ok I will stop commenting on each line with this as there are many of them, you got the point).
| size_t& variadic_counts_idx, | ||
| const org::apache::arrow::flatbuf::Field& field) | ||
| { | ||
| static array_deserializer instance; |
There was a problem hiding this comment.
I know these array_deserializer.{hpp, cpp} files are getting big (especially the header file - because of recursivity) but I'm not sure it's a good idea to move the instance around. I was meaning to think about this while doing some more refactoring but didn't get into it yet, so maybe we should think of something else.
Maybe @JohanMabille has an opinion already.
There was a problem hiding this comment.
I don't create an object anymore, I call the static function in the type deserializer
There was a problem hiding this comment.
A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.
There was a problem hiding this comment.
I don't create an object anymore, I call the static function in the type deserializer
Yeah but this requires including array_deserializer.hpp in deserialize_run_end_encoded_array.hpp (which is missing) as the array_deserializer::deserialize function is concretely called there and needs the full definition (otherwise leads to incomplete types) which leads to a circular dependency... As deserialize_run_end_encoded_array is not templated (rather add a .cpp file as well and include the header there?).
A solution is to have a dedicated file for each serialization function and a function to register the serializers; this way, it's easy for users to register new serializers for extension types for instance.
We have a map with the different types and corresponding functions which are defined in their dedicated files for non nested types, but we will have circular dependencies with nested templated functions - where we call array_deserializer::deserialize recursively - if we put them in different headers (or I didn't get what you mean).
There was a problem hiding this comment.
The idea is that you have a header with the main serializer with something like:
class deserializer
{
public:
void register_deserializer(id, std::function<R(Args...)>);
array deserialize(...);
};Then each specific serializer will include ithis header, register its own deserialization function, and can call deserializer::deserialize recursively
| size_t& variadic_counts_idx, | ||
| const org::apache::arrow::flatbuf::Field& field) | ||
| { | ||
| ++node_index; // Consume one FieldNode for this fixed-size binary array |
There was a problem hiding this comment.
As the list view array needed this as well, I'm actually incrementing this in array_deserializer::deserialize so that it's done only once before calling the specialized deserialization functions and called it next_field_node_index, but not sure what's best really... The way it's done here is more verbose but maybe more explicit/clear? What do you think? @Alex-PLACET @JohanMabille?
There was a problem hiding this comment.
I think it should be the reponsability of each array deserializer to do it.
There was a problem hiding this comment.
I agree, especially if not all arrays need it.
| @@ -0,0 +1,182 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
If we decide to keep this new file, we should add it to the CMakeLists.txt.
| n_children, | ||
| array_children, | ||
| nullptr, | ||
| std::move(buffers) |
There was a problem hiding this comment.
Should we directly construct an empty buffers here? (it would be more explicit).
| schema_children[0] = new ArrowSchema(std::move(run_ends_arrow_schema)); | ||
| schema_children[1] = new ArrowSchema(std::move(values_arrow_schema)); | ||
|
|
||
| const std::string_view format = "+r"; |
There was a problem hiding this comment.
Should we fix this in sparrow (or open an issue) to be able to use data_type_to_format instead of hardcoding it?
There was a problem hiding this comment.
I will add a todo and push a fix on sparrow side
No description provided.