Skip to content

refactor AccGenericColumn for better data type inference and memory usage statistics.#997

Merged
lihao712 merged 1 commit intomasterfrom
refactor-acc-column
May 23, 2025
Merged

refactor AccGenericColumn for better data type inference and memory usage statistics.#997
lihao712 merged 1 commit intomasterfrom
refactor-acc-column

Conversation

@richox
Copy link
Contributor

@richox richox commented May 22, 2025

No description provided.

@richox richox force-pushed the refactor-acc-column branch from df29dab to 258dce3 Compare May 22, 2025 16:13
@richox richox force-pushed the refactor-acc-column branch from 258dce3 to bdf1dc0 Compare May 22, 2025 16:59
@lihao712 lihao712 merged commit 79fac47 into master May 23, 2025
619 checks passed
@richox richox requested a review from Copilot June 27, 2025 09:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Refactors the generic accumulator column into specialized column types to improve data type inference and uninitialized memory allocation, and updates aggregation implementations accordingly.

  • Introduce UninitializedInit for zero-cost buffer allocation in file readers and batch serde.
  • Replace the monolithic AccGenericColumn with AccPrimColumn, AccBooleanColumn, AccBytesColumn, and AccScalarValueColumn, plus helper functions create_acc_generic_column and acc_generic_column_to_array.
  • Update all Agg implementations (sum, maxmin, first, count, collect, bloom_filter, avg, spark_udaf_wrapper) to use the new column structs and remove unwrap() on downcasts.

Reviewed Changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
native-engine/datafusion-ext-plans/src/scan/internal_file_reader.rs Use uninitialized-init buffer allocation in read_fully.
native-engine/datafusion-ext-plans/src/agg/sum.rs Migrate sum to AccPrimColumn and helper APIs.
native-engine/datafusion-ext-plans/src/agg/spark_udaf_wrapper.rs Use Vec::uninitialized_init for JVM byte arrays and ? on downcasts.
native-engine/datafusion-ext-plans/src/agg/maxmin.rs Refactor max/min to specialized Acc*Column types.
native-engine/datafusion-ext-plans/src/agg/first_ignores_null.rs Swap out AccGenericColumn for concrete column types.
native-engine/datafusion-ext-plans/src/agg/first.rs Introduce separate flags column and new accumulator helpers.
native-engine/datafusion-ext-plans/src/agg/count.rs Replace unwrap() on downcasts with error propagation ?.
native-engine/datafusion-ext-plans/src/agg/collect.rs Add guard for small-to-huge conversion and refactor collection.
native-engine/datafusion-ext-plans/src/agg/bloom_filter.rs Switch to specialized columns and remove panicking unwraps.
native-engine/datafusion-ext-plans/src/agg/avg.rs Update AVG to use new column creation and merge patterns.
native-engine/datafusion-ext-plans/src/agg/acc.rs Overhaul AccGenericColumn into concrete column structs and APIs.
native-engine/datafusion-ext-commons/src/lib.rs Add UninitializedInit trait for Vec and SmallVec.
native-engine/datafusion-ext-commons/src/io/batch_serde.rs Apply uninitialized-init to raw batch serialization buffers.
native-engine/datafusion-ext-commons/Cargo.toml Add smallvec dependency for UninitializedInit support.
Comments suppressed due to low confidence (2)

native-engine/datafusion-ext-plans/src/agg/first.rs:157

  • For a Binary data type you must downcast to BinaryArray, not StringArray. Please use downcast_any!(partial_arg, BinaryArray)?.
            DataType::Utf8 => handle_bytes!(downcast_any!(partial_arg, StringArray)?),

native-engine/datafusion-ext-plans/src/agg/collect.rs:514

  • The if let with && is invalid syntax. To add a guard, use if let Self::Small(s) = self if s.len() >= 4 { ... } or nest an if.
            && s.len() >= 4

DataType::Utf8 => handle_bytes!(String),
DataType::Binary => handle_bytes!(Binary),
DataType::Utf8 => handle_bytes!(downcast_any!(partial_arg, StringArray)?),
DataType::Binary => handle_bytes!(downcast_any!(partial_arg, StringArray)?),
Copy link

Copilot AI Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary arrays should be downcast to BinaryArray, not StringArray. Update to downcast_any!(partial_arg, BinaryArray)?.

Suggested change
DataType::Binary => handle_bytes!(downcast_any!(partial_arg, StringArray)?),
DataType::Binary => handle_bytes!(downcast_any!(partial_arg, BinaryArray)?),

Copilot uses AI. Check for mistakes.
fn convert_to_huge_if_needed(&mut self, list: &mut AccList) {
if let Self::Small(s) = self {
if let Self::Small(s) = self
&& s.len() >= 4
Copy link

Copilot AI Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The threshold 4 is a magic number. Consider extracting it into a named constant to clarify its purpose.

Suggested change
&& s.len() >= 4
&& s.len() >= SMALL_TO_HUGE_THRESHOLD

Copilot uses AI. Check for mistakes.
@cxzl25 cxzl25 deleted the refactor-acc-column branch September 5, 2025 08:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants