Skip to content

feat(parquet/pqarrow): parallelize SeekToRow#380

Merged
zeroshade merged 2 commits intoapache:mainfrom
zeroshade:parallelize-seek-rr
May 22, 2025
Merged

feat(parquet/pqarrow): parallelize SeekToRow#380
zeroshade merged 2 commits intoapache:mainfrom
zeroshade:parallelize-seek-rr

Conversation

@zeroshade
Copy link
Member

Rationale for this change

Closes #379

What changes are included in this PR?

Update the SeekToRow method of the record reader to parallelize the calls to SeekToRow for the columns if the Parallel option is set to true.

Are these changes tested?

Yes, the RecordReader SeekToRow is already tested via unit tests with multiple columns.

Are there any user-facing changes?

There should only be the performance benefit

Copy link
Member Author

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sunyue can you try this branch out and let me know if it is sufficient for your issue?

@zeroshade zeroshade requested a review from lidavidm May 19, 2025 19:18
@Sunyue
Copy link

Sunyue commented May 22, 2025

@Sunyue can you try this branch out and let me know if it is sufficient for your issue?

Looks good to me. I tested with a 400+ column parquet file from ADLV2, seek used took 4mins+ and now it decreased to 15s.

@zeroshade zeroshade merged commit 3e8e919 into apache:main May 22, 2025
23 checks passed
@zeroshade zeroshade deleted the parallelize-seek-rr branch May 22, 2025 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Follow up for https://github.com/apache/arrow-go/issues/278

3 participants