-
Notifications
You must be signed in to change notification settings - Fork 50
Add a Flush API to the Parquet writer to flush the buffered row group.
#213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d0390b6 to
815adc2
Compare
815adc2 to
5295e1d
Compare
|
|
||
| Status Flush() override { | ||
| if (row_group_writer_ != nullptr) { | ||
| auto row_group_writer = row_group_writer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a temp assignment needed ?
PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); row_group_writer_ = nullptr; does not work ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ensures that row_group_writer_ is set to nullptr when an exception occurs. When row_group_writer_->Close() fails (e.g., due to an HDFS write error), the upper layer typically calls Close() to terminate writing this Parquet file. If we donβt reset row_group_writer_ to nullptr, the Close() method will invoke row_group_writer_->Close() again. At that point, the data inside row_group_writer_ may already be incomplete and it may throw again; calling row_group_writer_->Close() is no longer meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use Guard or try-catch(...) to deal with this case.
| } | ||
|
|
||
| void Writer::flush(int64_t rowsInCurrentRowGroup) { | ||
| if (enableFlushBasedOnBlockSize_ && arrowContext_->writer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When enableFlushBasedOnBlockSize_ is enabled, the underlying implementation uses BufferedRowGroup, which only takes effect if the Flush API is called. This change ensures the correctness of the Writer::flush semantics and also enables adding tests.
d3a23f8 to
1f2abb3
Compare
|
|
||
| Status Flush() override { | ||
| if (row_group_writer_ != nullptr) { | ||
| auto row_group_writer = row_group_writer_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use Guard or try-catch(...) to deal with this case.
1f2abb3 to
e4bd229
Compare
What problem does this PR solve?
Issue Number: close #212
Type of Change
Description
Add a Flush API to the Parquet writer to flush the buffered row group.
In scenarios where
BufferedRowGroupis used, the currentNewBufferedRowGroupAPI flushes the current row group and also creates a newBufferedRowGroup. In some cases, we only want to flush the current row group without creating a newBufferedRowGroup, to avoid writing an empty row group.Performance Impact
No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
Positive Impact: I have run benchmarks.
Click to view Benchmark Results
Negative Impact: Explained below (e.g., trade-off for correctness).
Release Note
Please describe the changes in this PR
Release Note:
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes