Add a `Flush` API to the Parquet writer to flush the buffered row group. #213

luozenglin · 2026-02-06T08:06:49Z

What problem does this PR solve?

Issue Number: close #212

Type of Change

🐛 Bug fix (non-breaking change which fixes an issue)
✨ New feature (non-breaking change which adds functionality)
🚀 Performance improvement (optimization)
⚠️ Breaking change (fix or feature that would cause existing functionality to change)
🔨 Refactoring (no logic changes)
🔧 Build/CI or Infrastructure changes
📝 Documentation only

Description

Add a Flush API to the Parquet writer to flush the buffered row group.

In scenarios where BufferedRowGroup is used, the current NewBufferedRowGroup API flushes the current row group and also creates a new BufferedRowGroup. In some cases, we only want to flush the current row group without creating a new BufferedRowGroup, to avoid writing an empty row group.

Performance Impact

No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

Positive Impact: I have run benchmarks.

Click to view Benchmark Results

Paste your google-benchmark or TPC-H results here.
Before: 10.5s
After:   8.2s  (+20%)

Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Add a `Flush` API to the Parquet writer to flush the buffered row group.

Checklist (For Author)

I have added/updated unit tests (ctest).
I have verified the code with local build (Release/Debug).
I have run clang-format / linters.
(Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
No need to test or manual test.

Breaking Changes

No

Yes (Description: ...)

Click to view Breaking Changes

Breaking Changes:
- Description of the breaking change.
- Possible solutions or workarounds.
- Any other relevant information.

guhaiyan0221 · 2026-02-10T03:17:13Z

bolt/dwio/parquet/arrow/Writer.cpp


+  Status Flush() override {
+    if (row_group_writer_ != nullptr) {
+      auto row_group_writer = row_group_writer_;


why a temp assignment needed ?
PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); row_group_writer_ = nullptr; does not work ?

This ensures that row_group_writer_ is set to nullptr when an exception occurs. When row_group_writer_->Close() fails (e.g., due to an HDFS write error), the upper layer typically calls Close() to terminate writing this Parquet file. If we don’t reset row_group_writer_ to nullptr, the Close() method will invoke row_group_writer_->Close() again. At that point, the data inside row_group_writer_ may already be incomplete and it may throw again; calling row_group_writer_->Close() is no longer meaningful.

Better to use Guard or try-catch(...) to deal with this case.

guhaiyan0221 · 2026-02-10T03:30:01Z

bolt/dwio/parquet/writer/Writer.cpp

 }

 void Writer::flush(int64_t rowsInCurrentRowGroup) {
+  if (enableFlushBasedOnBlockSize_ && arrowContext_->writer) {


Why is this change needed?

When enableFlushBasedOnBlockSize_ is enabled, the underlying implementation uses BufferedRowGroup, which only takes effect if the Flush API is called. This change ensures the correctness of the Writer::flush semantics and also enables adding tests.

guhaiyan0221 · 2026-02-12T03:02:34Z

bolt/dwio/parquet/arrow/Writer.cpp


+  Status Flush() override {
+    if (row_group_writer_ != nullptr) {
+      auto row_group_writer = row_group_writer_;


Better to use Guard or try-catch(...) to deal with this case.

luozenglin force-pushed the add_writer_flush branch 4 times, most recently from d0390b6 to 815adc2 Compare February 9, 2026 03:49

luozenglin requested a review from guhaiyan0221 February 9, 2026 06:32

luozenglin force-pushed the add_writer_flush branch from 815adc2 to 5295e1d Compare February 10, 2026 03:17

guhaiyan0221 reviewed Feb 10, 2026

View reviewed changes

luozenglin requested a review from guhaiyan0221 February 11, 2026 03:27

luozenglin force-pushed the add_writer_flush branch 2 times, most recently from d3a23f8 to 1f2abb3 Compare February 12, 2026 02:53

guhaiyan0221 approved these changes Feb 12, 2026

View reviewed changes

Add a Flush API to the Parquet writer to flush the buffered row group.

e4bd229

luozenglin force-pushed the add_writer_flush branch from 1f2abb3 to e4bd229 Compare February 12, 2026 06:50

luozenglin enabled auto-merge February 12, 2026 06:50

luozenglin added this pull request to the merge queue Feb 12, 2026

github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 12, 2026

luozenglin added this pull request to the merge queue Feb 12, 2026

Merged via the queue into bytedance:main with commit e304347 Feb 12, 2026
11 of 12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `Flush` API to the Parquet writer to flush the buffered row group. #213

Add a `Flush` API to the Parquet writer to flush the buffered row group. #213

luozenglin commented Feb 6, 2026

Uh oh!

guhaiyan0221 Feb 10, 2026

Uh oh!

luozenglin Feb 10, 2026

Uh oh!

guhaiyan0221 Feb 12, 2026

Uh oh!

guhaiyan0221 Feb 10, 2026

Uh oh!

luozenglin Feb 10, 2026

Uh oh!

guhaiyan0221 Feb 12, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a Flush API to the Parquet writer to flush the buffered row group. #213

Add a Flush API to the Parquet writer to flush the buffered row group. #213

Conversation

luozenglin commented Feb 6, 2026

What problem does this PR solve?

Type of Change

Description

Performance Impact

Release Note

Checklist (For Author)

Breaking Changes

Uh oh!

guhaiyan0221 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

luozenglin Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

guhaiyan0221 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

guhaiyan0221 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

luozenglin Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

guhaiyan0221 Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add a `Flush` API to the Parquet writer to flush the buffered row group. #213

Add a `Flush` API to the Parquet writer to flush the buffered row group. #213