Skip to content

Conversation

@luozenglin
Copy link
Collaborator

What problem does this PR solve?

Issue Number: close #212

Type of Change

  • πŸ› Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • πŸš€ Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • πŸ”¨ Refactoring (no logic changes)
  • πŸ”§ Build/CI or Infrastructure changes
  • πŸ“ Documentation only

Description

Add a Flush API to the Parquet writer to flush the buffered row group.

In scenarios where BufferedRowGroup is used, the current NewBufferedRowGroup API flushes the current row group and also creates a new BufferedRowGroup. In some cases, we only want to flush the current row group without creating a new BufferedRowGroup, to avoid writing an empty row group.

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).

  • Positive Impact: I have run benchmarks.

    Click to view Benchmark Results
    Paste your google-benchmark or TPC-H results here.
    Before: 10.5s
    After:   8.2s  (+20%)
    
  • Negative Impact: Explained below (e.g., trade-off for correctness).

Release Note

Please describe the changes in this PR

Release Note:

Release Note:
- Add a `Flush` API to the Parquet writer to flush the buffered row group.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@luozenglin luozenglin force-pushed the add_writer_flush branch 4 times, most recently from d0390b6 to 815adc2 Compare February 9, 2026 03:49

Status Flush() override {
if (row_group_writer_ != nullptr) {
auto row_group_writer = row_group_writer_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a temp assignment needed ?
PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); row_group_writer_ = nullptr; does not work ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that row_group_writer_ is set to nullptr when an exception occurs. When row_group_writer_->Close() fails (e.g., due to an HDFS write error), the upper layer typically calls Close() to terminate writing this Parquet file. If we don’t reset row_group_writer_ to nullptr, the Close() method will invoke row_group_writer_->Close() again. At that point, the data inside row_group_writer_ may already be incomplete and it may throw again; calling row_group_writer_->Close() is no longer meaningful.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use Guard or try-catch(...) to deal with this case.

}

void Writer::flush(int64_t rowsInCurrentRowGroup) {
if (enableFlushBasedOnBlockSize_ && arrowContext_->writer) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When enableFlushBasedOnBlockSize_ is enabled, the underlying implementation uses BufferedRowGroup, which only takes effect if the Flush API is called. This change ensures the correctness of the Writer::flush semantics and also enables adding tests.

@luozenglin luozenglin force-pushed the add_writer_flush branch 2 times, most recently from d3a23f8 to 1f2abb3 Compare February 12, 2026 02:53

Status Flush() override {
if (row_group_writer_ != nullptr) {
auto row_group_writer = row_group_writer_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use Guard or try-catch(...) to deal with this case.

@luozenglin luozenglin enabled auto-merge February 12, 2026 06:50
@luozenglin luozenglin added this pull request to the merge queue Feb 12, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 12, 2026
@luozenglin luozenglin added this pull request to the merge queue Feb 12, 2026
Merged via the queue into bytedance:main with commit e304347 Feb 12, 2026
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support flush interface for parquet writer

2 participants