Skip to content

Conversation

@zhangxffff
Copy link
Collaborator

@zhangxffff zhangxffff commented Jan 29, 2026

What problem does this PR solve?

Issue Number: close #168

  • Add codec abstraction layer in bolt shuffle to support checksum during shuffle operations for data corruption detection
  • Implement multiple compression codecs: GZIP, Snappy, LZ4, LZ4_FRAME, ZSTD with optional checksum validation
  • Add checksumEnabled option to ShuffleReaderOptions and PartitionWriterOptions (enabled by default)
  • Include benchmarks for codec performance and block payload operations

Benchmark result

BlockPayload Benchmark Results (ZSTD)

Dataset: store_sales.parquet | 2.88M rows | 814MB uncompressed | 119MB compressed (14.6%)

Serialization & Deserialization Performance

Benchmark Time/Iter Iters/s Throughput (MB/s)
Serialize_NoChecksum 1.21s 824m 831
Serialize_Checksum 1.25s 798m 788
Deserialize_NoChecksum 401ms 2.49 1942
Deserialize_Checksum 442ms 2.26 1762

Checksum Overhead

Operation NoChecksum Checksum Overhead
Serialize 1.21s 1.25s +3.3%
Deserialize 401ms 442ms +10.2%

Corruption Detection (1056 tests)

Mode Total Tests Detected Errors Undetected Detection Rate
No Checksum 1,056 204 852 19.3%
With Checksum 1,056 1,056 0 100%

Checksum can detect all corruptions in test

Codec Benchmark - Throughput

OneShot Mode

Codec Checksum Compressed (bytes) Compress (MB/s) Decompress (MB/s)
ZSTD No 583,272 587.4 1,670.1
ZSTD Yes 583,276 553.5 1,484.2
LZ4_FRAME No 885,095 832.4 5,224.5
LZ4_FRAME Yes 885,099 716.4 2,318.2
GZIP Built-in 579,671 3.5 270.2
LZ4 No 681,074 7.7 3,198.9
SNAPPY No 834,173 420.7 826.4

Stream Mode

Codec Checksum Compressed (bytes) Compress (MB/s) Decompress (MB/s)
ZSTD No 583,271 484.9 1,309.7
ZSTD Yes 583,275 455.8 1,171.3
LZ4_FRAME No 885,095 889.5 5,419.4
LZ4_FRAME Yes 885,099 713.2 2,313.7
GZIP Built-in 579,671 3.5 265.4

Codec Benchmark - Checksum Overhead

Codec Mode Compress Overhead Decompress Overhead
ZSTD OneShot -5.8% -11.1%
ZSTD Stream -6.0% -10.6%
LZ4_FRAME OneShot -13.9% -55.6%
LZ4_FRAME Stream -19.8% -57.3%

Type of Change

  • 🐛 Bug fix (non-breaking change which fixes an issue)
  • ✨ New feature (non-breaking change which adds functionality)
  • 🚀 Performance improvement (optimization)
  • ⚠️ Breaking change (fix or feature that would cause existing functionality to change)
  • 🔨 Refactoring (no logic changes)
  • 🔧 Build/CI or Infrastructure changes
  • 📝 Documentation only

Description

Describe your changes in detail.
For complex logic, explain the "Why" and "How".

Performance Impact

  • No Impact: This change does not affect the critical path (e.g., build system, doc, error handling).
  • Positive Impact: I have run benchmarks.
  • Negative Impact: Explained below (e.g., trade-off for correctness).
    For Spark on Bolt, we primarily use Zstd for compression. Enabling Zstd checksums adds about 3% overhead to compression and about 10% overhead to decompression. However, without checksums, data corruption can go undetected and lead to incorrect results. To balance correctness and performance, we also provide an option to disable checksums when performance is the higher priority.

Checklist (For Author)

  • I have added/updated unit tests (ctest).
  • I have verified the code with local build (Release/Debug).
  • I have run clang-format / linters.
  • (Optional) I have run Sanitizers (ASAN/TSAN) locally for complex C++ changes.
  • No need to test or manual test.

Breaking Changes

  • No

  • Yes (Description: ...)

    Click to view Breaking Changes
    Breaking Changes:
    - Description of the breaking change.
    - Possible solutions or workarounds.
    - Any other relevant information.
    

@zhangxffff zhangxffff force-pushed the feat/shuffle_checksum branch 2 times, most recently from 43b26f1 to 51b102c Compare February 2, 2026 08:47
@zhangxffff zhangxffff marked this pull request as draft February 3, 2026 06:41
@zhangxffff zhangxffff force-pushed the feat/shuffle_checksum branch from 96b2a8c to 1866374 Compare February 5, 2026 07:37
@zhangxffff zhangxffff marked this pull request as ready for review February 5, 2026 07:37
@zhangxffff zhangxffff changed the title [feat][shuffle] Add codec in bolt to support checksum during shuffle feat(shuffle): Add codec in bolt to support checksum during shuffle Feb 5, 2026
@zhangxffff zhangxffff requested a review from fzhedu February 5, 2026 15:15
@zhangxffff zhangxffff requested a review from fzhedu February 10, 2026 03:43
@zhangxffff
Copy link
Collaborator Author

@fzhedu Thanks for your comments. I have resoved all comments and has some other minor change, please review the latest code when you have time.

Comment on lines 25 to 27
#include <arrow/util/type_fwd.h>
#include <common/base/Exceptions.h>
#include <common/base/SimdUtil.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format


#include "bolt/shuffle/sparksql/compression/GzipCodec.h"

#include <common/base/Exceptions.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format

*/

#include "bolt/shuffle/sparksql/compression/Lz4Codec.h"
#include <common/base/Exceptions.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format

@zhangxffff zhangxffff force-pushed the feat/shuffle_checksum branch from 631fbad to 34948e4 Compare February 12, 2026 06:42
Copy link
Collaborator

@fzhedu fzhedu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangxffff zhangxffff enabled auto-merge February 12, 2026 08:07
@zhangxffff zhangxffff added this pull request to the merge queue Feb 12, 2026
@zhangxffff zhangxffff removed this pull request from the merge queue due to a manual request Feb 12, 2026
@zhangxffff zhangxffff added this pull request to the merge queue Feb 12, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to no response for status checks Feb 12, 2026
@zhangxffff zhangxffff added this pull request to the merge queue Feb 12, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 12, 2026
@zhangxffff zhangxffff added this pull request to the merge queue Feb 12, 2026
Merged via the queue into bytedance:main with commit 0ab4ce7 Feb 12, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support checksum in Bolt Pull-Based Shuffle to enhance data integrity

2 participants