-
Notifications
You must be signed in to change notification settings - Fork 50
feat(shuffle): Add codec in bolt to support checksum during shuffle #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
43b26f1 to
51b102c
Compare
96b2a8c to
1866374
Compare
|
@fzhedu Thanks for your comments. I have resoved all comments and has some other minor change, please review the latest code when you have time. |
| #include <arrow/util/type_fwd.h> | ||
| #include <common/base/Exceptions.h> | ||
| #include <common/base/SimdUtil.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
|
|
||
| #include "bolt/shuffle/sparksql/compression/GzipCodec.h" | ||
|
|
||
| #include <common/base/Exceptions.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
| */ | ||
|
|
||
| #include "bolt/shuffle/sparksql/compression/Lz4Codec.h" | ||
| #include <common/base/Exceptions.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
631fbad to
34948e4
Compare
fzhedu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What problem does this PR solve?
Issue Number: close #168
Benchmark result
BlockPayload Benchmark Results (ZSTD)
Dataset: store_sales.parquet | 2.88M rows | 814MB uncompressed | 119MB compressed (14.6%)
Serialization & Deserialization Performance
Checksum Overhead
Corruption Detection (1056 tests)
Checksum can detect all corruptions in test
Codec Benchmark - Throughput
OneShot Mode
Stream Mode
Codec Benchmark - Checksum Overhead
Type of Change
Description
Describe your changes in detail.
For complex logic, explain the "Why" and "How".
Performance Impact
For Spark on Bolt, we primarily use Zstd for compression. Enabling Zstd checksums adds about 3% overhead to compression and about 10% overhead to decompression. However, without checksums, data corruption can go undetected and lead to incorrect results. To balance correctness and performance, we also provide an option to disable checksums when performance is the higher priority.
Checklist (For Author)
Breaking Changes
No
Yes (Description: ...)
Click to view Breaking Changes