[Feature] Support checksum in Bolt Pull-Based Shuffle to enhance data integrity

### Feature Category

Other

### Problem / Use Case

In the Spark on Bolt scenario, the shuffle data lifecycle involves several critical stages where data integrity must be guaranteed:
1. Shuffle Writer Phase: Data is serialized, compressed, written to files, and committed to Spark.
2. Shuffle Reader Phase: The Reader utilizes the Spark InputReader to fetch data from the Writer's Netty server. The fetched data is decompressed and deserialized to return data for downstream computation.

**Spark's Approach & Limitation**: While Spark introduced shuffle block checksums in version 3.2 (via [SPARK-35207](https://www.google.com/url?sa=E&q=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fpull%2F32385)), its primary design pattern is often reactive diagnosis.
    ◦ Checksums are calculated during the Write phase.
    ◦ However, during the Read phase, full verification is only triggered reactively when a FetchFailedException or stream corruption is detected downstream (e.g., during decompression).
    ◦ This means silent corruption (bit-rot) might not be detected until the data is deserialized or processed, leading to late failures.

So, currently there is no complete application-level checksum mechanism throughout this entire process.
• Transport Layer: The system relies almost entirely on TCP/IP protocol checksums for data transmission integrity. Since Spark/Bolt heavily utilizes Netty's Zero-Copy (sendfile/transferTo) mechanism to maximize throughput, the application layer does not verify the payload during transfer.
• Application Layer: There is no checksum validation before the data is converted into a RowVector. Although LZ4/ZSTD compression also support checksum during compression and decompression, it was not used in bolt shuffle.

Reliance solely on TCP checksums is insufficient for large-scale distributed systems, as it cannot detect issues like hardware bit-rot, disk silent corruption, or software bugs in the compression/decompression path.


### Proposed Solution

Currently, Bolt relies on Arrow’s codec abstraction to compress and decompress shuffle data. We plan to remove this dependency and integrate LZ4 and Zstd directly, so we can explicitly enable and enforce checksum validation during both compression and decompression, reducing the risk of silent data corruption.

### References / Prior Art

_No response_

### Importance

Blocker (Cannot use Bolt without this)

### Willingness to Contribute

Yes, I can submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support checksum in Bolt Pull-Based Shuffle to enhance data integrity #168

Feature Category

Problem / Use Case

Proposed Solution

References / Prior Art

Importance

Willingness to Contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Support checksum in Bolt Pull-Based Shuffle to enhance data integrity #168

Description

Feature Category

Problem / Use Case

Proposed Solution

References / Prior Art

Importance

Willingness to Contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions