Skip to content

[Feature] Support checksum in Bolt Pull-Based Shuffle to enhance data integrity #168

@zhangxffff

Description

@zhangxffff

Feature Category

Other

Problem / Use Case

In the Spark on Bolt scenario, the shuffle data lifecycle involves several critical stages where data integrity must be guaranteed:

  1. Shuffle Writer Phase: Data is serialized, compressed, written to files, and committed to Spark.
  2. Shuffle Reader Phase: The Reader utilizes the Spark InputReader to fetch data from the Writer's Netty server. The fetched data is decompressed and deserialized to return data for downstream computation.

Spark's Approach & Limitation: While Spark introduced shuffle block checksums in version 3.2 (via SPARK-35207), its primary design pattern is often reactive diagnosis.
◦ Checksums are calculated during the Write phase.
◦ However, during the Read phase, full verification is only triggered reactively when a FetchFailedException or stream corruption is detected downstream (e.g., during decompression).
◦ This means silent corruption (bit-rot) might not be detected until the data is deserialized or processed, leading to late failures.

So, currently there is no complete application-level checksum mechanism throughout this entire process.
• Transport Layer: The system relies almost entirely on TCP/IP protocol checksums for data transmission integrity. Since Spark/Bolt heavily utilizes Netty's Zero-Copy (sendfile/transferTo) mechanism to maximize throughput, the application layer does not verify the payload during transfer.
• Application Layer: There is no checksum validation before the data is converted into a RowVector. Although LZ4/ZSTD compression also support checksum during compression and decompression, it was not used in bolt shuffle.

Reliance solely on TCP checksums is insufficient for large-scale distributed systems, as it cannot detect issues like hardware bit-rot, disk silent corruption, or software bugs in the compression/decompression path.

Proposed Solution

Currently, Bolt relies on Arrow’s codec abstraction to compress and decompress shuffle data. We plan to remove this dependency and integrate LZ4 and Zstd directly, so we can explicitly enable and enforce checksum validation during both compression and decompression, reducing the risk of silent data corruption.

References / Prior Art

No response

Importance

Blocker (Cannot use Bolt without this)

Willingness to Contribute

Yes, I can submit a PR

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions