Skip to content

[v2] Validate full object checksum on multipart downloads#10179

Closed
hssyoo wants to merge 5 commits intoaws:v2from
hssyoo:checksum-download
Closed

[v2] Validate full object checksum on multipart downloads#10179
hssyoo wants to merge 5 commits intoaws:v2from
hssyoo:checksum-download

Conversation

@hssyoo
Copy link
Copy Markdown
Contributor

@hssyoo hssyoo commented Apr 1, 2026

Current State

s3transfer doesn't perform any checksum calculation or validation. This is handled by botocore under the following conditions (assuming checksum mode is enabled):

  • Object was stored with a full object checksum and it's retrieved using a single GET. In this case, botocore calculates and validates the full object checksum.
  • Object was stored with a composite checksum and a part GET returns a part-level checksum. In this case, botocore calculates and validates the part-level checksum for each part retrieved.

There are 2 gaps here:

  1. Object was stored with a full object checksum and retrieved using multiple GETs. In this case, since each retrieved part does not return a part-level checksum, botocore performs no validation.
  2. Object was stored with a composite checksum and it's retrieved using a single GET. In this case, botocore performs no validation. There's not much the client can do here. It would need to know the exact part sizes the object was originally uploaded with and then use those offsets to independently calculate the composite checksum.

This PR is designed to address the first gap by having s3transfer calculate and validate the full object checksum.

Solution

Calculating and validating full object checksums is constrained to CRC-based algorithms because because CRC algorithms have a property where combining multiple part-level checksums produces the same final value as if the checksum was calculated in a single, serial stream. This is important because as the transfer manager downloads multiple parts in parallel, the part bodies won't be blocked from being released after write. If full object checksums had to be calculated in a single, serial stream, then the part bodies would be blocked waiting to update a single checksum object. We use CRT's CRC combine functions for this.

The design makes calculating part-level checksums the responsibility of botocore and combining part-level checksums into a single full object checksum the responsibility of s3transfer. botocore's StreamingChecksumBody already calculates the checksum as the body is read from stream, and returns it to s3transfer. However, if the object was downloaded without checksum mode enabled, then it won't be returned as a StreamingChecksumBody object. In this case, s3transfer will wrap the body into StreamingChecksumBody so the checksum is calculated. This prevents double-computation of checksums. One tradeoff here is that checking to see if the returned body has a checksum attribute creates some coupling between botocore and s3transfer, but I couldn't think of a way to cleanly separate responsibilities without introducing any coupling.

When s3transfer initiates a multipart download, it decides from the HeadObject response whether or not it should calculate the full object checksum. If yes, then it creates a FullObjectChecksumCombiner object. As each part is downloaded, the part-level checksums are downloaded and stored in the FullObjectChecksumCombiner object. Once all parts have been downloaded, FullObjectChecksumCombiner combines all the part-level checksums and validates the full object checksum against the stored value returned from the initial HeadObject.

Manual Testing

Reviewer should also sanity check here.

  • Uploaded with MPU + CRC64, stored full object checksum
    • Multipart download - s3transfer calculates part checksum since S3 doesn't return part-level checksums when object has full object checksum, so botocore doesn't do any calculation. s3transfer also combines part checksums and validates full object checksum.
    • Single GET - botocore calculates and validates full object checksum.
  • Uploaded with single PUT + CRC64, stored full object checksum
    • Multipart download - s3transfer calculates part checksum and validates full object checksum.
    • Single GET - botocore calculates and validates full object checksum.
  • Uploaded with MPU + CRC32C, stored composite checksum
    • Multipart download - botocore calculates part checksum and validates part checksums. No full object checksum validation is performed.
    • Single GET - No checksum validation at any level. This is the second gap from Current State section.
  • Uploaded with single PUT + CRC32C, stored full object checksum
    • Multipart download - s3transfer calculates part checksum and validates full object checksum.
    • Single GET - botocore calculates and validates full object checksum.

@hssyoo
Copy link
Copy Markdown
Contributor Author

hssyoo commented Apr 1, 2026

Closing in favor of #10180 so dry run build runs

@hssyoo hssyoo closed this Apr 1, 2026
@hssyoo hssyoo deleted the checksum-download branch April 1, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant