feat: Implement MultipartUploadBackend on TieredStorage#458
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #458 +/- ##
==========================================
+ Coverage 86.94% 86.95% +0.01%
==========================================
Files 77 77
Lines 11394 11685 +291
==========================================
+ Hits 9906 10161 +255
- Misses 1488 1524 +36
☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This comment was marked as outdated.
This comment was marked as outdated.
| fn try_into(self) -> Result<UploadId, Self::Error> { | ||
| use base64::Engine; | ||
| let json = | ||
| serde_json::to_vec(&self).map_err(|e| Error::serde("encoding multipart token", e))?; |
There was a problem hiding this comment.
Maybe there's a smarter way to concatenate the 2 with less size overhead than JSON...
But any separator could theoretically also appear in the ID returned by the inner backend.
This comment was marked as off-topic.
This comment was marked as off-topic.
| .await? | ||
| .ok_or_else(|| { | ||
| Error::generic("completed multipart object not found in long-term storage") | ||
| })?; |
There was a problem hiding this comment.
Transient metadata read failure causes irrecoverable upload loss
Medium Severity
After the inner complete_multipart succeeds (consuming the multipart upload irreversibly), the guard is advanced to Written, and then get_metadata is called. If this call fails (transient network error or returns None), the ? propagates an error and the guard drops in Written phase. The cleanup logic then deletes the successfully assembled object from long-term storage. Unlike put_long_term — where the metadata is available as a parameter and no extra call exists between the LT write and CAS commit — here the client cannot retry complete_multipart because the upload has been consumed by the inner backend. They must re-upload all parts from scratch, which can be very costly for large uploads. Encoding the expiration_policy in the TieredUploadId at initiation time would eliminate this failure point entirely.
Reviewed by Cursor Bugbot for commit 409be20. Configure here.
There was a problem hiding this comment.
It's very unlikely that this fails. I don't want to encode more information in the token.
There was a problem hiding this comment.
another option is to move the inner.long_term.get_metadata() call earlier in the function before we create our change guard. that way if it fails we can at least 5XX and the user can retry before we've made any persistent changes
looking at the change guard impl
objectstore/objectstore-service/src/backend/changelog.rs
Lines 304 to 330 in 1615f71
thinking through the remaining possible errors and how the change guard would handle them:
- if
inner.long_term.complete_multipart()returns an error:- the guard will drop in the
Recordedwithout cleaning anything up. the user can retry, unless the underlying backend is in an unretryable state
- the guard will drop in the
- if
inner.high_volume.compare_and_write()returns an error and didn't persist anything:- the guard will drop in the
Writtenstate. it will fetch the current tombstone ID, see that our new ID doesn't match, and delete the new ID from LT storage. - the user has to redo the whole multipart upload, but the system should be sane. we need to make sure the error communicates this
- this is the same issue as what the bot describes, but this instance is unavoidable
- the guard will drop in the
- if
inner.high_volume.compare_and_write()returns an error but persisted the tombstone:- the guard will drop in the
Writtenstate. it will fetch the current tombstone ID, see that our new ID does match, and exit without cleaning anything up.
- the guard will drop in the
incidentally i see our change log doesn't yet have a durable backend implemented. i guess that's what FS-310 is still open for
There was a problem hiding this comment.
incidentally i see our change log doesn't yet have a durable backend implemented. i guess that's what FS-310 is still open for
Exactly.
another option is to move the inner.long_term.get_metadata() call earlier in the function before we create our change guard. that way if it fails we can at least 5XX and the user can retry before we've made any persistent changes
We cannot call get_metadata yet because the object doesn't exist. S3 and GCS don't have APIs to retrieve the metadata for an incomplete multipart upload.
That's why we need to materialize the object first. And therefore, we need to create the guard beforehand, to avoid leaving that blob orphaned.
The alternative would be to include the expiration policy in the UploadId we give back to the user, but that introduces all kinds of conditions you have to handle, what if the user passes a different expiration policy at creation/completion? Also, we would need to validate the expiration policy in two places, etc. etc. -- in my opinion the first call should be the source of truth for this.
But you're right that this warrants some more thought.
With the current implementation, the client can basically never retry complete_multipart effectivly, because the dropped ChangeGuard will probably delete the new blob even if it completed successfully 🤔
There was a problem hiding this comment.
The core issue is that ChangeGuard::Recorded assumes that every LT write uses a unique key, while in this case we would retry multiple times on the same key (revision).
| use tokio_util::task::task_tracker::TaskTrackerToken; | ||
|
|
||
| use crate::backend::common::{Backend, HighVolumeBackend, TieredMetadata}; | ||
| use crate::backend::common::{HighVolumeBackend, MultipartUploadBackend, TieredMetadata}; |
There was a problem hiding this comment.
is every long-term storage backend required to impl MultipartUploadBackend? should we rename it LongTermBackend trait to parallel the HighVolumeBackend and make sure both have a trait bound requiring that Backend be implemented too?
There was a problem hiding this comment.
Yes exactly, it's implemented as follows:
So, it already has a trait bound on
Backend.
I guess a more abstracted name for it could be LongTermBackend, yeah. Let me change it in a follow-up.
| .await?; | ||
|
|
||
| if error.is_some() { | ||
| return Ok(error); |
There was a problem hiding this comment.
is Ok appropriate here? is this an endpoint that returns 200 OK always and you have to read the response body to interpret errors?
There was a problem hiding this comment.
Yes that's intended, that's also how S3/GCS implement this and I mentioned it in the design doc.
That's because the complete call can take a while (I guess it's assembling the object in the backend), so S3 takes exactly this approach of a 200 and error in the body, therefore the most general solution is for us to mirror it.
Otherwise, the client might very easily run into a timeout if we don't return a response/headers within X s.
There was a problem hiding this comment.
But probably this warrants a comment explaining how it works.
| .await? | ||
| .ok_or_else(|| { | ||
| Error::generic("completed multipart object not found in long-term storage") | ||
| })?; |
There was a problem hiding this comment.
another option is to move the inner.long_term.get_metadata() call earlier in the function before we create our change guard. that way if it fails we can at least 5XX and the user can retry before we've made any persistent changes
looking at the change guard impl
objectstore/objectstore-service/src/backend/changelog.rs
Lines 304 to 330 in 1615f71
thinking through the remaining possible errors and how the change guard would handle them:
- if
inner.long_term.complete_multipart()returns an error:- the guard will drop in the
Recordedwithout cleaning anything up. the user can retry, unless the underlying backend is in an unretryable state
- the guard will drop in the
- if
inner.high_volume.compare_and_write()returns an error and didn't persist anything:- the guard will drop in the
Writtenstate. it will fetch the current tombstone ID, see that our new ID doesn't match, and delete the new ID from LT storage. - the user has to redo the whole multipart upload, but the system should be sane. we need to make sure the error communicates this
- this is the same issue as what the bot describes, but this instance is unavoidable
- the guard will drop in the
- if
inner.high_volume.compare_and_write()returns an error but persisted the tombstone:- the guard will drop in the
Writtenstate. it will fetch the current tombstone ID, see that our new ID does match, and exit without cleaning anything up.
- the guard will drop in the
incidentally i see our change log doesn't yet have a durable backend implemented. i guess that's what FS-310 is still open for
e8a193d to
803c658
Compare
Move ChangeGuard creation before get_metadata so that if the metadata read fails after complete_multipart succeeds, the guard cleans up the assembled blob. Previously, Manual-expiration objects could be orphaned permanently. Also fix redundant rustdoc link.
409be20 to
d6eb979
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d6eb979. Configure here.
|
|
||
| if error.is_some() { | ||
| return Ok(error); | ||
| } |
There was a problem hiding this comment.
Cleanup guard can race with retries on fixed revision key
Medium Severity
The change guard is recorded before calling complete_multipart on the long-term backend. If completion fails (body error or transport error), the guard drops in Recorded phase and spawns a background cleanup that deletes the physical key via delete_object. Unlike put_long_term — where each call generates a fresh revision key via new_long_term_revision — here the physical key is fixed (baked into the TieredUploadId at initiation time). If a client retries complete_multipart with the same upload_id and the retry succeeds before the old cleanup task executes delete_object, the cleanup deletes the newly completed object, leaving the subsequently written HV tombstone pointing to a non-existent LT blob (data loss).
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d6eb979. Configure here.
|
This is pending resolution of #458 (comment) |


Last implementation of
MultipartUploadBackendinobjectstore-service.Now
TieredStoragerequires aMultipartUploadBackendin itslong_termslot.All methods just pretty much forward to
long_term, with the exception ofcomplete_multipartwhich needs to use theChangelogfor consistency and deal with the tombstones, with logic very similar toput_objectexcept that we need an additional metadata read call.Close FS-339