Better handling for mediator time jumps in datashard#2342
Merged
snaury merged 1 commit intoydb-platform:mainfrom Feb 29, 2024
Merged
Better handling for mediator time jumps in datashard#2342snaury merged 1 commit intoydb-platform:mainfrom
snaury merged 1 commit intoydb-platform:mainfrom
Conversation
|
⚪
|
|
⚪
|
CyberROFL
approved these changes
Feb 29, 2024
snaury
added a commit
to snaury/ydb
that referenced
this pull request
Mar 1, 2024
This was referenced Mar 4, 2024
This was referenced Mar 8, 2024
Closed
This was referenced Mar 11, 2024
This was referenced Mar 13, 2024
This was referenced Mar 20, 2024
Merged
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changelog entry
Better handling for mediator time jumps in datashard
Changelog category
Additional information
While investigating G2-item and G-single-item anomalies detected with Jepsen, it was discovered that datashards didn't handle mediator time jumps very well. When mediator is restarted it may replay transaction stream that has not been acknowledged yet. This in turn could cause time cast atomic variable to jump backwards, and lead to confusion, where a chosen mvcc version wouldn't later produce intended side-effects and edge promotions. For example a write version may chose the current mediator step, but later current step jumps backwards, and
PromoteCompleteEdgeis not called, because the write is "in the future". This could theoretically cause later reads to incorrectly choose an earlier version (based on a concurrent distributed transaction) than intended. It's unclear whether there's an actual bug though, sincePromoteImmediatePostExecuteEdgescurrently callsMarkPlannedLogicallyCompleteUpTo(which also callsPromoteCompleteEdgefor all earlier inflight distributed transactions), however we may want to remove that call later (to avoid unintended writes when performing reads concurrently with distributed transactions), and the current code is not robust enough.This PR has two fixes. First is to never allow atomic time cast variable to go backwards (it's too difficult to reason about code correctness otherwise). Second is to unambiguously choose mvcc versions: for new reads to always include all previously replied immediate writes, and for new writes to always happen after all previously performed immediate writes.
Fixes KIKIMR-21065.