Skip to content

[Filestore] unconfirmed three stage write cache bypass#5913

Merged
neihar merged 1 commit into
mainfrom
users/ihar/fix-unconfirmed-data-db-cache
May 12, 2026
Merged

[Filestore] unconfirmed three stage write cache bypass#5913
neihar merged 1 commit into
mainfrom
users/ihar/fix-unconfirmed-data-db-cache

Conversation

@neihar

@neihar neihar commented May 6, 2026

Copy link
Copy Markdown
Collaborator

Problem

Currently with unconfirmed writes, we respond from tablet at ExecuteTx stage. At the same time, InMemoryIndexState updates at CompleteTx stage. During this window some requests, that require file metadata can obtain stale values from that cache. The solution is simple, track all commit ids of inflight unconfirmed writes and if some TXes that clash with our inflight writes we force them to bypass the cache and go full TX flow instead of quick reply from cache.

Implemented Solution

Introduced cache bypass mechanic for inflight TXes of unconfirmed writes
Added relevant ut

Issue

#2293

@neihar neihar added large-tests Launch large tests for PR filestore Add this label to run only cloud/filestore build and tests on PR need_ai_review labels May 6, 2026
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp
Comment thread cloud/filestore/libs/storage/tablet/tablet_state_cache.cpp Outdated
@robot-vibe-db

robot-vibe-db Bot commented May 6, 2026

Copy link
Copy Markdown

AI Review Summary

Verdict: ❌ 1 critical-adjacent (Major) issue found

Critical issues

No critical issues found.

Major issues

  • Major | Medium: Removed error guards before ConfirmedDataAdded in Execute_AddBlob_WriteUnconfirmed — the old code skipped ConfirmedDataAdded on write failure (commit ID overflow or CheckFreshBytes error), preserving unconfirmed data in DB for recovery. The new code always calls ConfirmedDataAdded, which deletes the DB entry and releases the collect barrier even when the write was not indexed. This can lead to data loss on tablet restart after a failed write, especially during recovery when there is no client to receive the error. — tablet_actor_addblob.cpp:201

Other findings

  • Minor | Medium: The reordering of Execute_AddBlob_Write before UnconfirmedAddBlobSafePointReached changes the ConfirmAddData response behavior — previously always success, now can return the actual write error. This is arguably an improvement but is a behavioral change. — tablet_actor_addblob.cpp:186
  • Minor | Medium: TABLET_VERIFY(args.WriteRanges.size() == 1) is a hard tablet crash if violated. Currently safe (single caller), but fragile to future changes. — tablet_actor_addblob.cpp:194
  • Minor | Medium: DeactivateInMemoryIndexStateBypass uses TABLET_VERIFY to assert FIFO ordering without a descriptive error message. — tablet_state_cache.cpp:97
  • Minor | Low: CacheBypassEnabled reuses GetAddingUnconfirmedDataEnabled() config flag — no independent control for cache bypass. — tablet_actor.cpp:48
  • Nit | High: CompleteTx_AddBlob condition change from ConfirmedDataRefCommitId != InvalidCommitId to Mode == WriteUnconfirmed is a good clarity improvement. — tablet_actor_addblob.cpp:578

This review was generated automatically. Critical issues require attention; other findings are advisory.
If this comment was useful, please give it a 👍 — it helps us improve the review bot.

@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🔴 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7365s): some tests FAILED for commit a466280.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3698 3697 0 1 0 0 0

🔴 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 34s): some tests FAILED for commit a466280.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 1 0 1 0 0 0

🔴 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 32s): some tests FAILED for commit a466280.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
2 1 0 1 0 0 0

@neihar neihar requested review from debnatkh and qkrorlqr May 7, 2026 09:00
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7292s): all tests PASSED for commit a105ca7.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3699 3699 0 0 0 0 0

@neihar

neihar commented May 7, 2026

Copy link
Copy Markdown
Collaborator Author

AI Review Summary

Verdict: ❌ 1 critical-adjacent (Major) issue found

Critical issues

No critical issues found.

Major issues

  • Major | Medium: Removed error guards before ConfirmedDataAdded in Execute_AddBlob_WriteUnconfirmed — the old code skipped ConfirmedDataAdded on write failure (commit ID overflow or CheckFreshBytes error), preserving unconfirmed data in DB for recovery. The new code always calls ConfirmedDataAdded, which deletes the DB entry and releases the collect barrier even when the write was not indexed. This can lead to data loss on tablet restart after a failed write, especially during recovery when there is no client to receive the error. — tablet_actor_addblob.cpp:201

Other findings

  • Minor | Medium: The reordering of Execute_AddBlob_Write before UnconfirmedAddBlobSafePointReached changes the ConfirmAddData response behavior — previously always success, now can return the actual write error. This is arguably an improvement but is a behavioral change. — tablet_actor_addblob.cpp:186
  • Minor | Medium: TABLET_VERIFY(args.WriteRanges.size() == 1) is a hard tablet crash if violated. Currently safe (single caller), but fragile to future changes. — tablet_actor_addblob.cpp:194
  • Minor | Medium: DeactivateInMemoryIndexStateBypass uses TABLET_VERIFY to assert FIFO ordering without a descriptive error message. — tablet_state_cache.cpp:97
  • Minor | Low: CacheBypassEnabled reuses GetAddingUnconfirmedDataEnabled() config flag — no independent control for cache bypass. — tablet_actor.cpp:48
  • Nit | High: CompleteTx_AddBlob condition change from ConfirmedDataRefCommitId != InvalidCommitId to Mode == WriteUnconfirmed is a good clarity improvement. — tablet_actor_addblob.cpp:578

This review was generated automatically. Critical issues require attention; other findings are advisory. If this comment was useful, please give it a 👍 — it helps us improve the review bot.

Some points were really good.

Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor_addblob.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_state_cache.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_state_cache.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_state_cache.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_ut_state_cache.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_ut_unconfirmed_data.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_ut_unconfirmed_data.cpp Outdated
@neihar neihar requested a review from qkrorlqr May 7, 2026 19:30
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7091s): all tests PASSED for commit 81b8339.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3699 3699 0 0 0 0 0

@github-actions

github-actions Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7009s): all tests PASSED for commit b5f814b.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3699 3699 0 0 0 0 0

qkrorlqr
qkrorlqr previously approved these changes May 11, 2026
Comment thread cloud/filestore/libs/storage/tablet/tablet_actor.cpp Outdated
Comment thread cloud/filestore/libs/storage/tablet/tablet_state_cache.h Outdated
qkrorlqr
qkrorlqr previously approved these changes May 11, 2026
@neihar neihar requested review from agalibin and yegorskii May 11, 2026 13:20
yegorskii
yegorskii previously approved these changes May 12, 2026
@neihar neihar dismissed stale reviews from yegorskii and qkrorlqr via a005632 May 12, 2026 11:26
@neihar neihar force-pushed the users/ihar/fix-unconfirmed-data-db-cache branch from 69f8b59 to a005632 Compare May 12, 2026 11:26
@github-actions

github-actions Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

Note

This is an automated comment that will be appended during run.

Note

All workloads for linux-x86_64-relwithdebinfo have completed.

Tip

Planned checks for linux-x86_64-relwithdebinfo.

🔴 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 7220s): some tests FAILED for commit a005632.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
3719 3717 0 2 0 0 0

🟢 linux-x86_64-relwithdebinfo target: cloud/filestore/ (test time: 290s): all tests PASSED for commit a005632.

TESTS PASSED ERRORS FAILED FAILED BUILD SKIPPED MUTED?
6 6 0 0 0 0 0

@neihar neihar merged commit 5af5047 into main May 12, 2026
16 of 20 checks passed
@neihar neihar deleted the users/ihar/fix-unconfirmed-data-db-cache branch May 12, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai_reviewed filestore Add this label to run only cloud/filestore build and tests on PR large-tests Launch large tests for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants