HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074
Open
devmadhuu wants to merge 10 commits intoapache:masterfrom
Open
HDDS-14758. Mismatch in Recon container count between Cluster State and Container Summary#10074devmadhuu wants to merge 10 commits intoapache:masterfrom
devmadhuu wants to merge 10 commits intoapache:masterfrom
Conversation
added 2 commits
April 13, 2026 11:38
…d Container Summary Totals - Initial Commit.
sumitagrawl
reviewed
Apr 15, 2026
Contributor
sumitagrawl
left a comment
There was a problem hiding this comment.
@devmadhuu Thanks for working over this, few points to be considered,
- We support moving back container from deleted to closed / quasi-closed based on state from DN
- open containers can be incremental based on last sync container ID, as new containers are created in increment id order. But closed / quasi-closed needs to be full sunc. So full sync can be 3 hours gap or more.
- Closing may not be required as its temporary state for few minutes, DN sync can help over this.
- For stale DN or some volume failure, there can be sudden spike of container mismatch like container moving to closing state from open state. Need consider full db sync for open container difference -- may be not required. And for quasi-closed/closed state, any way doing sync container IDs
Do we really need full db sync for quasi-closed/closed only OR for Open container state ?
Contributor
Author
Thanks @sumitagrawl for your review. Kindly have a re-look into the code. I have pushed the changes and now doing OPEN containers sync incrementally as you suggested. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR addresses the issue of large deviations between Recon and SCM related to container counts, container Ids and their respective states.
Largely the PR addresses two issues:
OPEN,CLOSING, orQUASI_CLOSEDafter SCM has already advanced it.The implementation now has two SCM sync mechanisms:
Full snapshot sync
This remains the safety net. Recon replaces its SCM DB view from a fresh SCM
checkpoint on the existing snapshot schedule.
Incremental targeted sync
This now runs on its own schedule and decides between:
NO_ACTIONTARGETED_SYNCFULL_SNAPSHOTThe targeted sync is implemented as a four-pass workflow:
Pass 1:
CLOSEDAdd missing
CLOSEDcontainers and correct staleOPEN/CLOSING/QUASI_CLOSEDcontainers to
CLOSED.Pass 2:
OPENAdd missing
OPENcontainers only. No downgrades and no state correction.Pass 3:
QUASI_CLOSEDAdd missing
QUASI_CLOSEDcontainers and correct staleOPEN/CLOSINGcontainers up to
QUASI_CLOSED.Pass 4: retirement for
DELETING/DELETEDStart from Recon's own
CLOSEDandQUASI_CLOSEDcontainers and move themforward only when SCM explicitly returns
DELETINGorDELETED.Root Causes Addressed
1. DN-report path could not advance beyond
CLOSING2. Sync used to be add-only for
CLOSED3. Recon could miss
OPENandQUASI_CLOSEDcontainers entirely4. Recon never retired stale live states based on SCM deletion progress
5. SCM batch API could drop containers when pipeline lookup failed
6. Recon add path and SCM state manager were not null-pipeline safe
7. Open-container count per pipeline could drift
8.
decideSyncAction()became a real tiered decisionCurrent logic:
>ozone.recon.scm.container.threshold:FULL_SNAPSHOT> 0:TARGETED_SYNCOPENQUASI_CLOSEDCLOSEDremainderozone.recon.scm.per.state.drift.threshold:TARGETED_SYNCNO_ACTION9. Incremental sync got its own schedule
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-14758
How was this patch tested?
Tests added or updated
TestReconSCMContainerSyncIntegration.java]TestReconStorageContainerSyncHelper.java]TestTriggerDBSyncEndpoint.java]TestUnhealthyContainersDerbyPerformance.java]TestReconContainerHealthSummaryEndToEnd.java]