HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295
HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295balodesecurity wants to merge 6 commits intoapache:trunkfrom
Conversation
… excess replica timing race. In HA mode, a timing race can cause the standby NN to incorrectly mark a replica as excess before it learns that a DataNode is decommissioning. This leaves the standby's isSufficient() check permanently returning false (live=1 < RF=2), so the decommission monitor never calls setDecommissioned() and logs under-replication warnings indefinitely. Fix: in isSufficient(), count excess replicas (physically-present block copies) alongside live replicas when checking decommission sufficiency for non-UC blocks. A hasMinStorage guard ensures at least dfs.replication.min live copies exist for durability. If the excess replica is later deleted, the block manager detects under-replication and schedules re-replication.
…cess replica fix. Tests cover: - Bug scenario: live=1 + excess=1 >= RF=2 → decommission allowed (HDFS-17722 fix) - Normal case: live=2, excess=0 → decommission allowed (not broken by fix) - Safety guard: live=0, excess=2 → decommission blocked (no durable copy) - Insufficient even with excess: live=0 + excess=1 < RF=2 → blocked - Excess above RF with min live: live=1 + excess=2 >= RF=2, live >= min → allowed
Docker Integration Test ResultsTested on a 3-DataNode Docker cluster (1 NameNode + 3 DataNodes, RF=3, balodesecurity/hadoop HDFS-17722 branch): Note on replicating the bug naturally: In a single-NameNode setup the race does not occur naturally (the block manager processes setrep deletions before the decommission check runs in the same thread). The bug is specific to the standby NameNode path. The unit tests in |
|
CI failed due to Jenkins OOM kill (exit code 137) — unrelated to the patch. Requesting retest. /retest |
|
💔 -1 overall
This message was automatically generated. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
Problem
On a standby NameNode, a DataNode can get stuck in the
DECOMMISSION_INPROGRESSstate indefinitely when a timing race causes a replica to be flagged as excess instead of live during decommissioning.Sequence:
isSufficient():numLive=2(DN-B, DN-C) satisfies RF=3? No. It sees only 2 live copies, so decommission stalls.isSufficient()never returns true.The excess replica on DN-D is a physically present block copy and contributes to durability — ignoring it causes the deadlock.
Fix
In
DatanodeAdminManager.isSufficient(), count excess replicas alongside live replicas for the sufficiency check on non-under-construction blocks:The
hasMinStorageguard (checksdfs.replication.min, default 1) ensures decommission does not proceed if zero live replicas exist — excess-only replicas are not guaranteed durable. After decommission completes, if the excess replica on DN-D is subsequently deleted, the block manager's normal under-replication detection will schedule re-replication.Testing
Unit tests —
TestDatanodeAdminManagerIsSufficient(5 tests, no cluster required):testExcessReplicaCountsTowardSufficiencytruetestNormalDecommissionStillSufficienttruetestNoLiveReplicaBlocksDecommissionfalsetestInsufficientEvenWithExcessfalsetestExcessAboveRFWithMinLivetrueDocker integration — 3-DataNode cluster with 1 NameNode and RF=3, 5 scenarios:
Related