HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race. by balodesecurity · Pull Request #8295 · apache/hadoop

balodesecurity · 2026-03-03T11:30:17Z

Problem

On a standby NameNode, a DataNode can get stuck in the DECOMMISSION_INPROGRESS state indefinitely when a timing race causes a replica to be flagged as excess instead of live during decommissioning.

Sequence:

File is written to DN-A, DN-B, DN-C (RF=3).
DN-A is marked for decommission.
The block manager schedules re-replication → copies a new replica to DN-D.
On the standby NN, the block report for DN-D arrives before the decommission state for DN-A is propagated. The standby marks DN-D's replica as excess (it looks like an over-replicated block).
The decommission monitor on the standby calls isSufficient(): numLive=2 (DN-B, DN-C) satisfies RF=3? No. It sees only 2 live copies, so decommission stalls.
Meanwhile DN-A is never fully decommissioned because isSufficient() never returns true.

The excess replica on DN-D is a physically present block copy and contributes to durability — ignoring it causes the deadlock.

Fix

In DatanodeAdminManager.isSufficient(), count excess replicas alongside live replicas for the sufficiency check on non-under-construction blocks:

final int numLiveAndExcess = numLive + numberReplicas.excessReplicas();
if (numLiveAndExcess >= blockManager.getDefaultStorageNum(block)
    && blockManager.hasMinStorage(block, numLive)) {
  return true;
}

The hasMinStorage guard (checks dfs.replication.min, default 1) ensures decommission does not proceed if zero live replicas exist — excess-only replicas are not guaranteed durable. After decommission completes, if the excess replica on DN-D is subsequently deleted, the block manager's normal under-replication detection will schedule re-replication.

Testing

Unit tests — TestDatanodeAdminManagerIsSufficient (5 tests, no cluster required):

Test	Scenario	Expected
`testExcessReplicaCountsTowardSufficiency`	HDFS-17722 bug: live=1, excess=1, RF=2	`true`
`testNormalDecommissionStillSufficient`	Baseline: live=2, excess=0, RF=2	`true`
`testNoLiveReplicaBlocksDecommission`	Safety guard: live=0, excess=2, RF=2	`false`
`testInsufficientEvenWithExcess`	live=0, excess=1, RF=2 — not enough either way	`false`
`testExcessAboveRFWithMinLive`	live=1, excess=2, RF=2 — excess over-covers RF	`true`

Tests run: 5, Failures: 0, Errors: 0, Skipped: 0

Docker integration — 3-DataNode cluster with 1 NameNode and RF=3, 5 scenarios:

Scenario 1: Clean decommission (RF=2) — PASS
Scenario 2: RF=3→2 creates excess replicas, then decommission DN2 — PASS
Scenario 3: Same scenario on DN3 — PASS
Scenario 4: Repeated decommission + recommission cycles (3 rounds) — PASS
Scenario 5: Data integrity check after decommission — PASS

Results: 0 failure(s) — ALL TESTS PASSED

… excess replica timing race. In HA mode, a timing race can cause the standby NN to incorrectly mark a replica as excess before it learns that a DataNode is decommissioning. This leaves the standby's isSufficient() check permanently returning false (live=1 < RF=2), so the decommission monitor never calls setDecommissioned() and logs under-replication warnings indefinitely. Fix: in isSufficient(), count excess replicas (physically-present block copies) alongside live replicas when checking decommission sufficiency for non-UC blocks. A hasMinStorage guard ensures at least dfs.replication.min live copies exist for durability. If the excess replica is later deleted, the block manager detects under-replication and schedules re-replication.

…cess replica fix. Tests cover: - Bug scenario: live=1 + excess=1 >= RF=2 → decommission allowed (HDFS-17722 fix) - Normal case: live=2, excess=0 → decommission allowed (not broken by fix) - Safety guard: live=0, excess=2 → decommission blocked (no durable copy) - Insufficient even with excess: live=0 + excess=1 < RF=2 → blocked - Excess above RF with min live: live=1 + excess=2 >= RF=2, live >= min → allowed

balodesecurity · 2026-03-03T11:30:31Z

Docker Integration Test Results

Tested on a 3-DataNode Docker cluster (1 NameNode + 3 DataNodes, RF=3, balodesecurity/hadoop HDFS-17722 branch):

--- Scenario 1: Clean decommission (RF=2, decom DN2) ---
  [PASS] DN2 decommissioned cleanly (RF=2)

--- Scenario 2: HDFS-17722 — RF=3→2 creates EXCESS, then decom DN2 ---
  [PASS] DN2 decommissioned with EXCESS replicas present (HDFS-17722 FIX VERIFIED!)
  [PASS] All 3 files accessible after decommission

--- Scenario 3: HDFS-17722 on DN3 (variant) ---
  [PASS] DN3 decommissioned with EXCESS replicas (HDFS-17722 fix verified on DN3)

--- Scenario 4: Repeated decom/recommission cycles (3 rounds) ---
  [PASS] Round 1: DN2 decommissioned + recommissioned (Normal)
  [PASS] Round 2: DN2 decommissioned + recommissioned (Normal)
  [PASS] Round 3: DN2 decommissioned + recommissioned (Normal)

--- Scenario 5: Data integrity after decommission ---
  [PASS] DN2 decommissioned
  [PASS] Data integrity OK: content matches

Results: 0 failure(s) — ALL TESTS PASSED

Note on replicating the bug naturally: In a single-NameNode setup the race does not occur naturally (the block manager processes setrep deletions before the decommission check runs in the same thread). The bug is specific to the standby NameNode path. The unit tests in TestDatanodeAdminManagerIsSufficient directly exercise isSufficient() with the exact replica counts that trigger the deadlock. The Docker tests verify no regression in normal decommission behavior.

balodesecurity · 2026-03-08T08:57:35Z

CI failed due to Jenkins OOM kill (exit code 137) — unrelated to the patch. Requesting retest.

/retest

hadoop-yetus · 2026-03-10T12:01:33Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 20s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	27m 30s		trunk passed
+1 💚	compile	0m 56s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	compile	0m 57s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	checkstyle	0m 58s		trunk passed
+1 💚	mvnsite	1m 3s		trunk passed
+1 💚	javadoc	0m 49s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 50s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	spotbugs	2m 30s		trunk passed
+1 💚	shadedclient	18m 23s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 49s		the patch passed
+1 💚	compile	0m 46s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javac	0m 46s		the patch passed
+1 💚	compile	0m 49s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	javac	0m 49s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 46s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	mvnsite	0m 53s		the patch passed
+1 💚	javadoc	0m 36s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 38s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	spotbugs	2m 29s		the patch passed
+1 💚	shadedclient	17m 59s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	174m 48s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch failed.
+1 💚	asflicense	0m 30s		The patch does not generate ASF License warnings.
		253m 55s

Subsystem	Report/Notes
Docker	ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/2/artifact/out/Dockerfile
GITHUB PR	#8295
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 086f748bd33c 5.15.0-141-generic #151-Ubuntu SMP Sun May 18 21:35:19 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `5b9b0ff`
Default Java	Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions	/usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/2/testReport/
Max. process+thread count	3556 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/2/console
versions	git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by	Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2026-03-23T12:34:54Z

💔 -1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 21s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	27m 5s		trunk passed
+1 💚	compile	0m 54s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	compile	0m 56s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	checkstyle	0m 55s		trunk passed
+1 💚	mvnsite	1m 0s		trunk passed
+1 💚	javadoc	0m 49s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 48s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
-1 ❌	spotbugs	1m 5s	/branch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in trunk failed.
+1 💚	shadedclient	21m 8s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 44s		the patch passed
+1 💚	compile	0m 40s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javac	0m 40s		the patch passed
+1 💚	compile	0m 41s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	javac	0m 41s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 39s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1)
+1 💚	mvnsite	0m 45s		the patch passed
+1 💚	javadoc	0m 32s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 34s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
-1 ❌	spotbugs	0m 42s	/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch failed.
+1 💚	shadedclient	21m 5s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
-1 ❌	unit	177m 1s	/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs in the patch failed.
+1 💚	asflicense	0m 25s		The patch does not generate ASF License warnings.
		252m 24s

Subsystem	Report/Notes
Docker	ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/4/artifact/out/Dockerfile
GITHUB PR	#8295
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux ae5715b647a1 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `2432aa2`
Default Java	Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions	/usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/4/testReport/
Max. process+thread count	4328 (vs. ulimit of 5500)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/4/console
versions	git=2.43.0 maven=3.9.11
Powered by	Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

hadoop-yetus · 2026-04-11T17:02:57Z

🎊 +1 overall

Vote	Subsystem	Runtime	Logfile	Comment
+0 🆗	reexec	0m 21s		Docker mode activated.
			_ Prechecks _
+1 💚	dupname	0m 0s		No case conflicting files found.
+0 🆗	codespell	0m 0s		codespell was not available.
+0 🆗	detsecrets	0m 0s		detect-secrets was not available.
+1 💚	@author	0m 0s		The patch does not contain any @author tags.
+1 💚	test4tests	0m 0s		The patch appears to include 1 new or modified test files.
			_ trunk Compile Tests _
+1 💚	mvninstall	27m 10s		trunk passed
+1 💚	compile	0m 58s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	compile	0m 57s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	checkstyle	0m 58s		trunk passed
+1 💚	mvnsite	1m 2s		trunk passed
+1 💚	javadoc	0m 48s		trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 48s		trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	spotbugs	2m 24s		trunk passed
+1 💚	shadedclient	18m 41s		branch has no errors when building and testing our client artifacts.
			_ Patch Compile Tests _
+1 💚	mvninstall	0m 49s		the patch passed
+1 💚	compile	0m 46s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javac	0m 46s		the patch passed
+1 💚	compile	0m 51s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	javac	0m 51s		the patch passed
+1 💚	blanks	0m 0s		The patch has no blanks issues.
-0 ⚠️	checkstyle	0m 46s	/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt	hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 0 unchanged - 0 fixed = 1 total (was 0)
+1 💚	mvnsite	0m 52s		the patch passed
+1 💚	javadoc	0m 36s		the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚	javadoc	0m 39s		the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚	spotbugs	2m 30s		the patch passed
+1 💚	shadedclient	18m 17s		patch has no errors when building and testing our client artifacts.
			_ Other Tests _
+1 💚	unit	179m 34s		hadoop-hdfs in the patch passed.
+1 💚	asflicense	0m 31s		The patch does not generate ASF License warnings.
		259m 10s

Subsystem	Report/Notes
Docker	ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/5/artifact/out/Dockerfile
GITHUB PR	#8295
Optional Tests	dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname	Linux 59f205fc8622 5.15.0-171-generic #181-Ubuntu SMP Fri Feb 6 22:44:50 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool	maven
Personality	dev-support/bin/hadoop.sh
git revision	trunk / `24427d3`
Default Java	Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions	/usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/5/testReport/
Max. process+thread count	4688 (vs. ulimit of 10000)
modules	C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output	https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8295/5/console
versions	git=2.43.0 maven=3.9.11 spotbugs=4.9.7
Powered by	Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

amitbalode added 2 commits March 3, 2026 15:55

github-actions Bot added HDFS trunk labels Mar 3, 2026

deepujain mentioned this pull request Mar 8, 2026

HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race #8308

Open

retrigger CI

5b9b0ff

amitbalode added 2 commits March 19, 2026 17:05

retrigger CI

6d15e6c

retrigger CI

2432aa2

retrigger CI

24427d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295

HDFS-17722. DataNode stuck decommissioning on standby NameNode due to excess replica timing race.#8295
balodesecurity wants to merge 6 commits intoapache:trunkfrom
balodesecurity:HDFS-17722

balodesecurity commented Mar 3, 2026

Uh oh!

balodesecurity commented Mar 3, 2026

Uh oh!

balodesecurity commented Mar 8, 2026

Uh oh!

hadoop-yetus commented Mar 10, 2026

Uh oh!

hadoop-yetus commented Mar 23, 2026

Uh oh!

hadoop-yetus commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

balodesecurity commented Mar 3, 2026

Problem

Fix

Testing

Related

Uh oh!

balodesecurity commented Mar 3, 2026

Docker Integration Test Results

Uh oh!

balodesecurity commented Mar 8, 2026

Uh oh!

hadoop-yetus commented Mar 10, 2026

Uh oh!

hadoop-yetus commented Mar 23, 2026

Uh oh!

hadoop-yetus commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants