Skip to content

HDFS-17918. WebHDFS client-side failover fails during NameNode fsimage loading in HA#8487

Open
magnuma3 wants to merge 1 commit intoapache:trunkfrom
magnuma3:webhdfs-fails-failover
Open

HDFS-17918. WebHDFS client-side failover fails during NameNode fsimage loading in HA#8487
magnuma3 wants to merge 1 commit intoapache:trunkfrom
magnuma3:webhdfs-fails-failover

Conversation

@magnuma3
Copy link
Copy Markdown

@magnuma3 magnuma3 commented May 8, 2026

Description of PR

When NameNode is configured with HA and NameNode 1 is restarting and loading fsimage, accessing WebHDFS does not trigger client-side failover.

During fsimage loading after NameNode restart in HA configuration, WebHDFS returns RetriableException instead of StandbyException. Because of this, the client-side failover does not work and the client keeps retrying the same NameNode. Eventually the client fails because it cannot find the active NameNode.

$ hdfs dfs -ls swebhdfs://blue
2023-07-14 12:23:55,890 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 0 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 0ms.
2023-07-14 12:23:55,929 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 1 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 731ms.
2023-07-14 12:23:56,697 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 2 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 1475ms.
2023-07-14 12:23:58,207 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 3 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 5061ms.
2023-07-14 12:24:03,304 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 4 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 7898ms.
2023-07-14 12:24:11,235 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 5 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 18847ms.
2023-07-14 12:24:30,117 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 6 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 20826ms.
2023-07-14 12:24:50,979 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 7 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 10374ms.
2023-07-14 12:25:01,385 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 8 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 16640ms.
2023-07-14 12:25:18,059 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 9 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 15514ms.
2023-07-14 12:25:33,610 INFO web.WebHdfsFileSystem: Retrying connect to namenode: blue-nn1.host/10.10.10.10:9470. Already retried 10 time(s); retry policy is org.apache.hadoop.io.retry.RetryPolicies$FailoverOnNetworkExceptionRetry@95e33cc, delay 10745ms.
ls: Namenode is in startup mode

How was this patch tested?

Added unit test. Also verified on a real HA cluster that client-side failover works correctly after the fix.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 22s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
+1 💚 test4tests 0m 0s The patch appears to include 1 new or modified test files.
_ trunk Compile Tests _
+1 💚 mvninstall 26m 22s trunk passed
+1 💚 compile 0m 55s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 0m 53s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 checkstyle 0m 59s trunk passed
+1 💚 mvnsite 1m 1s trunk passed
+1 💚 javadoc 0m 53s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 0m 49s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 10s trunk passed
+1 💚 shadedclient 17m 59s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 47s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javac 0m 40s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 javac 0m 40s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
-0 ⚠️ checkstyle 0m 45s /results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs-project/hadoop-hdfs: The patch generated 3 new + 130 unchanged - 1 fixed = 133 total (was 131)
+1 💚 mvnsite 0m 46s the patch passed
+1 💚 javadoc 0m 32s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 javadoc 0m 34s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 spotbugs 2m 3s the patch passed
+1 💚 shadedclient 17m 3s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 185m 40s /patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt hadoop-hdfs in the patch passed.
+1 💚 asflicense 0m 28s The patch does not generate ASF License warnings.
261m 50s
Reason Tests
Failed junit tests hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8487/1/artifact/out/Dockerfile
GITHUB PR #8487
Optional Tests dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets
uname Linux bba5b49fb61b 5.15.0-141-generic #151-Ubuntu SMP Sun May 18 21:35:19 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / a076f8b
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8487/1/testReport/
Max. process+thread count 4179 (vs. ulimit of 10000)
modules C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8487/1/console
versions git=2.43.0 maven=3.9.15 spotbugs=4.9.7
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants