Skip to content

YARN-11959. NodeManager becomes unhealthy when container exits with code 22 or 24#8474

Open
magnuma3 wants to merge 1 commit intoapache:trunkfrom
magnuma3:nm-unhealthy-exit-22
Open

YARN-11959. NodeManager becomes unhealthy when container exits with code 22 or 24#8474
magnuma3 wants to merge 1 commit intoapache:trunkfrom
magnuma3:nm-unhealthy-exit-22

Conversation

@magnuma3
Copy link
Copy Markdown

@magnuma3 magnuma3 commented May 7, 2026

Description of PR

When a user container exits with code 22 or 24, the NodeManager becomes unhealthy and no more containers are allocated to that node. This situation can be resolved by restarting the NodeManager.

It can be reproduced immediately by running Scala Spark wordcount job that exits with code 22.

I propose to fix this by wrapping exit code 22 or 24 with different exit code, so that ConfigurationException that causes NodeManager to become unhealthy is not triggered.

2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Obtaining the exit code...
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Docker inspect command: /usr/bin/docker inspect --format {{.State.ExitCode}} container_e161_1711009858797_8304894_01_000015
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Exit code from docker inspect: 22
2024-09-23 18:50:14,360 INFO  nodemanager.ContainerExecutor (ContainerExecutor.java:logOutput(532)) - Wrote the exit code 22 to /data6/hadoop/yarn/local/nmPrivate/application_1711009858797_8304894/container_e161_1711009858797_8304894_01_000015/container_e161_1711009858797_8304894_01_000015.pid.exitcode
2024-09-23 18:50:14,381 ERROR launcher.ContainerLaunch (ContainerLaunch.java:call(340)) - Failed to launch container due to configuration error.
org.apache.hadoop.yarn.exceptions.ConfigurationException: Linux Container Executor reached unrecoverable exception
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleExitCode(LinuxContainerExecutor.java:615)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:573)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:479)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.launchContainer(ContainerLaunch.java:513)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:323)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:106)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.yarn.server.nodemanager.containermanager.runtime.ContainerExecutionException: Launch container failed
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime.launchContainer(DockerLinuxContainerRuntime.java:1099)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DelegatingLinuxContainerRuntime.launchContainer(DelegatingLinuxContainerRuntime.java:166)
        at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.handleLaunchForLaunchType(LinuxContainerExecutor.java:564)
        ... 8 more 

How was this patch tested?

It can be reproduced immediately by running Scala Spark wordcount job that exits with code 22.

For code changes:

  • Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
  • Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • If applicable, have you updated the LICENSE, LICENSE-binary, NOTICE-binary files?

AI Tooling

If an AI tool was used:

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 8m 26s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 1s codespell was not available.
+0 🆗 detsecrets 0m 1s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 30m 21s trunk passed
+1 💚 compile 1m 28s trunk passed with JDK Red Hat, Inc.-21.0.11+10-LTS
+1 💚 compile 1m 47s trunk passed with JDK Red Hat, Inc.-17.0.19+10-LTS
+1 💚 mvnsite 1m 28s trunk passed
+1 💚 shadedclient 53m 55s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 59s the patch passed
+1 💚 compile 0m 59s the patch passed with JDK Red Hat, Inc.-21.0.11+10-LTS
+1 💚 cc 0m 59s the patch passed
+1 💚 golang 0m 59s the patch passed
+1 💚 javac 0m 59s the patch passed
+1 💚 compile 0m 57s the patch passed with JDK Red Hat, Inc.-17.0.19+10-LTS
+1 💚 cc 0m 57s the patch passed
+1 💚 golang 0m 57s the patch passed
+1 💚 javac 0m 57s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 35s the patch passed
+1 💚 shadedclient 16m 52s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 24m 15s hadoop-yarn-server-nodemanager in the patch passed.
+1 💚 asflicense 0m 39s The patch does not generate ASF License warnings.
108m 33s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/artifact/out/Dockerfile
GITHUB PR #8474
Optional Tests dupname asflicense compile cc mvnsite javac unit codespell detsecrets golang
uname Linux fbcdc01709b1 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2e0af25
Default Java Red Hat, Inc.-17.0.19+10-LTS
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-21.0.11.0.10-1.el8_10.x86_64:Red Hat, Inc.-21.0.11+10-LTS /usr/lib/jvm/java-17-openjdk-17.0.19.0.10-1.el8_10.x86_64:Red Hat, Inc.-17.0.19+10-LTS
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/testReport/
Max. process+thread count 615 (vs. ulimit of 10000)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/console
versions git=2.43.7 maven=3.9.11
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 6m 47s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 19m 10s trunk passed
+1 💚 compile 0m 55s trunk passed
+1 💚 mvnsite 0m 46s trunk passed
+1 💚 shadedclient 36m 50s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 40s the patch passed
+1 💚 compile 0m 39s the patch passed
+1 💚 cc 0m 39s the patch passed
+1 💚 golang 0m 39s the patch passed
+1 💚 javac 0m 39s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 25s the patch passed
+1 💚 shadedclient 15m 30s patch has no errors when building and testing our client artifacts.
_ Other Tests _
-1 ❌ unit 22m 48s /patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-nodemanager.txt hadoop-yarn-server-nodemanager in the patch passed.
-1 ❌ asflicense 0m 23s /results-asflicense.txt The patch generated 1 ASF License warnings.
84m 59s
Reason Tests
Failed junit tests hadoop.yarn.server.nodemanager.amrmproxy.TestFederationInterceptor
TEST-cetest
hadoop.yarn.server.nodemanager.containermanager.resourceplugin.fpga.TestFpgaDiscoverer
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/artifact/out/Dockerfile
GITHUB PR #8474
Optional Tests dupname asflicense compile cc mvnsite javac unit codespell detsecrets golang
uname Linux 60e96f3ea54a 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2e0af25
Default Java Debian-25.0.3+9-2-deb13u1-Debian
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/testReport/
Max. process+thread count 753 (vs. ulimit of 10000)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/console
versions git=2.47.3 maven=3.9.11
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

@hadoop-yetus
Copy link
Copy Markdown

💔 -1 overall

Vote Subsystem Runtime Logfile Comment
+0 🆗 reexec 0m 22s Docker mode activated.
_ Prechecks _
+1 💚 dupname 0m 0s No case conflicting files found.
+0 🆗 codespell 0m 0s codespell was not available.
+0 🆗 detsecrets 0m 0s detect-secrets was not available.
+1 💚 @author 0m 0s The patch does not contain any @author tags.
-1 ❌ test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
_ trunk Compile Tests _
+1 💚 mvninstall 20m 3s trunk passed
+1 💚 compile 0m 57s trunk passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 compile 0m 58s trunk passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 mvnsite 0m 42s trunk passed
+1 💚 shadedclient 38m 11s branch has no errors when building and testing our client artifacts.
_ Patch Compile Tests _
+1 💚 mvninstall 0m 42s the patch passed
+1 💚 compile 0m 40s the patch passed with JDK Ubuntu-21.0.10+7-Ubuntu-124.04
+1 💚 cc 0m 40s the patch passed
+1 💚 golang 0m 40s the patch passed
+1 💚 javac 0m 40s the patch passed
+1 💚 compile 0m 41s the patch passed with JDK Ubuntu-17.0.18+8-Ubuntu-124.04.1
+1 💚 cc 0m 41s the patch passed
+1 💚 golang 0m 41s the patch passed
+1 💚 javac 0m 41s the patch passed
+1 💚 blanks 0m 0s The patch has no blanks issues.
+1 💚 mvnsite 0m 23s the patch passed
+1 💚 shadedclient 14m 54s patch has no errors when building and testing our client artifacts.
_ Other Tests _
+1 💚 unit 22m 48s hadoop-yarn-server-nodemanager in the patch passed.
+1 💚 asflicense 0m 22s The patch does not generate ASF License warnings.
80m 4s
Subsystem Report/Notes
Docker ClientAPI=1.54 ServerAPI=1.54 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/artifact/out/Dockerfile
GITHUB PR #8474
Optional Tests dupname asflicense compile cc mvnsite javac unit codespell detsecrets golang
uname Linux 868c4341055f 5.15.0-173-generic #183-Ubuntu SMP Fri Mar 6 13:29:34 UTC 2026 x86_64 x86_64 x86_64 GNU/Linux
Build tool maven
Personality dev-support/bin/hadoop.sh
git revision trunk / 2e0af25
Default Java Ubuntu-17.0.18+8-Ubuntu-124.04.1
Multi-JDK versions /usr/lib/jvm/java-21-openjdk-amd64:Ubuntu-21.0.10+7-Ubuntu-124.04 /usr/lib/jvm/java-17-openjdk-amd64:Ubuntu-17.0.18+8-Ubuntu-124.04.1
Test Results https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/testReport/
Max. process+thread count 612 (vs. ulimit of 10000)
modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager
Console output https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-8474/1/console
versions git=2.43.0 maven=3.9.11
Powered by Apache Yetus 0.14.1 https://yetus.apache.org

This message was automatically generated.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a YARN NodeManager health regression where user container exit codes 22 or 24 are interpreted as LinuxContainerExecutor unrecoverable errors, causing the NodeManager to become unhealthy and stop scheduling containers.

Changes:

  • Introduces a new native container-executor error code (85) intended to represent a “wrapped” user container exit.
  • Wraps container launch exit codes 22 and 24 to 85 for both Docker and non-Docker RUN_AS_USER_LAUNCH_* paths.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
.../container-executor/impl/util.h Adds a new enum value for a wrapped user container exit code.
.../container-executor/impl/main.c Wraps exit codes 22/24 to avoid triggering NodeManager unrecoverable ConfigurationException behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 109 to 114
TOO_LONG_EXECUTOR_PATH = 81,
CANNOT_GET_EXECUTABLE_NAME_FROM_KERNEL = 82,
CANNOT_GET_EXECUTABLE_NAME_FROM_PID = 83,
WRONG_PATH_OF_EXECUTABLE = 84
WRONG_PATH_OF_EXECUTABLE = 84,
WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED = 85
};
Comment on lines +656 to +663
static int wrap_exit_code(int exit_code) {
if (exit_code == INVALID_CONTAINER_EXEC_PERMISSIONS || exit_code == INVALID_CONFIG_FILE) {
int wrap_code = WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED;
fprintf(LOGFILE, "Wrapped exit code of user container from %d to %d to avoid NodeManager unhealthy...\n", exit_code, wrap_code);
return wrap_code;
} else {
return exit_code;
}
static int wrap_exit_code(int exit_code) {
if (exit_code == INVALID_CONTAINER_EXEC_PERMISSIONS || exit_code == INVALID_CONFIG_FILE) {
int wrap_code = WRAPPED_EXIT_CODE_USER_CONTAINER_FAILED;
fprintf(LOGFILE, "Wrapped exit code of user container from %d to %d to avoid NodeManager unhealthy...\n", exit_code, wrap_code);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants