OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval by shivprakashmuley · Pull Request #531 · openshift/must-gather

shivprakashmuley · 2026-03-10T14:09:24Z

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

openshift-ci-robot · 2026-03-10T14:09:31Z

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-03-10T14:09:44Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5e58b6e-131c-48fe-a55b-685b761cf8ae

📥 Commits

Reviewing files that changed from the base of the PR and between 9998690 and 8eaf470.

📒 Files selected for processing (1)

collection-scripts/gather_core_dumps

🚧 Files skipped from review as they are similar to previous changes (1)

collection-scripts/gather_core_dumps

Walkthrough

Replaces synchronous oc debug usage by starting debug pods asynchronously, polling with exponential backoff to obtain the debug pod name and wait for Ready, adds non-fatal error handling for oc cp, tracks background PIDs for coordinated waiting, and ensures debug pod/process cleanup and temp-file removal.

Changes

Cohort / File(s)	Summary
Debug pod lifecycle & polling `collection-scripts/gather_core_dumps`	Starts `oc debug` asynchronously, captures output to a temp file, uses exponential-backoff polling to extract the debug pod name, explicitly waits for pod Ready with timeout handling, deletes the debug pod non-blocking on timeout, and cleans up temp files.
Copy, error handling & PID coordination `collection-scripts/gather_core_dumps`	Makes `oc cp` failures non-fatal (logs warning instead of exiting), appends each `get_dump_off_node` background PID into `PIDS` for coordinated waiting, and ensures background process termination/cleanup after copy/wait paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing a race condition in the gather_core_dumps pod name retrieval process.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names	✅ Passed	This PR modifies a bash shell script (gather_core_dumps) and does not introduce any Ginkgo test files or test definitions.
Test Structure And Quality	✅ Passed	The custom check for Ginkgo test structure is not applicable to this PR as it modifies shell scripts using BATS testing, not Ginkgo.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

shivprakashmuley · 2026-03-10T14:10:52Z

Can you please take a look @sferich888 @ingvagabund.

openshift-ci · 2026-03-10T14:12:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shivprakashmuley
Once this PR has been reviewed and has the lgtm label, please assign sferich888 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

collection-scripts/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shivprakashmuley · 2026-03-10T14:13:04Z

/jira refresh

openshift-ci-robot · 2026-03-10T14:13:17Z

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @KeenonLee

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@collection-scripts/gather_core_dumps`:
- Around line 35-38: The oc wait call for the debug pod (the debugPod variable)
must be checked for failure and handled: after invoking oc wait
--for=condition=Ready pod/"$debugPod" add a conditional check on its exit status
and, on non-zero, delete the pod (oc delete pod/"$debugPod") and exit the script
for this node so you do not proceed to oc cp; apply the same pattern around the
oc cp block (check oc cp return code and delete the pod on failure) to ensure
failed readiness or copy attempts clean up and short-circuit further processing.
- Around line 11-33: The oc debug invocation is started in background but its
child PID isn’t captured, so the script can’t cancel or reap that process if the
pod appears late; save the background PID immediately after the oc debug call
(e.g., debug_pid=$!) and use that PID to wait for or kill/reap the process when
the loop finishes or on cleanup/failure, ensuring you still remove tmpfile and
handle debugPod as before; apply the same fix for the other oc debug
invocation(s) that also start background jobs, referencing tmpfile, oc debug,
debugPod, and $! (store as debug_pid and use wait/kill).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 63010c4a-9d0d-4887-90ef-5cce320c7100

📥 Commits

Reviewing files that changed from the base of the PR and between d7db85b and 9998690.

📒 Files selected for processing (1)

collection-scripts/gather_core_dumps

sferich888

Thanks for tackling this @shivprakashmuley — the core approach of combining the two oc debug calls and polling with exponential backoff is solid. A few things need to be addressed before we can merge:

Background oc debug PID not captured or cleaned up — As @coderabbitai flagged, the oc debug ... & starts a background process but $! is never saved. If the pod name is never found (all retries exhausted), that background process is orphaned. Capture debug_pid=$! immediately after the background call and kill/wait it on failure paths.
oc wait failure not handled — Also flagged by CodeRabbit. If oc wait --for=condition=Ready times out, the script still proceeds to oc cp because the if [ -n "$debugPod" ] check passes (the pod name was found, it just isn't ready). Check the exit code of oc wait and skip the copy / clean up the pod if it fails.
tmpfile cleanup on early exit — If the function exits early or the script is killed between mktemp and rm -f "$tmpfile", the temp file leaks. Consider adding a trap at the top of the function to ensure cleanup, e.g.:
```
trap 'rm -f "$tmpfile"' RETURN
```
This ensures the temp file is removed regardless of how the function exits.

Please address these and we should be good to go. The rest of the change looks great — the exponential backoff, graceful oc cp warning, and PID tracking in gather_core_dump_data are all improvements.

…eval

openshift-ci-robot · 2026-03-20T06:56:45Z

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @KeenonLee

Details

In response to this:

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

shivprakashmuley · 2026-03-20T06:58:27Z

Thanks @sferich888 for the comments. I have pushed the relevant changes. Please have a re-look.

openshift-ci · 2026-03-20T10:10:01Z

@shivprakashmuley: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

shivprakashmuley · 2026-03-24T05:54:56Z

Hi @sferich888, can you please review the latest changes.

shivprakashmuley · 2026-03-30T05:57:43Z

Hi @sferich888 , @ingvagabund can you please review and approve if everything looks good.

sferich888

Reviewing from my phone but some of these changes seem odd or may simple need to be explained

sferich888 · 2026-04-07T13:52:31Z

-	#Start Debug pod force it to stay up until removed in "default" namespace
-	oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 &
+	# Start Debug pod in background and capture output to get pod name
+	oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &


We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.

sferich888 · 2026-04-07T13:53:38Z

-
-		#clean up debug pod after we are done using them
-		oc delete pod "$debugPod" -n "default"
+		kill "${oc_debug_pid}" 2>/dev/null


Why move away from deleting the pod?

Pod gets deleted here https://github.com/openshift/must-gather/pull/531/changes#diff-f7dfd1af20867af285734a2bf99bf98c1766f2a1a74074d45a75f28ee0ef3a82R61 @sferich888 .

ingvagabund · 2026-04-09T12:54:22Z

+	sleep 0.5
+
+	while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
+		debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)


I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider "Starting pod/xyz-debug-abc..." string as a part of the API.

ingvagabund · 2026-04-09T12:59:42Z

+	sleep 0.5
+
+	while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
+		debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)


Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via oc get pod while setting the right label selector (debug.openshift.io/managed-by=oc-debug) and checking for the right node name?

If this path does not work well (e.g. when two or more pods are still listed) oc debug nodes/... command could be extended with an extra option for adding custom labels.

I have tried to check if we can leverage labels somehow to identify the corresponding debug pod

`$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
debug.openshift.io/source-container: container-00
debug.openshift.io/source-resource: /v1, Resource=nodes/smuley-20260407-520b9-879rp-master-0
openshift.io/required-scc: privileged
creationTimestamp: "2026-04-10T07:44:50Z"
generation: 1
labels:
debug.openshift.io/managed-by: oc-debug
name: smuley-20260407-520b9-879rp-master-0-debug-7rd2p
namespace: default
resourceVersion: "1003576"
uid: 3f7d2587-4025-4c6a-8f2f-d64f5898af55
spec:
containers:

command:

/bin/bash

-c

sleep 300
env:

name: TMOUT
value: "900"

name: HOST
value: /host
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bf238da7a1100201223f287894d6cb90d7ac5eb6bcd3311382bcc6774cae8c0
imagePullPolicy: IfNotPresent
imagePullSecrets:

name: default-dockercfg-s4xcc
nodeName: smuley-20260407-520b9-879rp-master-0`

Thought we can go with Label + spec.nodeName but in case of 2 runs of gather_core_dumps on the same node - we won't be able to identify the pod corresponding to a particular run.

And from my understanding, we can not inject a custom label in oc debug command.

Thoughts? @ingvagabund .

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes? Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Thank you.

That would require another story/ticket to track right?

That's up to you how you track the effort.

Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes?

Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.

Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.

cc: @Prashanth684

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 10, 2026

openshift-ci Bot requested review from ingvagabund and sferich888 March 10, 2026 14:12

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 10, 2026

openshift-ci Bot requested a review from KeenonLee March 10, 2026 14:13

coderabbitai Bot reviewed Mar 10, 2026

View reviewed changes

Comment thread collection-scripts/gather_core_dumps Outdated

Comment thread collection-scripts/gather_core_dumps Outdated

shivprakashmuley mentioned this pull request Mar 10, 2026

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #517

Closed

sferich888 suggested changes Mar 17, 2026

View reviewed changes

Comment thread collection-scripts/gather_core_dumps

shivprakashmuley added 2 commits March 20, 2026 11:51

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retri…

868ab33

…eval

OCPBUGS-66983: Handled failures around wait and tmpfile

8eaf470

shivprakashmuley force-pushed the gather_core_dumps_fix branch from 9998690 to 8eaf470 Compare March 20, 2026 06:56

sferich888 reviewed Apr 7, 2026

View reviewed changes

ingvagabund reviewed Apr 9, 2026

View reviewed changes

Conversation

shivprakashmuley commented Mar 10, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 10, 2026

Uh oh!

coderabbitai Bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

shivprakashmuley commented Mar 10, 2026

Uh oh!

openshift-ci Bot commented Mar 10, 2026

Uh oh!

shivprakashmuley commented Mar 10, 2026

Uh oh!

openshift-ci-robot commented Mar 10, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sferich888 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openshift-ci-robot commented Mar 20, 2026

Uh oh!

shivprakashmuley commented Mar 20, 2026

Uh oh!

openshift-ci Bot commented Mar 20, 2026

Uh oh!

shivprakashmuley commented Mar 24, 2026

Uh oh!

shivprakashmuley commented Mar 30, 2026

Uh oh!

sferich888 left a comment

Choose a reason for hiding this comment

Uh oh!

sferich888 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

sferich888 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

shivprakashmuley commented Mar 10, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 10, 2026 •

edited

Loading

ingvagabund Apr 9, 2026 •

edited

Loading

shivprakashmuley Apr 10, 2026 •

edited

Loading

shivprakashmuley Apr 13, 2026 •

edited

Loading

ingvagabund Apr 14, 2026 •

edited

Loading