OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval#531
OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval#531shivprakashmuley wants to merge 2 commits intoopenshift:mainfrom
Conversation
|
@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughReplaces synchronous Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can get early access to new features in CodeRabbit.Enable the |
|
Can you please take a look @sferich888 @ingvagabund. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: shivprakashmuley The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@collection-scripts/gather_core_dumps`:
- Around line 35-38: The oc wait call for the debug pod (the debugPod variable)
must be checked for failure and handled: after invoking oc wait
--for=condition=Ready pod/"$debugPod" add a conditional check on its exit status
and, on non-zero, delete the pod (oc delete pod/"$debugPod") and exit the script
for this node so you do not proceed to oc cp; apply the same pattern around the
oc cp block (check oc cp return code and delete the pod on failure) to ensure
failed readiness or copy attempts clean up and short-circuit further processing.
- Around line 11-33: The oc debug invocation is started in background but its
child PID isn’t captured, so the script can’t cancel or reap that process if the
pod appears late; save the background PID immediately after the oc debug call
(e.g., debug_pid=$!) and use that PID to wait for or kill/reap the process when
the loop finishes or on cleanup/failure, ensuring you still remove tmpfile and
handle debugPod as before; apply the same fix for the other oc debug
invocation(s) that also start background jobs, referencing tmpfile, oc debug,
debugPod, and $! (store as debug_pid and use wait/kill).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 63010c4a-9d0d-4887-90ef-5cce320c7100
📒 Files selected for processing (1)
collection-scripts/gather_core_dumps
sferich888
left a comment
There was a problem hiding this comment.
Thanks for tackling this @shivprakashmuley — the core approach of combining the two oc debug calls and polling with exponential backoff is solid. A few things need to be addressed before we can merge:
-
Background
oc debugPID not captured or cleaned up — As @coderabbitai flagged, theoc debug ... &starts a background process but$!is never saved. If the pod name is never found (all retries exhausted), that background process is orphaned. Capturedebug_pid=$!immediately after the background call andkill/waitit on failure paths. -
oc waitfailure not handled — Also flagged by CodeRabbit. Ifoc wait --for=condition=Readytimes out, the script still proceeds tooc cpbecause theif [ -n "$debugPod" ]check passes (the pod name was found, it just isn't ready). Check the exit code ofoc waitand skip the copy / clean up the pod if it fails. -
tmpfilecleanup on early exit — If the function exits early or the script is killed betweenmktempandrm -f "$tmpfile", the temp file leaks. Consider adding atrapat the top of the function to ensure cleanup, e.g.:trap 'rm -f "$tmpfile"' RETURN
This ensures the temp file is removed regardless of how the function exits.
Please address these and we should be good to go. The rest of the change looks great — the exponential backoff, graceful oc cp warning, and PID tracking in gather_core_dump_data are all improvements.
9998690 to
8eaf470
Compare
|
@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Thanks @sferich888 for the comments. I have pushed the relevant changes. Please have a re-look. |
|
@shivprakashmuley: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Hi @sferich888, can you please review the latest changes. |
|
Hi @sferich888 , @ingvagabund can you please review and approve if everything looks good. |
sferich888
left a comment
There was a problem hiding this comment.
Reviewing from my phone but some of these changes seem odd or may simple need to be explained
| #Start Debug pod force it to stay up until removed in "default" namespace | ||
| oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 & | ||
| # Start Debug pod in background and capture output to get pod name | ||
| oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 & |
There was a problem hiding this comment.
We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.
|
|
||
| #clean up debug pod after we are done using them | ||
| oc delete pod "$debugPod" -n "default" | ||
| kill "${oc_debug_pid}" 2>/dev/null |
There was a problem hiding this comment.
Why move away from deleting the pod?
There was a problem hiding this comment.
| sleep 0.5 | ||
|
|
||
| while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do | ||
| debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1) |
There was a problem hiding this comment.
I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider "Starting pod/xyz-debug-abc..." string as a part of the API.
| sleep 0.5 | ||
|
|
||
| while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do | ||
| debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1) |
There was a problem hiding this comment.
Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via oc get pod while setting the right label selector (debug.openshift.io/managed-by=oc-debug) and checking for the right node name?
If this path does not work well (e.g. when two or more pods are still listed) oc debug nodes/... command could be extended with an extra option for adding custom labels.
There was a problem hiding this comment.
I have tried to check if we can leverage labels somehow to identify the corresponding debug pod
`$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
debug.openshift.io/source-container: container-00
debug.openshift.io/source-resource: /v1, Resource=nodes/smuley-20260407-520b9-879rp-master-0
openshift.io/required-scc: privileged
creationTimestamp: "2026-04-10T07:44:50Z"
generation: 1
labels:
debug.openshift.io/managed-by: oc-debug
name: smuley-20260407-520b9-879rp-master-0-debug-7rd2p
namespace: default
resourceVersion: "1003576"
uid: 3f7d2587-4025-4c6a-8f2f-d64f5898af55
spec:
containers:
- command:
- /bin/bash
- -c
- sleep 300
env: - name: TMOUT
value: "900" - name: HOST
value: /host
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bf238da7a1100201223f287894d6cb90d7ac5eb6bcd3311382bcc6774cae8c0
imagePullPolicy: IfNotPresent
imagePullSecrets:
- name: default-dockercfg-s4xcc
nodeName: smuley-20260407-520b9-879rp-master-0`
Thought we can go with Label + spec.nodeName but in case of 2 runs of gather_core_dumps on the same node - we won't be able to identify the pod corresponding to a particular run.
And from my understanding, we can not inject a custom label in oc debug command.
Thoughts? @ingvagabund .
There was a problem hiding this comment.
Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?
There was a problem hiding this comment.
Would you have time to investigate the possibility of extending
oc debug nodeswith a new option for a list of custom labels?
Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes? Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.
There was a problem hiding this comment.
Thank you.
That would require another story/ticket to track right?
That's up to you how you track the effort.
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes?
Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.
Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.
Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.
As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.
Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.
Failed logs on prod builds here
Saw a race condition where dump was tried to be copied but debug pod was already removed.