Skip to content

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval#531

Open
shivprakashmuley wants to merge 2 commits intoopenshift:mainfrom
shivprakashmuley:gather_core_dumps_fix
Open

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval#531
shivprakashmuley wants to merge 2 commits intoopenshift:mainfrom
shivprakashmuley:gather_core_dumps_fix

Conversation

@shivprakashmuley
Copy link
Copy Markdown
Contributor

@shivprakashmuley shivprakashmuley commented Mar 10, 2026

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c5e58b6e-131c-48fe-a55b-685b761cf8ae

📥 Commits

Reviewing files that changed from the base of the PR and between 9998690 and 8eaf470.

📒 Files selected for processing (1)
  • collection-scripts/gather_core_dumps
🚧 Files skipped from review as they are similar to previous changes (1)
  • collection-scripts/gather_core_dumps

Walkthrough

Replaces synchronous oc debug usage by starting debug pods asynchronously, polling with exponential backoff to obtain the debug pod name and wait for Ready, adds non-fatal error handling for oc cp, tracks background PIDs for coordinated waiting, and ensures debug pod/process cleanup and temp-file removal.

Changes

Cohort / File(s) Summary
Debug pod lifecycle & polling
collection-scripts/gather_core_dumps
Starts oc debug asynchronously, captures output to a temp file, uses exponential-backoff polling to extract the debug pod name, explicitly waits for pod Ready with timeout handling, deletes the debug pod non-blocking on timeout, and cleans up temp files.
Copy, error handling & PID coordination
collection-scripts/gather_core_dumps
Makes oc cp failures non-fatal (logs warning instead of exiting), appends each get_dump_off_node background PID into PIDS for coordinated waiting, and ensures background process termination/cleanup after copy/wait paths.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a race condition in the gather_core_dumps pod name retrieval process.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Stable And Deterministic Test Names ✅ Passed This PR modifies a bash shell script (gather_core_dumps) and does not introduce any Ginkgo test files or test definitions.
Test Structure And Quality ✅ Passed The custom check for Ginkgo test structure is not applicable to this PR as it modifies shell scripts using BATS testing, not Ginkgo.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

@shivprakashmuley
Copy link
Copy Markdown
Contributor Author

Can you please take a look @sferich888 @ingvagabund.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shivprakashmuley
Once this PR has been reviewed and has the lgtm label, please assign sferich888 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shivprakashmuley
Copy link
Copy Markdown
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 10, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @KeenonLee

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot requested a review from KeenonLee March 10, 2026 14:13
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@collection-scripts/gather_core_dumps`:
- Around line 35-38: The oc wait call for the debug pod (the debugPod variable)
must be checked for failure and handled: after invoking oc wait
--for=condition=Ready pod/"$debugPod" add a conditional check on its exit status
and, on non-zero, delete the pod (oc delete pod/"$debugPod") and exit the script
for this node so you do not proceed to oc cp; apply the same pattern around the
oc cp block (check oc cp return code and delete the pod on failure) to ensure
failed readiness or copy attempts clean up and short-circuit further processing.
- Around line 11-33: The oc debug invocation is started in background but its
child PID isn’t captured, so the script can’t cancel or reap that process if the
pod appears late; save the background PID immediately after the oc debug call
(e.g., debug_pid=$!) and use that PID to wait for or kill/reap the process when
the loop finishes or on cleanup/failure, ensuring you still remove tmpfile and
handle debugPod as before; apply the same fix for the other oc debug
invocation(s) that also start background jobs, referencing tmpfile, oc debug,
debugPod, and $! (store as debug_pid and use wait/kill).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 63010c4a-9d0d-4887-90ef-5cce320c7100

📥 Commits

Reviewing files that changed from the base of the PR and between d7db85b and 9998690.

📒 Files selected for processing (1)
  • collection-scripts/gather_core_dumps

Comment thread collection-scripts/gather_core_dumps Outdated
Comment thread collection-scripts/gather_core_dumps Outdated
Copy link
Copy Markdown
Contributor

@sferich888 sferich888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @shivprakashmuley — the core approach of combining the two oc debug calls and polling with exponential backoff is solid. A few things need to be addressed before we can merge:

  1. Background oc debug PID not captured or cleaned up — As @coderabbitai flagged, the oc debug ... & starts a background process but $! is never saved. If the pod name is never found (all retries exhausted), that background process is orphaned. Capture debug_pid=$! immediately after the background call and kill/wait it on failure paths.

  2. oc wait failure not handled — Also flagged by CodeRabbit. If oc wait --for=condition=Ready times out, the script still proceeds to oc cp because the if [ -n "$debugPod" ] check passes (the pod name was found, it just isn't ready). Check the exit code of oc wait and skip the copy / clean up the pod if it fails.

  3. tmpfile cleanup on early exit — If the function exits early or the script is killed between mktemp and rm -f "$tmpfile", the temp file leaks. Consider adding a trap at the top of the function to ensure cleanup, e.g.:

    trap 'rm -f "$tmpfile"' RETURN

    This ensures the temp file is removed regardless of how the function exits.

Please address these and we should be good to go. The rest of the change looks great — the exponential backoff, graceful oc cp warning, and PID tracking in gather_core_dump_data are all improvements.

Comment thread collection-scripts/gather_core_dumps
@openshift-ci-robot
Copy link
Copy Markdown

@shivprakashmuley: This pull request references Jira Issue OCPBUGS-66983, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @KeenonLee

Details

In response to this:

As per the discussion with @anuragthehatter, raising this PR with his original changes in #517.

Current logic found failed on one of prow CI run where node indeed had a core dump present but dump was not collected by oc adm must-gather -- gather_core_dumps command underneath a prow chain. It seems we may have missed core dumps collection since long time due to this race issue.

Failed logs on prod builds here

Saw a race condition where dump was tried to be copied but debug pod was already removed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@shivprakashmuley
Copy link
Copy Markdown
Contributor Author

Thanks @sferich888 for the comments. I have pushed the relevant changes. Please have a re-look.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Mar 20, 2026

@shivprakashmuley: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@shivprakashmuley
Copy link
Copy Markdown
Contributor Author

Hi @sferich888, can you please review the latest changes.

@shivprakashmuley
Copy link
Copy Markdown
Contributor Author

Hi @sferich888 , @ingvagabund can you please review and approve if everything looks good.

Copy link
Copy Markdown
Contributor

@sferich888 sferich888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing from my phone but some of these changes seem odd or may simple need to be explained

#Start Debug pod force it to stay up until removed in "default" namespace
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 &
# Start Debug pod in background and capture output to get pod name
oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.


#clean up debug pod after we are done using them
oc delete pod "$debugPod" -n "default"
kill "${oc_debug_pid}" 2>/dev/null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move away from deleting the pod?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sleep 0.5

while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider "Starting pod/xyz-debug-abc..." string as a part of the API.

sleep 0.5

while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)
Copy link
Copy Markdown
Member

@ingvagabund ingvagabund Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via oc get pod while setting the right label selector (debug.openshift.io/managed-by=oc-debug) and checking for the right node name?

If this path does not work well (e.g. when two or more pods are still listed) oc debug nodes/... command could be extended with an extra option for adding custom labels.

Copy link
Copy Markdown
Contributor Author

@shivprakashmuley shivprakashmuley Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to check if we can leverage labels somehow to identify the corresponding debug pod

`$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
debug.openshift.io/source-container: container-00
debug.openshift.io/source-resource: /v1, Resource=nodes/smuley-20260407-520b9-879rp-master-0
openshift.io/required-scc: privileged
creationTimestamp: "2026-04-10T07:44:50Z"
generation: 1
labels:
debug.openshift.io/managed-by: oc-debug
name: smuley-20260407-520b9-879rp-master-0-debug-7rd2p
namespace: default
resourceVersion: "1003576"
uid: 3f7d2587-4025-4c6a-8f2f-d64f5898af55
spec:
containers:

  • command:
    • /bin/bash
    • -c
    • sleep 300
      env:
    • name: TMOUT
      value: "900"
    • name: HOST
      value: /host
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bf238da7a1100201223f287894d6cb90d7ac5eb6bcd3311382bcc6774cae8c0
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
  • name: default-dockercfg-s4xcc
    nodeName: smuley-20260407-520b9-879rp-master-0`

Thought we can go with Label + spec.nodeName but in case of 2 runs of gather_core_dumps on the same node - we won't be able to identify the pod corresponding to a particular run.

And from my understanding, we can not inject a custom label in oc debug command.

Thoughts? @ingvagabund .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Copy link
Copy Markdown
Contributor Author

@shivprakashmuley shivprakashmuley Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes? Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Copy link
Copy Markdown
Member

@ingvagabund ingvagabund Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

That would require another story/ticket to track right?

That's up to you how you track the effort.

Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes?

Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.

Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants