OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #531

sferich888 · 2026-04-07T13:52:31Z

We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.

ingvagabund · 2026-04-09T12:54:22Z

I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider "Starting pod/xyz-debug-abc..." string as a part of the API.

ingvagabund · 2026-04-09T12:59:42Z

Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via oc get pod while setting the right label selector (debug.openshift.io/managed-by=oc-debug) and checking for the right node name?

If this path does not work well (e.g. when two or more pods are still listed) oc debug nodes/... command could be extended with an extra option for adding custom labels.

I have tried to check if we can leverage labels somehow to identify the corresponding debug pod

`$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
debug.openshift.io/source-container: container-00
debug.openshift.io/source-resource: /v1, Resource=nodes/smuley-20260407-520b9-879rp-master-0
openshift.io/required-scc: privileged
creationTimestamp: "2026-04-10T07:44:50Z"
generation: 1
labels:
debug.openshift.io/managed-by: oc-debug
name: smuley-20260407-520b9-879rp-master-0-debug-7rd2p
namespace: default
resourceVersion: "1003576"
uid: 3f7d2587-4025-4c6a-8f2f-d64f5898af55
spec:
containers:

command:

/bin/bash

-c

sleep 300
env:

name: TMOUT
value: "900"

name: HOST
value: /host
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bf238da7a1100201223f287894d6cb90d7ac5eb6bcd3311382bcc6774cae8c0
imagePullPolicy: IfNotPresent
imagePullSecrets:

name: default-dockercfg-s4xcc
nodeName: smuley-20260407-520b9-879rp-master-0`

Thought we can go with Label + spec.nodeName but in case of 2 runs of gather_core_dumps on the same node - we won't be able to identify the pod corresponding to a particular run.

And from my understanding, we can not inject a custom label in oc debug command.

Thoughts? @ingvagabund .

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes? Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Thank you.

That would require another story/ticket to track right?

That's up to you how you track the effort.

Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes?

Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.

Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.

cc: @Prashanth684

sferich888 · 2026-04-07T13:53:38Z

Why move away from deleting the pod?

Pod gets deleted here https://github.com/openshift/must-gather/pull/531/changes#diff-f7dfd1af20867af285734a2bf99bf98c1766f2a1a74074d45a75f28ee0ef3a82R61 @sferich888 .

-Original file line number
+Diff line change
@@ Expand Up @@
     mkdir -p "${CORE_DUMP_PATH}"/
     function get_dump_off_node {
+    	local node="$1"
     	local debugPod=""
+    	local oc_debug_pid=""
+    	local tmpfile
-    	#Get debug pod's name
-    	debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
+    	tmpfile=$(mktemp)
+    	trap 'rm -f "$tmpfile"' RETURN
-    	#Start Debug pod force it to stay up until removed in "default" namespace
-    	oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 &
+    	# Start Debug pod in background and capture output to get pod name
+    	oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
+    	oc_debug_pid=$!
-    	#Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
-    	sleep 2
-    	oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s
+    	#Wait for the debug pod to be created and extract its name with exponential backoff
+    	local max_attempts=10  # Fewer attempts needed with exponential backoff
+    	local attempt=0
+    	local base_delay=0.1  # Starting delay in seconds
+    	local max_delay=2.0   # Cap the maximum delay
+    	# Initial delay to allow pod creation to start
+    	sleep 0.5
+    	while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
+    		debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)
+    		if [ -z "$debugPod" ]; then
+    			# Calculate exponential backoff: base_delay * 2^attempt
+    			local delay=$(awk -v base="$base_delay" -v exponent="$attempt" -v max="$max_delay" \
+    				'BEGIN {d = base * (2 ^ exponent); print (d > max) ? max : d}')
+    			sleep "$delay"
+    			attempt=$((attempt + 1))
+    		fi
+    	done
+    	rm -f "$tmpfile"
     	if [ -z "$debugPod" ]; then
-    		echo "Debug pod for node ""$1"" never activated"
-    	else
-    		#Copy Core Dumps out of Nodes suppress Stdout
-    		echo "Copying core dumps on node ""$1"""
-    		oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump >/dev/null 2>&1 && PIDS+=($!)
-    		#clean up debug pod after we are done using them
-    		oc delete pod "$debugPod" -n "default"
+    		kill "${oc_debug_pid}" 2>/dev/null
+    		wait "${oc_debug_pid}" 2>/dev/null
+    		echo "Debug pod for node $node never activated"
+    		return
+    	fi
+    	if ! oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1; then
+    		echo "Warning: Debug pod $debugPod on node $node did not become Ready in time"
+    		oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1
+    		kill "${oc_debug_pid}" 2>/dev/null
+    		wait "${oc_debug_pid}" 2>/dev/null
+    		return
+    	fi
+    	echo "Copying core dumps on node $node"
+    	if ! oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}/${node}_core_dump" > /dev/null 2>&1; then
+    		echo "Warning: Failed to copy core dumps from node $node"
     	fi
+    	oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1
+    	kill "${oc_debug_pid}" 2>/dev/null
+    	wait "${oc_debug_pid}" 2>/dev/null
     }
     function gather_core_dump_data {
-    	#Run coredump pull function on all nodes in parallel
+    	# Run coredump pull function on all nodes in parallel
     	for NODE in ${NODES}; do
     		get_dump_off_node "${NODE}" &
+    		PIDS+=($!)
     	done
     }
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #531

Diff view

Diff view

There are no files selected for viewing

sferich888 Apr 7, 2026

Uh oh!

ingvagabund Apr 9, 2026

Uh oh!

ingvagabund Apr 9, 2026 •

edited

Loading

Uh oh!

shivprakashmuley Apr 10, 2026 •

edited

Loading

Uh oh!

ingvagabund Apr 10, 2026

Uh oh!

shivprakashmuley Apr 13, 2026 •

edited

Loading

Uh oh!

ingvagabund Apr 14, 2026 •

edited

Loading

Uh oh!

shivprakashmuley Apr 28, 2026

Uh oh!

Uh oh!

sferich888 Apr 7, 2026

Uh oh!

shivprakashmuley Apr 8, 2026

Uh oh!

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #531

Are you sure you want to change the base?

OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #531

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

sferich888 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sferich888 Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

shivprakashmuley Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

ingvagabund Apr 9, 2026 •

edited

Loading

shivprakashmuley Apr 10, 2026 •

edited

Loading

shivprakashmuley Apr 13, 2026 •

edited

Loading

ingvagabund Apr 14, 2026 •

edited

Loading