Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 51 additions & 16 deletions collection-scripts/gather_core_dumps
Original file line number Diff line number Diff line change
Expand Up @@ -5,34 +5,69 @@ CORE_DUMP_PATH=${OUT:-"${BASE_COLLECTION_PATH}/node_core_dumps"}
mkdir -p "${CORE_DUMP_PATH}"/

function get_dump_off_node {
local node="$1"
local debugPod=""
local oc_debug_pid=""
local tmpfile

#Get debug pod's name
debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}')
tmpfile=$(mktemp)
trap 'rm -f "$tmpfile"' RETURN

#Start Debug pod force it to stay up until removed in "default" namespace
oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 &
# Start Debug pod in background and capture output to get pod name
oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.

oc_debug_pid=$!

#Mimic a normal oc call, i.e pause between two successive calls to allow pod to register
sleep 2
oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s
#Wait for the debug pod to be created and extract its name with exponential backoff
local max_attempts=10 # Fewer attempts needed with exponential backoff
local attempt=0
local base_delay=0.1 # Starting delay in seconds
local max_delay=2.0 # Cap the maximum delay

# Initial delay to allow pod creation to start
sleep 0.5

while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do
debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider "Starting pod/xyz-debug-abc..." string as a part of the API.

Copy link
Copy Markdown
Member

@ingvagabund ingvagabund Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via oc get pod while setting the right label selector (debug.openshift.io/managed-by=oc-debug) and checking for the right node name?

If this path does not work well (e.g. when two or more pods are still listed) oc debug nodes/... command could be extended with an extra option for adding custom labels.

Copy link
Copy Markdown
Contributor Author

@shivprakashmuley shivprakashmuley Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to check if we can leverage labels somehow to identify the corresponding debug pod

`$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
debug.openshift.io/source-container: container-00
debug.openshift.io/source-resource: /v1, Resource=nodes/smuley-20260407-520b9-879rp-master-0
openshift.io/required-scc: privileged
creationTimestamp: "2026-04-10T07:44:50Z"
generation: 1
labels:
debug.openshift.io/managed-by: oc-debug
name: smuley-20260407-520b9-879rp-master-0-debug-7rd2p
namespace: default
resourceVersion: "1003576"
uid: 3f7d2587-4025-4c6a-8f2f-d64f5898af55
spec:
containers:

  • command:
    • /bin/bash
    • -c
    • sleep 300
      env:
    • name: TMOUT
      value: "900"
    • name: HOST
      value: /host
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1bf238da7a1100201223f287894d6cb90d7ac5eb6bcd3311382bcc6774cae8c0
      imagePullPolicy: IfNotPresent
      imagePullSecrets:
  • name: default-dockercfg-s4xcc
    nodeName: smuley-20260407-520b9-879rp-master-0`

Thought we can go with Label + spec.nodeName but in case of 2 runs of gather_core_dumps on the same node - we won't be able to identify the pod corresponding to a particular run.

And from my understanding, we can not inject a custom label in oc debug command.

Thoughts? @ingvagabund .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Copy link
Copy Markdown
Contributor Author

@shivprakashmuley shivprakashmuley Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you have time to investigate the possibility of extending oc debug nodes with a new option for a list of custom labels?

Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes? Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Copy link
Copy Markdown
Member

@ingvagabund ingvagabund Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

That would require another story/ticket to track right?

That's up to you how you track the effort.

Can you please suggest on the process usually followed for oc changes. Do we need a enhancement proposal for the oc debug nodes changes?

Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.

Also this means it will add more time in resolving #531 please share your thoughts on if we are ok with that.

Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if [ -z "$debugPod" ]; then
# Calculate exponential backoff: base_delay * 2^attempt
local delay=$(awk -v base="$base_delay" -v exponent="$attempt" -v max="$max_delay" \
'BEGIN {d = base * (2 ^ exponent); print (d > max) ? max : d}')
sleep "$delay"
attempt=$((attempt + 1))
fi
done
rm -f "$tmpfile"
Comment thread
shivprakashmuley marked this conversation as resolved.

if [ -z "$debugPod" ]; then
echo "Debug pod for node ""$1"" never activated"
else
#Copy Core Dumps out of Nodes suppress Stdout
echo "Copying core dumps on node ""$1"""
oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump >/dev/null 2>&1 && PIDS+=($!)

#clean up debug pod after we are done using them
oc delete pod "$debugPod" -n "default"
kill "${oc_debug_pid}" 2>/dev/null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why move away from deleting the pod?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait "${oc_debug_pid}" 2>/dev/null
echo "Debug pod for node $node never activated"
return
fi

if ! oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1; then
echo "Warning: Debug pod $debugPod on node $node did not become Ready in time"
oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1
kill "${oc_debug_pid}" 2>/dev/null
wait "${oc_debug_pid}" 2>/dev/null
return
fi

echo "Copying core dumps on node $node"
if ! oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}/${node}_core_dump" > /dev/null 2>&1; then
echo "Warning: Failed to copy core dumps from node $node"
fi

oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1
kill "${oc_debug_pid}" 2>/dev/null
wait "${oc_debug_pid}" 2>/dev/null
}

function gather_core_dump_data {
#Run coredump pull function on all nodes in parallel
# Run coredump pull function on all nodes in parallel
for NODE in ${NODES}; do
get_dump_off_node "${NODE}" &
PIDS+=($!)
done
}

Expand Down