-
Notifications
You must be signed in to change notification settings - Fork 227
OCPBUGS-66983: Fix race condition in gather_core_dumps pod name retrieval #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,34 +5,69 @@ CORE_DUMP_PATH=${OUT:-"${BASE_COLLECTION_PATH}/node_core_dumps"} | |
| mkdir -p "${CORE_DUMP_PATH}"/ | ||
|
|
||
| function get_dump_off_node { | ||
| local node="$1" | ||
| local debugPod="" | ||
| local oc_debug_pid="" | ||
| local tmpfile | ||
|
|
||
| #Get debug pod's name | ||
| debugPod=$(oc debug --to-namespace="default" node/"$1" -o jsonpath='{.metadata.name}') | ||
| tmpfile=$(mktemp) | ||
| trap 'rm -f "$tmpfile"' RETURN | ||
|
|
||
| #Start Debug pod force it to stay up until removed in "default" namespace | ||
| oc debug --to-namespace="default" node/"$1" -- /bin/bash -c 'sleep 300' >/dev/null 2>&1 & | ||
| # Start Debug pod in background and capture output to get pod name | ||
| oc debug --to-namespace="default" node/"$node" -- /bin/bash -c 'sleep 300' > "$tmpfile" 2>&1 & | ||
| oc_debug_pid=$! | ||
|
|
||
| #Mimic a normal oc call, i.e pause between two successive calls to allow pod to register | ||
| sleep 2 | ||
| oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s | ||
| #Wait for the debug pod to be created and extract its name with exponential backoff | ||
| local max_attempts=10 # Fewer attempts needed with exponential backoff | ||
| local attempt=0 | ||
| local base_delay=0.1 # Starting delay in seconds | ||
| local max_delay=2.0 # Cap the maximum delay | ||
|
|
||
| # Initial delay to allow pod creation to start | ||
| sleep 0.5 | ||
|
|
||
| while [ -z "$debugPod" ] && [ $attempt -lt $max_attempts ]; do | ||
| debugPod=$(sed -n 's/.*pod\/\([^ ]*\).*/\1/p' "$tmpfile" 2>/dev/null | head -1) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would not rely on whatever is printed in the logs as something referential. I don't think anyone will consider
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of reading the pod name from the logs would it help to fetch the corresponding pod name via If this path does not work well (e.g. when two or more pods are still listed)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have tried to check if we can leverage labels somehow to identify the corresponding debug pod `$ oc get po smuley-20260407-520b9-879rp-master-0-debug-7rd2p -n default -o yaml
Thought we can go with And from my understanding, we can not inject a custom label in Thoughts? @ingvagabund .
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you have time to investigate the possibility of extending
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Hi @ingvagabund , I can try to investigate. That would require another story/ticket to track right?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you.
That's up to you how you track the effort.
Given this is only about adding a new flag localized to oc debug node, no enhancement is needed. Only providing a meaningful description when opening a PR under openshift/oc.
Not sure who "we" is in this context :). Also, it's up to you or your team how much time you have allocated for resolving the issue. I am not aware of any time constraints here.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. cc: @Prashanth684 |
||
| if [ -z "$debugPod" ]; then | ||
| # Calculate exponential backoff: base_delay * 2^attempt | ||
| local delay=$(awk -v base="$base_delay" -v exponent="$attempt" -v max="$max_delay" \ | ||
| 'BEGIN {d = base * (2 ^ exponent); print (d > max) ? max : d}') | ||
| sleep "$delay" | ||
| attempt=$((attempt + 1)) | ||
| fi | ||
| done | ||
| rm -f "$tmpfile" | ||
|
shivprakashmuley marked this conversation as resolved.
|
||
|
|
||
| if [ -z "$debugPod" ]; then | ||
| echo "Debug pod for node ""$1"" never activated" | ||
| else | ||
| #Copy Core Dumps out of Nodes suppress Stdout | ||
| echo "Copying core dumps on node ""$1""" | ||
| oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}"/"$1"_core_dump >/dev/null 2>&1 && PIDS+=($!) | ||
|
|
||
| #clean up debug pod after we are done using them | ||
| oc delete pod "$debugPod" -n "default" | ||
| kill "${oc_debug_pid}" 2>/dev/null | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why move away from deleting the pod?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| wait "${oc_debug_pid}" 2>/dev/null | ||
| echo "Debug pod for node $node never activated" | ||
| return | ||
| fi | ||
|
|
||
| if ! oc wait -n "default" --for=condition=Ready pod/"$debugPod" --timeout=30s > /dev/null 2>&1; then | ||
| echo "Warning: Debug pod $debugPod on node $node did not become Ready in time" | ||
| oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1 | ||
| kill "${oc_debug_pid}" 2>/dev/null | ||
| wait "${oc_debug_pid}" 2>/dev/null | ||
| return | ||
| fi | ||
|
|
||
| echo "Copying core dumps on node $node" | ||
| if ! oc cp --loglevel 1 -n "default" "$debugPod":/host/var/lib/systemd/coredump "${CORE_DUMP_PATH}/${node}_core_dump" > /dev/null 2>&1; then | ||
| echo "Warning: Failed to copy core dumps from node $node" | ||
| fi | ||
|
|
||
| oc delete pod "$debugPod" -n "default" --wait=false > /dev/null 2>&1 | ||
| kill "${oc_debug_pid}" 2>/dev/null | ||
| wait "${oc_debug_pid}" 2>/dev/null | ||
| } | ||
|
|
||
| function gather_core_dump_data { | ||
| #Run coredump pull function on all nodes in parallel | ||
| # Run coredump pull function on all nodes in parallel | ||
| for NODE in ${NODES}; do | ||
| get_dump_off_node "${NODE}" & | ||
| PIDS+=($!) | ||
| done | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want this in the default namespace! We like want it in the must/hater name space that way when it's deleted these pods are cleaned up.