Skip to content

TCL-5951: Add DeleteNode to SlurmControlInterface#25

Open
jhu-svg wants to merge 1 commit intoslurm-1.0-together-changesfrom
jhu/tcl-5951-add-delete-node-interface
Open

TCL-5951: Add DeleteNode to SlurmControlInterface#25
jhu-svg wants to merge 1 commit intoslurm-1.0-together-changesfrom
jhu/tcl-5951-add-delete-node-interface

Conversation

@jhu-svg
Copy link
Copy Markdown

@jhu-svg jhu-svg commented May 4, 2026

Summary

  • Adds DeleteNode(ctx, nodeset, nodeName) to SlurmControlInterface, enabling the operator to programmatically remove Slurm node registrations
  • Previously, scontrol delete was only callable from inside a dying pod's PreStop hook — if that hook doesn't run (force-delete, OOM, node crash), the Slurm entry persists as a ghost forever
  • Takes a node name string (not a pod) so callers can delete orphaned entries that have no corresponding pod — needed for the follow-up orphan reconciliation PR
  • Follows the established MakeNodeDrain pattern: lookupClient → Get → Delete, with tolerateError on 404/204 for idempotency

Context

Part of TCL-5951 (Slinky operator lifecycle gaps). This is PR 1 of 3:

  1. This PR — adds the capability (no behavior change)
  2. PR TCL-5951: Reconcile orphaned Slurm node registrations #26 — orphaned node reconciliation (uses DeleteNode to clean up ghosts in the sync loop)
  3. TCCO PR #500 — detects PVCs bound to occupied (not just missing) nodes

Test plan

  • Unit tests: successful delete (verifies node removed from fake cache) and idempotent delete (non-existent node returns nil)
  • go build ./... passes
  • Staging E2E verified via PR TCL-5951: Reconcile orphaned Slurm node registrations #26DeleteNode was exercised on staging cluster jhu-test-slurm-slinky-gap-orphaned-node (s2-us-central-8a). Operator logs confirmed "deleting slurm node" for injected orphan nodes (slinky-99, slinky-5, slinky-10, slinky-42). Idempotency verified (deleting already-gone nodes returns nil).

Add DeleteNode(ctx, nodeset, nodeName) to the SlurmControlInterface,
enabling the operator to programmatically remove Slurm node
registrations. Previously, scontrol delete was only possible from
inside a dying pod's PreStop hook, leaving ghost entries when pods
terminate abnormally.

The implementation follows the established MakeNodeDrain pattern:
lookupClient → Get → Delete, with tolerateError on 404/204 for
idempotency. Takes a node name string (not a pod) so callers can
delete orphaned entries that have no corresponding pod.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant