How to categorize this issue?
/area robustness
/kind bug
/priority 3
What happened:
A machine was stuck with no phase for ~17 hours (exceeding creation timeout), when it's node failed to join the cluster. This machine was entirely missing it's status field
Root cause: The new helperuncordonNodeIfCordonedreturns node "X" not found errors unwrapped instead of swallowing IsNotFound like getNodePreserveAnnotationValue does. It's invoked unconditionally from manageMachinePreservation (called on every reconcile, including when phase is blank), short-circuiting the entire reconcile loop before triggerCreationFlow can write any status updates (like Phase: Pending) on the machine.
MachineCreationTimeout is gated behind Phase != "" AND Phase ∈ {Pending,Unknown,InPlaceUpdating,InPlaceUpdateFailed}.
- As phase is not written due to the above, empty phase escapes both gates, so the 20-minute creation timeout never fires.
- An added side-effect of this bug causes the
triggerCreationFlow to be blocked till the node joins the cluster, and there are no status updates (Pending phase) on the machine until the node finally joins.
What you expected to happen:
Any machine should move to Pending phase, and then move to Failed once it hits the CreationTimeout
Environment:
- Kubernetes version (use
kubectl version): 1.32.13
- Cloud provider or hardware configuration: GCP
How to categorize this issue?
/area robustness
/kind bug
/priority 3
What happened:
A machine was stuck with no phase for ~17 hours (exceeding creation timeout), when it's node failed to join the cluster. This machine was entirely missing it's status field
Root cause: The new helper
uncordonNodeIfCordonedreturns node "X" not found errors unwrapped instead of swallowingIsNotFoundlikegetNodePreserveAnnotationValuedoes. It's invoked unconditionally frommanageMachinePreservation(called on every reconcile, including when phase is blank), short-circuiting the entire reconcile loop before triggerCreationFlow can write any status updates (likePhase: Pending) on the machine.MachineCreationTimeoutis gated behind Phase != "" AND Phase ∈ {Pending,Unknown,InPlaceUpdating,InPlaceUpdateFailed}.triggerCreationFlowto be blocked till the node joins the cluster, and there are no status updates (Pendingphase) on the machine until the node finally joins.What you expected to happen:
Any machine should move to
Pendingphase, and then move toFailedonce it hits theCreationTimeoutEnvironment:
kubectl version): 1.32.13