Skip to content

Machine stuck with no phase past CreationTimeout #1110

Description

@gagan16k

How to categorize this issue?

/area robustness
/kind bug
/priority 3

What happened:
A machine was stuck with no phase for ~17 hours (exceeding creation timeout), when it's node failed to join the cluster. This machine was entirely missing it's status field

Root cause: The new helperuncordonNodeIfCordonedreturns node "X" not found errors unwrapped instead of swallowing IsNotFound like getNodePreserveAnnotationValue does. It's invoked unconditionally from manageMachinePreservation (called on every reconcile, including when phase is blank), short-circuiting the entire reconcile loop before triggerCreationFlow can write any status updates (like Phase: Pending) on the machine.
MachineCreationTimeout is gated behind Phase != "" AND Phase ∈ {Pending,Unknown,InPlaceUpdating,InPlaceUpdateFailed}.

  • As phase is not written due to the above, empty phase escapes both gates, so the 20-minute creation timeout never fires.
  • An added side-effect of this bug causes the triggerCreationFlow to be blocked till the node joins the cluster, and there are no status updates (Pending phase) on the machine until the node finally joins.

What you expected to happen:
Any machine should move to Pending phase, and then move to Failed once it hits the CreationTimeout

Environment:

  • Kubernetes version (use kubectl version): 1.32.13
  • Cloud provider or hardware configuration: GCP

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/robustnessRobustness, reliability, resilience relatedkind/bugBugpriority/3Priority (lower number equals higher priority)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions