OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot#349
OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot#349tchap wants to merge 3 commits intoopenshift:mainfrom
Conversation
Cover the current behavior of syncStatus, isDeploymentStatusAvailable, isDeploymentStatusAvailableAndUpdated, and isDeploymentStatusComplete including partial availability scenarios.
|
@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughSeparates "up-to-date" from "available" for deployments, tightens completion checks to require both, changes syncStatus control flow to handle deleting, up-to-date-but-unavailable, and not-up-to-date cases (with an early-return path), switches a set type, and adds extensive unit tests covering many rollout scenarios. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 9 | ❌ 3❌ Failed checks (3 warnings)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: tchap The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/operator/status_test.go (1)
251-409: Add a mixed-state regression case for unavailable deployments.Please add one
TestSyncStatuscase with two targets where one deployment is up-to-date-but-unavailable and the other is not-up-to-date-and-unavailable; expected outcome should remainManagedDeploymentsNotReady,Available=False,Progressing=True, andversionSet=false.Suggested table entry
+ { + name: "two targets, mixed unavailable states prefers not-ready", + targets: sets.NewString("deploy-a", "deploy-b"), + deployments: []appsv1.Deployment{ + makeDeployment("deploy-a", 1, 1, 1, 1, 0), // up-to-date but unavailable + makeDeployment("deploy-b", 1, 1, 0, 0, 0), // not up-to-date and unavailable + }, + expectProgressing: operatorv1.ConditionTrue, + expectAvailable: operatorv1.ConditionFalse, + expectVersionSet: false, + expectReason: "ManagedDeploymentsNotReady", + },🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status_test.go` around lines 251 - 409, Add a new TestSyncStatus table case inside the tests slice that uses TestSyncStatus and makeDeployment: set targets to sets.NewString("deploy-a","deploy-b") and deployments to one up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1, 1, 0)) and one not-up-to-date-and-unavailable deployment (e.g. makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse, expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/status.go`:
- Around line 170-183: The current check handles
deploymentUnavailableButUpToDate before deploymentUnavailableNotUpToDate, which
can prematurely call c.setVersion() and set Progressing=False; invert the two
condition blocks so the branch for deploymentUnavailableNotUpToDate runs first
(set reason "ManagedDeploymentsNotReady", call setAvailableFalse and
setProgressingTrue and return), and only after that check for
deploymentUnavailableButUpToDate (setAvailableFalse, setProgressingFalse, call
c.setVersion() and return); keep using the same helper calls (setAvailableFalse,
setProgressingFalse, setProgressingTrue, c.setVersion()) and return semantics.
- Line 103: Change the deprecated usage of sets.String to the generic sets API:
update the function signature of serviceCAOperator.syncStatus to accept
targetDeploymentNames as sets.Set[string] instead of sets.String, replace any
literal initializations like sets.String{} with sets.Set[string]{} or use
sets.New[string](), and update all call sites (e.g., where syncStatus is invoked
and any variables in starter (starter.go) and tests (status_test.go)) to create
and pass sets.New[string](...) or an empty sets.Set[string]{} accordingly so all
usages compile with k8s.io/apimachinery v0.35.1.
---
Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 251-409: Add a new TestSyncStatus table case inside the tests
slice that uses TestSyncStatus and makeDeployment: set targets to
sets.NewString("deploy-a","deploy-b") and deployments to one
up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1,
1, 0)) and one not-up-to-date-and-unavailable deployment (e.g.
makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to
operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse,
expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b5a06a13-3c4c-4501-abe8-c070124d7841
📒 Files selected for processing (2)
pkg/operator/status.gopkg/operator/status_test.go
2ecc13f to
4a3f1bc
Compare
867e379 to
25e9256
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
pkg/operator/status.go (1)
104-123:⚠️ Potential issue | 🟠 MajorHandle deleting and unavailable/outdated deployments before the
ManagedDeploymentsAvailablefallback.Line 109 and Line 120 only flip
versionUpdatable=false, so both cases currently fall through to Lines 175-177 and reportAvailable=True. That regresses the status this PR is meant to fix: a deleting deployment, or one with0available replicas that is not yet up to date, should end upAvailable=False,Progressing=True, withManagedDeploymentsNotReady.Proposed fix
versionUpdatable := true versionUpdatableAndDeploymentsComplete := true deploymentUnavailableButUpToDate := false + deploymentUnavailableNotUpToDate := false + deploymentDeleting := false statusMsg := "" existingDeploymentNames := sets.New[string]() for _, dep := range existingDeployments.Items { existingDeploymentNames.Insert(dep.Name) if dep.DeletionTimestamp != nil { statusMsg += fmt.Sprintf("\n%s deleting", dep.Name) + deploymentDeleting = true versionUpdatable = false versionUpdatableAndDeploymentsComplete = false continue } if !isDeploymentStatusAvailable(dep) { if isDeploymentUpToDate(dep) { statusMsg += fmt.Sprintf("\n%s: all replicas are up-to-date but not yet available", dep.Name) deploymentUnavailableButUpToDate = true } else { statusMsg += fmt.Sprintf("\n%s does not have available replicas", dep.Name) + deploymentUnavailableNotUpToDate = true versionUpdatable = false } versionUpdatableAndDeploymentsComplete = false continue } @@ + if deploymentDeleting || deploymentUnavailableNotUpToDate { + reason := "ManagedDeploymentsNotReady" + setAvailableFalse(operatorConfigCopy, reason, statusMsg) + setProgressingTrue(operatorConfigCopy, reason, statusMsg) + return + } + // Deployment is up-to-date but temporarily unavailable (e.g. node reboot, // pod eviction). The versionUpdatable guard ensures this only fires when no // other deployment is mid-rollout or deleting. if deploymentUnavailableButUpToDate && versionUpdatable {Also applies to: 154-177
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status.go` around lines 104 - 123, The loop over existingDeployments currently only flips versionUpdatable in some failure cases, letting the ManagedDeploymentsAvailable fallback incorrectly report Available=True; update the branches that handle deleting deployments and "not available" deployments (the blocks using dep.DeletionTimestamp, isDeploymentStatusAvailable(dep) and isDeploymentUpToDate(dep)) to also set versionUpdatableAndDeploymentsComplete = false (and ensure versionUpdatable=false where appropriate) and keep/decorate the deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or outdated deployment prevents the ManagedDeploymentsAvailable fallback from marking the overall status Available=True (look for variables existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate, deploymentUnavailableButUpToDate, versionUpdatable, versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable fallback logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/operator/status.go`:
- Around line 104-123: The loop over existingDeployments currently only flips
versionUpdatable in some failure cases, letting the ManagedDeploymentsAvailable
fallback incorrectly report Available=True; update the branches that handle
deleting deployments and "not available" deployments (the blocks using
dep.DeletionTimestamp, isDeploymentStatusAvailable(dep) and
isDeploymentUpToDate(dep)) to also set versionUpdatableAndDeploymentsComplete =
false (and ensure versionUpdatable=false where appropriate) and keep/decorate
the deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or
outdated deployment prevents the ManagedDeploymentsAvailable fallback from
marking the overall status Available=True (look for variables
existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate,
deploymentUnavailableButUpToDate, versionUpdatable,
versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable
fallback logic).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: db89e0dd-03a2-4258-bd01-2055f7623ff7
📒 Files selected for processing (3)
pkg/operator/starter.gopkg/operator/status.gopkg/operator/status_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/status_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/status.go`:
- Around line 116-122: When a deployment is unavailable but not up-to-date the
code must route to the ManagedDeploymentsNotReady path instead of falling
through to ManagedDeploymentsAvailable; in the else branch around
isDeploymentUpToDate(dep) update (the block setting statusMsg and
versionUpdatable) set the flag that triggers the "not ready"/delete path (e.g.,
set a managedDeploymentsNotReady or reuse deploymentUnavailableButUpToDate
appropriately) so that the post-loop logic distinguishes
"unavailable+not-up-to-date" from "unavailable+up-to-date" and reports
Available=false; update the checks later (the post-loop special-case that
currently only looks at deploymentUnavailableButUpToDate) to also consider this
not-ready flag so the ManagedDeploymentsNotReady branch executes for
stale/unavailable deployments.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5c05b29b-d3ec-436e-b82e-000cf5b9dd3e
📒 Files selected for processing (2)
pkg/operator/status.gopkg/operator/status_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/operator/status_test.go
25e9256 to
c3e8952
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/operator/status_test.go (1)
23-39: Consider parameterizingSpec.Replicasin the helper for clarity.The
makeDeploymenthelper acceptsreplicasforStatus.Replicasbut hardcodesSpec.Replicasto 1. This works but can be confusing when tests needStatus.Replicas != Spec.Replicas. Several test cases already work around this with inline functions (e.g., lines 59-63, 106-110).Consider adding an optional parameter or creating a separate helper variant for multi-replica scenarios.
♻️ Suggested helper improvement
-func makeDeployment(name string, generation, observedGeneration int64, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment { +func makeDeployment(name string, generation, observedGeneration int64, specReplicas, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment { return appsv1.Deployment{ ObjectMeta: metav1.ObjectMeta{ Name: name, Generation: generation, }, Spec: appsv1.DeploymentSpec{ - Replicas: ptr.To[int32](1), + Replicas: ptr.To[int32](specReplicas), }, Status: appsv1.DeploymentStatus{ Replicas: replicas, UpdatedReplicas: updatedReplicas, AvailableReplicas: availableReplicas, ObservedGeneration: observedGeneration, }, } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/operator/status_test.go` around lines 23 - 39, The helper makeDeployment currently hardcodes Spec.Replicas to 1 while taking replicas for Status.Replicas, which causes confusion and forces test workarounds; modify the helper to accept an additional parameter (e.g., specReplicas int32) and set Spec.Replicas to ptr.To[int32](specReplicas), or add a second helper (e.g., makeDeploymentWithSpecReplicas) that takes specReplicas and uses it for Spec.Replicas, then update tests that previously used inline functions (the callers of makeDeployment that need non-1 spec replicas) to call the new/extended helper so Spec.Replicas aligns with intended test scenarios.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/operator/status_test.go`:
- Around line 318-325: Rename the misleading test case name in the table entry
that uses makeDeployment("service-ca", 1, 1, 2, 2, 1) so it reflects that extra
replicas are still terminating rather than "not all replicas available yet";
update the name string (the value of the name field in that test case) to
something like "available and updated but extra replicas still terminating" to
match the actual scenario and align with TestIsDeploymentStatusComplete's
naming.
---
Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 23-39: The helper makeDeployment currently hardcodes Spec.Replicas
to 1 while taking replicas for Status.Replicas, which causes confusion and
forces test workarounds; modify the helper to accept an additional parameter
(e.g., specReplicas int32) and set Spec.Replicas to ptr.To[int32](specReplicas),
or add a second helper (e.g., makeDeploymentWithSpecReplicas) that takes
specReplicas and uses it for Spec.Replicas, then update tests that previously
used inline functions (the callers of makeDeployment that need non-1 spec
replicas) to call the new/extended helper so Spec.Replicas aligns with intended
test scenarios.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 16313b69-b0db-421c-b32a-263d261541c6
📒 Files selected for processing (2)
pkg/operator/status.gopkg/operator/status_test.go
When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was reporting Progressing=True. This happens during node reboots or pod evictions — situations where the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. The root cause was that syncStatus set conditions inside the loop but post-loop code unconditionally overwrote them. Fix by collecting flags in the loop and setting conditions only after: - Up-to-date but unavailable (node reboot): Available=False, Progressing=False, version set - Mid-rollout or deleting: falls through to existing Progressing=True paths
c3e8952 to
2cef2de
Compare
|
/retitle OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot |
|
@tchap: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
| statusMsg := "" | ||
| existingDeploymentNames := sets.String{} | ||
| existingDeploymentNames := sets.New[string]() | ||
| for _, dep := range existingDeployments.Items { |
There was a problem hiding this comment.
Why do we track multiple deployments? I can only see one. This is unnecessarily confusing.
| versionUpdatable = false | ||
| versionUpdatableAndDeploymentsComplete = false | ||
| continue | ||
| } |
There was a problem hiding this comment.
The status check also seems to expect a very specific deployment with 1 replica.
| setProgressingTrue(operatorConfigCopy, reason, statusMsg) | ||
| setAvailableFalse(operatorConfigCopy, reason, statusMsg) | ||
| versionUpdatable = false | ||
| if isDeploymentUpToDate(dep) { |
There was a problem hiding this comment.
This could be a false positive. During a workload rollout, we should report progressing until all the replicas become available. Why can't we depend on the Deployment Progressing condition instead like we did in openshift/library-go#2128?
| // Deployment is up-to-date but temporarily unavailable (e.g. node reboot, | ||
| // pod eviction). The versionUpdatable guard ensures this only fires when no | ||
| // other deployment is mid-rollout or deleting. Keep Available=True to avoid | ||
| // tripping CVO invariant monitors during transient pod rescheduling. |
There was a problem hiding this comment.
This is incorrect. The deployment is unavailable and since it is a single replica nothing can be done during rescheduling. We either need a monitor exception or increase the number of replicas.
| expect: false, | ||
| }, | ||
| { | ||
| name: "extra replicas still terminating", |
There was a problem hiding this comment.
FYI: Recreate strategy takes terminating replicas into account. We can even observe the terminating replicas in the status.
|
I am going to pause this for now and implement some helpers that can be used for status conditions management as part of openshift/library-go#2128 /hold |
The main motivation is to have
Progressing=falsewhen a node is rebooting and not all replicas are available.I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.
A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.
The overall situation regarding conditions in this operator seems rather arbitrary, but I focused on just fixing the issue and not breaking anything else.
The following summary is generated by Claude:
Summary
When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template),
syncStatuswas incorrectly reportingProgressing=True. This happens during node reboots or pod evictions — the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. ReportingProgressing=Truein this case can mislead the CVO into thinking an upgrade is still rolling out, delaying or blocking upgrades.The root cause was a structural bug in
syncStatus: conditions (Available=False,Progressing=True) were set inside the deployment loop, but the post-loop code unconditionally overwrote them — every code path after the loop setAvailable=True. This meantsetAvailableFalsecalls inside the loop had no effect.Fix
Stop setting conditions inside the loop. Instead, collect flags (
deploymentUnavailableButUpToDate,versionUpdatable) and decide after the loop:Available=True,Progressing=False, version is set. The operator has nothing to do — it's waiting for infrastructure. We keepAvailable=Trueto avoid tripping CVO invariant monitors during transient pod rescheduling.Progressing=Truepaths unchanged.A
versionUpdatableguard on the new branch ensures that if one deployment is transiently unavailable but another is mid-rollout, we still reportProgressing=True.Additional changes
isDeploymentUpToDatehelper to distinguish "unavailable because the node is rebooting" from "unavailable because we're mid-rollout"isDeploymentStatusCompleteto reuseisDeploymentUpToDatesets.Stringtosets.Set[string](deprecated API cleanup)syncStatusand all deployment status helpersTest plan
make test-unitpassesmake verifypassesProgressingstaysFalsewhile the service-ca pod is reschedulingProgressing=Trueis reported during the rollout andProgressing=Falseafter completionSummary by CodeRabbit
Bug Fixes
Tests