Skip to content

OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot#349

Open
tchap wants to merge 3 commits intoopenshift:mainfrom
tchap:fix-deployment-true-node-reboot-rewrite
Open

OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot#349
tchap wants to merge 3 commits intoopenshift:mainfrom
tchap:fix-deployment-true-node-reboot-rewrite

Conversation

@tchap
Copy link
Copy Markdown
Contributor

@tchap tchap commented Apr 28, 2026

The main motivation is to have Progressing=false when a node is rebooting and not all replicas are available.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The overall situation regarding conditions in this operator seems rather arbitrary, but I focused on just fixing the issue and not breaking anything else.

The following summary is generated by Claude:

Summary

When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was incorrectly reporting Progressing=True. This happens during node reboots or pod evictions — the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. Reporting Progressing=True in this case can mislead the CVO into thinking an upgrade is still rolling out, delaying or blocking upgrades.

The root cause was a structural bug in syncStatus: conditions (Available=False, Progressing=True) were set inside the deployment loop, but the post-loop code unconditionally overwrote them — every code path after the loop set Available=True. This meant setAvailableFalse calls inside the loop had no effect.

Fix

Stop setting conditions inside the loop. Instead, collect flags (deploymentUnavailableButUpToDate, versionUpdatable) and decide after the loop:

  • Up-to-date but unavailable (node reboot, pod eviction): Available=True, Progressing=False, version is set. The operator has nothing to do — it's waiting for infrastructure. We keep Available=True to avoid tripping CVO invariant monitors during transient pod rescheduling.
  • Mid-rollout or deleting: falls through to existing Progressing=True paths unchanged.

A versionUpdatable guard on the new branch ensures that if one deployment is transiently unavailable but another is mid-rollout, we still report Progressing=True.

Additional changes

  • Add isDeploymentUpToDate helper to distinguish "unavailable because the node is rebooting" from "unavailable because we're mid-rollout"
  • Refactor isDeploymentStatusComplete to reuse isDeploymentUpToDate
  • Migrate sets.String to sets.Set[string] (deprecated API cleanup)
  • Add comprehensive unit tests for syncStatus and all deployment status helpers

Test plan

  • make test-unit passes
  • make verify passes
  • e2e: deploy on a cluster with a single worker, drain the worker node, verify Progressing stays False while the service-ca pod is rescheduling
  • e2e: trigger a deployment rollout (e.g. by changing controller args), verify Progressing=True is reported during the rollout and Progressing=False after completion

Summary by CodeRabbit

  • Bug Fixes

    • Refined deployment status evaluation to better distinguish deployments that are up-to-date but temporarily unavailable, preventing unnecessary "progressing" or version-blocking and improving operator condition accuracy (including setting Available/Progressing appropriately in that case).
  • Tests

    • Added comprehensive tests covering many deployment rollout scenarios, generation observation, replica edge cases, and status synchronization outcomes.

Cover the current behavior of syncStatus, isDeploymentStatusAvailable,
isDeploymentStatusAvailableAndUpdated, and isDeploymentStatusComplete
including partial availability scenarios.
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The original motivation was to have Progressing=false when a node is rebooting and not all replicas are available. I also ended up fixing Available when a deployment is being deleted. These are separate commits and the latter can be dropped if not deemed necessary.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The following summary is generated by Claude:

Summary

Fix incorrect Progressing=True and Available=True status reporting when the controller deployment is temporarily unavailable (node reboot, pod eviction) or being deleted.

Problem

When the service-ca controller pod becomes temporarily unavailable — e.g. during a node reboot or pod eviction — the operator reports Progressing=True. This is incorrect because the deployment is already at the desired version; the pod is just transiently missing. Reporting Progressing=True can block cluster upgrades and trigger unnecessary alerts.

A related bug caused Available=True to be reported even when no replicas were available or the deployment was being deleted. The conditions set inside the loop were always overwritten by the fallthrough at the end of syncStatus.

Fix

Extract an isDeploymentUpToDate helper that checks whether the deployment controller has observed the current spec and all replicas use the current pod template (without requiring them to be available). Use it to distinguish three unavailability scenarios:

Scenario Available Progressing Version set Reason
Up-to-date but unavailable (node reboot) False False Yes ManagedDeploymentsUpToDateButUnavailable
Not up-to-date and unavailable (recreate rollout) False True No ManagedDeploymentsNotReady
Deployment being deleted False True No ManagedDeploymentsNotReady

Previously all three reported Available=True and Progressing=True.

The conditions are no longer set inside the loop body (where they were overwritten). Instead, flags are tracked during iteration and the conditions are set in dedicated branches after the loop, consistent with the existing versionUpdatable / versionUpdatableAndDeploymentsComplete pattern.

Test plan

  • TestIsDeploymentUpToDate — new unit tests for the extracted helper
  • TestIsDeploymentStatusComplete — updated for refactored implementation, added missing cases
  • TestSyncStatus — updated expectations for node reboot, recreate rollout, deleting deployment, and multi-target scenarios
  • make test-unit passes
  • make test-e2e on a cluster (node reboot / pod eviction scenario)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Separates "up-to-date" from "available" for deployments, tightens completion checks to require both, changes syncStatus control flow to handle deleting, up-to-date-but-unavailable, and not-up-to-date cases (with an early-return path), switches a set type, and adds extensive unit tests covering many rollout scenarios.

Changes

Cohort / File(s) Summary
Deployment status logic
pkg/operator/status.go
Introduces isDeploymentUpToDate and updates isDeploymentStatusComplete to require up-to-date + available replicas; refactors syncStatus to distinguish deleting vs up-to-date-but-unavailable vs not-up-to-date, adds an early-return that sets Available=True/Progressing=False with reason ManagedDeploymentsUpToDateButUnavailable, and changes targetDeploymentNames from sets.String to sets.Set[string].
Status tests
pkg/operator/status_test.go
Adds comprehensive tests and helpers exercising deployment predicates and syncStatus across many cases: generation observation, nil Spec.Replicas, recreate/scale-down edge cases, extra terminating replicas, deleting/missing deployments, partial updates, up-to-date-but-unavailable behavior, multi-target mixes, and assertions on Conditions, Reasons, and version recording.
Starter set construction
pkg/operator/starter.go
Replaces sets.NewString(...) with sets.New(...) when constructing targetDeploymentNames to match the updated sets.Set[string] type.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test case at line 311 has misleading name; scenario 'available and updated but not all replicas available yet' actually represents extra replicas terminating after scale-down, inconsistent with correctly-named similar case in TestIsDeploymentStatusComplete. Rename test case at line 311 from 'available and updated but not all replicas available yet' to 'extra replicas still terminating after scale-down' for accuracy and consistency.
Topology-Aware Scheduling Compatibility ⚠️ Warning The shell script execution failed - no git repository context available to verify commit details or file changes. Unable to access git log or git show output. Ensure the command is run within a git repository with commit history.
✅ Passed checks (9 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 31 test names in pkg/operator/status_test.go are static strings with no dynamic values, meeting Ginkgo naming standards.
Microshift Test Compatibility ✅ Passed The new test file contains standard Go unit tests using only standard Kubernetes APIs and library-go utilities, not Ginkgo e2e tests, so MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds unit tests using Go's standard testing package, not Ginkgo e2e tests. No SNO compatibility issues detected.
Ote Binary Stdout Contract ✅ Passed The modified files contain no process-level code that writes to stdout. Test functions are properly isolated and business logic contains no initialization code violations.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds only standard Go unit tests in pkg/operator/status_test.go, not Ginkgo e2e tests, so IPv6 and disconnected network requirements do not apply.
Title check ✅ Passed The title directly addresses the main purpose of the PR: fixing Progressing/Available status conditions during node reboot scenarios, which matches the core change of distinguishing up-to-date-but-unavailable deployments from other states.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from atiratree and ingvagabund April 28, 2026 11:39
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign rh-roman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented Apr 28, 2026

/jira refresh

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:

  • expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented Apr 28, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 28, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

Details

In response to this:

The original motivation was to have Progressing=false when a node is rebooting and not all replicas are available. I also ended up fixing Available when a deployment is being deleted. These are separate commits and the latter can be dropped if not deemed necessary.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The following summary is generated by Claude:

Summary

Fix incorrect Progressing=True and Available=True status reporting when the controller deployment is temporarily unavailable (node reboot, pod eviction) or being deleted.

Problem

When the service-ca controller pod becomes temporarily unavailable — e.g. during a node reboot or pod eviction — the operator reports Progressing=True. This is incorrect because the deployment is already at the desired version; the pod is just transiently missing. Reporting Progressing=True can block cluster upgrades and trigger unnecessary alerts.

A related bug caused Available=True to be reported even when no replicas were available or the deployment was being deleted. The conditions set inside the loop were always overwritten by the fallthrough at the end of syncStatus.

Fix

Extract an isDeploymentUpToDate helper that checks whether the deployment controller has observed the current spec and all replicas use the current pod template (without requiring them to be available). Use it to distinguish three unavailability scenarios:

Scenario Available Progressing Version set Reason
Up-to-date but unavailable (node reboot) False False Yes ManagedDeploymentsUpToDateButUnavailable
Not up-to-date and unavailable (recreate rollout) False True No ManagedDeploymentsNotReady
Deployment being deleted False True No ManagedDeploymentsNotReady

Previously all three reported Available=True and Progressing=True.

The conditions are no longer set inside the loop body (where they were overwritten). Instead, flags are tracked during iteration and the conditions are set in dedicated branches after the loop, consistent with the existing versionUpdatable / versionUpdatableAndDeploymentsComplete pattern.

Test plan

  • TestIsDeploymentUpToDate — new unit tests for the extracted helper
  • TestIsDeploymentStatusComplete — updated for refactored implementation, added missing cases
  • TestSyncStatus — updated expectations for node reboot, recreate rollout, deleting deployment, and multi-target scenarios
  • make test-unit passes
  • make test-e2e on a cluster (node reboot / pod eviction scenario)

Summary by CodeRabbit

  • Improvements
  • Enhanced operator status evaluation to more accurately track deployment states and conditions.
  • Improved distinction between deployment states (deleting, up-to-date-but-unavailable, and not-up-to-date) for more precise status reporting.
  • Refined Available and Progressing condition logic for deployments to provide clearer operational insights.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
pkg/operator/status_test.go (1)

251-409: Add a mixed-state regression case for unavailable deployments.

Please add one TestSyncStatus case with two targets where one deployment is up-to-date-but-unavailable and the other is not-up-to-date-and-unavailable; expected outcome should remain ManagedDeploymentsNotReady, Available=False, Progressing=True, and versionSet=false.

Suggested table entry
+		{
+			name:    "two targets, mixed unavailable states prefers not-ready",
+			targets: sets.NewString("deploy-a", "deploy-b"),
+			deployments: []appsv1.Deployment{
+				makeDeployment("deploy-a", 1, 1, 1, 1, 0), // up-to-date but unavailable
+				makeDeployment("deploy-b", 1, 1, 0, 0, 0), // not up-to-date and unavailable
+			},
+			expectProgressing: operatorv1.ConditionTrue,
+			expectAvailable:   operatorv1.ConditionFalse,
+			expectVersionSet:  false,
+			expectReason:      "ManagedDeploymentsNotReady",
+		},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status_test.go` around lines 251 - 409, Add a new TestSyncStatus
table case inside the tests slice that uses TestSyncStatus and makeDeployment:
set targets to sets.NewString("deploy-a","deploy-b") and deployments to one
up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1,
1, 0)) and one not-up-to-date-and-unavailable deployment (e.g.
makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to
operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse,
expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status.go`:
- Around line 170-183: The current check handles
deploymentUnavailableButUpToDate before deploymentUnavailableNotUpToDate, which
can prematurely call c.setVersion() and set Progressing=False; invert the two
condition blocks so the branch for deploymentUnavailableNotUpToDate runs first
(set reason "ManagedDeploymentsNotReady", call setAvailableFalse and
setProgressingTrue and return), and only after that check for
deploymentUnavailableButUpToDate (setAvailableFalse, setProgressingFalse, call
c.setVersion() and return); keep using the same helper calls (setAvailableFalse,
setProgressingFalse, setProgressingTrue, c.setVersion()) and return semantics.
- Line 103: Change the deprecated usage of sets.String to the generic sets API:
update the function signature of serviceCAOperator.syncStatus to accept
targetDeploymentNames as sets.Set[string] instead of sets.String, replace any
literal initializations like sets.String{} with sets.Set[string]{} or use
sets.New[string](), and update all call sites (e.g., where syncStatus is invoked
and any variables in starter (starter.go) and tests (status_test.go)) to create
and pass sets.New[string](...) or an empty sets.Set[string]{} accordingly so all
usages compile with k8s.io/apimachinery v0.35.1.

---

Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 251-409: Add a new TestSyncStatus table case inside the tests
slice that uses TestSyncStatus and makeDeployment: set targets to
sets.NewString("deploy-a","deploy-b") and deployments to one
up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1,
1, 0)) and one not-up-to-date-and-unavailable deployment (e.g.
makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to
operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse,
expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5a06a13-3c4c-4501-abe8-c070124d7841

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa88ac and 2ecc13f.

📒 Files selected for processing (2)
  • pkg/operator/status.go
  • pkg/operator/status_test.go

Comment thread pkg/operator/status.go Outdated
Comment thread pkg/operator/status.go Outdated
@tchap tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from 2ecc13f to 4a3f1bc Compare April 28, 2026 12:13
@tchap tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch 2 times, most recently from 867e379 to 25e9256 Compare April 28, 2026 16:10
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pkg/operator/status.go (1)

104-123: ⚠️ Potential issue | 🟠 Major

Handle deleting and unavailable/outdated deployments before the ManagedDeploymentsAvailable fallback.

Line 109 and Line 120 only flip versionUpdatable=false, so both cases currently fall through to Lines 175-177 and report Available=True. That regresses the status this PR is meant to fix: a deleting deployment, or one with 0 available replicas that is not yet up to date, should end up Available=False, Progressing=True, with ManagedDeploymentsNotReady.

Proposed fix
 	versionUpdatable := true
 	versionUpdatableAndDeploymentsComplete := true
 	deploymentUnavailableButUpToDate := false
+	deploymentUnavailableNotUpToDate := false
+	deploymentDeleting := false
 	statusMsg := ""
 	existingDeploymentNames := sets.New[string]()
 	for _, dep := range existingDeployments.Items {
 		existingDeploymentNames.Insert(dep.Name)
 		if dep.DeletionTimestamp != nil {
 			statusMsg += fmt.Sprintf("\n%s deleting", dep.Name)
+			deploymentDeleting = true
 			versionUpdatable = false
 			versionUpdatableAndDeploymentsComplete = false
 			continue
 		}
 		if !isDeploymentStatusAvailable(dep) {
 			if isDeploymentUpToDate(dep) {
 				statusMsg += fmt.Sprintf("\n%s: all replicas are up-to-date but not yet available", dep.Name)
 				deploymentUnavailableButUpToDate = true
 			} else {
 				statusMsg += fmt.Sprintf("\n%s does not have available replicas", dep.Name)
+				deploymentUnavailableNotUpToDate = true
 				versionUpdatable = false
 			}
 			versionUpdatableAndDeploymentsComplete = false
 			continue
 		}
@@
+	if deploymentDeleting || deploymentUnavailableNotUpToDate {
+		reason := "ManagedDeploymentsNotReady"
+		setAvailableFalse(operatorConfigCopy, reason, statusMsg)
+		setProgressingTrue(operatorConfigCopy, reason, statusMsg)
+		return
+	}
+
 	// Deployment is up-to-date but temporarily unavailable (e.g. node reboot,
 	// pod eviction). The versionUpdatable guard ensures this only fires when no
 	// other deployment is mid-rollout or deleting.
 	if deploymentUnavailableButUpToDate && versionUpdatable {

Also applies to: 154-177

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status.go` around lines 104 - 123, The loop over
existingDeployments currently only flips versionUpdatable in some failure cases,
letting the ManagedDeploymentsAvailable fallback incorrectly report
Available=True; update the branches that handle deleting deployments and "not
available" deployments (the blocks using dep.DeletionTimestamp,
isDeploymentStatusAvailable(dep) and isDeploymentUpToDate(dep)) to also set
versionUpdatableAndDeploymentsComplete = false (and ensure
versionUpdatable=false where appropriate) and keep/decorate the
deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or
outdated deployment prevents the ManagedDeploymentsAvailable fallback from
marking the overall status Available=True (look for variables
existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate,
deploymentUnavailableButUpToDate, versionUpdatable,
versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable
fallback logic).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/operator/status.go`:
- Around line 104-123: The loop over existingDeployments currently only flips
versionUpdatable in some failure cases, letting the ManagedDeploymentsAvailable
fallback incorrectly report Available=True; update the branches that handle
deleting deployments and "not available" deployments (the blocks using
dep.DeletionTimestamp, isDeploymentStatusAvailable(dep) and
isDeploymentUpToDate(dep)) to also set versionUpdatableAndDeploymentsComplete =
false (and ensure versionUpdatable=false where appropriate) and keep/decorate
the deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or
outdated deployment prevents the ManagedDeploymentsAvailable fallback from
marking the overall status Available=True (look for variables
existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate,
deploymentUnavailableButUpToDate, versionUpdatable,
versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable
fallback logic).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: db89e0dd-03a2-4258-bd01-2055f7623ff7

📥 Commits

Reviewing files that changed from the base of the PR and between 4a3f1bc and 867e379.

📒 Files selected for processing (3)
  • pkg/operator/starter.go
  • pkg/operator/status.go
  • pkg/operator/status_test.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/operator/starter.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/status_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status.go`:
- Around line 116-122: When a deployment is unavailable but not up-to-date the
code must route to the ManagedDeploymentsNotReady path instead of falling
through to ManagedDeploymentsAvailable; in the else branch around
isDeploymentUpToDate(dep) update (the block setting statusMsg and
versionUpdatable) set the flag that triggers the "not ready"/delete path (e.g.,
set a managedDeploymentsNotReady or reuse deploymentUnavailableButUpToDate
appropriately) so that the post-loop logic distinguishes
"unavailable+not-up-to-date" from "unavailable+up-to-date" and reports
Available=false; update the checks later (the post-loop special-case that
currently only looks at deploymentUnavailableButUpToDate) to also consider this
not-ready flag so the ManagedDeploymentsNotReady branch executes for
stale/unavailable deployments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c05b29b-d3ec-436e-b82e-000cf5b9dd3e

📥 Commits

Reviewing files that changed from the base of the PR and between 867e379 and 25e9256.

📒 Files selected for processing (2)
  • pkg/operator/status.go
  • pkg/operator/status_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/operator/status_test.go

Comment thread pkg/operator/status.go
@tchap tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from 25e9256 to c3e8952 Compare April 29, 2026 11:00
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/operator/status_test.go (1)

23-39: Consider parameterizing Spec.Replicas in the helper for clarity.

The makeDeployment helper accepts replicas for Status.Replicas but hardcodes Spec.Replicas to 1. This works but can be confusing when tests need Status.Replicas != Spec.Replicas. Several test cases already work around this with inline functions (e.g., lines 59-63, 106-110).

Consider adding an optional parameter or creating a separate helper variant for multi-replica scenarios.

♻️ Suggested helper improvement
-func makeDeployment(name string, generation, observedGeneration int64, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment {
+func makeDeployment(name string, generation, observedGeneration int64, specReplicas, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment {
 	return appsv1.Deployment{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:       name,
 			Generation: generation,
 		},
 		Spec: appsv1.DeploymentSpec{
-			Replicas: ptr.To[int32](1),
+			Replicas: ptr.To[int32](specReplicas),
 		},
 		Status: appsv1.DeploymentStatus{
 			Replicas:           replicas,
 			UpdatedReplicas:    updatedReplicas,
 			AvailableReplicas:  availableReplicas,
 			ObservedGeneration: observedGeneration,
 		},
 	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status_test.go` around lines 23 - 39, The helper makeDeployment
currently hardcodes Spec.Replicas to 1 while taking replicas for
Status.Replicas, which causes confusion and forces test workarounds; modify the
helper to accept an additional parameter (e.g., specReplicas int32) and set
Spec.Replicas to ptr.To[int32](specReplicas), or add a second helper (e.g.,
makeDeploymentWithSpecReplicas) that takes specReplicas and uses it for
Spec.Replicas, then update tests that previously used inline functions (the
callers of makeDeployment that need non-1 spec replicas) to call the
new/extended helper so Spec.Replicas aligns with intended test scenarios.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status_test.go`:
- Around line 318-325: Rename the misleading test case name in the table entry
that uses makeDeployment("service-ca", 1, 1, 2, 2, 1) so it reflects that extra
replicas are still terminating rather than "not all replicas available yet";
update the name string (the value of the name field in that test case) to
something like "available and updated but extra replicas still terminating" to
match the actual scenario and align with TestIsDeploymentStatusComplete's
naming.

---

Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 23-39: The helper makeDeployment currently hardcodes Spec.Replicas
to 1 while taking replicas for Status.Replicas, which causes confusion and
forces test workarounds; modify the helper to accept an additional parameter
(e.g., specReplicas int32) and set Spec.Replicas to ptr.To[int32](specReplicas),
or add a second helper (e.g., makeDeploymentWithSpecReplicas) that takes
specReplicas and uses it for Spec.Replicas, then update tests that previously
used inline functions (the callers of makeDeployment that need non-1 spec
replicas) to call the new/extended helper so Spec.Replicas aligns with intended
test scenarios.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16313b69-b0db-421c-b32a-263d261541c6

📥 Commits

Reviewing files that changed from the base of the PR and between 25e9256 and c3e8952.

📒 Files selected for processing (2)
  • pkg/operator/status.go
  • pkg/operator/status_test.go

Comment thread pkg/operator/status_test.go
When a deployment has zero available replicas but is fully up-to-date
(all replicas running the current pod template), syncStatus was
reporting Progressing=True. This happens during node reboots or pod
evictions — situations where the operator is not making progress, the
deployment is simply waiting for pods to be rescheduled.

The root cause was that syncStatus set conditions inside the loop but
post-loop code unconditionally overwrote them.

Fix by collecting flags in the loop and setting conditions only after:
- Up-to-date but unavailable (node reboot): Available=False,
  Progressing=False, version set
- Mid-rollout or deleting: falls through to existing Progressing=True
  paths
@tchap tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from c3e8952 to 2cef2de Compare April 29, 2026 11:22
@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented Apr 29, 2026

/retitle OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot

@openshift-ci openshift-ci Bot changed the title OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot or deployment being deleted OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot Apr 29, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 29, 2026

@tchap: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-upgrade 2cef2de link true /test e2e-aws-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The main motivation is to have Progressing=false when a node is rebooting and not all replicas are available.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The overall situation regarding conditions in this operator seems rather arbitrary, but I focused on just fixing the issue and not breaking anything else.

The following summary is generated by Claude:

Summary

When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was incorrectly reporting Progressing=True. This happens during node reboots or pod evictions — the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. Reporting Progressing=True in this case can mislead the CVO into thinking an upgrade is still rolling out, delaying or blocking upgrades.

The root cause was a structural bug in syncStatus: conditions (Available=False, Progressing=True) were set inside the deployment loop, but the post-loop code unconditionally overwrote them — every code path after the loop set Available=True. This meant setAvailableFalse calls inside the loop had no effect.

Fix

Stop setting conditions inside the loop. Instead, collect flags (deploymentUnavailableButUpToDate, versionUpdatable) and decide after the loop:

  • Up-to-date but unavailable (node reboot, pod eviction): Available=True, Progressing=False, version is set. The operator has nothing to do — it's waiting for infrastructure. We keep Available=True to avoid tripping CVO invariant monitors during transient pod rescheduling.
  • Mid-rollout or deleting: falls through to existing Progressing=True paths unchanged.

A versionUpdatable guard on the new branch ensures that if one deployment is transiently unavailable but another is mid-rollout, we still report Progressing=True.

Additional changes

  • Add isDeploymentUpToDate helper to distinguish "unavailable because the node is rebooting" from "unavailable because we're mid-rollout"
  • Refactor isDeploymentStatusComplete to reuse isDeploymentUpToDate
  • Migrate sets.String to sets.Set[string] (deprecated API cleanup)
  • Add comprehensive unit tests for syncStatus and all deployment status helpers

Test plan

  • make test-unit passes
  • make verify passes
  • e2e: deploy on a cluster with a single worker, drain the worker node, verify Progressing stays False while the service-ca pod is rescheduling
  • e2e: trigger a deployment rollout (e.g. by changing controller args), verify Progressing=True is reported during the rollout and Progressing=False after completion

Summary by CodeRabbit

  • Bug Fixes

  • Refined deployment status evaluation to better distinguish deployments that are up-to-date but temporarily unavailable, preventing unnecessary "progressing" or version-blocking and improving operator condition accuracy (including setting Available/Progressing appropriately in that case).

  • Tests

  • Added comprehensive tests covering many deployment rollout scenarios, generation observation, replica edge cases, and status synchronization outcomes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread pkg/operator/status.go
statusMsg := ""
existingDeploymentNames := sets.String{}
existingDeploymentNames := sets.New[string]()
for _, dep := range existingDeployments.Items {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we track multiple deployments? I can only see one. This is unnecessarily confusing.

Comment thread pkg/operator/status.go
versionUpdatable = false
versionUpdatableAndDeploymentsComplete = false
continue
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The status check also seems to expect a very specific deployment with 1 replica.

Comment thread pkg/operator/status.go
setProgressingTrue(operatorConfigCopy, reason, statusMsg)
setAvailableFalse(operatorConfigCopy, reason, statusMsg)
versionUpdatable = false
if isDeploymentUpToDate(dep) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a false positive. During a workload rollout, we should report progressing until all the replicas become available. Why can't we depend on the Deployment Progressing condition instead like we did in openshift/library-go#2128?

Comment thread pkg/operator/status.go
// Deployment is up-to-date but temporarily unavailable (e.g. node reboot,
// pod eviction). The versionUpdatable guard ensures this only fires when no
// other deployment is mid-rollout or deleting. Keep Available=True to avoid
// tripping CVO invariant monitors during transient pod rescheduling.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. The deployment is unavailable and since it is a single replica nothing can be done during rescheduling. We either need a monitor exception or increase the number of replicas.

expect: false,
},
{
name: "extra replicas still terminating",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Recreate strategy takes terminating replicas into account. We can even observe the terminating replicas in the status.

@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented May 5, 2026

I am going to pause this for now and implement some helpers that can be used for status conditions management as part of openshift/library-go#2128

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants