OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot by tchap · Pull Request #349 · openshift/service-ca-operator

tchap · 2026-04-28T11:38:53Z

The main motivation is to have Progressing=false when a node is rebooting and not all replicas are available.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The overall situation regarding conditions in this operator seems rather arbitrary, but I focused on just fixing the issue and not breaking anything else.

The following summary is generated by Claude:

Summary

When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was incorrectly reporting Progressing=True. This happens during node reboots or pod evictions — the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. Reporting Progressing=True in this case can mislead the CVO into thinking an upgrade is still rolling out, delaying or blocking upgrades.

The root cause was a structural bug in syncStatus: conditions (Available=False, Progressing=True) were set inside the deployment loop, but the post-loop code unconditionally overwrote them — every code path after the loop set Available=True. This meant setAvailableFalse calls inside the loop had no effect.

Fix

Stop setting conditions inside the loop. Instead, collect flags (deploymentUnavailableButUpToDate, versionUpdatable) and decide after the loop:

Up-to-date but unavailable (node reboot, pod eviction): Available=True, Progressing=False, version is set. The operator has nothing to do — it's waiting for infrastructure. We keep Available=True to avoid tripping CVO invariant monitors during transient pod rescheduling.
Mid-rollout or deleting: falls through to existing Progressing=True paths unchanged.

A versionUpdatable guard on the new branch ensures that if one deployment is transiently unavailable but another is mid-rollout, we still report Progressing=True.

Additional changes

Add isDeploymentUpToDate helper to distinguish "unavailable because the node is rebooting" from "unavailable because we're mid-rollout"
Refactor isDeploymentStatusComplete to reuse isDeploymentUpToDate
Migrate sets.String to sets.Set[string] (deprecated API cleanup)
Add comprehensive unit tests for syncStatus and all deployment status helpers

Test plan

make test-unit passes
make verify passes
e2e: deploy on a cluster with a single worker, drain the worker node, verify Progressing stays False while the service-ca pod is rescheduling
e2e: trigger a deployment rollout (e.g. by changing controller args), verify Progressing=True is reported during the rollout and Progressing=False after completion

Summary by CodeRabbit

Bug Fixes
- Refined deployment status evaluation to better distinguish deployments that are up-to-date but temporarily unavailable, preventing unnecessary "progressing" or version-blocking and improving operator condition accuracy (including setting Available/Progressing appropriately in that case).
Tests
- Added comprehensive tests covering many deployment rollout scenarios, generation observation, replica edge cases, and status synchronization outcomes.

Cover the current behavior of syncStatus, isDeploymentStatusAvailable, isDeploymentStatusAvailableAndUpdated, and isDeploymentStatusComplete including partial availability scenarios.

openshift-ci-robot · 2026-04-28T11:38:59Z

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The original motivation was to have Progressing=false when a node is rebooting and not all replicas are available. I also ended up fixing Available when a deployment is being deleted. These are separate commits and the latter can be dropped if not deemed necessary.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The following summary is generated by Claude:

Summary

Fix incorrect Progressing=True and Available=True status reporting when the controller deployment is temporarily unavailable (node reboot, pod eviction) or being deleted.

Problem

When the service-ca controller pod becomes temporarily unavailable — e.g. during a node reboot or pod eviction — the operator reports Progressing=True. This is incorrect because the deployment is already at the desired version; the pod is just transiently missing. Reporting Progressing=True can block cluster upgrades and trigger unnecessary alerts.

A related bug caused Available=True to be reported even when no replicas were available or the deployment was being deleted. The conditions set inside the loop were always overwritten by the fallthrough at the end of syncStatus.

Fix

Extract an isDeploymentUpToDate helper that checks whether the deployment controller has observed the current spec and all replicas use the current pod template (without requiring them to be available). Use it to distinguish three unavailability scenarios:

Scenario Available Progressing Version set Reason

Up-to-date but unavailable (node reboot) False False Yes ManagedDeploymentsUpToDateButUnavailable

Not up-to-date and unavailable (recreate rollout) False True No ManagedDeploymentsNotReady

Deployment being deleted False True No ManagedDeploymentsNotReady

Previously all three reported Available=True and Progressing=True.

The conditions are no longer set inside the loop body (where they were overwritten). Instead, flags are tracked during iteration and the conditions are set in dedicated branches after the loop, consistent with the existing versionUpdatable / versionUpdatableAndDeploymentsComplete pattern.

Test plan

TestIsDeploymentUpToDate — new unit tests for the extracted helper

TestIsDeploymentStatusComplete — updated for refactored implementation, added missing cases

TestSyncStatus — updated expectations for node reboot, recreate rollout, deleting deployment, and multi-target scenarios

make test-unit passes

make test-e2e on a cluster (node reboot / pod eviction scenario)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-28T11:39:07Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Separates "up-to-date" from "available" for deployments, tightens completion checks to require both, changes syncStatus control flow to handle deleting, up-to-date-but-unavailable, and not-up-to-date cases (with an early-return path), switches a set type, and adds extensive unit tests covering many rollout scenarios.

Changes

Cohort / File(s)	Summary
Deployment status logic `pkg/operator/status.go`	Introduces `isDeploymentUpToDate` and updates `isDeploymentStatusComplete` to require up-to-date + available replicas; refactors `syncStatus` to distinguish deleting vs up-to-date-but-unavailable vs not-up-to-date, adds an early-return that sets `Available=True`/`Progressing=False` with reason `ManagedDeploymentsUpToDateButUnavailable`, and changes `targetDeploymentNames` from `sets.String` to `sets.Set[string]`.
Status tests `pkg/operator/status_test.go`	Adds comprehensive tests and helpers exercising deployment predicates and `syncStatus` across many cases: generation observation, nil `Spec.Replicas`, recreate/scale-down edge cases, extra terminating replicas, deleting/missing deployments, partial updates, up-to-date-but-unavailable behavior, multi-target mixes, and assertions on Conditions, Reasons, and version recording.
Starter set construction `pkg/operator/starter.go`	Replaces `sets.NewString(...)` with `sets.New(...)` when constructing `targetDeploymentNames` to match the updated `sets.Set[string]` type.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 9 | ❌ 3

❌ Failed checks (3 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 9.09% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	Test case at line 311 has misleading name; scenario 'available and updated but not all replicas available yet' actually represents extra replicas terminating after scale-down, inconsistent with correctly-named similar case in TestIsDeploymentStatusComplete.	Rename test case at line 311 from 'available and updated but not all replicas available yet' to 'extra replicas still terminating after scale-down' for accuracy and consistency.
Topology-Aware Scheduling Compatibility	⚠️ Warning	The shell script execution failed - no git repository context available to verify commit details or file changes.	Unable to access git log or git show output. Ensure the command is run within a git repository with commit history.

✅ Passed checks (9 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	All 31 test names in pkg/operator/status_test.go are static strings with no dynamic values, meeting Ginkgo naming standards.
Microshift Test Compatibility	✅ Passed	The new test file contains standard Go unit tests using only standard Kubernetes APIs and library-go utilities, not Ginkgo e2e tests, so MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds unit tests using Go's standard testing package, not Ginkgo e2e tests. No SNO compatibility issues detected.
Ote Binary Stdout Contract	✅ Passed	The modified files contain no process-level code that writes to stdout. Test functions are properly isolated and business logic contains no initialization code violations.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR adds only standard Go unit tests in pkg/operator/status_test.go, not Ginkgo e2e tests, so IPv6 and disconnected network requirements do not apply.
Title check	✅ Passed	The title directly addresses the main purpose of the PR: fixing Progressing/Available status conditions during node reboot scenarios, which matches the core change of distinguishing up-to-date-but-unavailable deployments from other states.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-28T11:39:31Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tchap
Once this PR has been reviewed and has the lgtm label, please assign rh-roman for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tchap · 2026-04-28T11:39:46Z

/jira refresh

openshift-ci-robot · 2026-04-28T11:39:50Z

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is invalid:

expected the bug to target the "5.0.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tchap · 2026-04-28T11:41:03Z

/jira refresh

openshift-ci-robot · 2026-04-28T11:41:10Z

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-28T11:43:41Z

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

Details

In response to this:

The original motivation was to have Progressing=false when a node is rebooting and not all replicas are available. I also ended up fixing Available when a deployment is being deleted. These are separate commits and the latter can be dropped if not deemed necessary.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The following summary is generated by Claude:

Summary

Fix incorrect Progressing=True and Available=True status reporting when the controller deployment is temporarily unavailable (node reboot, pod eviction) or being deleted.

Problem

When the service-ca controller pod becomes temporarily unavailable — e.g. during a node reboot or pod eviction — the operator reports Progressing=True. This is incorrect because the deployment is already at the desired version; the pod is just transiently missing. Reporting Progressing=True can block cluster upgrades and trigger unnecessary alerts.

A related bug caused Available=True to be reported even when no replicas were available or the deployment was being deleted. The conditions set inside the loop were always overwritten by the fallthrough at the end of syncStatus.

Fix

Extract an isDeploymentUpToDate helper that checks whether the deployment controller has observed the current spec and all replicas use the current pod template (without requiring them to be available). Use it to distinguish three unavailability scenarios:

Scenario Available Progressing Version set Reason

Up-to-date but unavailable (node reboot) False False Yes ManagedDeploymentsUpToDateButUnavailable

Not up-to-date and unavailable (recreate rollout) False True No ManagedDeploymentsNotReady

Deployment being deleted False True No ManagedDeploymentsNotReady

Previously all three reported Available=True and Progressing=True.

The conditions are no longer set inside the loop body (where they were overwritten). Instead, flags are tracked during iteration and the conditions are set in dedicated branches after the loop, consistent with the existing versionUpdatable / versionUpdatableAndDeploymentsComplete pattern.

Test plan

TestIsDeploymentUpToDate — new unit tests for the extracted helper

TestIsDeploymentStatusComplete — updated for refactored implementation, added missing cases

TestSyncStatus — updated expectations for node reboot, recreate rollout, deleting deployment, and multi-target scenarios

make test-unit passes

make test-e2e on a cluster (node reboot / pod eviction scenario)

Summary by CodeRabbit

Improvements

Enhanced operator status evaluation to more accurately track deployment states and conditions.

Improved distinction between deployment states (deleting, up-to-date-but-unavailable, and not-up-to-date) for more precise status reporting.

Refined Available and Progressing condition logic for deployments to provide clearer operational insights.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/operator/status_test.go (1)

251-409: Add a mixed-state regression case for unavailable deployments.

Please add one TestSyncStatus case with two targets where one deployment is up-to-date-but-unavailable and the other is not-up-to-date-and-unavailable; expected outcome should remain ManagedDeploymentsNotReady, Available=False, Progressing=True, and versionSet=false.

Suggested table entry

+		{
+			name:    "two targets, mixed unavailable states prefers not-ready",
+			targets: sets.NewString("deploy-a", "deploy-b"),
+			deployments: []appsv1.Deployment{
+				makeDeployment("deploy-a", 1, 1, 1, 1, 0), // up-to-date but unavailable
+				makeDeployment("deploy-b", 1, 1, 0, 0, 0), // not up-to-date and unavailable
+			},
+			expectProgressing: operatorv1.ConditionTrue,
+			expectAvailable:   operatorv1.ConditionFalse,
+			expectVersionSet:  false,
+			expectReason:      "ManagedDeploymentsNotReady",
+		},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status_test.go` around lines 251 - 409, Add a new TestSyncStatus
table case inside the tests slice that uses TestSyncStatus and makeDeployment:
set targets to sets.NewString("deploy-a","deploy-b") and deployments to one
up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1,
1, 0)) and one not-up-to-date-and-unavailable deployment (e.g.
makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to
operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse,
expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status.go`:
- Around line 170-183: The current check handles
deploymentUnavailableButUpToDate before deploymentUnavailableNotUpToDate, which
can prematurely call c.setVersion() and set Progressing=False; invert the two
condition blocks so the branch for deploymentUnavailableNotUpToDate runs first
(set reason "ManagedDeploymentsNotReady", call setAvailableFalse and
setProgressingTrue and return), and only after that check for
deploymentUnavailableButUpToDate (setAvailableFalse, setProgressingFalse, call
c.setVersion() and return); keep using the same helper calls (setAvailableFalse,
setProgressingFalse, setProgressingTrue, c.setVersion()) and return semantics.
- Line 103: Change the deprecated usage of sets.String to the generic sets API:
update the function signature of serviceCAOperator.syncStatus to accept
targetDeploymentNames as sets.Set[string] instead of sets.String, replace any
literal initializations like sets.String{} with sets.Set[string]{} or use
sets.New[string](), and update all call sites (e.g., where syncStatus is invoked
and any variables in starter (starter.go) and tests (status_test.go)) to create
and pass sets.New[string](...) or an empty sets.Set[string]{} accordingly so all
usages compile with k8s.io/apimachinery v0.35.1.

---

Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 251-409: Add a new TestSyncStatus table case inside the tests
slice that uses TestSyncStatus and makeDeployment: set targets to
sets.NewString("deploy-a","deploy-b") and deployments to one
up-to-date-but-unavailable deployment (e.g. makeDeployment("deploy-a", 1, 1, 1,
1, 0)) and one not-up-to-date-and-unavailable deployment (e.g.
makeDeployment("deploy-b", 2, 1, 1, 1, 0)); set expectProgressing to
operatorv1.ConditionTrue, expectAvailable to operatorv1.ConditionFalse,
expectVersionSet to false, and expectReason to "ManagedDeploymentsNotReady".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5a06a13-3c4c-4501-abe8-c070124d7841

📥 Commits

Reviewing files that changed from the base of the PR and between 2aa88ac and 2ecc13f.

📒 Files selected for processing (2)

pkg/operator/status.go
pkg/operator/status_test.go

coderabbitai

♻️ Duplicate comments (1)

pkg/operator/status.go (1)

104-123: ⚠️ Potential issue | 🟠 Major

Handle deleting and unavailable/outdated deployments before the ManagedDeploymentsAvailable fallback.

Line 109 and Line 120 only flip versionUpdatable=false, so both cases currently fall through to Lines 175-177 and report Available=True. That regresses the status this PR is meant to fix: a deleting deployment, or one with 0 available replicas that is not yet up to date, should end up Available=False, Progressing=True, with ManagedDeploymentsNotReady.

Proposed fix

 	versionUpdatable := true
 	versionUpdatableAndDeploymentsComplete := true
 	deploymentUnavailableButUpToDate := false
+	deploymentUnavailableNotUpToDate := false
+	deploymentDeleting := false
 	statusMsg := ""
 	existingDeploymentNames := sets.New[string]()
 	for _, dep := range existingDeployments.Items {
 		existingDeploymentNames.Insert(dep.Name)
 		if dep.DeletionTimestamp != nil {
 			statusMsg += fmt.Sprintf("\n%s deleting", dep.Name)
+			deploymentDeleting = true
 			versionUpdatable = false
 			versionUpdatableAndDeploymentsComplete = false
 			continue
 		}
 		if !isDeploymentStatusAvailable(dep) {
 			if isDeploymentUpToDate(dep) {
 				statusMsg += fmt.Sprintf("\n%s: all replicas are up-to-date but not yet available", dep.Name)
 				deploymentUnavailableButUpToDate = true
 			} else {
 				statusMsg += fmt.Sprintf("\n%s does not have available replicas", dep.Name)
+				deploymentUnavailableNotUpToDate = true
 				versionUpdatable = false
 			}
 			versionUpdatableAndDeploymentsComplete = false
 			continue
 		}
@@
+	if deploymentDeleting || deploymentUnavailableNotUpToDate {
+		reason := "ManagedDeploymentsNotReady"
+		setAvailableFalse(operatorConfigCopy, reason, statusMsg)
+		setProgressingTrue(operatorConfigCopy, reason, statusMsg)
+		return
+	}
+
 	// Deployment is up-to-date but temporarily unavailable (e.g. node reboot,
 	// pod eviction). The versionUpdatable guard ensures this only fires when no
 	// other deployment is mid-rollout or deleting.
 	if deploymentUnavailableButUpToDate && versionUpdatable {

Also applies to: 154-177

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status.go` around lines 104 - 123, The loop over
existingDeployments currently only flips versionUpdatable in some failure cases,
letting the ManagedDeploymentsAvailable fallback incorrectly report
Available=True; update the branches that handle deleting deployments and "not
available" deployments (the blocks using dep.DeletionTimestamp,
isDeploymentStatusAvailable(dep) and isDeploymentUpToDate(dep)) to also set
versionUpdatableAndDeploymentsComplete = false (and ensure
versionUpdatable=false where appropriate) and keep/decorate the
deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or
outdated deployment prevents the ManagedDeploymentsAvailable fallback from
marking the overall status Available=True (look for variables
existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate,
deploymentUnavailableButUpToDate, versionUpdatable,
versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable
fallback logic).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/operator/status.go`:
- Around line 104-123: The loop over existingDeployments currently only flips
versionUpdatable in some failure cases, letting the ManagedDeploymentsAvailable
fallback incorrectly report Available=True; update the branches that handle
deleting deployments and "not available" deployments (the blocks using
dep.DeletionTimestamp, isDeploymentStatusAvailable(dep) and
isDeploymentUpToDate(dep)) to also set versionUpdatableAndDeploymentsComplete =
false (and ensure versionUpdatable=false where appropriate) and keep/decorate
the deploymentUnavailableButUpToDate flag so that any deleting, unavailable, or
outdated deployment prevents the ManagedDeploymentsAvailable fallback from
marking the overall status Available=True (look for variables
existingDeployments, isDeploymentStatusAvailable, isDeploymentUpToDate,
deploymentUnavailableButUpToDate, versionUpdatable,
versionUpdatableAndDeploymentsComplete, and the ManagedDeploymentsAvailable
fallback logic).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: db89e0dd-03a2-4258-bd01-2055f7623ff7

📥 Commits

Reviewing files that changed from the base of the PR and between 4a3f1bc and 867e379.

📒 Files selected for processing (3)

pkg/operator/starter.go
pkg/operator/status.go
pkg/operator/status_test.go

✅ Files skipped from review due to trivial changes (1)

pkg/operator/starter.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/operator/status_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status.go`:
- Around line 116-122: When a deployment is unavailable but not up-to-date the
code must route to the ManagedDeploymentsNotReady path instead of falling
through to ManagedDeploymentsAvailable; in the else branch around
isDeploymentUpToDate(dep) update (the block setting statusMsg and
versionUpdatable) set the flag that triggers the "not ready"/delete path (e.g.,
set a managedDeploymentsNotReady or reuse deploymentUnavailableButUpToDate
appropriately) so that the post-loop logic distinguishes
"unavailable+not-up-to-date" from "unavailable+up-to-date" and reports
Available=false; update the checks later (the post-loop special-case that
currently only looks at deploymentUnavailableButUpToDate) to also consider this
not-ready flag so the ManagedDeploymentsNotReady branch executes for
stale/unavailable deployments.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5c05b29b-d3ec-436e-b82e-000cf5b9dd3e

📥 Commits

Reviewing files that changed from the base of the PR and between 867e379 and 25e9256.

📒 Files selected for processing (2)

pkg/operator/status.go
pkg/operator/status_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

pkg/operator/status_test.go

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/operator/status_test.go (1)

23-39: Consider parameterizing Spec.Replicas in the helper for clarity.

The makeDeployment helper accepts replicas for Status.Replicas but hardcodes Spec.Replicas to 1. This works but can be confusing when tests need Status.Replicas != Spec.Replicas. Several test cases already work around this with inline functions (e.g., lines 59-63, 106-110).

Consider adding an optional parameter or creating a separate helper variant for multi-replica scenarios.

♻️ Suggested helper improvement

-func makeDeployment(name string, generation, observedGeneration int64, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment {
+func makeDeployment(name string, generation, observedGeneration int64, specReplicas, replicas, updatedReplicas, availableReplicas int32) appsv1.Deployment {
 	return appsv1.Deployment{
 		ObjectMeta: metav1.ObjectMeta{
 			Name:       name,
 			Generation: generation,
 		},
 		Spec: appsv1.DeploymentSpec{
-			Replicas: ptr.To[int32](1),
+			Replicas: ptr.To[int32](specReplicas),
 		},
 		Status: appsv1.DeploymentStatus{
 			Replicas:           replicas,
 			UpdatedReplicas:    updatedReplicas,
 			AvailableReplicas:  availableReplicas,
 			ObservedGeneration: observedGeneration,
 		},
 	}
 }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/status_test.go` around lines 23 - 39, The helper makeDeployment
currently hardcodes Spec.Replicas to 1 while taking replicas for
Status.Replicas, which causes confusion and forces test workarounds; modify the
helper to accept an additional parameter (e.g., specReplicas int32) and set
Spec.Replicas to ptr.To[int32](specReplicas), or add a second helper (e.g.,
makeDeploymentWithSpecReplicas) that takes specReplicas and uses it for
Spec.Replicas, then update tests that previously used inline functions (the
callers of makeDeployment that need non-1 spec replicas) to call the
new/extended helper so Spec.Replicas aligns with intended test scenarios.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/operator/status_test.go`:
- Around line 318-325: Rename the misleading test case name in the table entry
that uses makeDeployment("service-ca", 1, 1, 2, 2, 1) so it reflects that extra
replicas are still terminating rather than "not all replicas available yet";
update the name string (the value of the name field in that test case) to
something like "available and updated but extra replicas still terminating" to
match the actual scenario and align with TestIsDeploymentStatusComplete's
naming.

---

Nitpick comments:
In `@pkg/operator/status_test.go`:
- Around line 23-39: The helper makeDeployment currently hardcodes Spec.Replicas
to 1 while taking replicas for Status.Replicas, which causes confusion and
forces test workarounds; modify the helper to accept an additional parameter
(e.g., specReplicas int32) and set Spec.Replicas to ptr.To[int32](specReplicas),
or add a second helper (e.g., makeDeploymentWithSpecReplicas) that takes
specReplicas and uses it for Spec.Replicas, then update tests that previously
used inline functions (the callers of makeDeployment that need non-1 spec
replicas) to call the new/extended helper so Spec.Replicas aligns with intended
test scenarios.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16313b69-b0db-421c-b32a-263d261541c6

📥 Commits

Reviewing files that changed from the base of the PR and between 25e9256 and c3e8952.

📒 Files selected for processing (2)

pkg/operator/status.go
pkg/operator/status_test.go

When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was reporting Progressing=True. This happens during node reboots or pod evictions — situations where the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. The root cause was that syncStatus set conditions inside the loop but post-loop code unconditionally overwrote them. Fix by collecting flags in the loop and setting conditions only after: - Up-to-date but unavailable (node reboot): Available=False, Progressing=False, version set - Mid-rollout or deleting: falls through to existing Progressing=True paths

tchap · 2026-04-29T14:29:34Z

/retitle OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot

openshift-ci · 2026-04-29T14:36:46Z

@tchap: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-upgrade	`2cef2de`	link	true	`/test e2e-aws-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-04-29T14:41:41Z

@tchap: This pull request references Jira Issue OCPBUGS-62633, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (xxia@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

The main motivation is to have Progressing=false when a node is rebooting and not all replicas are available.

I started with creating unit tests for the current state in the first commit, the following commits actually apply necessary changes. It is more pleasant to review commit-by-commit.

A better long-term solution would be to migrate to the workload controller in library-go, which would cause change in operator-level conditions, but we would get all condition management for free.

The overall situation regarding conditions in this operator seems rather arbitrary, but I focused on just fixing the issue and not breaking anything else.

The following summary is generated by Claude:

Summary

When a deployment has zero available replicas but is fully up-to-date (all replicas running the current pod template), syncStatus was incorrectly reporting Progressing=True. This happens during node reboots or pod evictions — the operator is not making progress, the deployment is simply waiting for pods to be rescheduled. Reporting Progressing=True in this case can mislead the CVO into thinking an upgrade is still rolling out, delaying or blocking upgrades.

The root cause was a structural bug in syncStatus: conditions (Available=False, Progressing=True) were set inside the deployment loop, but the post-loop code unconditionally overwrote them — every code path after the loop set Available=True. This meant setAvailableFalse calls inside the loop had no effect.

Fix

Stop setting conditions inside the loop. Instead, collect flags (deploymentUnavailableButUpToDate, versionUpdatable) and decide after the loop:

Up-to-date but unavailable (node reboot, pod eviction): Available=True, Progressing=False, version is set. The operator has nothing to do — it's waiting for infrastructure. We keep Available=True to avoid tripping CVO invariant monitors during transient pod rescheduling.

Mid-rollout or deleting: falls through to existing Progressing=True paths unchanged.

A versionUpdatable guard on the new branch ensures that if one deployment is transiently unavailable but another is mid-rollout, we still report Progressing=True.

Additional changes

Add isDeploymentUpToDate helper to distinguish "unavailable because the node is rebooting" from "unavailable because we're mid-rollout"

Refactor isDeploymentStatusComplete to reuse isDeploymentUpToDate

Migrate sets.String to sets.Set[string] (deprecated API cleanup)

Add comprehensive unit tests for syncStatus and all deployment status helpers

Test plan

make test-unit passes

make verify passes

e2e: deploy on a cluster with a single worker, drain the worker node, verify Progressing stays False while the service-ca pod is rescheduling

e2e: trigger a deployment rollout (e.g. by changing controller args), verify Progressing=True is reported during the rollout and Progressing=False after completion

Summary by CodeRabbit

Bug Fixes

Refined deployment status evaluation to better distinguish deployments that are up-to-date but temporarily unavailable, preventing unnecessary "progressing" or version-blocking and improving operator condition accuracy (including setting Available/Progressing appropriately in that case).

Tests

Added comprehensive tests covering many deployment rollout scenarios, generation observation, replica edge cases, and status synchronization outcomes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

atiratree · 2026-05-04T22:22:11Z

 	statusMsg := ""
-	existingDeploymentNames := sets.String{}
+	existingDeploymentNames := sets.New[string]()
 	for _, dep := range existingDeployments.Items {


Why do we track multiple deployments? I can only see one. This is unnecessarily confusing.

atiratree · 2026-05-04T22:37:34Z

 			versionUpdatable = false
 			versionUpdatableAndDeploymentsComplete = false
 			continue
 		}


The status check also seems to expect a very specific deployment with 1 replica.

atiratree · 2026-05-05T12:27:12Z

-			setProgressingTrue(operatorConfigCopy, reason, statusMsg)
-			setAvailableFalse(operatorConfigCopy, reason, statusMsg)
-			versionUpdatable = false
+			if isDeploymentUpToDate(dep) {


This could be a false positive. During a workload rollout, we should report progressing until all the replicas become available. Why can't we depend on the Deployment Progressing condition instead like we did in openshift/library-go#2128?

atiratree · 2026-05-05T12:27:29Z

+	// Deployment is up-to-date but temporarily unavailable (e.g. node reboot,
+	// pod eviction). The versionUpdatable guard ensures this only fires when no
+	// other deployment is mid-rollout or deleting. Keep Available=True to avoid
+	// tripping CVO invariant monitors during transient pod rescheduling.


This is incorrect. The deployment is unavailable and since it is a single replica nothing can be done during rescheduling. We either need a monitor exception or increase the number of replicas.

atiratree · 2026-05-05T12:27:50Z

+			expect: false,
+		},
+		{
+			name:   "extra replicas still terminating",


FYI: Recreate strategy takes terminating replicas into account. We can even observe the terminating replicas in the status.

tchap · 2026-05-05T13:22:53Z

I am going to pause this for now and implement some helpers that can be used for status conditions management as part of openshift/library-go#2128

/hold

test: add unit tests for syncStatus and deployment status helpers

3861606

Cover the current behavior of syncStatus, isDeploymentStatusAvailable, isDeploymentStatusAvailableAndUpdated, and isDeploymentStatusComplete including partial availability scenarios.

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 28, 2026

openshift-ci Bot requested review from atiratree and ingvagabund April 28, 2026 11:39

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 28, 2026

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread pkg/operator/status.go Outdated

Comment thread pkg/operator/status.go Outdated

tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from 2ecc13f to 4a3f1bc Compare April 28, 2026 12:13

pkg/operator: status: Migrate to sets.Set[string]

6a3f6bd

tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch 2 times, most recently from 867e379 to 25e9256 Compare April 28, 2026 16:10

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread pkg/operator/status.go

tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from 25e9256 to c3e8952 Compare April 29, 2026 11:00

coderabbitai Bot reviewed Apr 29, 2026

View reviewed changes

Comment thread pkg/operator/status_test.go

tchap force-pushed the fix-deployment-true-node-reboot-rewrite branch from c3e8952 to 2cef2de Compare April 29, 2026 11:22

openshift-ci Bot changed the title ~~OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot or deployment being deleted~~ OCPBUGS-62633: Fix Progressing/Available status conditions on node reboot Apr 29, 2026

atiratree reviewed May 5, 2026

View reviewed changes

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 5, 2026

Conversation

tchap commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Additional changes

Test plan

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Apr 28, 2026

Summary

Problem

Fix

Test plan

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (3 warnings)

Uh oh!

openshift-ci Bot commented Apr 28, 2026

Uh oh!

tchap commented Apr 28, 2026

Uh oh!

openshift-ci-robot commented Apr 28, 2026

Uh oh!

tchap commented Apr 28, 2026

Uh oh!

openshift-ci-robot commented Apr 28, 2026

Uh oh!

openshift-ci-robot commented Apr 28, 2026

Summary

Problem

Fix

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchap commented Apr 29, 2026

Uh oh!

openshift-ci Bot commented Apr 29, 2026

Uh oh!

openshift-ci-robot commented Apr 29, 2026

Summary

Fix

Additional changes

Test plan

Summary by CodeRabbit

Uh oh!

atiratree May 4, 2026

Choose a reason for hiding this comment

Uh oh!

atiratree May 4, 2026

Choose a reason for hiding this comment

Uh oh!

atiratree May 5, 2026

Choose a reason for hiding this comment

Uh oh!

atiratree May 5, 2026

Choose a reason for hiding this comment

Uh oh!

atiratree May 5, 2026

tchap commented Apr 28, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading