Skip to content

OCPBUGS-65896: controller/workload: Rework Degraded/Progressing conditions#2128

Open
tchap wants to merge 1 commit intoopenshift:masterfrom
tchap:workload-condition-overwrites
Open

OCPBUGS-65896: controller/workload: Rework Degraded/Progressing conditions#2128
tchap wants to merge 1 commit intoopenshift:masterfrom
tchap:workload-condition-overwrites

Conversation

@tchap
Copy link
Copy Markdown
Contributor

@tchap tchap commented Feb 18, 2026

Reworks how the workload controller computes Degraded, Progressing, and Available conditions to fix false positives during scaling and improve failure detection.

Progressing/Degraded no longer consult Deployment generation. Previously, a generation mismatch (e.g. from a scale-up) would immediately set Progressing=True and could trigger Degraded. Now the controller relies solely on the Deployment's own Progressing condition:

  • Progressing=True when the deployment has not yet reported NewReplicaSetAvailable and hasn't timed out.
  • ProgressDeadlineExceeded is detected from the Deployment's Progressing condition and surfaces as both Progressing=False and DeploymentDegraded=True.

DeploymentDegraded now detects failing pods (when not mid-rollout). A pod is considered failing when it is not Ready past its ProgressDeadlineSeconds since creation, or (when MinReadySeconds > 0) its Ready condition is flapping — currently Ready but the last transition is too recent to count as available, checked only after ProgressDeadlineSeconds + MinReadySeconds have elapsed. Terminating pods are excluded to avoid false positives from the old ReplicaSet during rollouts. Diagnostic messages use deployment.PodContainersStatus to describe container states.

WorkloadDegraded SyncError is set via defer to ensure it is applied regardless of the return path.

This affects authentication-operator and openshift-apiserver-operator.

Summary by CodeRabbit

  • Bug Fixes

    • Centralized status error messaging and clearer Progressing state handling with improved timeout detection and explicit timed-out messaging.
    • More accurate Degraded reporting: better detection of failing pods and concise "N of M unavailable" messages; preserves "AsExpected" when appropriate.
    • Stricter version recording gated on observed generation and replica alignment; pod-list errors now surface as SyncError in status.
  • Tests

    • Expanded and rewritten status scenarios covering timeouts, multi-pod failures, pod-list errors, recovery paths, and version-recording validations.

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the Deployment is not Available or it times
out progressing.

Progressing is set to True when the Deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented Feb 18, 2026

/cc @atiratree

@openshift-ci openshift-ci Bot requested a review from atiratree February 18, 2026 10:53
// Update is done when all pods have been updated to the latest revision
// and the deployment controller has reported NewReplicaSetAvailable
workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use IsUpdatingTooLong any more, we based the check on ProgressDeadlineExceeded from the deployment controller.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do that. Sync from the delegate returns the synchronized deployment already. So we would have to send another update request, I guess.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can align this in auth when updating library-go. I will actually open a PR there once this is accepted to run CI anyway.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I meant that we should update it in the consumer repos like auth.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to open a proof PR before the library-go changes are merged.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
// Update is done when all pods have been updated to the latest revision
// and the deployment controller has reported NewReplicaSetAvailable
workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also set .spec.progressDeadlineSeconds to 15m on the consumer workloads to honor the superseded progressingConditionTimeout value

Comment thread pkg/operator/apiserver/controller/workload/workload_test.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload_test.go
Copy link
Copy Markdown
Member

@atiratree atiratree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit, otherwise LGTM, thanks!

Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload.go Outdated
// Update is done when all pods have been updated to the latest revision
// and the deployment controller has reported NewReplicaSetAvailable
workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I meant that we should update it in the consumer repos like auth.

// Update is done when all pods have been updated to the latest revision
// and the deployment controller has reported NewReplicaSetAvailable
workloadIsBeingUpdated := !workloadAtHighestGeneration || !hasDeploymentProgressed(workload.Status)
workloadIsBeingUpdatedTooLong := v1helpers.IsUpdatingTooLong(previousStatus, *deploymentProgressingCondition.Type)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to open a proof PR before the library-go changes are merged.

@atiratree
Copy link
Copy Markdown
Member

/lgtm
/hold
for the proof PR openshift/cluster-authentication-operator#843

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2026
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2026
@tchap tchap force-pushed the workload-condition-overwrites branch from 0b782a2 to 08c8c15 Compare April 14, 2026 08:21
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Refactors workload controller condition evaluation: centralizes error-to-message formatting, replaces generation-based Progressing checks with timeout/progress detection, simplifies Degraded detection to pod-level failure reasoning, tightens version-recording gating, and expands tests to cover timeout, multi-pod failures, pod-list errors, and version-recording assertions.

Changes

Cohort / File(s) Summary
Controller logic
pkg/operator/apiserver/controller/workload/workload.go
Added errMessage, hasDeploymentTimedOutProgressing, hasFailingPods, and isFailureWaitingReason. Replaced generation/update-based progressing checks with timeout/progress logic and moved degraded evaluation to pod-failure heuristics. Tightened version-recording gating to require Generation==ObservedGeneration and updated/available replicas equal desired. Minor message formatting tweaks.
Tests
pkg/operator/apiserver/controller/workload/workload_test.go
Rewrote TestUpdateOperatorStatus scenarios: added cases for ProgressDeadlineExceeded, multi-pod crashloop aggregation, pod-list lister errors, and timeout recovery via previous conditions. Introduced fakeVersionRecorder, per-scenario validateVersionRecorder, and podListErr injection. Updated expected Condition reasons/messages/statuses and version-recording expectations.

Sequence Diagram(s)

sequenceDiagram
  participant Controller
  participant Deployment as DeploymentStatus
  participant PodLister
  participant VersionRecorder
  Controller->>Deployment: read deployment status (generation, observedGeneration, desired/updated/available replicas, conditions)
  Controller->>PodLister: list pods for deployment
  alt pod list returns error
    Controller->>Controller: collect error -> errs
    Controller->>Controller: set Degraded = True (SyncError with errMessage)
  else pod list OK
    Controller->>Controller: hasFailingPods(pods) -> failingPodsCount
    alt failingPodsCount > 0
      Controller->>Controller: set Degraded = True (N of M instances unavailable...)
    else
      Controller->>Controller: hasDeploymentTimedOutProgressing(status) -> (msg, timedOut)
      alt timedOut
        Controller->>Controller: set Progressing = False (ProgressDeadlineExceeded, timed-out message)
        Controller->>Controller: set Degraded = True (ProgressDeadlineExceeded)
      else
        Controller->>Controller: detect PodsUpdating or AsExpected
        alt PodsUpdating
          Controller->>Controller: set Progressing = True (PodsUpdating)
        else AsExpected
          Controller->>Controller: set Progressing = False (AsExpected)
          Controller->>Controller: set Degraded = False (AsExpected)
        end
      end
    end
  end
  alt Generation/replica gating satisfied (Generation==ObservedGeneration && updated==available==desired)
    Controller->>VersionRecorder: SetVersion(operand, targetVersion)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Test assertions lack meaningful failure messages as required. Lines ~1163-1170 use bare t.Fatal(err) without context, violating the requirement for descriptive error messages. Replace bare t.Fatal(err) with t.Fatalf(...) including context. Example: t.Fatalf("scenario %q: status validation failed: %v", scenario.name, err)
✅ Passed checks (10 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and specifically describes the main changes: refactoring the Degraded and Progressing condition logic in the workload controller, which aligns with the substantial rework of both condition computations throughout the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test case names in the modified workload test file are stable and deterministic with no dynamic identifiers, generated suffixes, timestamps, UUIDs, or random values.
Microshift Test Compatibility ✅ Passed This pull request does not add any Ginkgo e2e tests. The modified test file contains only standard Go unit tests using the testing package, with no Ginkgo imports or constructs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The pull request does not add any new Ginkgo e2e tests. Modifications in workload_test.go are standard Go unit tests using the testing package, not Ginkgo tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies workload controller condition logic only; grep for scheduling keywords (affinity, topology, scheduling, nodeSelector, node-role, control-plane, ControlPlaneTopology, PodDisruptionBudget) returns zero results in modified files.
Ote Binary Stdout Contract ✅ Passed OTE Binary Stdout Contract applies only to OTE test binaries, not library code. This PR modifies library-go controller code and unit tests, which are not OTE test infrastructure.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed The modified test files use Go's standard testing package, not Ginkgo framework. No Ginkgo e2e tests with IPv4 assumptions or external connectivity requirements are present.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci-robot
Copy link
Copy Markdown

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

To enable scaling not to automatically set Progressing/Degraded,
conditions handling is aligned so that the Deployment generation is not
consulted any more.

Degraded only happens when the deployment is not Available or it times
out progressing. This is less strict than before, but checking available
replicas would mean we go Degraded on scaling automatically.

Progressing is set to True when the deployment is not in
NewReplicaSetAvailable and it hasn't timed out progressing.

I also improved the overall test case names in a separate commit.

This affects authentication-operator and openshift-apiserver-operator.

Summary by CodeRabbit

  • Bug Fixes
  • Improved deployment progress tracking and timeout detection to provide more accurate status reporting
  • Enhanced degradation state identification to better reflect unavailable pod conditions
  • Refined version recording requirements to ensure proper state validation before marking updates as complete

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)

54-67: Please add explicit coverage for the new version-recording gate.

pkg/operator/apiserver/controller/workload/workload.go Lines 358-361 changed when SetVersion fires, but this table still never exercises that path: no scenario sets operatorConfigAtHighestRevision = true, and the test controller leaves versionRecorder unset. A small fake recorder plus one positive and one negative case would keep this behavior from regressing.

Also applies to: 846-878

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 54
- 67, Add explicit scenarios to TestUpdateOperatorStatus that exercise the
version-recording gate by (1) injecting a fake versionRecorder into the test
controller and (2) adding two table entries: one with
operatorConfigAtHighestRevision = true and one with it = false; verify that when
operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is
invoked (and not invoked for the false case). Locate the test table in
TestUpdateOperatorStatus and the controller instance used to run scenarios, set
controller.versionRecorder to a simple fake recorder that records calls, and add
assertions in each scenario's validateOperatorStatus to check whether SetVersion
was called as expected to cover the SetVersion path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 54-67: Add explicit scenarios to TestUpdateOperatorStatus that
exercise the version-recording gate by (1) injecting a fake versionRecorder into
the test controller and (2) adding two table entries: one with
operatorConfigAtHighestRevision = true and one with it = false; verify that when
operatorConfigAtHighestRevision is true the fake recorder's SetVersion method is
invoked (and not invoked for the false case). Locate the test table in
TestUpdateOperatorStatus and the controller instance used to run scenarios, set
controller.versionRecorder to a simple fake recorder that records calls, and add
assertions in each scenario's validateOperatorStatus to check whether SetVersion
was called as expected to cover the SetVersion path.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 15f9a88c-d378-463c-ad2b-87fa88d3a8d7

📥 Commits

Reviewing files that changed from the base of the PR and between d2db42c and 08c8c15.

📒 Files selected for processing (2)
  • pkg/operator/apiserver/controller/workload/workload.go
  • pkg/operator/apiserver/controller/workload/workload_test.go

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/operator/apiserver/controller/workload/workload_test.go (1)

847-889: Add negative SetVersion() cases for the other readiness gates.

SetVersion() is also gated on Generation == ObservedGeneration, AvailableReplicas == desiredReplicas, and UpdatedReplicas == desiredReplicas in pkg/operator/apiserver/controller/workload/workload.go:355-364. Right now only the operatorConfigAtHighestRevision gate is exercised, so a regression that records the version during an incomplete rollout would still pass this suite.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/operator/apiserver/controller/workload/workload_test.go` around lines 847
- 889, Add negative test cases in the workload_test table that exercise the
other readiness gates which should prevent SetVersion(): create entries similar
to the existing cases but where (a) Generation != ObservedGeneration, (b)
Status.AvailableReplicas < Spec.Replicas, and (c) Status.UpdatedReplicas <
Spec.Replicas; each should set operatorConfigAtHighestRevision=true (so only the
other gate blocks) use the same fakeVersionRecorder and validateVersionRecorder
to assert r.setVersionCalls remains empty, and keep validateOperatorStatus as a
no-op; locate tests in workload_test.go and mirror the structure/fields used by
the existing "version ..." cases so the assertions against setVersionCalls catch
regressions in workload.go’s SetVersion gating logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/operator/apiserver/controller/workload/workload_test.go`:
- Around line 847-889: Add negative test cases in the workload_test table that
exercise the other readiness gates which should prevent SetVersion(): create
entries similar to the existing cases but where (a) Generation !=
ObservedGeneration, (b) Status.AvailableReplicas < Spec.Replicas, and (c)
Status.UpdatedReplicas < Spec.Replicas; each should set
operatorConfigAtHighestRevision=true (so only the other gate blocks) use the
same fakeVersionRecorder and validateVersionRecorder to assert r.setVersionCalls
remains empty, and keep validateOperatorStatus as a no-op; locate tests in
workload_test.go and mirror the structure/fields used by the existing "version
..." cases so the assertions against setVersionCalls catch regressions in
workload.go’s SetVersion gating logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: f83fb810-ff6d-40fd-9f77-c2365c142fe9

📥 Commits

Reviewing files that changed from the base of the PR and between 08c8c15 and 882ecf8.

📒 Files selected for processing (1)
  • pkg/operator/apiserver/controller/workload/workload_test.go

@tchap tchap force-pushed the workload-condition-overwrites branch from 882ecf8 to 3ae8d51 Compare April 14, 2026 09:08
@atiratree
Copy link
Copy Markdown
Member

/hold cancel
/lgtm

@openshift-ci openshift-ci Bot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 20, 2026
package workload

import (
"context"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right that those are the three (at least) main behavioral changes introduced in this PR ?

1. Scaling the workload (up/down) won't report Progressing=True 

2. A stuck rollout eventually stops reporting Progressing=True 

3. A workload missing some replicas (2/3) no longer triggers Degraded=True

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, right. We only degrade on Available == 0.

Also checking the Progressing timeout now depends on the native Deployment ProgressDeadlineExceeded instead of using a helper and once Progressing deadline is hit, we go Degraded.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to get @benluddy opinion on that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added another commit that actually makes Degraded work as expected, i.e. Degraded on not all replicas available. This is done by actually checking relevant pods and counting the failing ones so that starting new pods is not included into the Degrading condition.

@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 24, 2026
@tchap tchap force-pushed the workload-condition-overwrites branch from 08ecc7d to dee9075 Compare April 24, 2026 14:29
Comment thread pkg/operator/apiserver/controller/workload/workload_test.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload_test.go
} else {
if hasFailing || workload.Status.AvailableReplicas == 0 {
var failureDescription string
if hasFailing {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pods might still exist with AvailableReplicas == 0. Even if they don't, shouldn't PodContainersStatus fn handle that?

Comment thread pkg/operator/apiserver/controller/workload/workload.go
@tjungblu
Copy link
Copy Markdown
Contributor

/cc

@openshift-ci openshift-ci Bot requested a review from tjungblu April 30, 2026 09:52
@tchap tchap force-pushed the workload-condition-overwrites branch from a216c34 to 8af844e Compare April 30, 2026 12:05
@tchap
Copy link
Copy Markdown
Contributor Author

tchap commented Apr 30, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 30, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@tchap: This pull request references Jira Issue OCPBUGS-65896, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (ksiddiqu@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Comment thread pkg/operator/apiserver/controller/workload/workload_test.go Outdated
Comment thread pkg/operator/apiserver/controller/workload/workload_test.go
@tchap tchap force-pushed the workload-condition-overwrites branch from 8af844e to 3159205 Compare April 30, 2026 12:18
@atiratree
Copy link
Copy Markdown
Member

I think, this looks like a reasonable to way to monitor progressing and degraded even during scalling operations. The deployment controller does not support this, but we do use the Deployment primitives to achieve this.
/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
Rework how the workload controller reports Degraded and Progressing
conditions to support scaling without false positives, and improve
failure detection using pod-level Ready conditions.

Key changes:

- Stop consulting deployment's observed vs desired generation for
  Progressing; instead use the Progressing condition's
  NewReplicaSetAvailable reason, so scaling does not trigger
  Progressing=True.

- Report ProgressDeadlineExceeded via both DeploymentDegraded and
  DeploymentProgressing (set to False) when the deployment controller
  signals a timeout.

- Replace container-level failure inspection with pod Ready condition
  checks gated by per-pod ProgressDeadlineSeconds. Filter out
  terminating pods to avoid false positives from old ReplicaSets.

- Detect flapping Ready conditions: when MinReadySeconds > 0 and a pod
  keeps toggling Ready without staying stable, flag it as failing once
  the combined progressDeadline + minReadySeconds window has elapsed.

- Move SyncError condition into a defer so it is always set regardless
  of the return path.

- Always call PodContainersStatus when reporting UnavailablePod to
  include container-level diagnostics in the degraded message.
@tchap tchap force-pushed the workload-condition-overwrites branch from 3159205 to 2f343be Compare April 30, 2026 12:29
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@atiratree
Copy link
Copy Markdown
Member

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 30, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree, tchap
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@tchap: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@atiratree
Copy link
Copy Markdown
Member

Btw @tchap can you also try to run this again? openshift/cluster-authentication-operator#843

tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
@p0lyn0mial
Copy link
Copy Markdown
Contributor

Could we split this PR into smaller PRs? It looks like the following changes were included:

  1. Some cleanup, which could be moved to a separate refactoring PR
  2. A rework of the progressing condition
  3. A rewording of the degraded condition
  4. Changes related to the version

This breakdown would help me better understand the proposed changes. Thanks.

tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
tjungblu added a commit to tjungblu/library-go that referenced this pull request May 4, 2026
…controller

Contains the degraded/progressing condition changes made to the
workload controller from openshift#2128.

Signed-off-by: Thomas Jungblut <tjungblu@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants