Skip to content

fix: use server-side apply for RBAC resources in management controllers#6157

Open
alexmt wants to merge 1 commit into
akuity:mainfrom
alexmt:fix/rbac-ssa-apply
Open

fix: use server-side apply for RBAC resources in management controllers#6157
alexmt wants to merge 1 commit into
akuity:mainfrom
alexmt:fix/rbac-ssa-apply

Conversation

@alexmt
Copy link
Copy Markdown
Contributor

@alexmt alexmt commented Apr 23, 2026

Summary

Fixes #6155

Replaces the CreateAlreadyExistsUpdate pattern with server-side apply (SSA) in the project and serviceaccount management controllers.

  • In newReconciler, all createXFn function pointers now use an applySSA closure that calls client.Patch(..., client.Apply, client.ForceOwnership, client.FieldOwner("kargo")). The GVK is populated automatically from the client's scheme so no TypeMeta boilerplate is needed at each call site.
  • All call sites are simplified: AlreadyExists checks and fallback Update calls are removed.
  • Same fix applied to serviceaccounts.go which called r.client.Create/r.client.Update directly.

Before: every reconcile cycle generates N × 409s (one per project/resource), then a redundant Update even when nothing changed.

After: a single PATCH per resource — no 409s, and no etcd write when the desired state already matches.

@alexmt alexmt requested a review from a team as a code owner April 23, 2026 17:44
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 23, 2026

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit f42dca1
🔍 Latest deploy log https://app.netlify.com/projects/docs-kargo-io/deploys/69ea5b040d7859000823a192
😎 Deploy Preview https://deploy-preview-6157.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Replace the Create→AlreadyExists→Update pattern with server-side apply
(SSA) in the project and serviceaccount management controllers. This
eliminates ~800 HTTP 409 errors per minute observed in clusters with
active Kargo usage, where every reconcile cycle attempted to POST-create
RBAC resources that already existed.

Fixes akuity#6155
@alexmt alexmt force-pushed the fix/rbac-ssa-apply branch from 54c213e to f42dca1 Compare April 23, 2026 17:46
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 23, 2026

Codecov Report

❌ Patch coverage is 92.15686% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.54%. Comparing base (a25f290) to head (f42dca1).
⚠️ Report is 28 commits behind head on main.

Files with missing lines Patch % Lines
pkg/controller/management/projects/projects.go 94.73% 1 Missing and 1 partial ⚠️
...ller/management/serviceaccounts/serviceaccounts.go 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6157      +/-   ##
==========================================
+ Coverage   57.37%   57.54%   +0.16%     
==========================================
  Files         474      474              
  Lines       49666    40439    -9227     
==========================================
- Hits        28497    23270    -5227     
+ Misses      19785    15785    -4000     
  Partials     1384     1384              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jessesuen
Copy link
Copy Markdown
Member

jessesuen commented Apr 23, 2026

I have a question. If my understanding is correct, it seems we are continually attempting to update resources (perhaps on every reconciliation) even when it is not necessary (the object is the same). So even if we switch to SSA (which would be a good improvement and reduce API calls), nothing would stop us from continually performing the apply.

I wonder if there is a second improvement we can make that prevents the apply if we know it's not needed.

@krancour krancour self-requested a review April 23, 2026 20:22
@krancour krancour added this to the v1.10.3 milestone Apr 23, 2026
@krancour krancour added kind/bug Something isn't working as intended; If unsure that something IS a bug, start a discussion instead kind/refactor Non-functional changes to implementation details priority/normal This is the priority for most work area/management-controller Affects the controller that manages the Kargo control plane itself labels Apr 23, 2026
@krancour
Copy link
Copy Markdown
Member

@jessesuen I didn't look at this in detail yet, but have a recollection of there having been a guard to avoid unnecessary updates over and over, though it's possible I'm confusing it with another similar spot elsewhere in the code. If it's absent, I agree it should be there.

return fmt.Errorf("could not determine GVK for object: %w", err)
}
obj.GetObjectKind().SetGroupVersionKind(gvks[0])
return r.client.Patch(ctx, obj, client.Apply, client.ForceOwnership, client.FieldOwner("kargo"))
Copy link
Copy Markdown
Contributor

@EronWright EronWright Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One implication of switching to Apply is that the controller will NOT overwrite third-party contributions (of annotations, the RBAC subjects of a given CRB, etc). Looking at the current code, the Update call would've replaced everything. I think the SSA behavior is a good thing, just wanted to call out that the remediation behavior is slightly different.

Also, is the field manager name the same before-after here? It should be, to ensure that a newer version of Kargo replaces any earlier's contribution to a given object.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good difference in behavior to call out.

The purpose of this code, as is reflected in many of the comments on methods in this file, is to ensure the existence of RBAC resources required for the system to operate correctly upon each Project, so...

Overwriting "third-party contributions" was a feature; not a bug. If these are tampered with, things can break.

@krancour
Copy link
Copy Markdown
Member

it's possible I'm confusing it with another similar spot elsewhere in the code

And I was.

@jessesuen
Copy link
Copy Markdown
Member

jessesuen commented Apr 24, 2026

I can confirm through code inspection that we are ensuring RBAC with every Project reconciliation. So while SSA would help prevent the object's resourceVersion from constantly being updated, it would still result in an API call. To eliminate unnecessary API calls entirely, we could check what we have in an informer cache is already what we want and skip the call(s) completely. Even something like a DeepEquals check would help prevent this problem.

@jessesuen
Copy link
Copy Markdown
Member

jessesuen commented Apr 28, 2026

I'm targeting v1.11.0 for the potential switch to SSA, since there is debate if we are okay preserving users' annotations (which will be allowed with SSA), versus clobbering them completely (current behavior). Personally, I do see value in allowing users to annotate the OIDC claims of our generated ServiceAccounts, so they don't need to duplicate the entire set of Role/ServiceAccount/RoleBindings just to have an OIDC claim map to the kargo-admin role. But I do agree that the actual RBAC definitions should not drift from what we want (which might happen if we allow SSA).

To fix the continuous Update calls, I feel there is a safer approach to avoid unnecessary API calls using a DeepEquals check against the RBAC config. This avoids the change in behavior (risk) that comes with SSA, and is something I would like in v1.10.x. I have filed #6193 for that.

@fuskovic
Copy link
Copy Markdown
Member

fuskovic commented Apr 29, 2026

To fix the continuous Update calls, I feel there is a safer approach to avoid unnecessary API calls using a DeepEquals check against the RBAC config. This avoids the change in behavior (risk) that comes with SSA, and is something I would like in v1.10.x. I have filed #6193 for that.

I have the suggested approach implemented here: #6194

@hairyhum
Copy link
Copy Markdown
Contributor

I do see value in allowing users to annotate the OIDC claims of our generated ServiceAccounts, so they don't need to duplicate the entire set of Role/ServiceAccount/RoleBindings just to have an OIDC claim map to the kargo-admin role.

As an off-topic, I wonder what the purpose of that would be?

@jessesuen
Copy link
Copy Markdown
Member

jessesuen commented Apr 30, 2026

As an off-topic, I wonder what the purpose of that would be?

After looking at the current implementation, my concern may be a non-issue. I am concerned about the following scenario:

  1. Kargo vends kargo-admin, kargo-promoter, kargo-viewer SA/Role/RoleBinding
  2. User adds annotations to one of the above ServiceAccounts, in order to map it to OIDC groups/users/etc... This can be done using the UI.
  3. Controller undoes the user's change to the ServiceAccount annotations because it replaces/updates it to the hardwired version.

It turns out that the way ensureDefaultUserRoles() is currently implemented, (3) never happens because it only creates ServiceAccounts but never updates them. As long as we keep that behavior, I am good.

@krancour
Copy link
Copy Markdown
Member

@jessesuen is this something that can be closed because of alternative fixes @fuskovic implemented and merged for the one specific issue this PR was aimed at?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/management-controller Affects the controller that manages the Kargo control plane itself kind/bug Something isn't working as intended; If unsure that something IS a bug, start a discussion instead kind/refactor Non-functional changes to implementation details priority/normal This is the priority for most work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Management controllers generate excessive 409s by using Create→AlreadyExists→Update instead of server-side apply

6 participants