Skip to content

Classify OCI pull errors and fall back to git for skills content API#5014

Merged
samuv merged 5 commits into
mainfrom
content-fallback
Apr 23, 2026
Merged

Classify OCI pull errors and fall back to git for skills content API#5014
samuv merged 5 commits into
mainfrom
content-fallback

Conversation

@samuv
Copy link
Copy Markdown
Contributor

@samuv samuv commented Apr 22, 2026

Summary

  • Follow-up on the flat 502 behaviour from Return 502 Bad Gateway when OCI skill pull fails #4956: a content fetch that fails upstream now tells the caller why (auth vs not-found vs timeout vs network), and — when the registry entry declares both an OCI and a git package — transparently falls back to git instead of surfacing the failure at all.
  • Motivating case: GET /api/v1beta/skills/content?ref=ghcr.io%2Fstacklok%2Fdockyard%2Fskills%2Fyara-rule-authoring%3A0.1.0 returned a bare 502 Bad Gateway with no body, even though the catalog entry for yara-rule-authoring also lists a git package pinned to a commit on a public repo that would have served the same content.
  • Swagger annotations on POST /api/v1beta/skills and GET /api/v1beta/skills/content now advertise 401, 404, 429, and 504 alongside the existing codes, and docs/server/{docs.go,swagger.json,swagger.yaml} are regenerated to match.
  • Four independent commits, each shipped in isolation so they can be reviewed / reverted individually.

Fix A — surface upstream 5xx error messages

pkg/api/errors/handler.go previously collapsed every 5xx response body to http.StatusText(code). That's right for 500 (may wrap DB connection strings, container-runtime internals, etc.) but hides actionable detail for 502/503/504 — those are explicitly upstream failures the caller can act on.

Branch the 5xx path: keep 500 generic, return err.Error() for 502/503/504. Server-side slog/OTel/Sentry reporting is unchanged.

Fix B — classify oras-go pull errors

Registry.Pull failures in both getContentFromOCI and installFromOCI were mapped to a flat 502, so a 401 from GHCR looked the same as a network outage or a timeout. A new classifyPullError helper (pkg/skills/skillsvc/pull_errors.go) maps:

oras-go error HTTP status
context.DeadlineExceeded / context.Canceled 504 Gateway Timeout
*errcode.ErrorResponse 401 / 403 401 Unauthorized
*errcode.ErrorResponse 404 404 Not Found
*errcode.ErrorResponse 429 429 Too Many Requests
*errcode.ErrorResponse other 4xx / 5xx 502 Bad Gateway
errdef.ErrNotFound 404 Not Found
fallback 502 Bad Gateway

4xx classifications flow through the unchanged 4xx handler path, so auth / not-found errors surface verbatim without relying on the 5xx body change in Fix A.

Fix C — registry-declared git fallback

GetContent short-circuited whenever the ref was "unambiguous OCI" (digest / explicit tag / multi-segment path), which is why the YARA case above never tried the git package declared on the same registry entry.

A new resolveGitFallbackForOCIRef in registry.go:

  • Derives a search term from the ref's tail path segment (e.g. yara-rule-authoring).
  • Calls the existing SkillLookup.SearchSkills (no interface change).
  • Post-filters matches to entries whose OCI package identifier parses to the same repository path — tag/digest agnostic, so :0.1.0 and :latest match the same entry.
  • Requires exactly one match; ambiguous catalog matches are deliberately skipped (refusing to guess beats silently serving the wrong skill).
  • Returns a pinned git://host/owner/repo[@commit][#subfolder] URL.

GetContent now tries this fallback before returning the OCI error for unambiguous OCI refs. The existing name-based registry resolution for ambiguous refs (plain name / single-segment path) is kept intact.

Flow after the change

flowchart TD
    start[GET /content?ref] --> classify{ref scheme}
    classify -->|git:// or https://| git[getContentFromGit]
    classify -->|OCI| oci[getContentFromOCI]
    oci -->|success| ok[200 with content]
    oci -->|failure| classifyErr[classifyPullError]
    classifyErr --> lookup[resolveGitFallbackForOCIRef]
    lookup -->|unique git package| gitFallback[getContentFromGit]
    gitFallback -->|success| ok
    gitFallback -->|failure| err["return wrapped error (401/404/429/502/504)"]
    lookup -->|no fallback| err
    classify -->|plain name| nameLookup[resolveFromRegistry]
    nameLookup --> oci
    nameLookup --> git
Loading

Type of change

  • Bug fix (observability of content-preview failures)
  • New functionality (automatic git fallback when the OCI pull fails)
  • Documentation (OpenAPI spec updated for the new status codes)

Test plan

  • go test ./pkg/skills/skillsvc/... -count=1
  • go test ./pkg/api/errors/... -count=1
  • go build ./...
  • swag init -g pkg/api/server.go --v3.1 -o docs/server --parseDependencyLevel 1 regenerated cleanly.
  • New TestClassifyPullError table-driven cases cover every branch of the classifier (context errors, each registry status code, wrapped errors, generic fallback).
  • New TestErrorHandler subtests assert 500 stays generic while 502/503/504 now include err.Error().
  • New TestGetContent subtests cover: fallback succeeds with pinned commit; fallback tolerates a different OCI version tag (:v9 caller vs :0.1.0 catalog); no git package → original OCI error; ambiguous catalog matches → no fallback; OCI success → registry lookup never invoked (verified via gomock with no SearchSkills expectation); registry lookup error → treated as no-fallback.

Changes

File Change
pkg/api/errors/handler.go isUpstreamStatus helper; 502/503/504 responses now include err.Error()
pkg/api/errors/handler_test.go Three new subtests for 502/503/504; 500-generic assertion kept
pkg/skills/skillsvc/pull_errors.go NewclassifyPullError(err) int
pkg/skills/skillsvc/pull_errors_test.go New — table-driven classifier tests
pkg/skills/skillsvc/content.go Wire classifyPullError at the registry.Pull site; dispatcher tries resolveGitFallbackForOCIRef before returning the OCI error for unambiguous OCI refs
pkg/skills/skillsvc/install_oci.go Reuse classifyPullError at the install pull site
pkg/skills/skillsvc/registry.go resolveGitFallbackForOCIRef + canonicalOCIRepo / skillHasMatchingOCIRepo / firstGitPackageURL / preferredGitRef helpers
pkg/skills/skillsvc/content_test.go Six new subtests covering git-fallback success / version-tag tolerance / no-git / ambiguous / OCI-success-skips-lookup / lookup-error cases
pkg/api/v1/skills.go, docs/server/{docs.go,swagger.json,swagger.yaml} Swagger @Failure annotations for the new 401/404/429/504 responses on installSkill and getSkillContent; regenerated OpenAPI spec

Does this introduce a user-facing change?

Yes — two, both additive:

  • Sharper error responses. Clients calling GET /api/v1beta/skills/content or POST /api/v1beta/skills/install that previously saw 502 Bad Gateway with an empty body now see 401 Unauthorized / 404 Not Found / 429 Too Many Requests / 504 Gateway Timeout where appropriate, and the 502 body now includes the wrapped error for debugging.
  • Automatic git fallback. A content-preview request for an OCI ref whose pull fails will transparently serve the content from the registry-declared git package when there is exactly one unambiguous match. An ambiguous catalog (two entries for the same repo tail) still returns the original OCI error.

@samuv samuv self-assigned this Apr 22, 2026
@samuv samuv requested review from JAORMX and amirejaz as code owners April 22, 2026 16:14
@github-actions github-actions Bot added the size/L Large PR: 600-999 lines changed label Apr 22, 2026
samuv added a commit that referenced this pull request Apr 22, 2026
Two issues reported by CI on #5014:

- revive (content.go): drop the else after return in the git-fallback block and hoist the short declaration onto its own line.

- unparam (registry.go): resolveGitFallbackForOCIRef never returns a non-nil error (search errors are swallowed as "no fallback available"), so narrow the signature to return just string and update the caller.
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 83.76068% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.09%. Comparing base (b5d1cf3) to head (a44fcca).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/skills/skillsvc/registry.go 76.19% 9 Missing and 6 partials ⚠️
pkg/skills/skillsvc/content.go 75.00% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5014      +/-   ##
==========================================
- Coverage   69.11%   69.09%   -0.03%     
==========================================
  Files         554      555       +1     
  Lines       73176    73288     +112     
==========================================
+ Hits        50574    50636      +62     
- Misses      19589    19633      +44     
- Partials     3013     3019       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

samuv added 5 commits April 23, 2026 10:21
Today every 5xx response body collapses to http.StatusText(code), which is appropriate for 500 (may wrap DB drivers, container runtimes, connection strings) but hides actionable detail for 502/503/504 — which are explicitly about upstream failures the caller can act on.

Branch the 5xx path: keep 500 generic, include err.Error() in the body for 502/503/504. Server-side slog/OTel/Sentry reporting is unchanged.
Registry.Pull failures in getContentFromOCI and installFromOCI were all mapped to a flat 502 Bad Gateway, so a 401 from GHCR, a missing manifest, a 429, and a network outage all looked identical to the caller.

Add classifyPullError to inspect oras-go errors: context.DeadlineExceeded/Canceled → 504; *errcode.ErrorResponse 401/403 → 401, 404 → 404, 429 → 429, other 4xx/5xx → 502; errdef.ErrNotFound → 404; fallback → 502.

Wire into content.go and install_oci.go at the Pull call sites. 4xx classifications flow through the unchanged 4xx handler path so auth/not-found errors surface verbatim without relying on the 5xx body change.
A content-preview request for an OCI ref that also appears in the skill registry with a git package was returned as a flat error as soon as the OCI pull failed, even though the catalog declared a perfectly good git mirror alongside the OCI artifact.

Add resolveGitFallbackForOCIRef in registry.go. It searches the catalog by the refs tail path segment, keeps entries whose OCI package identifier parses to the same repository path (tag/digest agnostic, so :0.1.0 and :latest match), and returns the single matching entry's pinned git:// URL. Ambiguous matches are deliberately skipped — refusing to guess is better than silently serving the wrong skill.

Wire into GetContent: for an unambiguous OCI ref whose remote pull fails, try the git fallback before returning the OCI error. The existing name-based registry resolution for ambiguous refs is kept intact.
Fixes A and B of this series widened the set of HTTP status codes both endpoints can return: 401 (registry auth), 404 (artifact missing), 429 (registry rate limit), and 504 (upstream pull timeout) on top of the existing 400/500/502.

Add @failure annotations for the new codes on installSkill and getSkillContent in pkg/api/v1/skills.go and regenerate docs/server/{docs.go,swagger.json,swagger.yaml} via `swag init -g pkg/api/server.go --v3.1 -o docs/server --parseDependencyLevel 1`.
Two issues reported by CI on #5014:

- revive (content.go): drop the else after return in the git-fallback block and hoist the short declaration onto its own line.

- unparam (registry.go): resolveGitFallbackForOCIRef never returns a non-nil error (search errors are swallowed as "no fallback available"), so narrow the signature to return just string and update the caller.
@samuv samuv force-pushed the content-fallback branch from 416e4da to a44fcca Compare April 23, 2026 08:21
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 23, 2026
@samuv samuv merged commit 72ef835 into main Apr 23, 2026
42 checks passed
@samuv samuv deleted the content-fallback branch April 23, 2026 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants