Skip to content

fix(ci): retry dockhand build-skill on transient GHCR 5xx#517

Merged
JAORMX merged 1 commit into
mainfrom
fix/build-skill-retry-on-push-failure
Apr 20, 2026
Merged

fix(ci): retry dockhand build-skill on transient GHCR 5xx#517
JAORMX merged 1 commit into
mainfrom
fix/build-skill-retry-on-push-failure

Conversation

@JAORMX
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX commented Apr 20, 2026

Summary

Concurrent first-time pushes to brand-new GHCR packages regularly fail with response status code 500: unknown: unknown error on the manifest PUT. We observed this clearly once PR #516 unmasked the dockhand error:

Wrap the dockhand build-skill invocation in a bounded retry loop (3 attempts, 10s then 20s backoff). A successful retry typically lands on the first extra attempt, and matrix jobs have a 30-minute timeout — worst-case added latency is well under budget. Genuine build/validation errors still fail; they just cost ~30s before surfacing.

Behavior

  • Each attempt's dockhand output streams through tee to the job log (same visibility as PR fix(ci): surface dockhand errors in Build Skill Artifacts step #516).
  • Only the last attempt's output is kept in the tempfile used for digest extraction — that's the successful one on exit, so digest extraction is unchanged.
  • If all 3 attempts fail, the step exits 1 with all attempts visible in the job log.
  • Empty-digest path preserved via || true; downstream steps already gate on digest != ''.

Test plan

🤖 Generated with Claude Code

When a push to main touches N skill specs, the `build-skill-artifacts`
matrix schedules N concurrent first-time pushes to GHCR, each creating
a brand-new package. GHCR intermittently fails the manifest PUT with
`response status code 500: unknown: unknown error`, and the failure
rate scales with matrix width (observed ~16/17 failures on an 18-skill
batch, ~2/3 on smaller ones). Re-running the same jobs later succeeds
because by then the packages exist.

Wrap the dockhand invocation in a bounded retry loop (3 attempts,
10s/20s linear backoff). The backoff is short on purpose: a successful
retry typically lands on the first extra attempt, and matrix jobs have
a 30-minute timeout so the total ceiling is well under budget. Real
(non-transient) build or validation failures still fail the step —
they just cost ~30s extra before surfacing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

🛡️ Skill Security Scan Results

⚠️ No skills were scanned in this PR.

Copy link
Copy Markdown
Contributor

@samuv samuv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@JAORMX JAORMX merged commit 27fec4a into main Apr 20, 2026
54 checks passed
@JAORMX JAORMX deleted the fix/build-skill-retry-on-push-failure branch April 20, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants