Skip to content

queue-runner and builder: fail faster on infrastructure errors#1748

Draft
Ericson2314 wants to merge 5 commits into
NixOS:masterfrom
obsidiansystems:fail-fast
Draft

queue-runner and builder: fail faster on infrastructure errors#1748
Ericson2314 wants to merge 5 commits into
NixOS:masterfrom
obsidiansystems:fail-fast

Conversation

@Ericson2314

@Ericson2314 Ericson2314 commented May 17, 2026

Copy link
Copy Markdown
Member

WIP. See commit for preliminary details --- it is a long run on commit message because it's a bunch of separate edits tacked together, and it will probably become multiple commits once finished up.

I am hoping this will make it easier to fix the remaining issues, especially the ones that are hard to reproduce out of real-world testing

@Ericson2314

Copy link
Copy Markdown
Member Author

@dasJ the second commit of this I would very much like your feedback on. The general rules I am thinking are:

  • destination stores are always best effort (don't die if fail to upload)
  • queue doesn't die on bad builders
  • builders don't die on bad queue runner (but could change that?)
  • queue runner does die on bad PostgreSQL
  • when in doubt ? and (at the top of the call stack) die

I think all the edits here fit that pattern but I definitely didn't finish reviewing them yet.

I am also confused why we had

tracing::error!("Rejecting new machine creation: {e}");
return Err(tonic::...);

in so many places before.

@dasJ

dasJ commented May 19, 2026

Copy link
Copy Markdown
Member

I am fully on board with most of these rules. Not dying is never a bad thing ;)

The only thing I am unsure is the "don't die if fail to upload". Wdym by that? I don't see any unwraps or expects so I assume that's "assume the build succeeded". I don't think that's what we want because it could mean we end up with an incomplete binary cache.

@Ericson2314

Copy link
Copy Markdown
Member Author

I think doing #1755 first will make this more clear.

The only thing I am unsure is the "don't die if fail to upload". Wdym by that?

To be clear, this was something I am keeping! It already is OK with "destination stores" being down, and I while I am not obvious to me that we want that, I didn't feel certain enough that we didn't to change it.

@Ericson2314

Copy link
Copy Markdown
Member Author

I am fully on board with most of these rules. Not dying is never a bad thing ;)

Also to be clear, this PR should generally be making is die more. The counter-intuitive logic of "fail fast" is that not-dying can allow simple problems to fester, only for one to go down with a more complex problem later, and so it ends up being more sysadmin/debubgging work than it would be otherwise.

@Ericson2314 Ericson2314 force-pushed the fail-fast branch 4 times, most recently from e2302ac to f92c786 Compare May 20, 2026 18:48
amaanq and others added 5 commits June 9, 2026 18:22
Every function now returns the narrowest error type that covers its
actual failure modes, rather than a catch-all `anyhow::Error`. This
makes it possible to distinguish infrastructure failures (DB down, store
unreachable) from logic errors (step not found in queue, drv lookup
miss) at the type level.

Among other benefits, this will assist the "fail-fast" PR in deciding
what is fatal vs recoverable, and making sure those decisions are
consistent and enforced.

Error types are co-located with the functions that produce them, using
split `impl State` blocks where needed. The fine-grained sub-enums (e.g.
`ResolutionError`, `StepLookupError`, `MachineLookupError`,
`DrvLookupError`) might seem like overkill in a single 2500-line
`state/mod.rs`, but they are designed so that a future file split (e.g.
`state/resolution.rs`, `state/step_completion.rs`) would give each
sub-module its own natural error type, with `StateLogicError` in
`mod.rs` just re-exporting them. The error hierarchy becomes the module
hierarchy.

Similarly, the builder's per-phase error types (`ImportError`,
`UploadError`, `BuildOutputParseError`, `ResultInfoError`) are each
defined next to the free functions they belong to, and `JobFailure` ties
them together by build phase. A future split into `import.rs`,
`upload.rs`, etc. would be straightforward.

New `error-context` crate provides `WithContext<E>`, a recursive enum
(`Root(E)` | `Context { context, source }`) that adds context layers
without changing the type — a typed replacement for `anyhow::context()`.
`StateError` is a newtype around `WithContext<StateErrorKind>`.

`anyhow` is removed from workspace dependencies entirely. Examples use
it as a non-workspace `[dev-dependencies]` only.
The queue-runner should not try to recover from a down database or
invalid store state — only builder failures should be totally
recoverable since builders are numerous and expected to come and go.

The builder should die if it cannot communicate with the queue-runner —
the queue-runner will notice and retry the step elsewhere.

These are the general principles at play:

- destination stores are always best effort (don't die if fail to upload)

- queue doesn't die on bad builders

- builders don't die on bad queue runner (but could change that?)

- queue runner does die on bad PostgreSQL

- when in doubt `?` and (at the top of the call stack) die

Queue-runner changes:

- Critical task loops (queue monitor, dispatch) now return
  `JoinHandle<Result<()>>` and are joined in the main `select!` loop.
  If either fails, the queue-runner exits with the error.
  Non-critical tasks (config reloader, uploader, fod checker) remain
  abort-only.

- `create_build` returns `Result`; an invalid drv path is a hard error
  (indicates a GC rooting bug) rather than silently aborting the build

- `handle_previous_failure` and `handle_cached_build` errors propagate
  (DB unreachable = stop processing)

- `process_new_builds` returns `Result` and propagates child build
  creation errors

- Queue monitor loop propagates `get_queued_builds`,
  `process_queue_change`, and `handle_jobset_change` errors instead of
  logging and continuing

- `fail_step` handles missing jobs in queue/machine as warnings (race
  with builder disconnect), but DB errors propagate. Callers
  (`remove_machine`, cancelled steps) now use `?` instead of
  log-and-continue.

- `abort_unsupported` propagates `inner_fail_job` DB errors through
  the dispatch loop. Unsupported build step log level downgraded from
  `error` to `warn` (expected operational condition, not a bug).

- Fire-and-forget gRPC handlers (`build_step_update`, `complete_build`)
  use `spawn_fatal` — DB errors panic the task, bringing down the
  queue-runner. Builder parse errors (e.g. bad `BuildOutput`) are
  logged and dropped without crashing.

- Machine tunnel loop: `remove_machine` is called once after the loop
  exits instead of at every break point. Channel errors downgraded
  from `error` to `warn` (builder disconnections are expected).

- gRPC error messages now include the underlying error instead of
  generic strings. Redundant `tracing::error` calls removed where
  `#[tracing::instrument(err)]` already logs the error.

- `parse_uuid` helper deduplicates UUID parsing across gRPC handlers.

- Log directory creation at startup is fatal on failure

- mTLS misconfiguration is a typed `ConfigError`

Builder changes:

- `filter_missing` returns `Result` — daemon connection failure is
  a hard error, not "path is present"

- `handle_request` errors propagate (kills the builder — queue-runner
  will retry)

- Ping message construction failure breaks the ping stream (builder
  exits), with the error captured and returned from the gRPC function

- `submit_build_result` helper deduplicates the retry-then-report
  pattern for `complete_build` calls

- Build task returns `Result<()>` instead of `()` — failure to submit
  results after retries propagates rather than calling
  `process::exit(1)`

- mTLS misconfiguration is a typed `ConfigError`

- Unreachable error paths in log stream (`utils.rs`) now panic with
  explanatory comments

- `substitute_output` S3 replication uses async-block-as-try-block
  pattern for clean early exit on daemon error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants