queue-runner and builder: fail faster on infrastructure errors#1748
queue-runner and builder: fail faster on infrastructure errors#1748Ericson2314 wants to merge 5 commits into
Conversation
77dc98b to
706238a
Compare
e61875b to
9fab2e1
Compare
|
@dasJ the second commit of this I would very much like your feedback on. The general rules I am thinking are:
I think all the edits here fit that pattern but I definitely didn't finish reviewing them yet. I am also confused why we had in so many places before. |
|
I am fully on board with most of these rules. Not dying is never a bad thing ;) The only thing I am unsure is the "don't die if fail to upload". Wdym by that? I don't see any unwraps or expects so I assume that's "assume the build succeeded". I don't think that's what we want because it could mean we end up with an incomplete binary cache. |
|
I think doing #1755 first will make this more clear.
To be clear, this was something I am keeping! It already is OK with "destination stores" being down, and I while I am not obvious to me that we want that, I didn't feel certain enough that we didn't to change it. |
Also to be clear, this PR should generally be making is die more. The counter-intuitive logic of "fail fast" is that not-dying can allow simple problems to fester, only for one to go down with a more complex problem later, and so it ends up being more sysadmin/debubgging work than it would be otherwise. |
e2302ac to
f92c786
Compare
Assisted-by: GNU sed 4.9
Every function now returns the narrowest error type that covers its
actual failure modes, rather than a catch-all `anyhow::Error`. This
makes it possible to distinguish infrastructure failures (DB down, store
unreachable) from logic errors (step not found in queue, drv lookup
miss) at the type level.
Among other benefits, this will assist the "fail-fast" PR in deciding
what is fatal vs recoverable, and making sure those decisions are
consistent and enforced.
Error types are co-located with the functions that produce them, using
split `impl State` blocks where needed. The fine-grained sub-enums (e.g.
`ResolutionError`, `StepLookupError`, `MachineLookupError`,
`DrvLookupError`) might seem like overkill in a single 2500-line
`state/mod.rs`, but they are designed so that a future file split (e.g.
`state/resolution.rs`, `state/step_completion.rs`) would give each
sub-module its own natural error type, with `StateLogicError` in
`mod.rs` just re-exporting them. The error hierarchy becomes the module
hierarchy.
Similarly, the builder's per-phase error types (`ImportError`,
`UploadError`, `BuildOutputParseError`, `ResultInfoError`) are each
defined next to the free functions they belong to, and `JobFailure` ties
them together by build phase. A future split into `import.rs`,
`upload.rs`, etc. would be straightforward.
New `error-context` crate provides `WithContext<E>`, a recursive enum
(`Root(E)` | `Context { context, source }`) that adds context layers
without changing the type — a typed replacement for `anyhow::context()`.
`StateError` is a newtype around `WithContext<StateErrorKind>`.
`anyhow` is removed from workspace dependencies entirely. Examples use
it as a non-workspace `[dev-dependencies]` only.
The queue-runner should not try to recover from a down database or invalid store state — only builder failures should be totally recoverable since builders are numerous and expected to come and go. The builder should die if it cannot communicate with the queue-runner — the queue-runner will notice and retry the step elsewhere. These are the general principles at play: - destination stores are always best effort (don't die if fail to upload) - queue doesn't die on bad builders - builders don't die on bad queue runner (but could change that?) - queue runner does die on bad PostgreSQL - when in doubt `?` and (at the top of the call stack) die Queue-runner changes: - Critical task loops (queue monitor, dispatch) now return `JoinHandle<Result<()>>` and are joined in the main `select!` loop. If either fails, the queue-runner exits with the error. Non-critical tasks (config reloader, uploader, fod checker) remain abort-only. - `create_build` returns `Result`; an invalid drv path is a hard error (indicates a GC rooting bug) rather than silently aborting the build - `handle_previous_failure` and `handle_cached_build` errors propagate (DB unreachable = stop processing) - `process_new_builds` returns `Result` and propagates child build creation errors - Queue monitor loop propagates `get_queued_builds`, `process_queue_change`, and `handle_jobset_change` errors instead of logging and continuing - `fail_step` handles missing jobs in queue/machine as warnings (race with builder disconnect), but DB errors propagate. Callers (`remove_machine`, cancelled steps) now use `?` instead of log-and-continue. - `abort_unsupported` propagates `inner_fail_job` DB errors through the dispatch loop. Unsupported build step log level downgraded from `error` to `warn` (expected operational condition, not a bug). - Fire-and-forget gRPC handlers (`build_step_update`, `complete_build`) use `spawn_fatal` — DB errors panic the task, bringing down the queue-runner. Builder parse errors (e.g. bad `BuildOutput`) are logged and dropped without crashing. - Machine tunnel loop: `remove_machine` is called once after the loop exits instead of at every break point. Channel errors downgraded from `error` to `warn` (builder disconnections are expected). - gRPC error messages now include the underlying error instead of generic strings. Redundant `tracing::error` calls removed where `#[tracing::instrument(err)]` already logs the error. - `parse_uuid` helper deduplicates UUID parsing across gRPC handlers. - Log directory creation at startup is fatal on failure - mTLS misconfiguration is a typed `ConfigError` Builder changes: - `filter_missing` returns `Result` — daemon connection failure is a hard error, not "path is present" - `handle_request` errors propagate (kills the builder — queue-runner will retry) - Ping message construction failure breaks the ping stream (builder exits), with the error captured and returned from the gRPC function - `submit_build_result` helper deduplicates the retry-then-report pattern for `complete_build` calls - Build task returns `Result<()>` instead of `()` — failure to submit results after retries propagates rather than calling `process::exit(1)` - mTLS misconfiguration is a typed `ConfigError` - Unreachable error paths in log stream (`utils.rs`) now panic with explanatory comments - `substitute_output` S3 replication uses async-block-as-try-block pattern for clean early exit on daemon error
WIP. See commit for preliminary details --- it is a long run on commit message because it's a bunch of separate edits tacked together, and it will probably become multiple commits once finished up.
I am hoping this will make it easier to fix the remaining issues, especially the ones that are hard to reproduce out of real-world testing