Skip to content

Tune log levels for production observability#2215

Open
rejuvenile wants to merge 11 commits intoTraceMachina:mainfrom
rejuvenile:pr/11-log-levels
Open

Tune log levels for production observability#2215
rejuvenile wants to merge 11 commits intoTraceMachina:mainfrom
rejuvenile:pr/11-log-levels

Conversation

@rejuvenile
Copy link

@rejuvenile rejuvenile commented Mar 12, 2026

Summary

  • Promote key operational summaries to info!: action phase timing, download completion, FindMissingBlobs/GetTree summaries, command timeouts, background CAS upload completion, stall recovery, BlobsAvailable reports, scheduler stall detection
  • Keep per-blob/per-RPC detail at debug!: ByteStream per-blob, BatchRead/Update per-blob, WorkerProxyStore per-blob redirect/race, per-command execute/complete, internal phase logs
  • Promote BatchReadBlobs batch/retry failures to warn!

Test plan

  • cargo check passes
  • CI checks pass
  • Production logs show only actionable info! messages, not per-blob noise

Stack: 11/11 — depends on #2214

🤖 Generated with Claude Code


This change is Reviewable

rejuvenile and others added 11 commits March 11, 2026 16:54
Update tonic, prost, and other dependencies to latest versions.
Regenerate protobuf bindings. Add aws-lc-rs and rayon support
to digest hasher. Update BUILD.bazel files for LRE toolchain.

Co-Authored-By: Claude <noreply@anthropic.com>
Rewrite existence cache to prevent stale positives by bypassing cache
on update and cleaning entries on NotFound. Fix BatchUpdateBlobs
duplicate digest handling. Add POSIX_FADV_SEQUENTIAL for read-ahead.
Pre-set CAS files to 0o555 to avoid redundant chmod on hardlink.
Fix LRU eviction ordering at startup by sorting files by atime.
Add stall detector for store operations. Replace async Mutex with
parking_lot in EvictingMap. Increase gRPC connections_per_endpoint
default to 32.

Co-Authored-By: Claude <noreply@anthropic.com>
Fix ByteStream protocol compliance for large blobs. Add gRPC error
details with proper status codes. Add max_total_batch_size config
for BatchReadBlobs/BatchUpdateBlobs. Improve capabilities server
to report actual supported features. Add TLS improvements and
server startup configuration options. Downgrade per-request
transfer logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
Add load-aware worker selection with CPU load tracking. Implement
directory cache and subtree coverage scoring for locality-aware
scheduling. Add stall detection for queued actions. Support
SIGKILL retry on worker timeout. Handle FAILED_PRECONDITION for
missing inputs. Add fair round-robin dispatch via LRU promotion.
Downgrade per-action dispatch logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
Implement batch input fetching using GetTree + has_with_results +
BatchReadBlobs + ByteStream for concurrent blob downloads. Add
hardlink retry on ENOENT (eviction during link). Implement phase
timing for action lifecycle. Add upload timeout with stall
detection. Support peer blob hints from scheduler. Downgrade
per-blob fetch/upload logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
Implement WorkerProxyStore for worker-to-worker blob transfers
with locality-aware routing. Add BlobLocalityMap for tracking
which workers have which blobs. Add BlobsAvailable RPC for workers
to report cached digests. Support redirect-based peer discovery
and racing peer vs server downloads. Add inner_store() to all
store types. Update all config examples with store_type field.
Add integration tests for peer sharing.

Co-Authored-By: Claude <noreply@anthropic.com>
Implement DirectoryCache that caches complete directory trees on
disk for reuse across actions. Support subtree matching for partial
cache hits. Add cache versioning for format changes. Fix EPERM on
shell scripts by preserving file permissions during hardlink.
Handle zero-byte files and CAS inode corruption during cleanup.
Fall back to download when cached subtree is evicted.

Co-Authored-By: Claude <noreply@anthropic.com>
Rewrite hardlink_directory_tree, set_readonly_recursive, and
calculate_directory_size from async recursive to single spawn_blocking
with sync std::fs, eliminating ~5,550 async task transitions per large
tree. Add combined set_readonly_and_calculate_size function. Add macOS
clonefile(2) support in fs_util.

In directory_cache: parallel subtree clones with tokio::join!, skip
set_readonly on macOS (CoW makes it unnecessary), populate_fast_store_
unchecked for batch-checked blobs, page cache warming, zero-byte file
handling, format version bumps, EPERM fixes, subtree race fallback
with eviction recovery, and cfg-gate fixes.

Demote per-construction internal detail logs to debug! while keeping
cache HIT/MISS summaries and startup messages at info!.
Add macOS F_RDADVISE in advise_sequential for file-level readahead.
Add advise_willneed method with 3-way cfg (F_RDADVISE on macOS,
POSIX_FADV_WILLNEED on Linux, no-op elsewhere). Add micro-prefetch
in read_file_to_channel read loop. Thread start_offset parameter
through read_file_to_channel callers.
Add subtree_files tracking alongside byte size for more accurate
cache coverage estimation. Introduce PER_FILE_WEIGHT constant for
blended scoring in coverage winner selection. Demote per-dispatch
scheduling logs to debug! while keeping quarantine recovery at info!.
Promote key operational summaries to info!:
- Action phase timing (queue/fetch/execute/upload durations)
- download_to_directory completion summary
- FindMissingBlobs and GetTree RPC summaries
- Command timeout events
- upload_to_remote background CAS upload completion
- Stall recovery events
- BlobsAvailable periodic reports
- Worker connection removal
- Scheduler stall detection oldest-actions report

Keep per-blob/per-RPC detail at debug!:
- ByteStream read/write per-blob completion
- BatchReadBlobs/BatchUpdateBlobs per-blob transfers
- WorkerProxyStore per-blob redirect/race logs
- Per-command execute/complete with full args
- download_to_directory internal phase logs
- resolve_directory_tree internal steps
- Per-blob upload_file, stdout/stderr uploads

Promote BatchReadBlobs batch/retry failures to warn!.
@CLAassistant
Copy link

CLAassistant commented Mar 12, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants