Skip to content

Improve worker execution pipeline with batch fetch and upload#2209

Open
rejuvenile wants to merge 5 commits intoTraceMachina:mainfrom
rejuvenile:pr/05-worker-pipeline
Open

Improve worker execution pipeline with batch fetch and upload#2209
rejuvenile wants to merge 5 commits intoTraceMachina:mainfrom
rejuvenile:pr/05-worker-pipeline

Conversation

@rejuvenile
Copy link

@rejuvenile rejuvenile commented Mar 12, 2026

Summary

  • Batch input fetch: GetTree + has_with_results + BatchReadBlobs + ByteStream pipeline
  • Pipelined fetch+hardlink with concurrent producer/consumer
  • Parallel output uploads with background CAS upload
  • Action phase timing summary logging

Test plan

  • cargo check passes
  • CI checks pass

Stack: 5/11 — depends on #2208

🤖 Generated with Claude Code


This change is Reviewable

rejuvenile and others added 5 commits March 11, 2026 16:54
Update tonic, prost, and other dependencies to latest versions.
Regenerate protobuf bindings. Add aws-lc-rs and rayon support
to digest hasher. Update BUILD.bazel files for LRE toolchain.

Co-Authored-By: Claude <noreply@anthropic.com>
Rewrite existence cache to prevent stale positives by bypassing cache
on update and cleaning entries on NotFound. Fix BatchUpdateBlobs
duplicate digest handling. Add POSIX_FADV_SEQUENTIAL for read-ahead.
Pre-set CAS files to 0o555 to avoid redundant chmod on hardlink.
Fix LRU eviction ordering at startup by sorting files by atime.
Add stall detector for store operations. Replace async Mutex with
parking_lot in EvictingMap. Increase gRPC connections_per_endpoint
default to 32.

Co-Authored-By: Claude <noreply@anthropic.com>
Fix ByteStream protocol compliance for large blobs. Add gRPC error
details with proper status codes. Add max_total_batch_size config
for BatchReadBlobs/BatchUpdateBlobs. Improve capabilities server
to report actual supported features. Add TLS improvements and
server startup configuration options. Downgrade per-request
transfer logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
Add load-aware worker selection with CPU load tracking. Implement
directory cache and subtree coverage scoring for locality-aware
scheduling. Add stall detection for queued actions. Support
SIGKILL retry on worker timeout. Handle FAILED_PRECONDITION for
missing inputs. Add fair round-robin dispatch via LRU promotion.
Downgrade per-action dispatch logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
Implement batch input fetching using GetTree + has_with_results +
BatchReadBlobs + ByteStream for concurrent blob downloads. Add
hardlink retry on ENOENT (eviction during link). Implement phase
timing for action lifecycle. Add upload timeout with stall
detection. Support peer blob hints from scheduler. Downgrade
per-blob fetch/upload logs to debug level.

Co-Authored-By: Claude <noreply@anthropic.com>
@CLAassistant
Copy link

CLAassistant commented Mar 12, 2026

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants