Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
98 commits
Select commit Hold shift + click to select a range
93c8317
Setup new module for a job queue service
DiegoTavares Jul 10, 2025
425a393
Initial version of the distributed job-scheduler
DiegoTavares Jul 18, 2025
1068970
[draft] dispatcher
DiegoTavares Jul 22, 2025
dc3bf95
Add frame range parsing and chunking for job dispatching
DiegoTavares Aug 7, 2025
71aa58b
Add job_resource cores limits to host_dao query
DiegoTavares Aug 8, 2025
378dde3
Merge branch 'master' into distributed_scheduler
DiegoTavares Aug 8, 2025
c0a6193
Implement scheduler using kafka
DiegoTavares Aug 19, 2025
afb000c
Compiles
DiegoTavares Aug 29, 2025
75ac963
Make all memory fields bytesize
DiegoTavares Aug 29, 2025
1596d02
Fix database memory values from bytes to kb
DiegoTavares Aug 29, 2025
dd3d41a
Implement cluster logic using facility+show+tag
DiegoTavares Sep 4, 2025
40d8a2b
Fix layer host candidate loop
DiegoTavares Sep 4, 2025
b147b7f
Remove dead files and old TODOs
DiegoTavares Sep 4, 2025
94d7a6c
Add integration tests
DiegoTavares Sep 5, 2025
d04eaa9
Rename and refactor integration_tests to smoke_tests
DiegoTavares Sep 12, 2025
804f2ae
WIP: Add scheduler stress tests
DiegoTavares Sep 18, 2025
78ecb0e
Minor fixes
DiegoTavares Sep 19, 2025
623fcc3
First working stress tests
DiegoTavares Sep 22, 2025
fd6305e
Update job fetcher to use fetch_all and stream processing
DiegoTavares Sep 23, 2025
d959d16
Refactor modules
DiegoTavares Sep 24, 2025
1b7a9ad
Batch layer and frame queries
DiegoTavares Sep 24, 2025
2b97f12
Fixed several dashmap related deadlocks
DiegoTavares Sep 25, 2025
3216258
Convert host_cache to scc
DiegoTavares Sep 26, 2025
2e9f651
Wrap HostCache in an Actor System using actix
DiegoTavares Sep 27, 2025
55f5e41
Remove unecessary debug statements
DiegoTavares Sep 27, 2025
733f366
Migrate Dispatcher interface into an Actor
DiegoTavares Sep 27, 2025
2dee076
Migrate Dispatcher interface into an Actor
DiegoTavares Sep 27, 2025
2af6c54
Clean up warnings
DiegoTavares Oct 2, 2025
91ab556
Prevent race condition trying to book the same host
DiegoTavares Oct 2, 2025
dd5f9e3
Document pub functions
DiegoTavares Oct 2, 2025
8edb761
Document host_cache
DiegoTavares Oct 2, 2025
7b68055
Add host locking mechanism to avoid race conditions
DiegoTavares Oct 9, 2025
915bebf
Add option to launch scheduler with a list of clusters
DiegoTavares Oct 9, 2025
b825e18
Add ProcDao to store resource allocation for frame dispatch
DiegoTavares Oct 9, 2025
f945316
Fix warnings
DiegoTavares Oct 9, 2025
b3eaadb
Add Allocation and Subscription Service to Scheduler
DiegoTavares Oct 20, 2025
1cd71c6
Replace experimental duration_constructors with explicit Duration calls
DiegoTavares Oct 28, 2025
f49c759
Refactor dispatcher to centralize virtual proc dispatch logic
DiegoTavares Oct 28, 2025
899e379
Add row-level locking to frame dispatch to prevent conflicts
DiegoTavares Oct 28, 2025
82970a4
Handle selfish services
DiegoTavares Oct 28, 2025
eba78c7
Fix unit tests
DiegoTavares Oct 28, 2025
48890ef
Add show:alloc exclusion list to cuebot
DiegoTavares Oct 28, 2025
faa4efa
Turn off spotless for HostDaoJdbc and format queries
DiegoTavares Oct 29, 2025
20bb2ca
Merge branch 'master' into distributed_scheduler_2
DiegoTavares Oct 29, 2025
e1caa64
Use rust latest stable edition 2021
DiegoTavares Oct 29, 2025
c9d9714
Remove cargo.lock
DiegoTavares Oct 29, 2025
9acc76d
Ignore Cargo.lock
DiegoTavares Oct 29, 2025
7913f3f
Remove stress test from the basic build
DiegoTavares Oct 29, 2025
e7a4591
Fix warnings
DiegoTavares Oct 29, 2025
b900590
Use Cargo resolver version 2 in workspace configuration
DiegoTavares Oct 29, 2025
fd7aa51
Add a dockerfile for the scheduler module
DiegoTavares Oct 30, 2025
c096ae1
Remove kafka references from scheduler
DiegoTavares Oct 30, 2025
3bf764c
Use channels between cluster and entrypoint logics
DiegoTavares Nov 5, 2025
7c1ed0b
Add LayerPermitService to prevent concurrent processing of same layer
DiegoTavares Nov 5, 2025
6da9ee0
Add memory_hungry.sh script to allocate specified memory for testing
DiegoTavares Nov 5, 2025
d6e48d6
Refine dry run logging to debug level
DiegoTavares Nov 6, 2025
511e889
Add host booking strategy and cluster round counters
DiegoTavares Nov 6, 2025
548ccd6
Remove lock_for_update from frame dispatch logic
DiegoTavares Nov 6, 2025
9acbd1a
Fix unit of memory on proc_dao.rs
DiegoTavares Nov 6, 2025
b4cc44f
Improve logging
DiegoTavares Nov 7, 2025
e76db1c
Add updated_at timestamp to frame and layer models
DiegoTavares Nov 7, 2025
35ee40a
Add Prometheus metrics to scheduler service
DiegoTavares Nov 7, 2025
f6f2008
Disable stress tests on default testset
DiegoTavares Nov 7, 2025
bc962d8
Merge branch 'master' into distributed_scheduler_2
DiegoTavares Nov 7, 2025
aec329a
Add optional host locking in dispatcher commands
DiegoTavares Nov 7, 2025
401cf64
Refactor DatabaseConfig to use explicit connection params
DiegoTavares Nov 7, 2025
550c7b8
Update Scheduler Configuration and Config Model
DiegoTavares Nov 8, 2025
886c194
Fix unit tests and warnings
DiegoTavares Nov 8, 2025
635b6ff
Add URL encoding for database credentials
DiegoTavares Nov 8, 2025
14e60b5
Improve RqdDispatcherService connection caching and timeout handling
DiegoTavares Nov 10, 2025
221e120
Add metrics for job query duration in scheduler
DiegoTavares Nov 10, 2025
491feb3
Add config option to turn off host booking
DiegoTavares Nov 10, 2025
3ac24dc
Change log level from info to debug in matcher
DiegoTavares Nov 10, 2025
1816b49
Add OS validation to host matching process
DiegoTavares Nov 10, 2025
bb8a53c
Ensure grpc connection cache is invalidated in any error condition
DiegoTavares Nov 10, 2025
de1de29
Fix reference to facility ID in cluster feed loading
DiegoTavares Nov 10, 2025
f092f62
[rebase] Update host cache and DAO to improve resource tracking and a…
DiegoTavares Nov 13, 2025
f75c8ba
[rebase] Refactor host_cache
DiegoTavares Nov 13, 2025
1ed312a
Refactor host_cache
DiegoTavares Nov 20, 2025
125930f
Refactor HostStore with atomic operations and improved concurrency
DiegoTavares Nov 20, 2025
f1c9016
Add debug logging for host cache and signal handling
DiegoTavares Nov 21, 2025
ac35982
Add debug log when host is considered stale in cache
DiegoTavares Nov 21, 2025
178ba52
Change Id types to Uuid
DiegoTavares Nov 24, 2025
11aab71
Replace host.ts_last_updated by host_stat.ts_ping
DiegoTavares Nov 25, 2025
0e9bc63
Set PostgreSQL connection to UTC timezone
DiegoTavares Nov 25, 2025
e5d60bb
Remove custom timestamp layers from tracing logs
DiegoTavares Nov 25, 2025
c3581e6
Fix case issue on facility pk on hos_dao
DiegoTavares Nov 25, 2025
8d35a31
Revert cuebot changes
DiegoTavares Nov 25, 2025
fafef7e
Remove unused opencue.properties entries
DiegoTavares Nov 25, 2025
00c318c
Add retry count tracking for dispatched frames
DiegoTavares Nov 27, 2025
9ffcb65
Add option to ignore a list of tags
DiegoTavares Nov 28, 2025
06ad353
Make host_dao facility query case-insensitive
DiegoTavares Dec 3, 2025
d11b51d
Add configurable frame memory soft and hard limits
DiegoTavares Dec 4, 2025
2faa2cf
Update Cluster type to include facility ID
DiegoTavares Dec 8, 2025
0b62d3b
Merge branch 'master' into distributed_scheduler_2
DiegoTavares Dec 8, 2025
5699d77
Remove local files
DiegoTavares Dec 8, 2025
738ce31
Merge branch 'master' into distributed_scheduler_2
DiegoTavares Dec 11, 2025
628c878
Fix unit test
DiegoTavares Dec 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,12 @@ cuebot/.project
/pycue/opencue/compiled_proto/
/rqd/rqd/compiled_proto/
docker-compose-local.yml
/sandbox/kafka*
/sandbox/zookeeper*
docs/_site/
docs/bin/
sandbox/kafka-data
sandbox/zookeeper-data
sandbox/zookeeper-logs
docs/_data/version.yml
target/*
7 changes: 6 additions & 1 deletion rust/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,12 @@
# TODO: Remove once these crates are stable and ready for public use
/crates/cuebot-config
/crates/dist-lock
/crates/scheduler

.DS_Store
config/rqd.local_docker.yaml
/sandbox/kafka*
/reference
Cargo.lock

# Localized files only meant for building docker images locally
proto
175 changes: 175 additions & 0 deletions rust/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

This is the Rust implementation of OpenCue components - a render farm management system. The project consists of three main crates:

- **rqd**: The main worker daemon that executes rendering tasks
- **dummy-cuebot**: A testing/development server for interacting with rqd
- **opencue-proto**: gRPC protocol definitions and generated code

## Build and Development Commands

### Prerequisites
```bash
# macOS
brew install protobuf

# Ubuntu/Debian
sudo apt-get install protobuf-compiler
```

### Build Commands
```bash
# Build entire project (release mode - OS-specific)
cargo build -r

# Build in debug mode (includes both Linux and macOS versions)
cargo build

# Build specific crate
cargo build -p rqd
cargo build -p dummy-cuebot
cargo build -p opencue-proto

# Run tests (unit tests only)
cargo test

# Run all tests including integration tests (requires database setup)
cargo test --features integration-tests

# Run only integration tests
cargo test --features integration-tests integration_tests

# Run clippy linting
cargo clippy -- -D warnings

# Format code
cargo fmt
```

### Running the System

1. **Start dummy-cuebot report server:**
```bash
target/release/dummy-cuebot report-server
```

2. **Start RQD service:**
```bash
# With fake Linux environment simulation
env OPENCUE_RQD_CONFIG=config/rqd.fake_linux.yaml target/release/openrqd

# With default config
target/release/openrqd
```

3. **Launch a test frame:**
```bash
target/release/dummy-cuebot rqd-client launch-frame crates/rqd/resources/test_scripts/memory_fork.sh
```

### Development Testing
```bash
# Run a single test
cargo test test_name

# Run tests with output
cargo test -- --nocapture

# Run tests for specific crate
cargo test -p rqd

# Check logs for test frames
tail -f /tmp/rqd/test_job.test_frame.rqlog
```

## Architecture Overview

### Core Components

**MachineMonitor** (`crates/rqd/src/system/machine.rs`):
- Central orchestrator for system monitoring and resource management
- Manages CPU/GPU reservations and NIMBY (user activity detection)
- Handles process cleanup and zombie detection

**FrameManager** (`crates/rqd/src/frame/manager.rs`):
- Manages frame lifecycle: validation, spawning, monitoring, cleanup
- Supports frame recovery after restarts via snapshot system
- Handles resource affinity and Docker containerization

**ReportClient** (`crates/rqd/src/report/report_client.rs`):
- Handles communication with Cuebot server
- Implements retry logic with exponential backoff
- Supports endpoint rotation for high availability

**RqdServant** (`crates/rqd/src/servant/rqd_servant.rs`):
- gRPC service implementation
- Handles incoming commands from Cuebot
- Delegates to appropriate managers

### Key Architectural Patterns

1. **Async/Await Throughout**: Full async architecture with Tokio runtime
2. **Resource Management**: Careful resource reservation and cleanup
3. **Platform Abstraction**: Separate Linux/macOS system implementations
4. **Configuration System**: YAML-based config with environment variable overrides
5. **Error Handling**: Uses `miette` for user-friendly error reporting

### Configuration

- **Default config location**: `~/.local/share/rqd.yaml`
- **Environment override**: `OPENCUE_RQD_CONFIG` environment variable
- **Environment prefix**: `OPENRQD_` for individual settings
- **Test config**: `config/rqd.fake_linux.yaml` for development

### Frame Execution Flow

1. **Validation**: Machine state, user permissions, resource availability
2. **Resource Reservation**: CPU cores and GPUs via CoreStateManager
3. **User Management**: Creates system users if needed
4. **Frame Spawning**: Launches in separate threads with optional Docker
5. **Monitoring**: Tracks execution, resource usage, process health
6. **Cleanup**: Releases resources and reports completion

### Development Notes

- **Resource Isolation**: Frames run in separate process groups
- **Container Support**: Optional Docker containerization via `containerized_frames` feature
- **Recovery System**: Restores running frames from snapshots after restarts
- **Kill Monitoring**: Tracks frame termination with forced kill capability
- **NIMBY Support**: Prevents frame execution when user is active

### Important Files

- `crates/rqd/src/main.rs`: RQD entry point and application setup
- `crates/rqd/src/config/config.rs`: Configuration structure definitions
- `crates/rqd/src/system/reservation.rs`: Resource reservation system
- `crates/dummy-cuebot/src/main.rs`: Testing server entry point
- `crates/opencue-proto/build.rs`: Protocol buffer build configuration

### Platform-Specific Code

- `crates/rqd/src/system/linux.rs`: Linux-specific system monitoring
- `crates/rqd/src/system/macos.rs`: macOS-specific system monitoring
- Build configuration automatically selects appropriate implementation

### Logging and Debugging

- **Log location**: Configurable via logging config
- **Log levels**: trace, debug, info, warn, error
- **Frame logs**: Individual frame execution logs in `/tmp/rqd/`
- **Structured logging**: Uses `tracing` crate for structured logging

## Code Review and Standards

### Rules

- When reviewing code check:
- If all public methods are documented on their head comment
- Verify for all changed functions if the preexisting documentation needs to be updated
- Analyse possible race conditions introduced by the changes
- Evaluate the overall quality of the change taking into consideration rust standards
- Check for introduced panic conditions that are not properly documented
Loading
Loading