Bernstein is a multi-agent orchestration platform for CLI coding agents. It is the Kubernetes of AI software engineering — spawn agents, assign tasks, verify output, merge results, learn from failures, repeat. The goal is to build the most reliable, observable, and effective orchestrator in the ecosystem.
Bernstein orchestrates SHORT-LIVED agents (1–3 tasks each, then exit). State
lives in FILES (.sdd/), not in agent memory. Agents are spawned fresh per
task — no "sleep" problem. The orchestrator itself is DETERMINISTIC CODE, not
an LLM. It works with ANY CLI agent (Claude Code, Codex, Gemini CLI, etc.).
Stack: Python 3.12+, Starlette/FastAPI task server, Textual TUI, git worktree isolation, YAML-based task specs, JSONL metrics/traces.
Optimize for: reliability, agent effectiveness, observability, safe iteration, and user trust. Do not optimize for: clever abstractions nobody asked for, premature generalization, architecture theater, or broad refactors without measured gain.
-
Smallest safe delta. Isolate one change per commit. Preserve rollback clarity. If a change touches 5+ files, consider splitting.
-
No monoliths. Do not create or extend god-files. Split by capability. Triggers: >400 LOC (soft), >600 LOC (hard stop unless justified), mixed concerns, multiple reasons to change. The orchestrator is already split across
orchestrator.py,tick_pipeline.py,task_lifecycle.py, andagent_lifecycle.py— follow this pattern. -
Structure. Thin orchestration facade, isolated core logic, separate adapters, explicit schemas, prompt building separate from retrieval.
-
OOP where useful, pure funcs where better. Small classes for stateful collaborators; pure functions for deterministic transforms, parsing, scoring. Prefer composition over inheritance. Use Protocols/ABCs only at real seams.
-
Strict typing. No dict soup, loose
Any, or silentOptionalmisuse in core paths. Type all public APIs. UseTypedDict/@dataclassfor internal records, Pydantic only for FastAPI request/response boundaries. Pyright strict is mandatory for all touched code. -
Async for IO, sync for CPU. No blocking sync IO in async paths. No fire-and-forget without owned lifecycle. Use explicit timeouts. Do not break SSE, streaming, or telemetry.
-
Observability. Preserve or improve logging, metrics, traces, token accounting, and agent signal files. No hidden globals, silent fallbacks, or unauditable magic.
-
Performance. Avoid repeated parsing, N+1 HTTP calls, needless serialization, duplicate work. Cache only when invalidation is safe.
-
Testing. Add the smallest deterministic tests proving the change works. No fake-green tests. Always mock the CLI adapter and HTTP calls. Use
tmp_pathfor filesystem. -
YAGNI. Don't build for hypothetical future requirements. Three similar lines is better than a premature abstraction.
| Class | What | Example |
|---|---|---|
| A | Tiny low-risk patch (1-2 files, <50 lines) | Fix typo, add log line, extract constant |
| B | Narrow feature or fix (2-5 files, <200 lines) | New quality gate, adapter fix, endpoint |
| C | Bounded refactor + logic (5-10 files) | Split module, new subsystem, drain system |
| D | Major feature branch | New TUI screen, protocol support, provider |
| E | Investigate / needs discussion | Conflict, unclear requirement, risky change |
For C and D changes: write a plan before coding. For A and B: just do it.
- Ignoring Pyright strict on touched code
- Leaving Ruff/pytest failures for "later"
- Blocking sync IO in async paths
- Breaking SSE, telemetry, or agent signal protocol
- Creating new god-files (>600 LOC)
- Untyped new core logic (
Anysoup) pkill -f bernsteinorpgrep bernstein(use PID files)- Running
pytest tests/without the isolated runner - Committing
.sdd/runtime/contents - Pushing to
master(branch ismain)
If you discover conflicting behavior between code, docs, tests, or specs:
[CONFLICT DETECTED]
File(s): ...
Conflict: ...
Why it matters: ...
Smallest safe resolution: ...
Do not paper over conflicts. Report and resolve explicitly.
uv venv && uv pip install -e ".[dev]"uv run python scripts/run_tests.py -x # all tests (isolated per-file, stops on first failure)
uv run python scripts/run_tests.py -k router # filter by keyword
uv run pytest tests/unit/test_foo.py -x -q # single file (fast)NEVER run uv run pytest tests/ -x -q — the full suite keeps references
across 2000+ tests and can leak 100+ GB RAM. The isolated runner in
scripts/run_tests.py caps each file at ~200 MB.
uv run ruff check src/
uv run ruff format src/
uv run pyright src/All three must pass before committing. No exceptions, no "fix later."
- Python 3.12+, type hints on every public function and method
- Max line length: 120 (enforced by ruff)
from __future__ import annotationsat the top of every module- Ruff rules: E, F, W, I, UP, B, SIM, TCH, RUF
- No dict soup — use
@dataclassorTypedDict, not rawdict[str, Any] - Enums over string literals for any value that has a fixed set of options
- Google-style docstrings on all public symbols
- Async only for IO-bound code; sync for CPU-bound/pure logic
- Concise inline comments only for non-obvious logic or dangerous edges
| File | Purpose |
|---|---|
models.py |
Core data models for tasks, agents, and cells |
server.py |
FastAPI task server — central coordination point for all agents |
orchestrator.py |
Orchestrator loop: watch tasks, spawn agents, verify completion, repeat |
tick_pipeline.py |
Tick pipeline helpers: task fetching, batching, and server interaction |
task_lifecycle.py |
Task lifecycle: claim, spawn, complete, retry, decompose |
agent_lifecycle.py |
Agent lifecycle: tracking, heartbeat, crash detection, reaping |
spawner.py |
Spawn short-lived CLI agents for task batches |
router.py |
Route tasks to appropriate model and effort level with tier awareness |
janitor.py |
Verify task completion via concrete signals |
context.py |
Gather project context for the manager's planning prompt |
a2a.py |
A2A (Agent-to-Agent) protocol support |
agency_loader.py |
Load Agency agent personas as additional Bernstein role templates |
agent_discovery.py |
Auto-discover installed CLI coding agents, check login status, and register capabilities |
agent_signals.py |
Agent signal file protocol: WAKEUP, SHUTDOWN, and HEARTBEAT |
api_usage.py |
API usage tracking and metrics collection |
approval.py |
Approval gates: configurable review step between janitor verification and merge |
batch_router.py |
Batch API routing for non-urgent tasks |
bootstrap.py |
Bootstrap orchestration: coordinate startup, task planning, and agent spawning |
bulletin.py |
Append-only bulletin board for cross-agent communication |
ci_fix.py |
CI self-healing: detect failing CI jobs and create fix tasks |
ci_log_parser.py |
Generic CI log parser with adapter pattern |
cluster.py |
Cluster coordination: node registration, heartbeats, topology management |
complexity_advisor.py |
Complexity Advisor: single-agent vs multi-agent mode selection |
cost.py |
Intelligent cost optimization engine |
cost_history.py |
Cost history persistence and alert logic |
cost_tracker.py |
Per-run cost budget tracker |
cross_model_verifier.py |
Cross-model verification: route completed task diffs to a different model for review |
evolution.py |
Backward-compatibility shim — delegates to bernstein.evolution package |
fast_path.py |
Fast-path execution for trivial tasks that don't need an LLM agent |
file_discovery.py |
File discovery and project context gathering |
file_locks.py |
File-level locking for concurrent agent safety |
git_basic.py |
Basic git operations: run, status, staging, committing |
git_context.py |
Git read operations for building agent context |
git_ops.py |
Centralized git write operations for Bernstein |
git_pr.py |
Pull request and branching operations |
github.py |
GitHub API integration for evolve coordination |
graph.py |
Task dependency graph with critical-path and parallelism analysis |
guardrails.py |
Output guardrails: secret detection, scope enforcement, dangerous operations |
heartbeat.py |
Agent heartbeat and stall detection |
hijacker.py |
Automatic tier hijacking — detects and routes to free tier opportunities |
home.py |
Global ~/.bernstein home directory management |
knowledge_base.py |
Knowledge base, file indexing, and task context enrichment |
lessons.py |
Agent lesson propagation system |
llm.py |
Async native LLM client for Bernstein manager and external models |
manager.py |
Manager Intelligence — LLM-powered task decomposition and review |
manager_models.py |
Manager result types and data models |
manager_parsing.py |
Manager LLM response parsing |
manager_prompts.py |
Manager prompt templates and rendering |
mcp_manager.py |
MCP server lifecycle manager |
mcp_registry.py |
MCP server auto-discovery and per-task configuration |
merge_queue.py |
FIFO merge queue for serialized branch merging with conflict routing |
metric_collector.py |
Metrics collection and recording |
metric_export.py |
Metrics export and reporting functionality |
metrics.py |
Performance metrics collection and storage (facade) |
multi_cell.py |
Multi-cell orchestrator: coordinates multiple cells, each with its own manager + workers |
notifications.py |
Webhook notification system for Bernstein run events |
policy.py |
Policy engine for tier optimization and provider routing |
pr_size_governor.py |
PR Size Governor — auto-split large agent PRs into reviewable chunks |
preflight.py |
Pre-flight checks: validate CLI, API key, port availability before bootstrap |
prometheus.py |
Prometheus metrics for Bernstein |
prompt_caching.py |
Prompt caching orchestration for token savings via prefix detection |
quality_gates.py |
Automated quality gates: lint, type-check, and test gates after task completion |
quarantine.py |
Cross-run task quarantine — track repeatedly-failing tasks across Bernstein runs |
rag.py |
Lightweight codebase RAG using SQLite FTS5 (BM25 ranking) |
rate_limit_tracker.py |
Rate-limit-aware scheduling: per-provider throttle tracking and 429 detection |
researcher.py |
Web research module for evolve mode |
retrospective.py |
Run retrospective report generation |
rule_enforcer.py |
Organizational rule enforcement: load .bernstein/rules.yaml, check violations |
seed.py |
Seed file parser for bernstein.yaml |
server_launch.py |
Server and spawner lifecycle: startup, health checks, task injection, cleanup |
session.py |
Session state persistence for fast resume after bernstein stop/restart |
signals.py |
Pivot signal system for strategic re-evaluation of tickets |
store.py / store_redis.py / store_postgres.py |
Abstract TaskStore base class for pluggable storage backends |
store_factory.py |
Storage backend factory for the Bernstein task server |
sync.py |
Sync .sdd/backlog/*.yaml files with the task server |
task_store.py |
Thread-safe in-memory task store with JSONL persistence |
token_monitor.py |
Token growth monitor with auto-intervention |
traces.py |
Agent execution trace storage, parsing, and replay utilities |
upgrade_executor.py |
Autonomous upgrade executor with transaction-like safety and rollback |
worker.py |
bernstein-worker: visible process wrapper for spawned CLI agents |
workspace.py |
Multi-repo workspace orchestration |
worktree.py |
WorktreeManager — git worktree lifecycle for agent session isolation |
Modules added after initial map (in alphabetical order):
| File | Purpose |
|---|---|
auth.py |
SSO / SAML / OIDC authentication for the Bernstein task server |
auth_middleware.py |
Authentication middleware for the Bernstein task server |
cascade_router.py |
Cost-aware model cascading router |
circuit_breaker.py |
Real-time circuit breaker for purpose enforcement |
context_degradation_detector.py |
Monitor agent quality over time; restart when degraded |
graduation.py |
Pilot-to-production graduation framework |
plan_approval.py |
Plan mode: pre-execution cost estimation and human approval |
planner.py |
Task planning: LLM-powered goal decomposition and replan |
repo_index.py |
Repository intelligence index — lightweight code graph for agent context |
reviewer.py |
Task review: LLM-powered completion review and queue correction |
semantic_cache.py |
Semantic caching layer for LLM requests |
semantic_graph.py |
Semantic code graph — symbol-level dependency graph for context routing |
benchmark_gate.py |
Benchmark regression gate — block merge when performance degrades |
cost_anomaly.py |
Cost anomaly detection with Z-score signaling |
log_redact.py |
PII redaction filter for Python logging |
loop_detector.py |
Agent loop and file-lock deadlock detection |
spawn_prompt.py |
Prompt rendering utilities for agent spawning |
task_completion.py |
Task completion, retry, and post-completion processing |
trigger_manager.py |
Event-driven trigger manager — evaluates incoming events against user-defined rules |
trigger_sources/ |
Trigger source adapters: github.py, slack.py, file_watch.py, webhook.py |
| File | Purpose |
|---|---|
agents.py |
Agent inspection routes — logs, kill signals, and SSE output streams |
auth.py |
Authentication routes for SSO / SAML / OIDC flows (OIDC, SAML, device flow, session) |
costs.py |
Cost budget routes |
dashboard.py |
Dashboard routes — file lock inspection |
graduation.py |
Graduation framework routes — stage inspection, event recording, and promotion |
plans.py |
Plan approval routes — list, view, approve, and reject execution plans |
quality.py |
Quality metrics routes — success rate, token usage, p50/p90/p99 completion times |
slack.py |
Slack webhook routes — slash command and Events API endpoints |
status.py |
Status, health, metrics, dashboard, and SSE event routes |
tasks.py |
Task CRUD routes, agent heartbeats, bulletin board, A2A, cluster, session streaming |
webhooks.py |
Inbound webhook routes for external event ingestion |
| File | Purpose |
|---|---|
aider.py |
Aider CLI adapter |
amp.py |
Amp CLI adapter |
base.py |
Base adapter for CLI coding agents |
caching_adapter.py |
Caching wrapper for CLI adapters to enable prompt prefix deduplication |
claude.py |
Claude Code CLI adapter |
codex.py |
OpenAI Codex CLI adapter |
env_isolation.py |
Environment variable isolation for spawned agents |
gemini.py |
Google Gemini CLI adapter |
generic.py |
Generic CLI adapter for arbitrary coding agent CLIs |
manager.py |
Manager adapter — spawns the internal Python ManagerAgent as a CLI participant |
qwen.py |
Qwen CLI adapter for OpenAI compatible models |
registry.py |
Adapter registry — look up CLI adapters by name |
roo_code.py |
Roo Code CLI adapter |
cody.py |
Sourcegraph Cody CLI adapter |
continue_dev.py |
Continue.dev CLI adapter |
cursor.py |
Cursor CLI adapter |
goose.py |
Goose CLI adapter |
kilo.py |
Kilo Code CLI adapter |
kiro.py |
Kiro CLI adapter |
ollama.py |
Ollama local model CLI adapter |
opencode.py |
OpenCode CLI adapter |
tabby.py |
Tabby CLI adapter |
claude_agents.py |
Claude Agents SDK adapter |
iac.py |
Infrastructure-as-Code adapter |
mock.py |
Mock adapter for testing |
skills_injector.py |
Skills injection middleware for adapters |
conformance.py |
Adapter conformance test suite |
ci/ |
CI system adapters for log parsing and failure extraction (github_actions.py) |
| File | Purpose |
|---|---|
agency_provider.py |
AgencyProvider — loads CatalogAgent instances from msitarzewski/agency-agents format |
catalog.py |
Agent catalog registry — loads agent definitions from external sources |
discovery.py |
Agent directory auto-discovery for Bernstein |
registry.py |
Dynamic agent registry with YAML-based definitions and hot-reload support |
| File | Purpose |
|---|---|
advanced_cmd.py |
Advanced tools and utilities for Bernstein CLI |
agents_cmd.py |
Agent catalog management commands: sync, list, validate, showcase, match, discover |
cost.py |
Bernstein cost — spend visibility across all recorded metrics |
dashboard.py |
Bernstein TUI -- retro-futuristic agent orchestration dashboard |
errors.py |
Structured error reporting for Bernstein CLI |
eval_benchmark_cmd.py |
Evaluation and benchmarking commands for Bernstein CLI |
evolve_cmd.py |
Evolution commands: evolve run/review/approve/status/export |
helpers.py |
Shared constants, helpers, and utilities for Bernstein CLI modules |
live.py |
Live view helpers for bernstein live --classic |
main.py |
CLI entry point for Bernstein -- multi-agent orchestration |
run.py |
Enhanced run output for bernstein run |
run_cmd.py |
Run commands: init, conduct, downbeat (legacy start), and the main CLI group |
status.py |
Formatted status output for bernstein status |
status_cmd.py |
Status and diagnostic commands: status, ps, doctor |
stop_cmd.py |
Stop commands: soft/hard stop, shutdown signals, session save, ticket recovery |
task_cmd.py |
Task lifecycle commands for Bernstein CLI |
ui.py |
Shared Rich UI components for Bernstein CLI |
workspace_cmd.py |
Workspace and configuration commands for Bernstein CLI |
| File | Purpose |
|---|---|
aggregator.py |
Metrics aggregation with EWMA, CUSUM, BOCPD, and Goodhart defenses |
applicator.py |
Change applicator — execute upgrades via file modification |
benchmark.py |
Tiered benchmark runner for evolution validation |
circuit.py |
CircuitBreaker — halt evolution when safety conditions are violated |
creative.py |
Creative evolution pipeline — visionary → analyst → production gate |
cycle_runner.py |
Evolution cycle execution engine |
detector.py |
Opportunity detection from aggregated metrics |
gate.py |
ApprovalGate and EvalGate — risk-stratified routing for evolution proposals |
governance.py |
Adaptive governance for the evolution system |
invariants.py |
InvariantsGuard — hash-lock safety-critical files |
loop.py |
Autoresearch evolution loop — continuous self-improvement via experiment cycles |
proposal_scorer.py |
Proposal risk scoring and routing classification |
proposals.py |
Upgrade proposal generation |
report.py |
Evolution observability — history table and static report generation |
risk.py |
Strategic Risk Score (SRS) computation for evolution proposals |
sandbox.py |
SandboxValidator — isolated testing of evolution proposals |
types.py |
Shared types for the evolution system |
| File | Purpose |
|---|---|
baseline.py |
Baseline tracking for eval-gated evolution |
golden.py |
Golden benchmark suite — curated tasks for eval |
harness.py |
Eval harness — multiplicative scoring, LLM judge, failure taxonomy |
judge.py |
LLM judge — evaluate code quality of agent-produced changes |
metrics.py |
Custom eval metrics — each metric is a dataclass with a compute method |
scenario_runner.py |
Scenario runner — execute YAML-defined eval scenarios against the live codebase |
taxonomy.py |
Failure taxonomy — classify every eval failure into a closed set |
telemetry.py |
Telemetry contract — strict schema for agent output metadata |
| File | Purpose |
|---|---|
hookspecs.py |
Hook specifications — defines extension points for Bernstein plugins |
manager.py |
Plugin manager — discovers, loads, and invokes Bernstein plugins |
| File | Purpose |
|---|---|
app.py |
Main Textual application for the Bernstein TUI session manager |
widgets.py |
Custom Textual widgets for the Bernstein TUI |
| File | Purpose |
|---|---|
app.py |
GitHub App authentication: JWT creation and installation token exchange |
ci_router.py |
CI failure routing: blame attribution and enriched fix-task generation |
mapper.py |
Event-to-task conversion: maps GitHub webhook events to Bernstein task payloads |
webhooks.py |
Webhook parsing and HMAC-SHA256 signature verification |
| File | Purpose |
|---|---|
server.py |
Bernstein MCP server |
| File | Purpose |
|---|---|
swe_bench.py |
SWE-Bench evaluation harness for Bernstein |
| Path | Purpose |
|---|---|
templates/roles/ |
Jinja2 role prompts (manager, backend, qa, security, devops, etc.) |
templates/prompts/ |
Prompt templates (judge.md, etc.) — bundled into wheel |
.sdd/ |
All runtime state (never commit .sdd/runtime/) |
.sdd/backlog/open/ |
YAML task specs waiting to be picked up |
.sdd/backlog/claimed/ |
Tasks currently being worked |
.sdd/backlog/done/ |
Completed tasks (automated sync moves files here) |
.sdd/backlog/closed/ |
Completed tasks (manual sprint scripts move files here) |
.sdd/runtime/ |
PIDs, logs, session state, signal files |
.sdd/metrics/ |
JSONL metric records |
.sdd/traces/ |
JSONL agent traces |
.sdd/agents/catalog.json |
Registered agent catalog |
tests/unit/ |
Fast unit tests (no network) |
tests/integration/ |
Integration tests (require running server) |
scripts/run_tests.py |
Per-file isolated test runner |
snake_case.pyfor all Python modules- Test files:
test_<module_name>.pymirrors source structure - Backlog task files:
p{priority}_c{complexity}_{date}_{type}_{slug}.yaml - Role templates:
<role-name>.mdor<role-name>/directory
- PascalCase:
TaskGraph,AgentSpawner,TierAwareRouter - Enums: PascalCase name, SCREAMING_SNAKE members:
TaskStatus.IN_PROGRESS - Dataclasses preferred over Pydantic models in core; Pydantic only for FastAPI request/response
snake_case, verbs:spawn_for_tasks(),verify_task(),build_worker_cmd()- Private helpers: leading underscore
_read_cached(),_render_prompt() - Async functions: prefix with nothing special, but always
async defand awaited correctly - Module-level helpers that accept the orchestrator as explicit arg (not
self): free functions intask_lifecycle.py/agent_lifecycle.py
snake_casefor variablesSCREAMING_SNAKEfor module-level constants:MAX_JUDGE_RETRIES,JUDGE_MODEL- Private module-level caches:
_FILE_CACHE,_DIR_CACHE
- Short hex string:
16e2d84f94aa(12 hex chars fromuuid.uuid4().hex[:12])
- Full UUID4:
str(uuid.uuid4())
- Lowercase hyphenated:
backend,qa,security,devops,docs,frontend,architect,manager
"""Tests for <module> — <what is mocked>."""
from __future__ import annotations
import pytest
from unittest.mock import MagicMock, patch
# --- Fixtures ---
@pytest.fixture()
def my_thing(tmp_path):
...
# --- TestClassName ---
class TestMyThing:
def test_happy_path(self, ...) -> None:
...
def test_failure_case(self, ...) -> None:
...import pytest
@pytest.mark.asyncio
async def test_something(client: AsyncClient) -> None:
resp = await client.post("/tasks", json={...})
assert resp.status_code == 200Use httpx.ASGITransport + AsyncClient against the FastAPI app directly — no real network:
from httpx import ASGITransport, AsyncClient
from bernstein.core.server import create_app
@pytest.fixture()
async def client(tmp_path):
app = create_app(jsonl_path=tmp_path / "tasks.jsonl")
transport = ASGITransport(app=app)
async with AsyncClient(transport=transport, base_url="http://test") as c:
yield cmake_task()— factory forTaskwith defaults; override only what mattersmock_adapter_factory(pid=42)— returns aMagicMock(spec=CLIAdapter)with.spawn()returningSpawnResultsdd_dir(tmp_path)— temp.sdd/with standard subdirectories_memory_guard(autouse) — forces GC after every test; aborts if RSS > 2 GB
- Always mock the CLI adapter in spawner/orchestrator tests — never shell out for real
- Always mock httpx calls in orchestrator tests — use
unittest.mock.patchor inject fake responses - Real filesystem via
tmp_path— never mockPathor file I/O whentmp_pathworks - No database — state is files; use
tmp_pathfor.sdd/
Group related cases: class TestSpawnForTasks:, class TestProviderType:. Each method is one scenario.
pytest tests/ will leak memory across 2000+ test files and can hit 100 GB. Always use:
uv run python scripts/run_tests.py -xThe script runs each test_*.py file in a fresh subprocess.
Bernstein writes PID metadata JSON files to .sdd/runtime/pids/. Use those to find and stop processes. Never pkill -f bernstein or pgrep bernstein — it will kill the orchestrator indiscriminately.
# Correct: signal via file
echo "stop" > .sdd/runtime/signals/<role>-<session>/SHUTDOWN
# Correct: use bernstein CLI
bernstein stop
# WRONG: grep-kill
pkill -f bernstein # kills everything including your own shell session if bernstein is in the pathsrc/bernstein/core/evolution.py is a backward-compat re-export shim. The real implementation lives in src/bernstein/evolution/. Don't add code to the shim — extend the package.
orchestrator.py is the public façade. The actual logic is split:
tick_pipeline.py— data containers and task fetchingtask_lifecycle.py— claim/spawn/complete/retryagent_lifecycle.py— heartbeat/crash/reap
If you're editing orchestration behavior, read all three before touching any one.
manager.py is the public façade for the LLM-powered Manager. The logic is split:
manager_models.py—ReviewResult,QueueCorrection,QueueReviewResultdataclassesmanager_parsing.py— JSON response parsing from LLM callsmanager_prompts.py— prompt template loading and rendering
manager.py imports from all three and exposes ManagerAgent. Don't add models/parsing/prompts to manager.py itself — extend the relevant sub-module.
All modules use this for forward references and PEP 604 union syntax. Without it, type annotations that reference yet-to-be-defined classes fail at import.
.sdd/backlog/ and .sdd/metrics/ persist across restarts and are git-friendly. .sdd/runtime/ contains ephemeral PIDs, logs, and signal files — never commit it. The server flushes tasks to .sdd/runtime/tasks.jsonl but that's only a recovery checkpoint.
task_id = uuid.uuid4().hex[:12] # "16e2d84f94aa"Don't use full UUIDs for task IDs — the server, backlog filenames, and trace files all expect the short form.
Every POST /tasks/{id}/complete or /fail increments task.version. If two agents try to complete the same task, the second call gets a 409. Build your agent completion code to handle 409 gracefully.
All adapter .spawn() implementations must wrap the CLI command with build_worker_cmd() from adapters/base.py. This sets the process title and writes the PID metadata file that the orchestrator uses for bernstein ps and crash detection.
The project uses pytest-asyncio. Async tests need @pytest.mark.asyncio. Async fixtures need @pytest_asyncio.fixture() (not plain @pytest.fixture()).
Any import used only for type annotations must be under if TYPE_CHECKING:. Ruff will flag imports that can be moved there. This is enforced in CI.
Files in templates/roles/ are Jinja2 templates. The TemplateRenderer in templates/renderer.py resolves them. When adding a new role, create templates/roles/<role>.md and register it in the role catalog.
When an agent starts, the task file moves from open/ → claimed/. On success the automated sync system moves it to done/. If you find tasks stuck in claimed/, the agent likely crashed — run janitor cleanup or use bernstein gc. Note: manual sprint scripts may move completed tickets to closed/ instead — both directories are checked by cleanup commands.
rule_enforcer.py reads .bernstein/rules.yaml from the working directory (not .sdd/). If the file is absent, enforcement is silently skipped — no error. error-severity violations hard-block merge; warning violations are soft-flags only. Violations are appended to .sdd/metrics/rule_violations.jsonl.
lessons.py stores lessons in .sdd/memory/lessons.jsonl. Retrieval is by tag overlap with the current task — not vector search. Confidence decays exponentially over time. The same lesson filed twice from different agents raises its confidence rather than creating a duplicate.
prompt_caching.py deduplicates system prompts by hashing the role prompt + shared context. If you change a role template or context, the cache key changes automatically. Cache hits are logged to .sdd/caching/. The CachingAdapter wrapper in adapters/caching_adapter.py applies this transparently to any adapter.
core/complexity_advisor.py inspects task owned_files and cross-file dependency scores to choose ComplexityMode.SINGLE or ComplexityMode.MULTI. Tasks routed to SINGLE skip spawning sub-agents. This fires before the spawner — if you see tasks not fanning out, check the advisor output first.
Never push to or create a branch named master. PRs target main. The git config enforces this via CI.
When plan_mode is enabled in orchestrator config, the planner decomposes goals into PLANNED-status tasks and holds them for human approval via POST /plans/{id}/approve. Tasks stay frozen until approved — agents will not pick them up. Approval routes are in routes/plans.py.
Event-driven triggers are configured in .bernstein/triggers.yaml (not .sdd/). The TriggerManager evaluates incoming TriggerEvent objects against configured rules and creates tasks when rules match. Trigger sources (trigger_sources/) normalize raw events (GitHub webhooks, Slack events, file-system changes, generic HTTP webhooks) into TriggerEvent before evaluation.
get_or_build_graph() persists the code graph to .sdd/index/codebase.db. The cache expires after 30 minutes by default. If you need a fresh graph after a large refactor, delete the cache file or call build_repo_graph() directly. The graph is used by semantic_graph.py for symbol-level context routing.
router.py is tier-aware model selection (which model, which tier). cascade_router.py is cost-aware cascading (try cheap model first, escalate on failure/low confidence). They are separate concerns — don't conflate them. cascade_router.py wraps router.py output.
The circuit breaker monitors agent output for purpose violations. When it fires, it sends a SHUTDOWN signal to the offending agent and marks the task failed. Check .sdd/runtime/signals/<role>-<session>/SHUTDOWN if an agent exits unexpectedly.
graduation.py stages work through configurable promotion stages (e.g. pilot → staging → production). Stage transitions fire events recorded via POST /graduation/events. The graduation routes are at routes/graduation.py.
janitor.py verifies task completion via concrete signals (file exists, tests pass). reviewer.py uses an LLM to review the quality of what was produced and can push corrections back into the queue. Both run post-task, in that order.
check_loops_and_deadlocks() in agent_lifecycle.py polls file modification times each tick. When the same agent edits the same file more than LOOP_EDIT_THRESHOLD times within LOOP_WINDOW_SECONDS, the agent is killed. Deadlock detection builds a wait-for graph from FileLockManager and breaks cycles by releasing the oldest lock holder.
install_pii_filter() is called in bootstrap.py and attaches to the root logger. All log handlers (file, console, structured) receive sanitised messages — emails, phone numbers, SSNs, and credit card numbers are replaced with [REDACTED].
After task completion, cost data is checked against historical Z-scores. AnomalySignal.LOG just logs, AnomalySignal.PAUSE_SPAWNING stops new agent spawning, and AnomalySignal.KILL_AGENT terminates the expensive agent.
Bernstein is an open-source project aiming to become the standard orchestrator for AI coding agents. Key competitive advantages to protect:
- Agent-agnostic — works with any CLI agent, not locked to one vendor
- Deterministic orchestrator — scheduling is code, not LLM (predictable, auditable)
- File-based state —
.sdd/is git-friendly, inspectable, recoverable - Self-evolving — Bernstein develops itself via
bernstein evolve - Enterprise-ready — approval gates, audit trails, cost tracking, compliance
When making decisions, ask: does this make Bernstein more reliable for users who trust it with their codebase? Does this make agents more effective at completing tasks? Does this make the system more observable when things go wrong?
- The orchestrator is deterministic code. No LLM in the scheduling loop.
- Agents are short-lived. No persistent agent processes.
- State lives in
.sdd/files. No hidden in-memory-only state. - Every agent runs in a git worktree. Main branch is never dirty.
- Task completion is verified by concrete signals, not trust.
- Git branch is
main. Nevermaster.
- Fixes a real failure mode observed in production
- Improves agent success rate (fewer retries, better prompts)
- Improves observability (better logs, metrics, traces)
- Reduces cost (smarter model selection, caching, batching)
- Reduces time-to-completion (parallelism, fast path, scheduling)
- Has tests proving it works
- Is small enough to review in 5 minutes
- Refactoring that doesn't fix a bug or enable a feature
- Adding abstractions for one caller
- Config options nobody asked for
- "Improving" code style in files you didn't otherwise touch
- Architecture changes without a design doc
- Branch from
main - Title: imperative mood ("Add X", "Fix Y", "Refactor Z")
- Run
uv run ruff check src/ && uv run pyright src/ && uv run python scripts/run_tests.py -xbefore committing - One logical change per PR/commit
- Mark task complete on the task server when done:
curl -s -X POST http://127.0.0.1:8052/tasks/<id>/complete \ -H "Content-Type: application/json" \ -d '{"result_summary": "Done: <description>"}'
Check .sdd/backlog/open/ for YAML task specs. Each file has a role, priority,
and description. Take tasks matching your role. Use bernstein status to see
what's running. Prioritize by priority field (1=critical, 2=normal, 3=nice-to-have). Note: ticket filenames use a 0-based prefix (p0/p1/p2/p3/p4) but the task server normalises priority to 1–3 on ingestion.
When picking tasks: prefer tasks where you can make measurable progress in 15-30 minutes. If a task seems too large, decompose it into subtasks. If a task is blocked by another task, skip it and take the next one.