High-Performance Orchestration Layer for Heterogeneous LLM Agents
Orchestrating agents with near-zero overhead using Speculative Routing, KV-Cache Affinity, and Multi-LoRA Serving.
The Zero-Latency Switchboard is a middleware layer that solves the critical latency problems in multi-agent LLM workflows:
| Problem | Solution |
|---|---|
| Network hops between services | Edge computing with embedded ONNX |
| Model loading delays | Speculative pre-loading while user types |
| Context recomputation | KV-Cache affinity routing |
| Multiple GPU deployments | Multi-LoRA on shared base model |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β User Input β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Tier 1: Edge Gateway β β
β β FastAPI + uvloop (10k+ WebSockets) β β
β β β β β
β β ββββββββββββββββββββββ΄βββββββββββββββββββββ β β
β β β Tier 2: Speculative Brain β β β
β β β ONNX DistilBERT (<1ms inference) β β β
β β β Predicts intent BEFORE Enter key β β β
β β ββββββββββββββββββββββ¬βββββββββββββββββββββ β β
β β β β β
β β ββββββββββββ΄βββββββββββ β β
β β β Redis State β β β
β β β Session Affinity β β β
β β ββββββββββββ¬βββββββββββ β β
β βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Tier 3: Inference Cluster β β
β β SGLang Router (cache_aware policy) β β
β β β β β
β β βββββββββββββββββΌββββββββββββββββ β β
β β βΌ βΌ βΌ β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β β β GPU 1 β β GPU 2 β β GPU N β β β
β β β Worker β β Worker β β Worker β β β
β β β β β β β β β β
β β β LoRA A β β LoRA B β β LoRA C β β β
β β β LoRA B β β LoRA A β β LoRA A β β β
β β βββββββββββ βββββββββββ βββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Predicts user intent before they finish typing
- Embedded ONNX model runs in <1ms
- Pre-loads the correct LoRA adapter while user composes
- Routes returning users to the same GPU
- Zero context recomputation (0s time-to-first-token)
- SGLang's
cache_awarepolicy handles routing
- Single base model (Llama-3-405B)
- Hot-swap specialized adapters (<100ms)
- Massive cost efficiency
zero-latency-switchboard/
βββ src/
β βββ gateway/ # FastAPI Edge Gateway
β β βββ main.py # WebSocket handler
β β βββ config.py # Configuration
β β βββ state.py # Redis session management
β β βββ sglang_client.py # SGLang communication
β β βββ router/ # Speculative routing (ONNX)
β β βββ __init__.py
β βββ inference/ # SGLang Configuration
β β βββ adapters.json # LoRA adapter definitions
β β βββ router_config.yaml
β β βββ launch_router.sh
β β βββ launch_worker.sh
β βββ frontend/ # Test Client
β β βββ index.html
β βββ mock_sglang/ # Development Mock
β βββ server.py
βββ docs/ # Documentation
βββ Dockerfile.gateway # Gateway container
βββ Dockerfile.worker # Worker container
βββ docker-compose.yaml # Production deployment
βββ docker-compose.dev.yaml # Development deployment
βββ requirements.txt
# 1. Clone and navigate
cd zero-latency-switchboard
# 2. Start development stack
docker-compose -f docker-compose.dev.yaml up -d
# 3. Open test client
open http://localhost:3000
# 4. View gateway logs
docker logs -f zls-gateway-dev# 1. Prepare model and adapters
# Download Llama-3-405B to ./models/
# Place LoRA adapters in ./adapters/
# 2. Start full stack
docker-compose up -d
# 3. Monitor
docker-compose logs -f# 1. Create virtual environment
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
# 2. Install dependencies
pip install -r requirements-dev.txt
# 3. Start Redis
docker run -d -p 6379:6379 redis:alpine
# 4. Run gateway
cd src
python -m uvicorn gateway.main:app --reload --port 8000
# 5. Open frontend
# Open src/frontend/index.html in browserWS /ws/{client_id}
Messages (Client β Server):
- Text: Partial input for speculative routing
__submit__: Trigger generation__ping__: Heartbeat
Messages (Server β Client):
{"type": "intent_change", "old": "chat", "new": "code", "confidence": 0.92}
{"type": "generation_start", "adapter": "code"}
{"type": "token", "content": "def "}
{"type": "generation_complete"}
{"type": "error", "message": "..."}GET /health # Health check
POST /predict?text= # Debug intent prediction
Environment variables for the gateway:
| Variable | Default | Description |
|---|---|---|
GATEWAY_REDIS_URL |
redis://localhost:6379 |
Redis connection |
GATEWAY_SGLANG_ROUTER_URL |
http://localhost:30000 |
SGLang router |
GATEWAY_LOG_LEVEL |
INFO |
Logging level |
GATEWAY_DEBUG |
false |
Debug mode |
1. User starts typing: "Write a Python function..."
β
2. On every space character: β
βββΊ Gateway runs ONNX prediction (<1ms)
βββΊ Intent: "code" (87% confidence)
β
3. Confidence > 75%? β
βββΊ YES: Update Redis state β
βββΊ Fire-and-forget: POST /load_adapter to SGLang
β
4. User hits Enter β
βββΊ Forward to SGLang (adapter already loaded!)
βββΊ Stream response tokens back
The critical insight: LoRA adapters must NOT modify K/V projections.
# β
GOOD: Apply LoRA to these (KV cache shareable)
target_modules = ["o_proj", "up_proj", "down_proj"]
# β BAD: Avoid these (breaks KV cache sharing)
avoid_modules = ["k_proj", "v_proj"]This allows all agents to share the same KV cache, enabling instant switching.
| Metric | Target | Description |
|---|---|---|
| Speculative Inference | <1ms | ONNX on CPU |
| Adapter Pre-load | <100ms | Before user hits Enter |
| Context Reload | 0ms | With KV affinity |
| Time-to-First-Token | <50ms | After submission |
| Concurrent Users | 10,000+ | Per gateway instance |
- Phase 1: Infrastructure & Gateway
- Phase 2: Speculative Intelligence
- Phase 3: SGLang Configuration
- Phase 4: End-to-End Integration Testing
- Phase 5: Kubernetes Deployment
- Phase 6: Monitoring & Observability
MIT License - See LICENSE for details.
Built for the future of agentic AI.