Zero-Latency Switchboard

High-Performance Orchestration Layer for Heterogeneous LLM Agents

Orchestrating agents with near-zero overhead using Speculative Routing, KV-Cache Affinity, and Multi-LoRA Serving.

🚀 Overview

The Zero-Latency Switchboard is a middleware layer that solves the critical latency problems in multi-agent LLM workflows:

Problem	Solution
Network hops between services	Edge computing with embedded ONNX
Model loading delays	Speculative pre-loading while user types
Context recomputation	KV-Cache affinity routing
Multiple GPU deployments	Multi-LoRA on shared base model

📐 Architecture

┌─────────────────────────────────────────────────────────────────┐
│                         User Input                               │
│                             │                                    │
│                             ▼                                    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Tier 1: Edge Gateway                        │    │
│  │         FastAPI + uvloop (10k+ WebSockets)              │    │
│  │                         │                                │    │
│  │    ┌────────────────────┴────────────────────┐          │    │
│  │    │      Tier 2: Speculative Brain          │          │    │
│  │    │    ONNX DistilBERT (<1ms inference)     │          │    │
│  │    │    Predicts intent BEFORE Enter key     │          │    │
│  │    └────────────────────┬────────────────────┘          │    │
│  │                         │                                │    │
│  │              ┌──────────┴──────────┐                    │    │
│  │              │    Redis State      │                    │    │
│  │              │  Session Affinity   │                    │    │
│  │              └──────────┬──────────┘                    │    │
│  └─────────────────────────┼───────────────────────────────┘    │
│                            │                                     │
│                            ▼                                     │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │              Tier 3: Inference Cluster                   │    │
│  │           SGLang Router (cache_aware policy)            │    │
│  │                         │                                │    │
│  │         ┌───────────────┼───────────────┐               │    │
│  │         ▼               ▼               ▼               │    │
│  │    ┌─────────┐    ┌─────────┐    ┌─────────┐           │    │
│  │    │ GPU 1   │    │ GPU 2   │    │ GPU N   │           │    │
│  │    │ Worker  │    │ Worker  │    │ Worker  │           │    │
│  │    │         │    │         │    │         │           │    │
│  │    │ LoRA A  │    │ LoRA B  │    │ LoRA C  │           │    │
│  │    │ LoRA B  │    │ LoRA A  │    │ LoRA A  │           │    │
│  │    └─────────┘    └─────────┘    └─────────┘           │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

✨ Key Features

1. Speculative Routing ("Psychic" Router)

Predicts user intent before they finish typing
Embedded ONNX model runs in <1ms
Pre-loads the correct LoRA adapter while user composes

2. KV-Cache Affinity ("Sticky Minds")

Routes returning users to the same GPU
Zero context recomputation (0s time-to-first-token)
SGLang's cache_aware policy handles routing

3. Multi-LoRA Serving ("One Base, Many Faces")

Single base model (Llama-3-405B)
Hot-swap specialized adapters (<100ms)
Massive cost efficiency

🛠️ Project Structure

zero-latency-switchboard/
├── src/
│   ├── gateway/              # FastAPI Edge Gateway
│   │   ├── main.py           # WebSocket handler
│   │   ├── config.py         # Configuration
│   │   ├── state.py          # Redis session management
│   │   ├── sglang_client.py  # SGLang communication
│   │   └── router/           # Speculative routing (ONNX)
│   │       └── __init__.py
│   ├── inference/            # SGLang Configuration
│   │   ├── adapters.json     # LoRA adapter definitions
│   │   ├── router_config.yaml
│   │   ├── launch_router.sh
│   │   └── launch_worker.sh
│   ├── frontend/             # Test Client
│   │   └── index.html
│   └── mock_sglang/          # Development Mock
│       └── server.py
├── docs/                     # Documentation
├── Dockerfile.gateway        # Gateway container
├── Dockerfile.worker         # Worker container
├── docker-compose.yaml       # Production deployment
├── docker-compose.dev.yaml   # Development deployment
└── requirements.txt

🚀 Quick Start

Development Mode (No GPU Required)

# 1. Clone and navigate
cd zero-latency-switchboard

# 2. Start development stack
docker-compose -f docker-compose.dev.yaml up -d

# 3. Open test client
open http://localhost:3000

# 4. View gateway logs
docker logs -f zls-gateway-dev

Production Mode (With GPUs)

# 1. Prepare model and adapters
# Download Llama-3-405B to ./models/
# Place LoRA adapters in ./adapters/

# 2. Start full stack
docker-compose up -d

# 3. Monitor
docker-compose logs -f

Local Development (Python)

# 1. Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# 2. Install dependencies
pip install -r requirements-dev.txt

# 3. Start Redis
docker run -d -p 6379:6379 redis:alpine

# 4. Run gateway
cd src
python -m uvicorn gateway.main:app --reload --port 8000

# 5. Open frontend
# Open src/frontend/index.html in browser

📡 API Reference

WebSocket Endpoint

WS /ws/{client_id}

Messages (Client → Server):

Text: Partial input for speculative routing
__submit__: Trigger generation
__ping__: Heartbeat

Messages (Server → Client):

{"type": "intent_change", "old": "chat", "new": "code", "confidence": 0.92}
{"type": "generation_start", "adapter": "code"}
{"type": "token", "content": "def "}
{"type": "generation_complete"}
{"type": "error", "message": "..."}

REST Endpoints

GET /health           # Health check
POST /predict?text=   # Debug intent prediction

⚙️ Configuration

Environment variables for the gateway:

Variable	Default	Description
`GATEWAY_REDIS_URL`	`redis://localhost:6379`	Redis connection
`GATEWAY_SGLANG_ROUTER_URL`	`http://localhost:30000`	SGLang router
`GATEWAY_LOG_LEVEL`	`INFO`	Logging level
`GATEWAY_DEBUG`	`false`	Debug mode

🔬 How It Works

Speculative Routing Flow

1. User starts typing: "Write a Python function..."
                                    │
2. On every space character:        │
   └─► Gateway runs ONNX prediction (<1ms)
       └─► Intent: "code" (87% confidence)
                                    │
3. Confidence > 75%?                │
   └─► YES: Update Redis state      │
       └─► Fire-and-forget: POST /load_adapter to SGLang
                                    │
4. User hits Enter                  │
   └─► Forward to SGLang (adapter already loaded!)
       └─► Stream response tokens back

KV-Cache Affinity

The critical insight: LoRA adapters must NOT modify K/V projections.

# ✅ GOOD: Apply LoRA to these (KV cache shareable)
target_modules = ["o_proj", "up_proj", "down_proj"]

# ❌ BAD: Avoid these (breaks KV cache sharing)
avoid_modules = ["k_proj", "v_proj"]

This allows all agents to share the same KV cache, enabling instant switching.

📊 Performance Targets

Metric	Target	Description
Speculative Inference	<1ms	ONNX on CPU
Adapter Pre-load	<100ms	Before user hits Enter
Context Reload	0ms	With KV affinity
Time-to-First-Token	<50ms	After submission
Concurrent Users	10,000+	Per gateway instance

🗺️ Roadmap

Phase 1: Infrastructure & Gateway
Phase 2: Speculative Intelligence
Phase 3: SGLang Configuration
Phase 4: End-to-End Integration Testing
Phase 5: Kubernetes Deployment
Phase 6: Monitoring & Observability

📄 License

MIT License - See LICENSE for details.

Built for the future of agentic AI.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Zero-Latency Switchboard

🚀 Overview

📐 Architecture

✨ Key Features

1. Speculative Routing ("Psychic" Router)

2. KV-Cache Affinity ("Sticky Minds")

3. Multi-LoRA Serving ("One Base, Many Faces")

🛠️ Project Structure

🚀 Quick Start

Development Mode (No GPU Required)

Production Mode (With GPUs)

Local Development (Python)

📡 API Reference

WebSocket Endpoint

REST Endpoints

⚙️ Configuration

🔬 How It Works

Speculative Routing Flow

KV-Cache Affinity

📊 Performance Targets

🗺️ Roadmap

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.env.example		.env.example
Dockerfile.gateway		Dockerfile.gateway
Dockerfile.worker		Dockerfile.worker
IMPLEMENTATION_LOG.md		IMPLEMENTATION_LOG.md
README.md		README.md
docker-compose.dev.yaml		docker-compose.dev.yaml
docker-compose.yaml		docker-compose.yaml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Zero-Latency Switchboard

🚀 Overview

📐 Architecture

✨ Key Features

1. Speculative Routing ("Psychic" Router)

2. KV-Cache Affinity ("Sticky Minds")

3. Multi-LoRA Serving ("One Base, Many Faces")

🛠️ Project Structure

🚀 Quick Start

Development Mode (No GPU Required)

Production Mode (With GPUs)

Local Development (Python)

📡 API Reference

WebSocket Endpoint

REST Endpoints

⚙️ Configuration

🔬 How It Works

Speculative Routing Flow

KV-Cache Affinity

📊 Performance Targets

🗺️ Roadmap

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages