Skip to content

Hemanthc-dotcom/zero-latency-switchboard

Repository files navigation

Zero-Latency Switchboard

High-Performance Orchestration Layer for Heterogeneous LLM Agents

Orchestrating agents with near-zero overhead using Speculative Routing, KV-Cache Affinity, and Multi-LoRA Serving.


πŸš€ Overview

The Zero-Latency Switchboard is a middleware layer that solves the critical latency problems in multi-agent LLM workflows:

Problem Solution
Network hops between services Edge computing with embedded ONNX
Model loading delays Speculative pre-loading while user types
Context recomputation KV-Cache affinity routing
Multiple GPU deployments Multi-LoRA on shared base model

πŸ“ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         User Input                               β”‚
β”‚                             β”‚                                    β”‚
β”‚                             β–Ό                                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              Tier 1: Edge Gateway                        β”‚    β”‚
β”‚  β”‚         FastAPI + uvloop (10k+ WebSockets)              β”‚    β”‚
β”‚  β”‚                         β”‚                                β”‚    β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚    β”‚
β”‚  β”‚    β”‚      Tier 2: Speculative Brain          β”‚          β”‚    β”‚
β”‚  β”‚    β”‚    ONNX DistilBERT (<1ms inference)     β”‚          β”‚    β”‚
β”‚  β”‚    β”‚    Predicts intent BEFORE Enter key     β”‚          β”‚    β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚    β”‚
β”‚  β”‚                         β”‚                                β”‚    β”‚
β”‚  β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚    β”‚
β”‚  β”‚              β”‚    Redis State      β”‚                    β”‚    β”‚
β”‚  β”‚              β”‚  Session Affinity   β”‚                    β”‚    β”‚
β”‚  β”‚              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                            β”‚                                     β”‚
β”‚                            β–Ό                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚              Tier 3: Inference Cluster                   β”‚    β”‚
β”‚  β”‚           SGLang Router (cache_aware policy)            β”‚    β”‚
β”‚  β”‚                         β”‚                                β”‚    β”‚
β”‚  β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚    β”‚
β”‚  β”‚         β–Ό               β–Ό               β–Ό               β”‚    β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚    β”‚
β”‚  β”‚    β”‚ GPU 1   β”‚    β”‚ GPU 2   β”‚    β”‚ GPU N   β”‚           β”‚    β”‚
β”‚  β”‚    β”‚ Worker  β”‚    β”‚ Worker  β”‚    β”‚ Worker  β”‚           β”‚    β”‚
β”‚  β”‚    β”‚         β”‚    β”‚         β”‚    β”‚         β”‚           β”‚    β”‚
β”‚  β”‚    β”‚ LoRA A  β”‚    β”‚ LoRA B  β”‚    β”‚ LoRA C  β”‚           β”‚    β”‚
β”‚  β”‚    β”‚ LoRA B  β”‚    β”‚ LoRA A  β”‚    β”‚ LoRA A  β”‚           β”‚    β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

✨ Key Features

1. Speculative Routing ("Psychic" Router)

  • Predicts user intent before they finish typing
  • Embedded ONNX model runs in <1ms
  • Pre-loads the correct LoRA adapter while user composes

2. KV-Cache Affinity ("Sticky Minds")

  • Routes returning users to the same GPU
  • Zero context recomputation (0s time-to-first-token)
  • SGLang's cache_aware policy handles routing

3. Multi-LoRA Serving ("One Base, Many Faces")

  • Single base model (Llama-3-405B)
  • Hot-swap specialized adapters (<100ms)
  • Massive cost efficiency

πŸ› οΈ Project Structure

zero-latency-switchboard/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ gateway/              # FastAPI Edge Gateway
β”‚   β”‚   β”œβ”€β”€ main.py           # WebSocket handler
β”‚   β”‚   β”œβ”€β”€ config.py         # Configuration
β”‚   β”‚   β”œβ”€β”€ state.py          # Redis session management
β”‚   β”‚   β”œβ”€β”€ sglang_client.py  # SGLang communication
β”‚   β”‚   └── router/           # Speculative routing (ONNX)
β”‚   β”‚       └── __init__.py
β”‚   β”œβ”€β”€ inference/            # SGLang Configuration
β”‚   β”‚   β”œβ”€β”€ adapters.json     # LoRA adapter definitions
β”‚   β”‚   β”œβ”€β”€ router_config.yaml
β”‚   β”‚   β”œβ”€β”€ launch_router.sh
β”‚   β”‚   └── launch_worker.sh
β”‚   β”œβ”€β”€ frontend/             # Test Client
β”‚   β”‚   └── index.html
β”‚   └── mock_sglang/          # Development Mock
β”‚       └── server.py
β”œβ”€β”€ docs/                     # Documentation
β”œβ”€β”€ Dockerfile.gateway        # Gateway container
β”œβ”€β”€ Dockerfile.worker         # Worker container
β”œβ”€β”€ docker-compose.yaml       # Production deployment
β”œβ”€β”€ docker-compose.dev.yaml   # Development deployment
└── requirements.txt

πŸš€ Quick Start

Development Mode (No GPU Required)

# 1. Clone and navigate
cd zero-latency-switchboard

# 2. Start development stack
docker-compose -f docker-compose.dev.yaml up -d

# 3. Open test client
open http://localhost:3000

# 4. View gateway logs
docker logs -f zls-gateway-dev

Production Mode (With GPUs)

# 1. Prepare model and adapters
# Download Llama-3-405B to ./models/
# Place LoRA adapters in ./adapters/

# 2. Start full stack
docker-compose up -d

# 3. Monitor
docker-compose logs -f

Local Development (Python)

# 1. Create virtual environment
python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

# 2. Install dependencies
pip install -r requirements-dev.txt

# 3. Start Redis
docker run -d -p 6379:6379 redis:alpine

# 4. Run gateway
cd src
python -m uvicorn gateway.main:app --reload --port 8000

# 5. Open frontend
# Open src/frontend/index.html in browser

πŸ“‘ API Reference

WebSocket Endpoint

WS /ws/{client_id}

Messages (Client β†’ Server):

  • Text: Partial input for speculative routing
  • __submit__: Trigger generation
  • __ping__: Heartbeat

Messages (Server β†’ Client):

{"type": "intent_change", "old": "chat", "new": "code", "confidence": 0.92}
{"type": "generation_start", "adapter": "code"}
{"type": "token", "content": "def "}
{"type": "generation_complete"}
{"type": "error", "message": "..."}

REST Endpoints

GET /health           # Health check
POST /predict?text=   # Debug intent prediction

βš™οΈ Configuration

Environment variables for the gateway:

Variable Default Description
GATEWAY_REDIS_URL redis://localhost:6379 Redis connection
GATEWAY_SGLANG_ROUTER_URL http://localhost:30000 SGLang router
GATEWAY_LOG_LEVEL INFO Logging level
GATEWAY_DEBUG false Debug mode

πŸ”¬ How It Works

Speculative Routing Flow

1. User starts typing: "Write a Python function..."
                                    β”‚
2. On every space character:        β”‚
   └─► Gateway runs ONNX prediction (<1ms)
       └─► Intent: "code" (87% confidence)
                                    β”‚
3. Confidence > 75%?                β”‚
   └─► YES: Update Redis state      β”‚
       └─► Fire-and-forget: POST /load_adapter to SGLang
                                    β”‚
4. User hits Enter                  β”‚
   └─► Forward to SGLang (adapter already loaded!)
       └─► Stream response tokens back

KV-Cache Affinity

The critical insight: LoRA adapters must NOT modify K/V projections.

# βœ… GOOD: Apply LoRA to these (KV cache shareable)
target_modules = ["o_proj", "up_proj", "down_proj"]

# ❌ BAD: Avoid these (breaks KV cache sharing)
avoid_modules = ["k_proj", "v_proj"]

This allows all agents to share the same KV cache, enabling instant switching.

πŸ“Š Performance Targets

Metric Target Description
Speculative Inference <1ms ONNX on CPU
Adapter Pre-load <100ms Before user hits Enter
Context Reload 0ms With KV affinity
Time-to-First-Token <50ms After submission
Concurrent Users 10,000+ Per gateway instance

πŸ—ΊοΈ Roadmap

  • Phase 1: Infrastructure & Gateway
  • Phase 2: Speculative Intelligence
  • Phase 3: SGLang Configuration
  • Phase 4: End-to-End Integration Testing
  • Phase 5: Kubernetes Deployment
  • Phase 6: Monitoring & Observability

πŸ“„ License

MIT License - See LICENSE for details.


Built for the future of agentic AI.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors