High-performance LLM inference runtime with monolithic architecture targeting 2000+ tok/s throughput.
All components integrated into a single nanovllm.py module:
- Model loading (HuggingFace transformers)
- KV cache management
- Token generation with sampling
- Batch processing
- CUDA optimizations
pip install -r requirements.txtpython bench.pyThe benchmark expects the model at ~/huggingface/Qwen3-0.6B/.
- Goal: 2000 tok/s throughput
- Test: 256 sequences, random input/output lengths up to 1024 tokens
- Model: Qwen3-0.6B
- Integrated KV cache (no separate module overhead)
- Batch processing with past_key_values reuse
- CUDA optimizations (Flash Attention, TF32)
- Minimal abstraction layers
- Direct tensor operations
The code is designed to run on the GCP VM (researchvm-ubuntu) which has all dependencies pre-installed.