A high-performance C++ and Python implementation for exporting Qwen language models to ONNX format and running optimized inference with ONNX Runtime.
This project provides tools to:
- Export Qwen models (Qwen3-1.7B, Qwen3-8B) from Hugging Face format to optimized ONNX format
- Run inference using ONNX Runtime in both Python and C++
- Achieve fast token generation with CPU/CUDA acceleration
- Test model outputs with structured evaluation datasets
- ONNX Export: Convert Qwen models to ONNX with optimized graph transformations
- Dual Implementation: Both Python and C++ inference engines
- Performance Optimized: Configured ONNX Runtime sessions with multiple optimization flags
- KV-Cache Support: Efficient autoregressive generation with past key-value caching
- Tokenizer Integration: Uses HuggingFace tokenizers (Python) and tokenizers-cpp (C++)
- Quantization Ready: Embedding layer quantization to uint8 for reduced memory footprint
- Flexible Configuration: JSON-based configuration for paths and model parameters
.
├── src/ # Source code
│ ├── onnx_inference.cpp # C++ inference engine
│ ├── onnx_inference.py # Python inference engine
│ ├── exporter.py # Model export script (PyTorch → ONNX)
│ └── __init__.py # Package initialization
├── scripts/ # Utility scripts
│ ├── run_gpu_inference.sh # GPU inference wrapper
│ ├── test_inference.sh # Integration tests
│ ├── quick_test.sh # Quick validation
│ └── download_onnxruntime.sh # ONNX Runtime downloader
├── configs/ # Configuration files
│ ├── config.json # Runtime configuration (user-specific)
│ ├── config.example.json # Example configuration template
│ └── launch.json # VS Code debug configuration
├── docs/ # Documentation
│ ├── BUILD.md # Detailed build instructions
│ ├── CONTRIBUTING.md # Contribution guidelines
│ ├── FIXES_APPLIED.md # Bug fixes and workarounds
│ └── todo.md # Development roadmap
├── examples/ # Example scripts and notebooks
│ └── test_onnx_model.py # ONNX model testing
├── test/ # Test datasets
├── build/ # Build artifacts (generated)
│ ├── bin/ # Compiled executables
│ └── lib/ # Compiled libraries
├── export/ # ONNX exports (1.7B model)
├── export3-8B/ # ONNX exports (8B model)
├── Qwen3-1.7B/ # Qwen 1.7B model files (downloaded)
├── Qwen3-8B/ # Qwen 8B model files (downloaded)
├── tokenizers/ # Tokenizers C++ bindings (submodule)
├── CMakeLists.txt # Build configuration
├── setup.py # Python package setup
├── requirements.txt # Python dependencies
├── .gitignore # Git ignore rules
└── README.md # This file
pip install -r requirements.txtCore dependencies:
- torch>=2.0.0
- transformers>=4.30.0
- onnxruntime>=1.19.0
- numpy>=1.24.0
Optional for CUDA:
pip install onnxruntime-gpu- CMake 3.16+
- C++17 compatible compiler (gcc, clang, MSVC)
- ONNX Runtime 1.19.0+ (GPU build recommended)
- CUDA Toolkit 11.x or 12.x (for GPU inference)
- cuDNN 9.x (for GPU inference)
- nlohmann/json library
- tokenizers-cpp (included as submodule)
git clone https://github.com/sgowdaks/llm-inference.git
cd llm-inference
git submodule update --init --recursiveCopy and edit the configuration file:
cp configs/config.example.json configs/config.jsonEdit configs/config.json to set your paths:
{
"paths": {
"model_path": "/path/to/Qwen3-8B",
"model_config": "/path/to/Qwen3-8B/config.json",
"onnx_file": "/path/to/export/qwen.onnx",
"test_file": "/path/to/test/test1.json"
}
}python src/exporter.py --config configs/config.json --mode exportThis will:
- Load the Qwen model from Hugging Face format
- Wrap it in an optimized PyTorch module
- Export to ONNX with dynamic axes for efficient batching
- Save model weights and constants to the export directory
# Single prompt
python src/onnx_inference.py --config configs/config.json --prompt "What is the capital of France?"
# Test mode with JSON test file
python src/onnx_inference.py --config configs/config.json --test-mode
# Short answer mode (stops at first sentence)
python src/onnx_inference.py --config configs/config.json --short-answer --prompt "2 + 2 = ?"See docs/BUILD.md for detailed build instructions.
Quick build:
# Configure
mkdir build && cd build
cmake .. -DONNXRUNTIME_ROOT_DIR=/path/to/onnxruntime-linux-x64-gpu-1.19.0
# Build
make -j4
# Run (GPU)
cd ..
./scripts/run_gpu_inference.sh "What is 2+2?"
# Or run directly (CPU)
./build/bin/onnx_inference "What is 2+2?"Download ONNX Runtime GPU build:
wget https://github.com/microsoft/onnxruntime/releases/download/v1.19.0/onnxruntime-linux-x64-gpu-1.19.0.tgz
tar -xzf onnxruntime-linux-x64-gpu-1.19.0.tgzInstall cuDNN 9 (required for GPU inference):
# Using conda (recommended)
conda install -c conda-forge cudnn=9
# Or download from NVIDIA and set LD_LIBRARY_PATH manually# Configure build
mkdir build && cd build
cmake .. -DONNXRUNTIME_ROOT_DIR=/path/to/onnxruntime-linux-x64-gpu-1.19.0
# Build
make -j4
# Return to project root
cd ..Use the provided wrapper script to set up the environment:
# Create run_gpu_inference.sh if it doesn't exist
cat > run_gpu_inference.sh << 'EOF'
#!/bin/bash
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=$CONDA_PREFIX/lib/libstdc++.so.6
./build/onnx_inference "$@"
EOF
chmod +x run_gpu_inference.sh
# Run inference
./run_gpu_inference.sh "What is 2+2?"Or for CPU-only inference (slower):
./build/onnx_inference "What is 2+2?"The exporter wraps the original Hugging Face model with a custom QWENWrapper that:
- Quantizes embeddings to uint8 (scale/zero-point per token)
- Precomputes rotary embeddings (cos/sin tables)
- Integrates attention mask generation
- Implements custom attention with KV-cache management
- Applies RMSNorm with variance epsilon for stability
Key optimizations:
- Q/K normalization scaling integrated into weights
- Efficient KV-cache concatenation and reuse
- Dynamic axes for variable sequence lengths
Features:
- Clean, maintainable rewrite with type hints and logging
- HuggingFace tokenizer integration
- OrtValue-based inference for zero-copy tensor management
- Configurable ONNX Runtime session options
- Test mode for batch evaluation
Performance settings:
- Sequential execution mode
- All graph optimizations enabled
- QDQ cleanup for quantized models
- GELU approximation
- CPU memory arena enabled
Features:
- Native ONNX Runtime C++ API (v1.19.0+)
- GPU acceleration with CUDA execution provider
- tokenizers-cpp for fast tokenization
- Streaming token generation with KV-cache management
- Automatic CUDA detection and fallback to CPU
- Proper tensor lifecycle management for multi-iteration decode
- Naive UTF-8 decoding with character replacement
Optimizations:
- Zero-copy tensor creation where possible
- Efficient KV-cache reuse via std::move
- Persistent data buffers across decode iterations
- Minimal heap allocations in decode loop
- Device selection for multi-GPU systems
Both Python and C++ implementations configure:
log_severity_level: 4 (minimal logging)inter/intra_op_num_threads: 0 (auto-detect)execution_mode: Sequentialgraph_optimization_level: All optimizations enabledenable_cpu_mem_arena: Trueset_denormal_as_zero: Trueallow_spinning: True for inter/intra opsqdq_matmulnbits_accuracy_level: 4enable_gelu_approximation: True
Required fields:
num_key_value_heads: Number of KV heads (e.g., 4 for 8B model)head_dim: Dimension per attention head (e.g., 128)num_hidden_layers: Number of transformer layers (e.g., 32)
Create a test file in JSON format (test/test1.json):
[
{
"prompt": "What is 2+2?",
"expected": "4"
},
{
"prompt": "What is the capital of France?",
"expected": "Paris"
}
]Run tests:
# Python
python onnx_inference.py --config config.json --test-mode
# HuggingFace baseline comparison
python exporter.py --config config.json --mode testTypical performance on example hardware:
Qwen3-8B on GPU (NVIDIA RTX A6000, CUDA):
- First token: ~200ms (model load time separate)
- Subsequent tokens: ~12 tokens/sec
- Using ONNX Runtime 1.19.0 GPU + cuDNN 9
Qwen3-8B on GPU (NVIDIA RTX 3090):
- First token: ~100ms
- Subsequent tokens: ~60-80 tokens/sec
Qwen3-8B on CPU (AMD Ryzen 9 5950X):
- First token: ~500ms
- Subsequent tokens: ~15-20 tokens/sec
Qwen3-1.7B on CPU:
- Subsequent tokens: ~30-40 tokens/sec
Performance varies based on sequence length, hardware, ONNX Runtime build, and CUDA/cuDNN versions.
See todo.md for planned improvements:
- Fix ONNX Runtime 1.19.0 API compatibility
- Enable GPU inference with CUDA provider
- Fix tensor lifecycle issues in decode loop
- Add cuDNN 9 support
- Remove transformers dependency at inference time
- Create minimal runtime config during export
- Add HuggingFace download automation
- Expand test suite with evaluation metrics
- Add accuracy comparison with HF baseline
- Support multi-GPU inference
- Add batch inference support
The C++ code requires ONNX Runtime 1.19.0 or later. Key API changes:
AddSessionConfigEntry→AddConfigEntrySetLogVerbosityLevelremoved (use session options)SetProviders→AppendExecutionProvider_CUDAfor GPU
If you encounter API errors, ensure you're using ONNX Runtime 1.19.0+.
Error: Failed to load CUDA execution provider or cuDNN library not found
Solution: Install cuDNN 9 and set up environment:
# Install via conda (recommended)
conda install -c conda-forge cudnn=9
# Use the wrapper script to run
./run_gpu_inference.sh "Your prompt here"The wrapper script sets:
LD_LIBRARY_PATH=$CONDA_PREFIX/lib(finds cuDNN)LD_PRELOAD=$CONDA_PREFIX/lib/libstdc++.so.6(avoids version conflicts)
Error: Could not find an implementation for Mul(14) node
Cause: ONNX Runtime GPU builds may not include all CPU fallback kernels for newer opsets.
Solution: The exporter now uses opset 13 by default. If you have an old export, re-run:
python exporter.py --config config.json --mode exportError: Unexpected input data type. Actual: (tensor(int64)), expected: (tensor(float))
Cause: Input tensors not properly reconstructed between decode iterations.
Solution: Already fixed in current code. Ensure you have the latest version where:
current_tokens,history_len_data,ids_len_data,attention_mask_datapersist across iterations- KV cache tensors are moved (not copied) from outputs to inputs
- Memory info and shapes are properly maintained
Set the ONNXRUNTIME_ROOT_DIR CMake variable:
cmake .. -DONNXRUNTIME_ROOT_DIR=/path/to/onnxruntime-linux-x64-gpu-1.19.0Ensure tokenizer.json exists in your model directory:
ls -la /path/to/Qwen3-8B/tokenizer.jsonCheck available providers in C++:
# The program will print: "Using CUDA execution provider on GPU 0"
# Or fall back to: "CUDA provider not available, using CPU"For Python:
import onnxruntime
print(onnxruntime.get_available_providers())
# Should show: ['CUDAExecutionProvider', 'CPUExecutionProvider']If you run out of GPU memory:
- Use a smaller model (Qwen3-1.7B instead of 8B)
- Reduce batch size (currently 1)
- Select a different GPU: Modify
device_idinonnx_inference.cpp - Close other GPU applications
- Ensure you're using GPU inference (check for "Using CUDA" message)
- Build ONNX Runtime from source with optimizations
- Use cuDNN 9 for best performance
- Check GPU utilization with
nvidia-smi
This project is based on the Native-LLM-for-Android example by DakeQQ, with significant refactoring for:
- Improved maintainability and code quality
- Better error handling and logging
- Type safety (Python type hints)
- Modular architecture
- Enhanced documentation
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Submit a pull request
Author: Shivani Gowda Repository: llm-inference