Run text generation with pretrained models from HuggingFace Hub using the unified inference pipeline.
The UnifiedPipeline auto-detects model family and provides a simplified, one-liner API:
from chuk_lazarus.inference import UnifiedPipeline, UnifiedPipelineConfig, DType
# One-liner model loading - auto-detects family!
pipeline = UnifiedPipeline.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
# Simple chat API
result = pipeline.chat("What is the capital of France?")
print(result.text)
print(result.stats.summary) # "25 tokens in 0.42s (59.5 tok/s)"from chuk_lazarus.inference import (
UnifiedPipeline,
UnifiedPipelineConfig,
GenerationConfig,
DType,
)
# Configure the pipeline
config = UnifiedPipelineConfig(
dtype=DType.BFLOAT16,
default_system_message="You are a helpful coding assistant.",
default_max_tokens=200,
default_temperature=0.7,
)
pipeline = UnifiedPipeline.from_pretrained(
"HuggingFaceTB/SmolLM2-360M-Instruct",
pipeline_config=config,
)
# Generate with custom settings
result = pipeline.chat(
"Write a Python function to calculate Fibonacci numbers",
max_new_tokens=300,
temperature=0.3,
)
print(result.text)# Basic inference with TinyLlama
chuk-lazarus infer --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --prompt "What is the capital of France?"
# With generation parameters
chuk-lazarus infer \
--model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" \
--prompt "Explain quantum computing in one sentence" \
--max-tokens 100 \
--temperature 0.7For more control, use the models directly:
from chuk_lazarus.models_v2 import LlamaConfig, LlamaForCausalLM
from transformers import AutoTokenizer
import mlx.core as mx
# Load model and tokenizer
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
config = LlamaConfig.from_pretrained(model_id)
model = LlamaForCausalLM.from_pretrained(model_id, config)
# Generate text
prompt = "What is machine learning?"
input_ids = tokenizer.encode(prompt, return_tensors="np")
input_ids = mx.array(input_ids)
output_ids = model.generate(
input_ids,
max_new_tokens=100,
temperature=0.7,
stop_tokens=[tokenizer.eos_token_id],
)
mx.eval(output_ids)
response = tokenizer.decode(output_ids[0].tolist(), skip_special_tokens=True)
print(response)| Class | Description |
|---|---|
UnifiedPipeline |
High-level API with auto-detection and generation |
UnifiedPipelineConfig |
Pipeline configuration (dtype, defaults, introspection) |
GenerationConfig |
Generation parameters (max_tokens, temperature, top_p) |
GenerationResult |
Generation output with text and stats |
ChatHistory |
Multi-turn conversation management |
# Synchronous loading (auto-detects family)
pipeline = UnifiedPipeline.from_pretrained(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
pipeline_config=UnifiedPipelineConfig(dtype=DType.BFLOAT16),
)
# Async loading
pipeline = await UnifiedPipeline.from_pretrained_async(
"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
)
# Access detected family
print(f"Family: {pipeline.family_type}") # ModelFamilyType.LLAMAThe pipeline auto-detects these model families:
| Family | Model Types | Example Models |
|---|---|---|
llama |
llama, mistral, codellama | TinyLlama, Llama 2/3, SmolLM2 |
llama4 |
llama4 | Llama 4 Scout |
gemma |
gemma, gemma2, gemma3 | Gemma 3 270M-27B |
granite |
granite | Granite 3.x |
granitemoehybrid |
granitemoehybrid | Granite 4.0 |
jamba |
jamba | Jamba, Jamba 1.5 |
mamba |
mamba | Mamba SSM models |
starcoder2 |
starcoder2 | StarCoder2 3B/7B/15B |
qwen3 |
qwen2, qwen3 | Qwen 2/3 |
gpt2 |
gpt2 | GPT-2, DistilGPT-2 |
# Simple single-turn chat
result = pipeline.chat("What is 2+2?")
# With custom system message
result = pipeline.chat(
"Write a haiku",
system_message="You are a poet.",
)
# Multi-turn conversation
from chuk_lazarus.inference import ChatHistory
history = ChatHistory()
history.add_system("You are a helpful assistant.")
history.add_user("What is Python?")
history.add_assistant("Python is a programming language.")
history.add_user("What is it used for?")
result = pipeline.chat_with_history(history)# Direct prompt without chat formatting
result = pipeline.generate(
"Once upon a time",
max_new_tokens=100,
temperature=0.9,
)
# With full config
from chuk_lazarus.inference import GenerationConfig
config = GenerationConfig(
max_new_tokens=200,
temperature=0.7,
top_p=0.9,
top_k=40,
)
result = pipeline.generate("The quick brown fox", config=config)from chuk_lazarus.inference import generate_stream
# Stream tokens as they're generated
for chunk in generate_stream(model, tokenizer, "Write a story"):
print(chunk, end="", flush=True)Select the KV_DIRECT engine at pipeline construction time and retrieve a ready-to-use generator:
from chuk_lazarus.inference import UnifiedPipeline, UnifiedPipelineConfig
from chuk_lazarus.inference.unified import EngineMode
# Load with KV-direct engine mode selected
config = UnifiedPipelineConfig(engine=EngineMode.KV_DIRECT)
pipeline = UnifiedPipeline.from_pretrained("google/gemma-3-1b-it", config=config)
# Get a ready-to-use KVDirectGenerator for any supported model
kv_gen = pipeline.make_engine()Stores K,V directly after prefill (post-norm, post-RoPE). Bit-exact with standard KV cache. Exposes the full generation lifecycle — prefill, step, extend, and slide — for direct KV store control:
import mlx.core as mx
from chuk_lazarus.inference.context import make_kv_generator
# Auto-detects model family (Gemma, Llama, Mistral, ...)
kv_gen = make_kv_generator(pipeline.model)
# Prefill on a prompt
prompt_ids = mx.array([[tok1, tok2, ...]]) # (1, S)
logits, kv_store = kv_gen.prefill(prompt_ids)
# Generate tokens one at a time
seq_len = prompt_ids.shape[1]
for _ in range(max_new_tokens):
next_token = mx.argmax(logits[0, -1])
logits, kv_store = kv_gen.step(next_token[None, None], kv_store, seq_len)
seq_len += 1
# Extend with a new batch of tokens (e.g. second turn)
new_ids = mx.array([[...]]) # (1, N)
logits, kv_store = kv_gen.extend(new_ids, kv_store, abs_start=seq_len)
# Evict oldest tokens when budget is hit
kv_store = kv_gen.slide(kv_store, evict_count=64)
# Memory accounting
print(f"KV bytes at 512 tokens: {kv_gen.kv_bytes(512):,}")Use the auto-detect factory or construct from a specific adapter. To support a new architecture, implement ModelBackboneProtocol and pass it to KVDirectGenerator:
from chuk_lazarus.inference.context import (
KVDirectGenerator,
make_kv_generator,
GemmaBackboneAdapter,
LlamaBackboneAdapter,
ModelBackboneProtocol,
TransformerLayerProtocol,
)
# Auto-detect factory (Gemma, Llama, Mistral):
gen = make_kv_generator(model)
# Or construct from a specific adapter:
gen = KVDirectGenerator.from_gemma_rs(rs_model)
gen = KVDirectGenerator.from_llama(llama_model)
# For a new architecture, implement ModelBackboneProtocol and pass it in:
class MyBackboneAdapter: # implements ModelBackboneProtocol
...
gen = KVDirectGenerator(MyBackboneAdapter(my_model))The --engine flag selects the engine at the CLI level:
# Standard generation (default)
lazarus infer --model google/gemma-3-1b-it --prompt "Hello"
# KV-direct stateful engine
lazarus infer --model google/gemma-3-1b-it --prompt "Hello" --engine kv_direct| Engine | Class | Use case |
|---|---|---|
standard |
Built-in KV cache | General inference, default |
kv_direct |
KVDirectGenerator |
Stateful multi-turn, sliding window, custom eviction |
bounded_kv |
BoundedKVEngine |
HOT/WARM/COLD memory budgets (Gemma) |
unlimited |
UnlimitedContextEngine |
Unbounded context via checkpoint replay |
The examples/inference/ directory contains streamlined examples using UnifiedPipeline:
# Simple inference (any supported model)
uv run python examples/inference/simple_inference.py --prompt "What is the capital of France?"
# List supported families
uv run python examples/inference/simple_inference.py --list-families
# Llama family with model presets
uv run python examples/inference/llama_inference.py --model smollm2-360m
uv run python examples/inference/llama_inference.py --list
# Gemma 3 with interactive chat
uv run python examples/inference/gemma_inference.py --chat
# Granite (IBM)
uv run python examples/inference/granite_inference.py --model granite-3.1-2b
# Llama 4 Scout (Mamba-Transformer hybrid)
uv run python examples/inference/llama4_inference.py
# StarCoder2 (code generation)
uv run python examples/inference/starcoder2_inference.py --prompt "def quicksort(arr):"
# Jamba (hybrid Mamba-Transformer MoE)
uv run python examples/inference/jamba_inference.py --test-tinyThe examples/inference/llama_inference.py script provides a unified interface for Llama-architecture models:
# List available model presets
uv run python examples/inference/llama_inference.py --list
# Run with different models
uv run python examples/inference/llama_inference.py --model tinyllama
uv run python examples/inference/llama_inference.py --model smollm2-360m
uv run python examples/inference/llama_inference.py --model llama3.2-1b
# Custom prompt
uv run python examples/inference/llama_inference.py \
--model smollm2-360m \
--prompt "Explain relativity in simple terms" \
--max-tokens 150 \
--temperature 0.8| Preset | Model ID | Parameters | Notes |
|---|---|---|---|
tinyllama |
TinyLlama/TinyLlama-1.1B-Chat-v1.0 | 1.1B | Fast, good for testing |
smollm2-135m |
HuggingFaceTB/SmolLM2-135M-Instruct | 135M | Tiny, runs anywhere |
smollm2-360m |
HuggingFaceTB/SmolLM2-360M-Instruct | 360M | Good quality/speed balance |
smollm2-1.7b |
HuggingFaceTB/SmolLM2-1.7B-Instruct | 1.7B | High quality, still fast |
llama2-7b |
meta-llama/Llama-2-7b-chat-hf | 7B | Llama 2 Chat |
llama2-13b |
meta-llama/Llama-2-13b-chat-hf | 13B | Larger Llama 2 |
llama3.2-1b |
meta-llama/Llama-3.2-1B-Instruct | 1B | Smallest Llama 3 |
llama3.2-3b |
meta-llama/Llama-3.2-3B-Instruct | 3B | Small but capable |
llama3.1-8b |
meta-llama/Llama-3.1-8B-Instruct | 8B | Standard size |
mistral-7b |
mistralai/Mistral-7B-Instruct-v0.3 | 7B | Sliding window attention |
Note: Meta Llama models require HuggingFace authentication. Run huggingface-cli login first.
The generate() method supports several parameters to control text generation:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_new_tokens |
int | 100 | Maximum tokens to generate |
temperature |
float | 0.7 | Sampling temperature (0 = greedy, higher = more random) |
top_p |
float | 0.9 | Nucleus sampling probability threshold |
top_k |
int | None | Top-k sampling (limits to k most likely tokens) |
repetition_penalty |
float | 1.0 | Penalty for repeating tokens (>1 reduces repetition) |
stop_tokens |
list | None | Token IDs that stop generation |
# Greedy decoding (deterministic)
output = model.generate(input_ids, temperature=0.0)
# Low temperature (focused, coherent)
output = model.generate(input_ids, temperature=0.3)
# Medium temperature (balanced)
output = model.generate(input_ids, temperature=0.7)
# High temperature (creative, diverse)
output = model.generate(input_ids, temperature=1.2)Models are loaded from HuggingFace Hub and automatically converted:
- Download: Weights are downloaded via
huggingface_hub.snapshot_download() - Load: MLX's native
mx.load()handles safetensors files - Detect: Model family is auto-detected from config.json
- Convert: Weights are mapped using family-specific converters
- Update: Weights are loaded into the model via
model.update()
Use bfloat16 for numerical stability with most models:
config = UnifiedPipelineConfig(dtype=DType.BFLOAT16) # RecommendedThe pipeline uses the tokenizer's built-in chat template when available:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
# Apply chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)- Use bfloat16: Default dtype for numerical stability
- Enable KV-cache: Automatically enabled for autoregressive generation
- Batch prompts: Process multiple prompts in a single forward pass when possible
- Use smaller models: SmolLM2-135M for fast iteration, larger models for quality
- Use KVDirectGenerator for stateful inference:
make_kv_generator(model)gives you direct control over the KV store — sliding window eviction, turn-level extend, and memory accounting without touching the model code.
If the model produces nonsensical output:
- Ensure weights loaded correctly (check tensor count)
- Verify dtype is
bfloat16(notfloat16) - Check that
tie_word_embeddingsmatches config
- First token is slow (model compilation)
- Subsequent tokens should be fast (~50-150 tok/s)
- Larger models need more memory bandwidth
If weight loading fails:
- Check model files exist in cache
- Verify safetensors format
- Some models may need HF authentication
If you get "Unable to detect model family":
- Check if the model's
model_typeis supported - The model architecture may not be implemented yet
- Open an issue to request support for new model families
Gemma 3 is Google's latest open model family with 5 sizes (270M, 1B, 4B, 12B, 27B) and 128K context.
# Basic inference
uv run python examples/inference/gemma_inference.py --prompt "What is the capital of France?"
# Gemma 3 270M (smallest, fastest)
uv run python examples/inference/gemma_inference.py --model gemma3-270m
# Interactive chat mode
uv run python examples/inference/gemma_inference.py --chatIBM Granite models are available in dense (3.x) and hybrid (4.x) variants:
# Granite 3.1 (dense)
uv run python examples/inference/granite_inference.py --model granite-3.1-2b
# Granite 4.0 (hybrid Mamba/Transformer)
uv run python examples/inference/granite_inference.py --model granite-4.0-micro
# Test without downloading
uv run python examples/inference/granite_inference.py --test-tinyJamba is AI21 Labs' hybrid Mamba-Transformer MoE model with 256K context:
# Test with tiny model
uv run python examples/inference/jamba_inference.py --test-tiny
# Basic inference
uv run python examples/inference/jamba_inference.py --prompt "What is quantum computing?"BigCode's code generation models:
# Code completion
uv run python examples/inference/starcoder2_inference.py --prompt "def fibonacci(n):"
# Use specific model size
uv run python examples/inference/starcoder2_inference.py --model starcoder2-7bTo expose a model over HTTP as an OpenAI-compatible API — for use with mcp-cli, LangChain, the openai SDK, or any other OpenAI-compatible client — start the built-in inference server:
# Requires the server extra
uv add "chuk-lazarus[server]"
lazarus serve --model google/gemma-3-4b-itThe server is ready at http://localhost:8080/v1. From there:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "google/gemma-3-4b-it", "messages": [{"role": "user", "content": "Hello!"}]}'Or use the Python client library:
from chuk_lazarus.client import LazarusClient, ChatMessage, ClientRole
with LazarusClient() as client:
response = client.chat(
model="google/gemma-3-4b-it",
messages=[ChatMessage(role=ClientRole.USER, content="Hello!")],
)
print(response.content)See server.md for the full server guide and client.md for the client library.
- Models Guide - Architecture details
- Training Guide - Fine-tuning models
- Inference Server - OpenAI-compatible HTTP server
- Client Library - Python client for the server
- Examples - Working inference examples