From smartphones to datacenters β’ No vendor lock-in β’ Privacy-preserving on-device training
Quick Start β’ Downloads β’ Datasets β’ Research β’ Benchmarks
The Problem: LLM fine-tuning has been locked to NVIDIA GPUs and CUDA. Mobile devices, AMD/Intel GPUs, and Apple Silicon were left behind.
Our Solution: A unified LoRA fine-tuning framework that works on any modern GPU:
| Platform | Hardware |
|---|---|
| π± Android | Qualcomm Adreno, ARM Mali |
| π iOS/macOS | Apple Silicon (A-series, M-series) |
| π₯οΈ Windows/Linux | AMD, Intel, NVIDIA GPUs |
Key Innovation: Novel dynamic tiling algorithm enables stable training on mobile GPUs with hardware memory constraints.
This repository contains the implementation and artifacts for our article:
"An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs"
- π Cross-Platform LoRA Framework - First unified solution for parameter-efficient fine-tuning across heterogeneous consumer hardware
- π± Mobile GPU Support - First successful fine-tuning on Adreno, Mali, and Apple mobile GPUs
- π Instruction-Tuning - Masked-loss training for instruction-following alignment
- β‘ Modern Architecture Support - Extended llama.cpp to support Qwen3 and Gemma3 fine-tuning
- π§ Hardware Innovation - Dynamic tiling algorithm solves critical Adreno GPU memory constraints
To accelerate development and innovation, Tether Data is publicly releasing:
-
Fineβtuned Model Adapters
π fabricβllmβfinetune on Hugging Face -
Source Code (WorkβinβProgress)
π qvacβfabricβllm.cpp (fabricβllmβfinetune branch)
Currently experimental and intended for developers to extend the solution for other LLM models.
- β Quality Parity: 45-48% win rate vs PyTorch/HuggingFace (LLM-as-judge)
- β Domain Adaptation: 79-94% accuracy on biomedical Q&A tasks
- β Production Scale: Tested on 6 GPU architectures, 5 model families, 4 quantization levels
π View detailed benchmarks | π Research paper: Coming soon
- First time? Start with Quick Start section below
- Platform-specific setup? Go to
releases/[your-platform]/README.md - Download binaries? Browse
releases/directory
- Training datasets:
evaluation/email_style_transfer/ - Dataset format guide:
evaluation/email_style_transfer/README.md - How to perform custom finetuning:
evaluation/README.md
- Run model comparisons: Use scripts in
evaluation/scripts/ - View benchmark results:
docs/BENCHMARKS.md(comprehensive) - Detailed experiment reports:
evaluation/reports/directory - Compare base vs fine-tuned:
evaluation/scripts/compare_base_vs_adapters.py
- Complete benchmarks:
docs/BENCHMARKS.md(all platforms, metrics) - Methodology & results:
evaluation/reports/COMPLETE_PROJECT_REPORT.md - Biomedical case study:
evaluation/reports/BIOMED_FINETUNING_REPORT.md
| Task | Location |
|---|---|
| Download binaries | releases/[platform]/ |
| Get training data | evaluation/email_style_transfer/email_dataset.jsonl |
| See platform benchmarks | docs/BENCHMARKS.md |
| Run evaluation scripts | evaluation/scripts/ |
| View experiment results | evaluation/reports/ |
| Platform setup guide | releases/[platform]/README.md |
qvac-fabric/
βββ README.md # This file - main documentation
β
βββ docs/ # π Research Documentation
β βββ BENCHMARKS.md # Comprehensive performance metrics across all platforms
β
βββ evaluation/ # π§ͺ Datasets, Scripts & Results
β βββ README.md # Evaluation guide and methodology
β β
β βββ biomedical_qa/ # Biomedical Question-Answering Dataset
β β βββ biomedical_qa.zip # PubMedQA-derived dataset (330 examples)
β β
β βββ email_style_transfer/ # Personal Email Style Transfer Dataset
β β βββ email_dataset.jsonl # Email conversation examples
β β βββ README.md # Usage and format documentation
β β
β βββ scripts/ # Python Evaluation & Monitoring Tools
β β βββ compare_base_vs_adapters.py # Compare base model vs fine-tuned
β β βββ monitor_training.py # Real-time training monitoring
β β βββ quick_compare.py # Quick model comparison
β β βββ test_biomed_prompts.py # Test prompts accuracy
β β βββ build_biomed_yn_dataset.py # Dataset preprocessing
β β
β βββ reports/ # Experimental Results & Analysis
β βββ BIOMED_FINETUNING_REPORT.md # Detailed results
β βββ COMPLETE_PROJECT_REPORT.md # Full project overview
β βββ README_inference_comparison.md # Inference benchmarks
β βββ biomed_comparison_results.json # Structured results data
β
βββ releases/ # π¦ Pre-built Binaries
βββ README.md # Platform overview and installation
β
βββ android/ # Android (Termux) Builds
β βββ README.md # Android-specific setup guide
β βββ qvac-android-adreno-arm64-v1.0.zip
β
βββ ios/ # iOS Builds
β βββ README.md # iOS-specific setup guide
β βββ qvac-ios-v1.0.zip
β
βββ linux/ # Linux Builds (Multiple Backends)
β βββ README.md # Linux setup and backend selection
β βββ qvac-linux-arm64-v1.0.zip # ARM64 CPU build
β βββ qvac-linux-vulkan-x64-v1.0.zip # AMD/NVIDIA/Intel Vulkan
β βββ qvac-linux-sycl-intel-v1.0.zip # Intel GPU SYCL
β
βββ macos/ # macOS Builds
βββ README.md # macOS setup guide
βββ qvac-macos-apple-silicon-v1.0.zip # M1/M2/M3/M4
βββ qvac-macos-intel-v1.0.zip # Intel x64
π± Android (Termux)
# Download pre-built binary for your device
wget https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-android-adreno-arm64-v1.0.zip
unzip qvac-android-adreno-arm64-v1.0.zip
cd qvac-android-adreno-arm64-v1.0
# Set library path (required for Android)
export LD_LIBRARY_PATH=.
# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/qwen3-0_6b-q8_0.gguf -O models/qwen3-0.6b-q8_0.gguf
# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip
# Quick test with biomedical dataset
./bin/llama-finetune-lora \
-m models/qwen3-0.6b-q8_0.gguf \
-f biomedical_qa/train.jsonl \
--assistant-loss-only \
-c 128 -b 128 -ub 128 -ngl 99 -fa off \
--num-epochs 2π Full Android Guide
π macOS (Apple Silicon)
# Download pre-built binary
curl -L https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-macos-apple-silicon-v1.0.zip -o qvac-macos.zip
unzip qvac-macos.zip
cd qvac-macos-apple-silicon-v1.0
# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/qwen3-1.7b-q8_0.gguf
# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip
# Quick test with biomedical dataset
./bin/llama-finetune-lora \
-m models/qwen3-1.7b-q8_0.gguf \
-f biomedical_qa/train.jsonl \
--assistant-loss-only \
-c 128 -b 128 -ub 128 -ngl 999 -fa off \
--num-epochs 3π Full macOS Guide
π₯οΈ Linux/Windows (AMD/Intel/NVIDIA)
# Download binary for your GPU
# For AMD/Intel/NVIDIA (Vulkan):
wget https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-linux-vulkan-x64-v1.0.zip
unzip qvac-linux-vulkan-x64-v1.0.zip
cd qvac-linux-vulkan-x64-v1.0
# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/qwen3-1.7b-q8_0.gguf
# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip
# Run biomedical fine-tuning
./bin/llama-finetune-lora \
-m models/qwen3-1.7b-q8_0.gguf \
-f biomedical_qa/train.jsonl \
--assistant-loss-only \
-c 128 -b 128 -ub 128 -ngl 999 -fa off \
--learning-rate 1e-5 --lr-min 1e-8 \
--lr-scheduler cosine --warmup-ratio 0.1 \
--num-epochs 8 \
--lora-modules "attn_q,attn_k,attn_v,attn_o,ffn_gate,ffn_up,ffn_down"π Full Linux Guide
Pre-built binaries optimized for each platform:
| Platform | Hardware | Backend | Size | Download |
|---|---|---|---|---|
| Android | Qualcomm Adreno, ARM Mali | Vulkan | 180MB | π₯ Download |
| macOS | Apple M1/M2/M3/M4 | Metal | 35MB | π₯ Download |
| macOS | Intel x64 | CPU | 36MB | π₯ Download |
| iOS | Apple A-series | Metal | 1.3MB | π₯ Download |
| Linux/Win | AMD/Intel/NVIDIA | Vulkan | 55MB | π₯ Download |
| Linux | ARM64 | CPU | 37MB | π₯ Download |
| Linux | Intel GPU | SYCL | 56MB | π₯ Download |
Each download contains pre-built binaries:
- β
llama-finetune-lora- LoRA fine-tuning binary - β
llama-finetune- Full fine-tuning binary - β
llama-cli- Inference and interactive chat - β
llama-quantize- Model quantization tool - β
llama-perplexity- Model evaluation tool - β
llama-export-lora- Export/merge LoRA adapters - β All required libraries (GGML, Vulkan/Metal backends)
Note: Datasets and examples are available in the evaluation directory of this repository.
π All Releases & Documentation
We provide curated, privacy-safe datasets for reproducible fine-tuning research. See the evaluation/ directory for available datasets and documentation.
- π― Full Fine-tuning & LoRA - Support for both full model updates and parameter-efficient LoRA
- π Instruction Fine-Tuning - Masked-loss training on assistant tokens only
- π Chat Templates - Built-in ChatML + custom Jinja template support
- πΎ Checkpointing - Resume training with complete optimizer state
- π Learning Rate Scheduling - Cosine annealing with warmup
- π¦ Quantization - Train and infer with F32, F16, Q8_0, Q4_0
- β Qwen3 (0.6B, 1.7B, 4B)
- β Gemma-3 (1B, 4B)
- β LLaMA family
- β TinyLlama
- π· Vulkan - AMD, Intel, NVIDIA, Qualcomm Adreno, ARM Mali
- π Metal - Apple Silicon (M-series, A-series)
- π CUDA - NVIDIA GPUs (optional, Vulkan works too)
- π₯οΈ CPU - Fallback for any platform
| Model | Mali | Adreno | Intel A770 | AMD 7900XTX | RTX 4090 | Apple M3 | iPhone 16 |
|---|---|---|---|---|---|---|---|
| Qwen3-0.6B Q8 | 15.4 | 35.0 | 133.0 | 178.2 | 199+ | 120+ | 32.5 |
| Qwen3-1.7B Q8 | 7.8 | 17.3 | 90.0 | 158.0 | 176+ | 62-90 | 15.2 |
| Gemma-1B Q8 | 11.7 | 36.6 | 89.0 | 148.8 | 150+ | 70-90 | 33.2 |
| Hardware | Time/Epoch | Full Training (8 epochs) |
|---|---|---|
| RTX 4090 | 5.5 min | 45 min β‘ |
| AMD 7900 XTX | 13 min | 1.7 hrs |
| Intel Arc A770 | 20 min | 2.7 hrs |
| Apple M3 Pro | 40 min | 5.3 hrs |
| iPhone 16 | 1h 55min | 15 hrs |
| Adreno 830 | 1h 40min | 13 hrs |
| Mali G715 | 7h 40min | 61 hrs |
π View complete benchmarks with detailed metrics across all platforms
| Metric | qvac-fabric | PyTorch/HuggingFace |
|---|---|---|
| LLM-as-Judge Win Rate | 45-48% | 52-55% |
| Biomedical Accuracy | 79-94% | 78-86% |
| Cosine Similarity | 0.82 | 0.77 |
Conclusion: Near-parity quality with established frameworks, but works on 8x more hardware platforms.
# Create new LoRA adapter
./bin/llama-finetune-lora \
-m model.gguf \
-f dataset.txt \
-ngl 999 -c 512 -b 512 -ub 512 -fa off# Advanced LoRA parameters
./bin/llama-finetune-lora \
-m model.gguf \
-f dataset.txt \
-ngl 999 -c 512 -b 512 -ub 512 \
--lora-rank 16 --lora-alpha 32 \
--lora-modules "attn_q,attn_k,attn_v,attn_o,ffn_gate,ffn_up,ffn_down" \
-fa off# Train only on assistant responses
./bin/llama-finetune-lora \
-m model.gguf \
-f conversations.jsonl \
--assistant-loss-only \
--chat-template custom.jinja \
-ngl 999 -c 512 -b 128 -ub 128 -fa off# Save checkpoints every 50 steps
./bin/llama-finetune-lora \
-m model.gguf \
-f dataset.txt \
--checkpoint-save-steps 50 \
--checkpoint-save-dir "./checkpoints" \
-ngl 999
# Resume from checkpoint
./bin/llama-finetune-lora \
-m model.gguf \
-f dataset.txt \
--resume-from "./checkpoints/checkpoint_step_00000150/" \
--output-adapter improved_adapter.gguf \
-ngl 999# Inference with LoRA adapter
./bin/llama-cli \
-m base_model.gguf \
--lora trained_adapter.gguf \
-ngl 999 \
-p "Your prompt here"Our implementation augments pretrained weights with low-rank updates:
W' = W + Ξ±(AB)
Where:
- W: Frozen base model weights
- A β β^(dΓr), B β β^(rΓd): Trainable low-rank matrices
- r: LoRA rank (typically 8-32)
- Ξ±: Scaling factor
Only matrices A and B are updated during training, reducing parameters by orders of magnitude.
Problem: Adreno GPUs have undocumented 128MiB SSBO limit causing DeviceLoss errors.
Solution: Dynamically tile large matrix operations based on input shapes:
- Calculate tile dimensions that respect 128MiB limit
- Execute operations on tiles independently
- Assemble results into final output tensor
This enables stable training on mobile GPUs where static approaches fail.
βββββββββββββββββββββββββββββββββββββββ
β llama.cpp Public API β
β (llama_lora_training_init, etc.) β
βββββββββββββββββββββββββββββββββββββββ€
β GGML Core Engine β
β (Forward/Backward Pass, Optimizer) β
ββββββββββββ¬βββββββββββ¬ββββββββββββββββ€
β Vulkan β Metal β CUDA β
β (Cross) β (Apple) β (NVIDIA) β
ββββββββββββ΄βββββββββββ΄ββββββββββββββββ
β β β
ββββββββββββ¬βββββββββββ¬ββββββββββββββββ
β Adreno β Apple β RTX β
β Mali β M/A β AMD β
β Intel β Series β etc. β
ββββββββββββ΄βββββββββββ΄ββββββββββββββββ
We welcome contributions! Areas of interest:
- π§ Optimizations for specific GPU architectures
- π± Testing on additional mobile devices
- ποΈ Support for new model architectures
- π Benchmark contributions
- π Documentation improvements
Please open an issue or pull request on GitHub to contribute.
β οΈ Qwen3-4B causes OOM on most mobile devices β Use 1.7B or smallerβ οΈ iOS may suspend background training β Keep app in foregroundβ οΈ Mali G715 training slower than Adreno β Functional but requires patience
β οΈ Flash attention not yet supported on Vulkan β Use-fa offβ οΈ Multi-GPU training experimental β Use single GPU
If you use this work in your research, please cite:
@article{qvac-fabric,
title={An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs},
author={[Subash, Akshay, Patrik, Milan, Nurman]},
journal={arXiv preprint},
year={2025}
}This work builds on:
- llama.cpp - Foundation inference engine
- LoRA (Hu et al., 2021) - Parameter-efficient fine-tuning method
- PubMedQA - Jin, Qiao, et al. "Pubmedqa: A dataset for biomedical research question answering." Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2019.
This project is licensed under the Apache 2.0 License.
- π Project Website
- π¦ Release Downloads
- π¬ Discussion Forum
- π Issue Tracker
- π Full Documentation