Skip to content

tetherto/qvac-rnd-fabric-llm-finetune

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs

The first truly cross-platform LoRA fine-tuning solution for Large Language Models
From smartphones to datacenters β€’ No vendor lock-in β€’ Privacy-preserving on-device training

Quick Start β€’ Downloads β€’ Datasets β€’ Research β€’ Benchmarks


🎯 What Makes This Different?

The Problem: LLM fine-tuning has been locked to NVIDIA GPUs and CUDA. Mobile devices, AMD/Intel GPUs, and Apple Silicon were left behind.

Our Solution: A unified LoRA fine-tuning framework that works on any modern GPU:

Platform Hardware
πŸ“± Android Qualcomm Adreno, ARM Mali
🍎 iOS/macOS Apple Silicon (A-series, M-series)
πŸ–₯️ Windows/Linux AMD, Intel, NVIDIA GPUs

Key Innovation: Novel dynamic tiling algorithm enables stable training on mobile GPUs with hardware memory constraints.


πŸ”¬ Research Highlights

This repository contains the implementation and artifacts for our article:

"An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs"

Key Contributions

  1. 🌍 Cross-Platform LoRA Framework - First unified solution for parameter-efficient fine-tuning across heterogeneous consumer hardware
  2. πŸ“± Mobile GPU Support - First successful fine-tuning on Adreno, Mali, and Apple mobile GPUs
  3. πŸŽ“ Instruction-Tuning - Masked-loss training for instruction-following alignment
  4. ⚑ Modern Architecture Support - Extended llama.cpp to support Qwen3 and Gemma3 fine-tuning
  5. πŸ”§ Hardware Innovation - Dynamic tiling algorithm solves critical Adreno GPU memory constraints

πŸš€ Empowering the Community with Open Resources

To accelerate development and innovation, Tether Data is publicly releasing:

Validated Performance

  • βœ… Quality Parity: 45-48% win rate vs PyTorch/HuggingFace (LLM-as-judge)
  • βœ… Domain Adaptation: 79-94% accuracy on biomedical Q&A tasks
  • βœ… Production Scale: Tested on 6 GPU architectures, 5 model families, 4 quantization levels

πŸ“Š View detailed benchmarks | πŸ“„ Research paper: Coming soon


πŸ—ΊοΈ Navigation Guide: Where to Find What

πŸš€ Getting Started

  • First time? Start with Quick Start section below
  • Platform-specific setup? Go to releases/[your-platform]/README.md
  • Download binaries? Browse releases/ directory

πŸ“Š Datasets & Examples

  • Training datasets: evaluation/email_style_transfer/
  • Dataset format guide: evaluation/email_style_transfer/README.md
  • How to perform custom finetuning: evaluation/README.md

πŸ§ͺ Evaluation & Testing

  • Run model comparisons: Use scripts in evaluation/scripts/
  • View benchmark results: docs/BENCHMARKS.md (comprehensive)
  • Detailed experiment reports: evaluation/reports/ directory
  • Compare base vs fine-tuned: evaluation/scripts/compare_base_vs_adapters.py

πŸ“– Documentation & Research

  • Complete benchmarks: docs/BENCHMARKS.md (all platforms, metrics)
  • Methodology & results: evaluation/reports/COMPLETE_PROJECT_REPORT.md
  • Biomedical case study: evaluation/reports/BIOMED_FINETUNING_REPORT.md

πŸ’‘ Common Tasks

Task Location
Download binaries releases/[platform]/
Get training data evaluation/email_style_transfer/email_dataset.jsonl
See platform benchmarks docs/BENCHMARKS.md
Run evaluation scripts evaluation/scripts/
View experiment results evaluation/reports/
Platform setup guide releases/[platform]/README.md

πŸ“ Repository Structure

qvac-fabric/
β”œβ”€β”€ README.md                      # This file - main documentation
β”‚
β”œβ”€β”€ docs/                          # πŸ“– Research Documentation
β”‚   └── BENCHMARKS.md              # Comprehensive performance metrics across all platforms
β”‚
β”œβ”€β”€ evaluation/                    # πŸ§ͺ Datasets, Scripts & Results
β”‚   β”œβ”€β”€ README.md                  # Evaluation guide and methodology
β”‚   β”‚
β”‚   β”œβ”€β”€ biomedical_qa/             # Biomedical Question-Answering Dataset
β”‚   β”‚   └── biomedical_qa.zip      # PubMedQA-derived dataset (330 examples)
β”‚   β”‚
β”‚   β”œβ”€β”€ email_style_transfer/      # Personal Email Style Transfer Dataset
β”‚   β”‚   β”œβ”€β”€ email_dataset.jsonl    # Email conversation examples
β”‚   β”‚   └── README.md              # Usage and format documentation
β”‚   β”‚
β”‚   β”œβ”€β”€ scripts/                   # Python Evaluation & Monitoring Tools
β”‚   β”‚   β”œβ”€β”€ compare_base_vs_adapters.py    # Compare base model vs fine-tuned
β”‚   β”‚   β”œβ”€β”€ monitor_training.py            # Real-time training monitoring
β”‚   β”‚   β”œβ”€β”€ quick_compare.py               # Quick model comparison
β”‚   β”‚   β”œβ”€β”€ test_biomed_prompts.py         # Test prompts accuracy
β”‚   β”‚   └── build_biomed_yn_dataset.py     # Dataset preprocessing
β”‚   β”‚
β”‚   └── reports/                   # Experimental Results & Analysis
β”‚       β”œβ”€β”€ BIOMED_FINETUNING_REPORT.md        # Detailed results
β”‚       β”œβ”€β”€ COMPLETE_PROJECT_REPORT.md         # Full project overview
β”‚       β”œβ”€β”€ README_inference_comparison.md     # Inference benchmarks
β”‚       └── biomed_comparison_results.json     # Structured results data
β”‚
└── releases/                      # πŸ“¦ Pre-built Binaries
    β”œβ”€β”€ README.md                  # Platform overview and installation
    β”‚
    β”œβ”€β”€ android/                   # Android (Termux) Builds
    β”‚   β”œβ”€β”€ README.md              # Android-specific setup guide
    β”‚   └── qvac-android-adreno-arm64-v1.0.zip
    β”‚
    β”œβ”€β”€ ios/                       # iOS Builds
    β”‚   β”œβ”€β”€ README.md              # iOS-specific setup guide
    β”‚   └── qvac-ios-v1.0.zip
    β”‚
    β”œβ”€β”€ linux/                     # Linux Builds (Multiple Backends)
    β”‚   β”œβ”€β”€ README.md              # Linux setup and backend selection
    β”‚   β”œβ”€β”€ qvac-linux-arm64-v1.0.zip         # ARM64 CPU build
    β”‚   β”œβ”€β”€ qvac-linux-vulkan-x64-v1.0.zip    # AMD/NVIDIA/Intel Vulkan
    β”‚   └── qvac-linux-sycl-intel-v1.0.zip    # Intel GPU SYCL
    β”‚
    └── macos/                     # macOS Builds
        β”œβ”€β”€ README.md              # macOS setup guide
        β”œβ”€β”€ qvac-macos-apple-silicon-v1.0.zip  # M1/M2/M3/M4
        └── qvac-macos-intel-v1.0.zip          # Intel x64

πŸš€ Quick Start

Choose Your Platform

πŸ“± Android (Termux)
# Download pre-built binary for your device
wget https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-android-adreno-arm64-v1.0.zip
unzip qvac-android-adreno-arm64-v1.0.zip
cd qvac-android-adreno-arm64-v1.0

# Set library path (required for Android)
export LD_LIBRARY_PATH=.

# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-0.6B-GGUF/resolve/main/qwen3-0_6b-q8_0.gguf -O models/qwen3-0.6b-q8_0.gguf

# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip

# Quick test with biomedical dataset
./bin/llama-finetune-lora \
  -m models/qwen3-0.6b-q8_0.gguf \
  -f biomedical_qa/train.jsonl \
  --assistant-loss-only \
  -c 128 -b 128 -ub 128 -ngl 99 -fa off \
  --num-epochs 2

πŸ“– Full Android Guide

🍎 macOS (Apple Silicon)
# Download pre-built binary
curl -L https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-macos-apple-silicon-v1.0.zip -o qvac-macos.zip
unzip qvac-macos.zip
cd qvac-macos-apple-silicon-v1.0

# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/qwen3-1.7b-q8_0.gguf

# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip

# Quick test with biomedical dataset
./bin/llama-finetune-lora \
  -m models/qwen3-1.7b-q8_0.gguf \
  -f biomedical_qa/train.jsonl \
  --assistant-loss-only \
  -c 128 -b 128 -ub 128 -ngl 999 -fa off \
  --num-epochs 3

πŸ“– Full macOS Guide

πŸ–₯️ Linux/Windows (AMD/Intel/NVIDIA)
# Download binary for your GPU
# For AMD/Intel/NVIDIA (Vulkan):
wget https://github.com/tetherto/qvac-fabric/releases/download/v1.0/qvac-linux-vulkan-x64-v1.0.zip

unzip qvac-linux-vulkan-x64-v1.0.zip
cd qvac-linux-vulkan-x64-v1.0

# Download model
mkdir -p models
wget https://huggingface.co/Qwen/Qwen3-1.7B-GGUF/resolve/main/qwen3-1_7b-q8_0.gguf -O models/qwen3-1.7b-q8_0.gguf

# Download and extract biomedical dataset
wget https://github.com/tetherto/qvac-rnd-fabric-llm-finetune/raw/main/evaluation/biomedical_qa/biomedical_qa.zip
unzip biomedical_qa.zip

# Run biomedical fine-tuning
./bin/llama-finetune-lora \
  -m models/qwen3-1.7b-q8_0.gguf \
  -f biomedical_qa/train.jsonl \
  --assistant-loss-only \
  -c 128 -b 128 -ub 128 -ngl 999 -fa off \
  --learning-rate 1e-5 --lr-min 1e-8 \
  --lr-scheduler cosine --warmup-ratio 0.1 \
  --num-epochs 8 \
  --lora-modules "attn_q,attn_k,attn_v,attn_o,ffn_gate,ffn_up,ffn_down"

πŸ“– Full Linux Guide


πŸ“¦ Downloads

Pre-built binaries optimized for each platform:

Platform Hardware Backend Size Download
Android Qualcomm Adreno, ARM Mali Vulkan 180MB πŸ“₯ Download
macOS Apple M1/M2/M3/M4 Metal 35MB πŸ“₯ Download
macOS Intel x64 CPU 36MB πŸ“₯ Download
iOS Apple A-series Metal 1.3MB πŸ“₯ Download
Linux/Win AMD/Intel/NVIDIA Vulkan 55MB πŸ“₯ Download
Linux ARM64 CPU 37MB πŸ“₯ Download
Linux Intel GPU SYCL 56MB πŸ“₯ Download

What's Included

Each download contains pre-built binaries:

  • βœ… llama-finetune-lora - LoRA fine-tuning binary
  • βœ… llama-finetune - Full fine-tuning binary
  • βœ… llama-cli - Inference and interactive chat
  • βœ… llama-quantize - Model quantization tool
  • βœ… llama-perplexity - Model evaluation tool
  • βœ… llama-export-lora - Export/merge LoRA adapters
  • βœ… All required libraries (GGML, Vulkan/Metal backends)

Note: Datasets and examples are available in the evaluation directory of this repository.

πŸ“– All Releases & Documentation


πŸ“Š Datasets

We provide curated, privacy-safe datasets for reproducible fine-tuning research. See the evaluation/ directory for available datasets and documentation.


🎯 Key Features

Training Capabilities

  • 🎯 Full Fine-tuning & LoRA - Support for both full model updates and parameter-efficient LoRA
  • πŸ”„ Instruction Fine-Tuning - Masked-loss training on assistant tokens only
  • πŸ“ Chat Templates - Built-in ChatML + custom Jinja template support
  • πŸ’Ύ Checkpointing - Resume training with complete optimizer state
  • πŸ“Š Learning Rate Scheduling - Cosine annealing with warmup
  • πŸ“¦ Quantization - Train and infer with F32, F16, Q8_0, Q4_0

Architecture Support

  • βœ… Qwen3 (0.6B, 1.7B, 4B)
  • βœ… Gemma-3 (1B, 4B)
  • βœ… LLaMA family
  • βœ… TinyLlama

Hardware Backends

  • πŸ”· Vulkan - AMD, Intel, NVIDIA, Qualcomm Adreno, ARM Mali
  • 🍎 Metal - Apple Silicon (M-series, A-series)
  • πŸ’š CUDA - NVIDIA GPUs (optional, Vulkan works too)
  • πŸ–₯️ CPU - Fallback for any platform

πŸ“ˆ Performance Benchmarks

Inference Speed (tokens/second)

Model Mali Adreno Intel A770 AMD 7900XTX RTX 4090 Apple M3 iPhone 16
Qwen3-0.6B Q8 15.4 35.0 133.0 178.2 199+ 120+ 32.5
Qwen3-1.7B Q8 7.8 17.3 90.0 158.0 176+ 62-90 15.2
Gemma-1B Q8 11.7 36.6 89.0 148.8 150+ 70-90 33.2

Fine-tuning Speed (Time per Epoch, Qwen3-1.7B Q8)

Hardware Time/Epoch Full Training (8 epochs)
RTX 4090 5.5 min 45 min ⚑
AMD 7900 XTX 13 min 1.7 hrs
Intel Arc A770 20 min 2.7 hrs
Apple M3 Pro 40 min 5.3 hrs
iPhone 16 1h 55min 15 hrs
Adreno 830 1h 40min 13 hrs
Mali G715 7h 40min 61 hrs

πŸ“Š View complete benchmarks with detailed metrics across all platforms

Quality Comparison vs PyTorch

Metric qvac-fabric PyTorch/HuggingFace
LLM-as-Judge Win Rate 45-48% 52-55%
Biomedical Accuracy 79-94% 78-86%
Cosine Similarity 0.82 0.77

Conclusion: Near-parity quality with established frameworks, but works on 8x more hardware platforms.


πŸ”§ Usage Examples

Basic LoRA Fine-Tuning

# Create new LoRA adapter
./bin/llama-finetune-lora \
  -m model.gguf \
  -f dataset.txt \
  -ngl 999 -c 512 -b 512 -ub 512 -fa off

Custom LoRA Configuration

# Advanced LoRA parameters
./bin/llama-finetune-lora \
  -m model.gguf \
  -f dataset.txt \
  -ngl 999 -c 512 -b 512 -ub 512 \
  --lora-rank 16 --lora-alpha 32 \
  --lora-modules "attn_q,attn_k,attn_v,attn_o,ffn_gate,ffn_up,ffn_down" \
  -fa off

Instruction Fine-Tuning (SFT)

# Train only on assistant responses
./bin/llama-finetune-lora \
  -m model.gguf \
  -f conversations.jsonl \
  --assistant-loss-only \
  --chat-template custom.jinja \
  -ngl 999 -c 512 -b 128 -ub 128 -fa off

Checkpointing & Resume

# Save checkpoints every 50 steps
./bin/llama-finetune-lora \
  -m model.gguf \
  -f dataset.txt \
  --checkpoint-save-steps 50 \
  --checkpoint-save-dir "./checkpoints" \
  -ngl 999

# Resume from checkpoint
./bin/llama-finetune-lora \
  -m model.gguf \
  -f dataset.txt \
  --resume-from "./checkpoints/checkpoint_step_00000150/" \
  --output-adapter improved_adapter.gguf \
  -ngl 999

Using Trained Adapters

# Inference with LoRA adapter
./bin/llama-cli \
  -m base_model.gguf \
  --lora trained_adapter.gguf \
  -ngl 999 \
  -p "Your prompt here"

πŸ—οΈ Technical Architecture

LoRA Integration

Our implementation augments pretrained weights with low-rank updates:

W' = W + Ξ±(AB)

Where:

  • W: Frozen base model weights
  • A ∈ ℝ^(dΓ—r), B ∈ ℝ^(rΓ—d): Trainable low-rank matrices
  • r: LoRA rank (typically 8-32)
  • Ξ±: Scaling factor

Only matrices A and B are updated during training, reducing parameters by orders of magnitude.

Dynamic Tiling Algorithm

Problem: Adreno GPUs have undocumented 128MiB SSBO limit causing DeviceLoss errors.

Solution: Dynamically tile large matrix operations based on input shapes:

  1. Calculate tile dimensions that respect 128MiB limit
  2. Execute operations on tiles independently
  3. Assemble results into final output tensor

This enables stable training on mobile GPUs where static approaches fail.

Backend Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    llama.cpp Public API             β”‚
β”‚  (llama_lora_training_init, etc.)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         GGML Core Engine            β”‚
β”‚  (Forward/Backward Pass, Optimizer) β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Vulkan  β”‚  Metal   β”‚    CUDA       β”‚
β”‚ (Cross)  β”‚ (Apple)  β”‚  (NVIDIA)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓           ↓            ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Adreno  β”‚  Apple   β”‚    RTX        β”‚
β”‚   Mali   β”‚  M/A     β”‚    AMD        β”‚
β”‚  Intel   β”‚  Series  β”‚    etc.       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“š Documentation

Getting Started

Advanced Topics

Platform Guides


🀝 Contributing

We welcome contributions! Areas of interest:

  • πŸ”§ Optimizations for specific GPU architectures
  • πŸ“± Testing on additional mobile devices
  • πŸ—οΈ Support for new model architectures
  • πŸ“Š Benchmark contributions
  • πŸ“ Documentation improvements

Please open an issue or pull request on GitHub to contribute.


πŸ› Known Issues

Mobile Platforms

  • ⚠️ Qwen3-4B causes OOM on most mobile devices β†’ Use 1.7B or smaller
  • ⚠️ iOS may suspend background training β†’ Keep app in foreground
  • ⚠️ Mali G715 training slower than Adreno β†’ Functional but requires patience

Desktop Platforms

  • ⚠️ Flash attention not yet supported on Vulkan β†’ Use -fa off
  • ⚠️ Multi-GPU training experimental β†’ Use single GPU

πŸ“ Citation

If you use this work in your research, please cite:

@article{qvac-fabric,
  title={An Edge-First Generalized LLM LoRA Fine-Tuning Framework for Heterogeneous GPUs},
  author={[Subash, Akshay, Patrik, Milan, Nurman]},
  journal={arXiv preprint},
  year={2025}
}

πŸ™ Acknowledgments

This work builds on:

  • llama.cpp - Foundation inference engine
  • LoRA (Hu et al., 2021) - Parameter-efficient fine-tuning method
  • PubMedQA - Jin, Qiao, et al. "Pubmedqa: A dataset for biomedical research question answering." Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 2019.

πŸ“„ License

This project is licensed under the Apache 2.0 License.


πŸ”— Links


Making LLM fine-tuning accessible to everyone, everywhere

From smartphones to datacenters β€’ No vendor lock-in β€’ Privacy-preserving

⭐ Star this repo if you find it useful!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages