Skip to content

audiohacking/claude-code-local

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Claude Code Local

Run a 122 billion parameter AI on your MacBook.
No cloud. No fees. No data leaves your machine.

Speed Claude Code 7.5x faster MIT


πŸ€” What Is This?

Your MacBook has a powerful GPU built right into the chip. This project uses that GPU to run a massive AI model β€” the same kind that powers ChatGPT and Claude β€” entirely on your computer.

🚫 No internet needed πŸ’° No monthly subscription πŸ”’ No one sees your code or data βœ… Full Claude Code experience β€” write code, edit files, manage projects, control your browser

         πŸ“± You (Mac or Phone)
          β”‚
     πŸ€– Claude Code         ← the AI coding tool you know
          β”‚
     ⚑ MLX Native Server    ← our server (200 lines of Python)
          β”‚
     🧠 Qwen 3.5 122B       ← 122 billion parameter brain
          β”‚
     πŸ–₯️ Apple Silicon GPU    ← your M-series chip does all the work

πŸ“± Control From Your Phone

You don't have to be at your Mac to use this. We built a remote control pipeline:

πŸ“± Your iPhone                    πŸ’» Your Mac
     β”‚                                β”‚
     │── iMessage ──────────────────>β”‚
     β”‚                                │── Claude Code
     β”‚                                │── MLX Server
     β”‚                                │── Qwen 3.5 122B
     β”‚                                │── (does the work)
     β”‚<── iMessage response ────────│
     β”‚                                β”‚
   πŸ›‹οΈ From your couch            πŸ–₯️ At your desk

How it works:

  • πŸ“² Send a message from your phone via iMessage
  • πŸ€– Claude Code receives it and runs the task on your local AI
  • πŸ’¬ Response comes back to your phone
  • ✈️ Works anywhere your Mac has power β€” even offline for the AI part

We built this before Anthropic shipped their Dispatch feature. Same concept, but ours uses iMessage and runs on your local model instead of cloud.

πŸ’‘ Pro tip: Anthropic's Dispatch doesn't read your CLAUDE.md. Mention it in your message or it'll miss your custom setup. Our iMessage system doesn't have this problem.


πŸ“Š Benchmarks

We built and tested three different approaches. Each one got faster.

⚑ Speed Comparison

                         Tokens per Second
  🐌 Ollama (Gen 1)      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 30 tok/s
  πŸƒ llama.cpp (Gen 2)   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 41 tok/s
  πŸš€ MLX Native (Gen 3)  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 65 tok/s

⏱️ Real-World Claude Code Task

How long to ask Claude Code to write a function:

  😴 Ollama + Proxy          β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 133 seconds
  😐 llama.cpp + Proxy       β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 133 seconds
  πŸ”₯ MLX Native (no proxy)   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 17.6 seconds

                              7.5x faster ⚑

πŸ“‹ Side-by-Side

🐌 Ollama πŸƒ llama.cpp + TurboQuant πŸš€ MLX Native (ours)
Speed 30 tok/s 41 tok/s 65 tok/s
Claude Code task 133s 133s 17.6s
Needs a proxy? ❌ Yes ❌ Yes βœ… No
Lines of code N/A N/A (C++ fork) ~200 Python
Apple native? ❌ Generic ❌ Ported βœ… MLX

☁️ vs Cloud APIs

πŸ–₯️ Our Local Setup ☁️ Claude Sonnet ☁️ Claude Opus
Speed 65 tok/s ~80 tok/s ~40 tok/s
Monthly cost $0 πŸŽ‰ $20-100+ $20-100+
Privacy 100% local πŸ”’ Cloud Cloud
Works offline Yes ✈️ No No
Data leaves your Mac Never Always Always

πŸ’‘ Our local setup beats cloud Opus on raw speed (65 vs 40 tok/s) at $0/month.


πŸ’‘ How We Got Here

Most people trying to run Claude Code locally hit the same wall:

Claude Code speaks Anthropic API. Local models speak OpenAI API. Different languages. 🀷

So everyone builds a proxy to translate between them. That proxy adds latency, complexity, and breaks things.

We took a different approach:

🐌 What everyone else does πŸš€ What we did
Claude Code β†’ Proxy β†’ Ollama β†’ Model Claude Code β†’ Our Server β†’ Model
3 processes, 2 API translations 1 process, 0 translations
133 seconds per task 17.6 seconds per task

🎯 That one change β€” eliminating the proxy β€” made it 7.5x faster.


πŸ’» What You Need

Your Mac RAM What You Can Run
M1/M2/M3/M4 (base) 8-16 GB 🟑 Small models (4B)
M1/M2/M3/M4 Pro 18-36 GB 🟠 Medium models (32B)
M2/M3/M4/M5 Max 64-128 GB 🟒 Large models (122B)
M2/M3/M4 Ultra 128-192 GB πŸ”΅ Multiple large models

Also need:

  • 🐍 Python 3.12+ (for MLX)
  • πŸ€– Claude Code (npm install -g @anthropic-ai/claude-code)

πŸš€ Quick Start (4 Steps)

1️⃣ Set up Python environment

python3.12 -m venv ~/.local/mlx-server
~/.local/mlx-server/bin/pip install mlx-lm

2️⃣ Download the AI model

First run downloads ~50 GB (one time only):

~/.local/mlx-server/bin/python3 -c "
from mlx_lm.utils import load
load('mlx-community/Qwen3.5-122B-A10B-4bit')
print('Done!')
"

3️⃣ Start the server

~/.local/mlx-server/bin/python3 proxy/server.py

4️⃣ Launch Claude Code

ANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_API_KEY=sk-local \
claude --model claude-sonnet-4-6

πŸ’‘ Or just double-click Claude Local.command on your Desktop. It does all of this automatically.


πŸ”§ How It Works

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Your MacBook (M5 Max)               β”‚
β”‚                                                  β”‚
β”‚  πŸ“± You type ──> πŸ€– Claude Code                  β”‚
β”‚                      β”‚                           β”‚
β”‚                      β–Ό                           β”‚
β”‚                 ⚑ MLX Server (port 4000)         β”‚
β”‚                      β”‚                           β”‚
β”‚                      β–Ό                           β”‚
β”‚                 🧠 Qwen 3.5 122B ──> πŸ–₯️ GPU      β”‚
β”‚                      β”‚                           β”‚
β”‚                      β–Ό                           β”‚
β”‚  πŸ“± Answer <─── ✨ Clean response                β”‚
β”‚                                                  β”‚
β”‚         πŸ”’ Nothing leaves this box. Ever.        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The server (proxy/server.py) is one file, ~200 lines. It does three things:

  1. πŸ“¦ Loads the model β€” Apple's MLX framework, native Metal GPU, unified memory
  2. πŸ”Œ Speaks Anthropic API β€” Claude Code thinks it's talking to Anthropic's cloud. It's not.
  3. 🧹 Cleans the output β€” Qwen thinks out loud in <think> tags. We strip those.

🌐 Browser Control

Claude Code can control your real web browser β€” not a sandbox. Your actual browser with all your logins. πŸ”“

🟒 Chrome DevTools (CDP) πŸ”΅ Playwright
Controls Your real Brave/Chrome Separate sandboxed browser
Logged in? βœ… All your sessions ❌ Starts fresh
Speed ⚑ Fast 🐌 Slower
Best for Daily tasks Automated jobs

πŸ’‘ Example: "Go to my GitHub and check which PRs need review" β€” it opens your actual browser, already logged in, and does it. No re-authenticating. Ever.


✈️ When To Use This

Situation Use This? Why
On a plane βœ… Full AI coding, no internet needed
Sensitive client code βœ… Nothing leaves your machine
Don't want API fees βœ… $0/month forever
Want fastest possible ☁️ Cloud API is still faster
Need Claude-level reasoning ☁️ Local model is good, not Claude-level
Controlling from phone βœ… iMessage pipeline works offline

πŸ“ What's In This Repo

πŸ“¦ claude-code-local/
 β”œβ”€β”€ ⚑ proxy/
 β”‚   └── server.py              ← The entire server. 200 lines. This IS the project.
 β”œβ”€β”€ πŸš€ launchers/
 β”‚   β”œβ”€β”€ Claude Local.command    ← Double-click to start everything
 β”‚   └── Browser Agent.command   ← Double-click for browser control
 β”œβ”€β”€ πŸ› οΈ scripts/
 β”‚   β”œβ”€β”€ download-and-import.sh  ← Download models
 β”‚   β”œβ”€β”€ persistent-download.sh  ← Auto-retry downloader
 β”‚   └── start-mlx-server.sh    ← Alternative config
 β”œβ”€β”€ πŸ“Š docs/
 β”‚   β”œβ”€β”€ BENCHMARKS.md           ← Detailed speed comparisons
 β”‚   └── TWITTER-THREAD.md       ← Social media content
 └── setup.sh                    ← One-command installer

πŸ”’ Security

We audited every component before running it:

Component Source Network Calls Verdict
server.py We wrote it 0 βœ… Safe
MLX framework Apple 0 βœ… Safe
Qwen 3.5 model HuggingFace verified 0 βœ… Safe

🚫 No telemetry 🚫 No analytics 🚫 No phone-home 🚫 No sketchy pip packages

⚠️ We removed LiteLLM after supply chain attack concerns were raised. Every dependency was audited.


πŸ›€οΈ The Journey

We didn't start here. We went through three generations in one night:

Gen What We Tried Speed πŸ’‘ What We Learned
1️⃣ Ollama + custom proxy 30 tok/s Ollama works but Claude Code can't talk to it directly
2️⃣ llama.cpp TurboQuant + proxy 41 tok/s TurboQuant compresses KV cache 4.9x, but the proxy is the bottleneck
3️⃣ MLX native server 65 tok/s Kill the proxy. Speak Anthropic API directly. 7.5x faster.

🎯 Each generation taught us something. The final insight β€” the proxy was the bottleneck, not the model β€” changed everything.


πŸ™ Credits

Built on the shoulders of giants:

Project What It Does By
πŸ€– Claude Code AI coding agent Anthropic
🍎 MLX Apple Silicon ML framework Apple
πŸ“¦ mlx-lm Model loading + inference Apple
🧠 Qwen 3.5 The 122B model Alibaba
⚑ TurboQuant KV cache compression research Google Research

Tested on Apple M5 Max with 128 GB unified memory.


πŸ“œ MIT License β€” Use it however you want.

⭐ Star this repo if it helped you! ⭐

About

Run Claude Code with local AI on Apple Silicon. 122B model at 41 tok/s with Google TurboQuant. No cloud, no API fees.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 70.4%
  • Shell 29.6%