Run a 122 billion parameter AI on your MacBook.
No cloud. No fees. No data leaves your machine.
Your MacBook has a powerful GPU built right into the chip. This project uses that GPU to run a massive AI model β the same kind that powers ChatGPT and Claude β entirely on your computer.
π« No internet needed π° No monthly subscription π No one sees your code or data β Full Claude Code experience β write code, edit files, manage projects, control your browser
π± You (Mac or Phone)
β
π€ Claude Code β the AI coding tool you know
β
β‘ MLX Native Server β our server (200 lines of Python)
β
π§ Qwen 3.5 122B β 122 billion parameter brain
β
π₯οΈ Apple Silicon GPU β your M-series chip does all the work
You don't have to be at your Mac to use this. We built a remote control pipeline:
π± Your iPhone π» Your Mac
β β
βββ iMessage ββββββββββββββββββ>β
β βββ Claude Code
β βββ MLX Server
β βββ Qwen 3.5 122B
β βββ (does the work)
β<ββ iMessage response βββββββββ
β β
ποΈ From your couch π₯οΈ At your desk
How it works:
- π² Send a message from your phone via iMessage
- π€ Claude Code receives it and runs the task on your local AI
- π¬ Response comes back to your phone
βοΈ Works anywhere your Mac has power β even offline for the AI part
We built this before Anthropic shipped their Dispatch feature. Same concept, but ours uses iMessage and runs on your local model instead of cloud.
π‘ Pro tip: Anthropic's Dispatch doesn't read your CLAUDE.md. Mention it in your message or it'll miss your custom setup. Our iMessage system doesn't have this problem.
We built and tested three different approaches. Each one got faster.
Tokens per Second
π Ollama (Gen 1) ββββββββββββββββββββββββββββββ 30 tok/s
π llama.cpp (Gen 2) βββββββββββββββββββββββββββββββββββββββββ 41 tok/s
π MLX Native (Gen 3) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 65 tok/s
How long to ask Claude Code to write a function:
π΄ Ollama + Proxy ββββββββββββββββββββββββββββββββββββββββββββ 133 seconds
π llama.cpp + Proxy ββββββββββββββββββββββββββββββββββββββββββββ 133 seconds
π₯ MLX Native (no proxy) ββββββ 17.6 seconds
7.5x faster β‘
| π Ollama | π llama.cpp + TurboQuant | π MLX Native (ours) | |
|---|---|---|---|
| Speed | 30 tok/s | 41 tok/s | 65 tok/s |
| Claude Code task | 133s | 133s | 17.6s |
| Needs a proxy? | β Yes | β Yes | β No |
| Lines of code | N/A | N/A (C++ fork) | ~200 Python |
| Apple native? | β Generic | β Ported | β MLX |
| π₯οΈ Our Local Setup | βοΈ Claude Sonnet | βοΈ Claude Opus | |
|---|---|---|---|
| Speed | 65 tok/s | ~80 tok/s | ~40 tok/s |
| Monthly cost | $0 π | $20-100+ | $20-100+ |
| Privacy | 100% local π | Cloud | Cloud |
| Works offline | Yes |
No | No |
| Data leaves your Mac | Never | Always | Always |
π‘ Our local setup beats cloud Opus on raw speed (65 vs 40 tok/s) at $0/month.
Most people trying to run Claude Code locally hit the same wall:
Claude Code speaks Anthropic API. Local models speak OpenAI API. Different languages. π€·
So everyone builds a proxy to translate between them. That proxy adds latency, complexity, and breaks things.
We took a different approach:
| π What everyone else does | π What we did |
|---|---|
| Claude Code β Proxy β Ollama β Model | Claude Code β Our Server β Model |
| 3 processes, 2 API translations | 1 process, 0 translations |
| 133 seconds per task | 17.6 seconds per task |
π― That one change β eliminating the proxy β made it 7.5x faster.
| Your Mac | RAM | What You Can Run |
|---|---|---|
| M1/M2/M3/M4 (base) | 8-16 GB | π‘ Small models (4B) |
| M1/M2/M3/M4 Pro | 18-36 GB | π Medium models (32B) |
| M2/M3/M4/M5 Max | 64-128 GB | π’ Large models (122B) |
| M2/M3/M4 Ultra | 128-192 GB | π΅ Multiple large models |
Also need:
- π Python 3.12+ (for MLX)
- π€ Claude Code (
npm install -g @anthropic-ai/claude-code)
python3.12 -m venv ~/.local/mlx-server
~/.local/mlx-server/bin/pip install mlx-lmFirst run downloads ~50 GB (one time only):
~/.local/mlx-server/bin/python3 -c "
from mlx_lm.utils import load
load('mlx-community/Qwen3.5-122B-A10B-4bit')
print('Done!')
"~/.local/mlx-server/bin/python3 proxy/server.pyANTHROPIC_BASE_URL=http://localhost:4000 \
ANTHROPIC_API_KEY=sk-local \
claude --model claude-sonnet-4-6π‘ Or just double-click
Claude Local.commandon your Desktop. It does all of this automatically.
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Your MacBook (M5 Max) β
β β
β π± You type ββ> π€ Claude Code β
β β β
β βΌ β
β β‘ MLX Server (port 4000) β
β β β
β βΌ β
β π§ Qwen 3.5 122B ββ> π₯οΈ GPU β
β β β
β βΌ β
β π± Answer <βββ β¨ Clean response β
β β
β π Nothing leaves this box. Ever. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
The server (proxy/server.py) is one file, ~200 lines. It does three things:
- π¦ Loads the model β Apple's MLX framework, native Metal GPU, unified memory
- π Speaks Anthropic API β Claude Code thinks it's talking to Anthropic's cloud. It's not.
- π§Ή Cleans the output β Qwen thinks out loud in
<think>tags. We strip those.
Claude Code can control your real web browser β not a sandbox. Your actual browser with all your logins. π
| π’ Chrome DevTools (CDP) | π΅ Playwright | |
|---|---|---|
| Controls | Your real Brave/Chrome | Separate sandboxed browser |
| Logged in? | β All your sessions | β Starts fresh |
| Speed | β‘ Fast | π Slower |
| Best for | Daily tasks | Automated jobs |
π‘ Example: "Go to my GitHub and check which PRs need review" β it opens your actual browser, already logged in, and does it. No re-authenticating. Ever.
| Situation | Use This? | Why |
|---|---|---|
| On a plane | β | Full AI coding, no internet needed |
| Sensitive client code | β | Nothing leaves your machine |
| Don't want API fees | β | $0/month forever |
| Want fastest possible | βοΈ | Cloud API is still faster |
| Need Claude-level reasoning | βοΈ | Local model is good, not Claude-level |
| Controlling from phone | β | iMessage pipeline works offline |
π¦ claude-code-local/
βββ β‘ proxy/
β βββ server.py β The entire server. 200 lines. This IS the project.
βββ π launchers/
β βββ Claude Local.command β Double-click to start everything
β βββ Browser Agent.command β Double-click for browser control
βββ π οΈ scripts/
β βββ download-and-import.sh β Download models
β βββ persistent-download.sh β Auto-retry downloader
β βββ start-mlx-server.sh β Alternative config
βββ π docs/
β βββ BENCHMARKS.md β Detailed speed comparisons
β βββ TWITTER-THREAD.md β Social media content
βββ setup.sh β One-command installer
We audited every component before running it:
| Component | Source | Network Calls | Verdict |
|---|---|---|---|
| server.py | We wrote it | 0 | β Safe |
| MLX framework | Apple | 0 | β Safe |
| Qwen 3.5 model | HuggingFace verified | 0 | β Safe |
π« No telemetry π« No analytics π« No phone-home π« No sketchy pip packages
β οΈ We removed LiteLLM after supply chain attack concerns were raised. Every dependency was audited.
We didn't start here. We went through three generations in one night:
| Gen | What We Tried | Speed | π‘ What We Learned |
|---|---|---|---|
| 1οΈβ£ | Ollama + custom proxy | 30 tok/s | Ollama works but Claude Code can't talk to it directly |
| 2οΈβ£ | llama.cpp TurboQuant + proxy | 41 tok/s | TurboQuant compresses KV cache 4.9x, but the proxy is the bottleneck |
| 3οΈβ£ | MLX native server | 65 tok/s | Kill the proxy. Speak Anthropic API directly. 7.5x faster. |
π― Each generation taught us something. The final insight β the proxy was the bottleneck, not the model β changed everything.
Built on the shoulders of giants:
| Project | What It Does | By |
|---|---|---|
| π€ Claude Code | AI coding agent | Anthropic |
| π MLX | Apple Silicon ML framework | Apple |
| π¦ mlx-lm | Model loading + inference | Apple |
| π§ Qwen 3.5 | The 122B model | Alibaba |
| β‘ TurboQuant | KV cache compression research | Google Research |
Tested on Apple M5 Max with 128 GB unified memory.
π MIT License β Use it however you want.
β Star this repo if it helped you! β