Llama 70B, Flux image generation, Kokoro TTS, Whisper STT — all running locally on an M1 Max with 64GB unified memory. The practical setup, benchmarks, and honest assessment of what's ready and what isn't.
In early 2025 I bought a MacBook M1 Max with 64GB unified memory. The primary reason was local AI inference — the ability to run large language models, image generation, speech synthesis, and speech recognition entirely offline, at no marginal cost per request.
Eleven months later, I have a working local AI stack that I use daily. Some of it has exceeded my expectations. Some of it hasn't. This is an honest assessment of what actually works, what the real limitations are, and whether it's worth the investment.
The case for local inference over cloud APIs comes down to four things:
Cost. Cloud inference has marginal cost. Running Llama 70B locally has zero marginal cost. For development work — which involves a lot of iterative prompting, testing edge cases, generating test data — the API bills accumulate fast. At my development intensity, I was spending ₹15,000–25,000/month on Claude and GPT-4 API calls before going local.
Privacy. Anything I send to a cloud API becomes training data or at minimum is retained for some period. Code I'm working on, documents I'm analyzing, internal knowledge I'm querying against — none of that should leave the machine.
Offline operation. I work on trains, in areas with poor connectivity, and at hours when services have outages. Local models don't have availability dependencies.
No rate limits. I can run inference in a tight loop for hours. For automated evaluation pipelines, document processing, and batch generation, rate limits are a genuine bottleneck on cloud APIs.
The M1 Max with 64GB unified memory is specifically good for AI inference for one reason: unified memory means the CPU, GPU, and neural engine share the same memory pool. There's no PCIe bandwidth bottleneck between CPU and discrete GPU — the weights live in unified memory and the GPU reads them at full memory bandwidth.
For an Nvidia setup, you need a GPU with enough VRAM to hold the model weights entirely. A 70B model in 4-bit quantization is about 40GB. Consumer Nvidia cards top out at 24GB VRAM. You can run larger models with CPU offloading, but it's slow. On M1 Max with 64GB, the 40GB model fits in the shared pool and inference runs entirely on GPU.
This is the specific reason why Apple Silicon is competitive for local inference — not raw FLOPS, but memory architecture.
Runner: Ollama with llama.cpp backend Model: Llama 3.1 70B in Q4_K_M quantization (~40GB)
ollama pull llama3.1:70b
ollama run llama3.1:70bPerformance on M1 Max 64GB:
12–15 tokens/second is readable at normal reading speed. It's not GPT-4-turbo fast, but it's fast enough that you're not watching it token by token.
Quality: Llama 3.1 70B is genuinely close to GPT-4 on most tasks I throw at it — code review, explanation, refactoring, documentation generation. The gap is noticeable on complex multi-step reasoning and tasks requiring very recent knowledge. For development work, I rarely feel the quality gap.
Karanveer Singh Shaktawat
Full Stack Engineer & Infrastructure Architect
Building portfolio, contributing to open source, and seeking remote full-time roles with significant technical ownership.
Pick what you want to hear about — I'll only email when it's worth it.
Did this resonate?
How I turned a single MacBook into a private cloud — AI inference, media server, dev services, wildcard HTTPS — all managed as code across 19 Docker Compose profiles.
Tool calls, visitor memory, real-time analytics, an LLM eval harness, and full-duplex voice — wired into a Next.js portfolio over six sessions. The decisions, the wrong turns, and the bugs that took longer to find than to fix.
Runner: ComfyUI Model: Flux.1-schnell (the distilled fast variant, ~15GB)
Flux produces substantially better output than Stable Diffusion XL for photography-style images and technical diagrams. The "schnell" variant trades some quality for speed — generation time on M1 Max is about 8–12 seconds for a 1024×1024 image at 4 steps.
I use this for:
The honest limitation: Flux can't read text reliably. Any generation involving text in the image will hallucinate. For anything text-critical, you still need to composite text in post.
Runner: Python FastAPI server wrapping the model Model: Kokoro 82M (small but very good quality)
Kokoro is an open-source TTS model that produces surprisingly natural-sounding output for its size. The 82M parameter version runs inference in 2–3x real-time on M1 Max — a 60-second audio clip generates in about 25 seconds.
I use Kokoro for:
The voice quality is not quite Eleven Labs, but it's better than anything open source was producing 18 months ago. The Indian English accent handling is notably good — it doesn't mangle words the way most TTS models do when they encounter transliterated Hindi terms.
Runner: faster-whisper (CTranslate2 backend, much faster than original) Model: large-v3 (~3GB)
Whisper large-v3 is the best open-source STT model available. On M1 Max, transcription runs at approximately real-time speed for large-v3, or 3–4x real-time for the medium model.
I use Whisper for:
The quality on Indian-accented English is noticeably better than cloud STT services, probably because OpenAI trained Whisper on a more diverse corpus than most enterprise STT providers.
Running four separate model servers (Ollama, ComfyUI, Kokoro, Whisper) creates an integration problem — you need a unified way to call them from scripts and applications. I built a thin FastAPI router that provides a single endpoint interface:
# router.py
from fastapi import FastAPI
import httpx
app = FastAPI()
@app.post("/generate/text")
async def generate_text(prompt: str, model: str = "llama3.1:70b"):
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:11434/api/generate",
json={"model": model, "prompt": prompt, "stream": False}
)
return response.json()
@app.post("/generate/image")
async def generate_image(prompt: str):
# ComfyUI workflow trigger
...
@app.post("/generate/audio")
async def generate_audio(text: str):
# Kokoro TTS
...
@app.post("/transcribe")
async def transcribe(audio_path: str):
# Whisper STT
...This router is what my shell scripts and Claude Code MCP tools call. Adding a new local model means adding one endpoint — the calling code doesn't change.
The most useful thing I've done with local AI is building MCP (Model Context Protocol) servers that expose local model capabilities to Claude Code. This means I can be in a Claude Code session and say "transcribe this voice memo" or "generate an image for this blog post" and Claude routes the request to the local model.
// .claude/mcp-servers.json
{
"local-ai": {
"command": "python",
"args": ["/Users/xczer/local-ai/mcp_server.py"],
"description": "Local AI models: Llama, Flux, Kokoro, Whisper"
}
}The practical result: Claude Code can call Whisper to transcribe a voice note I dropped in the project folder, use the transcript as context, and then generate code based on what I said. The entire thing runs locally. No data leaves the machine.
Honest comparison on tasks I actually run:
| Task | Local (M1 Max) | Cloud equivalent | Quality delta | |------|---------------|-----------------|---------------| | Code explanation (2K tokens) | ~20 seconds | ~3 seconds | Minimal | | Document summarization | ~35 seconds | ~5 seconds | Minimal | | Complex reasoning chain | ~90 seconds | ~15 seconds | Noticeable | | Image generation (1024px) | ~10 seconds | ~3 seconds (DALL-E 3) | Cloud better | | Speech-to-text (10 min audio) | ~10 minutes | ~30 seconds | Minimal | | TTS (1 min audio) | ~25 seconds | ~5 seconds | Cloud marginally better |
Speed is the honest limitation. For tasks where I'm in an interactive loop — asking a question, reading the answer, refining — 20 seconds per response is fine. For anything where I need rapid iteration, cloud is still faster.
Vision tasks. I haven't found a local vision model that competes with Claude's image understanding for technical diagrams and screenshots. LLaVA and the various Llama vision variants are decent for photos but poor for code screenshots and system diagrams.
Very long context. Llama 3.1 70B supports a 128K context window, but filling it locally hits memory limits. At ~80K tokens, generation speed drops noticeably as the KV cache fills unified memory. Cloud models handle full context windows more gracefully.
Function calling reliability. Tool use / function calling with local Llama models is less reliable than GPT-4 or Claude. The models don't always follow the JSON schema correctly, and the error recovery is worse. For any automated pipeline that depends on structured output, I still route to cloud models.
Multimodal workflows. Chaining text → image → audio on local models works, but the latency chain adds up. A workflow that takes 2 minutes on local models takes 15 seconds with cloud APIs. For automated pipelines, the latency difference matters.
Local AI on M1 Max is genuinely useful, not just a technical demonstration. It covers probably 60–70% of my AI workflow at zero marginal cost. The remaining 30–40% goes to cloud models for tasks where speed, vision quality, or function calling reliability matter.
The investment pays off at a specific usage intensity. If you're spending ₹20,000+/month on AI APIs and doing the kind of development work where privacy and offline operation have value, the hardware cost amortizes in a few months. If you're an occasional AI user, managed services are the right answer.
The thing that surprised me most: it's changed how I think about AI features in the things I build. When inference is free, you instrument everything. I've added AI-powered features to internal tools that I would never have added if they had per-request costs. The marginal cost changing from some to zero is a genuine product design shift, not just a cost optimization.
The other thing: the local stack has made me better at understanding what the models are actually doing. When you own the inference, you see the parameters, the quantization choices, the memory usage, the temperature effects in real time. It's the same lesson as self-hosting mail infrastructure — understanding the thing you depend on makes you better at using it.