Running a Complete Local AI Stack on M1 Max: What Actually Works

In early 2025 I bought a MacBook M1 Max with 64GB unified memory. The primary reason was local AI inference — the ability to run large language models, image generation, speech synthesis, and speech recognition entirely offline, at no marginal cost per request.

Eleven months later, I have a working local AI stack that I use daily. Some of it has exceeded my expectations. Some of it hasn't. This is an honest assessment of what actually works, what the real limitations are, and whether it's worth the investment.

Why Local AI

The case for local inference over cloud APIs comes down to four things:

Cost. Cloud inference has marginal cost. Running Llama 70B locally has zero marginal cost. For development work — which involves a lot of iterative prompting, testing edge cases, generating test data — the API bills accumulate fast. At my development intensity, I was spending ₹15,000–25,000/month on Claude and GPT-4 API calls before going local.

Privacy. Anything I send to a cloud API becomes training data or at minimum is retained for some period. Code I'm working on, documents I'm analyzing, internal knowledge I'm querying against — none of that should leave the machine.

Offline operation. I work on trains, in areas with poor connectivity, and at hours when services have outages. Local models don't have availability dependencies.

No rate limits. I can run inference in a tight loop for hours. For automated evaluation pipelines, document processing, and batch generation, rate limits are a genuine bottleneck on cloud APIs.

The Hardware Reality

The M1 Max with 64GB unified memory is specifically good for AI inference for one reason: unified memory means the CPU, GPU, and neural engine share the same memory pool. There's no PCIe bandwidth bottleneck between CPU and discrete GPU — the weights live in unified memory and the GPU reads them at full memory bandwidth.

For an Nvidia setup, you need a GPU with enough VRAM to hold the model weights entirely. A 70B model in 4-bit quantization is about 40GB. Consumer Nvidia cards top out at 24GB VRAM. You can run larger models with CPU offloading, but it's slow. On M1 Max with 64GB, the 40GB model fits in the shared pool and inference runs entirely on GPU.

This is the specific reason why Apple Silicon is competitive for local inference — not raw FLOPS, but memory architecture.

What's Running

Llama 3.1 70B — Primary LLM

Runner: Ollama with llama.cpp backend Model: Llama 3.1 70B in Q4_K_M quantization (~40GB)

Bash

ollama pull llama3.1:70b
ollama run llama3.1:70b

Performance on M1 Max 64GB:

Prompt processing: ~1,200 tokens/second
Token generation: ~12–15 tokens/second

12–15 tokens/second is readable at normal reading speed. It's not GPT-4-turbo fast, but it's fast enough that you're not watching it token by token.

Quality: Llama 3.1 70B is genuinely close to GPT-4 on most tasks I throw at it — code review, explanation, refactoring, documentation generation. The gap is noticeable on complex multi-step reasoning and tasks requiring very recent knowledge. For development work, I rarely feel the quality gap.

Flux.1-schnell — Image Generation

Why Local AI

The case for local inference over cloud APIs comes down to four things:

Offline operation. I work on trains, in areas with poor connectivity, and at hours when services have outages. Local models don't have availability dependencies.

No rate limits. I can run inference in a tight loop for hours. For automated evaluation pipelines, document processing, and batch generation, rate limits are a genuine bottleneck on cloud APIs.

The Hardware Reality

This is the specific reason why Apple Silicon is competitive for local inference — not raw FLOPS, but memory architecture.

What's Running

Llama 3.1 70B — Primary LLM

Runner: Ollama with llama.cpp backend Model: Llama 3.1 70B in Q4_K_M quantization (~40GB)

Bash

ollama pull llama3.1:70b
ollama run llama3.1:70b

Performance on M1 Max 64GB:

Prompt processing: ~1,200 tokens/second

Token generation: ~12–15 tokens/second

12–15 tokens/second is readable at normal reading speed. It's not GPT-4-turbo fast, but it's fast enough that you're not watching it token by token.

Flux.1-schnell — Image Generation

Running a Complete Local AI Stack on M1 Max: What Actually Works

Why Local AI

The Hardware Reality

What's Running

Llama 3.1 70B — Primary LLM

Flux.1-schnell — Image Generation

Stay in the loop

Comments

Semantically related

Running 116 Containers on My Laptop: Building a Personal Cloud from Scratch

I Built a Five-Layer AI Assistant Into My Portfolio Site. Here's Every Layer.

Running a Complete Local AI Stack on M1 Max: What Actually Works

Why Local AI

The Hardware Reality

What's Running

Llama 3.1 70B — Primary LLM

Flux.1-schnell — Image Generation

Stay in the loop

Comments

Semantically related

Running 116 Containers on My Laptop: Building a Personal Cloud from Scratch

I Built a Five-Layer AI Assistant Into My Portfolio Site. Here's Every Layer.

Kokoro — Text to Speech

Whisper Large-v3 — Speech to Text

The Integration Layer

The Claude Code MCP Integration

Benchmarks vs Cloud

What Doesn't Work Well

The Honest Verdict

Why I Self-Host Everything (And What It's Actually Cost Me)