Back to HomeAI Infrastructure

Why I Run a Local LLM on My Mac Mini

By PickingMay 19, 20264 min read
Why I Run a Local LLM on My Mac Mini

The honest reason: I didn't want every sentence I typed going to an API.

When I started working with AI seriously — late 2023, early 2024 — everything went through cloud APIs. OpenAI, Anthropic, whatever was available. Every prompt, every draft, every system prompt going somewhere external. For work that involves client context or strategy thinking, that was a compromise I kept making but didn't love.

The second reason was cost. Not API cost directly — I was already paying for subscriptions. It was the psychology of cost. When every token has a price, you ration. You cut the prompt shorter. You don't run a second draft because "good enough." You stop exploring because you can feel the meter running.

A local model removes that friction. The Mac Mini M4 with 24GB wasn't free — but the marginal cost of the next prompt is zero.

What it actually took to get running

Harder than I expected, easier than I feared.

The first model I ran was Hermes 3 Llama 3.1 8B via llama.cpp. That meant:

  • Downloading a GGUF file (the quantised model format llama.cpp uses)
  • Compiling llama.cpp for Apple Silicon
  • Starting a server process manually with the right flags: GPU layer offload, context window size, RAM headroom
  • Pointing a client at localhost:8080
It's not plug-and-play. You're managing a server process. You need to know what you're trading off. These aren't difficult decisions once you've made them — but the first time, they're opaque.

Ollama makes this easier. One command, model downloaded, server running. I tried it. It works. But I found I didn't use it for anything serious — the startup latency and reliability profile didn't fit my automation setup. More on that in a separate post.

What I actually use it for

The Hermes 3 8B model on llama.cpp is still running at localhost:8080. Legacy fallback — chat only, not in any automated jobs. It was my first working local model. I keep it because it costs nothing to keep it running.

The main model is Qwen3.5-9B-MLX-4bit via the MLX framework. MLX is Apple's machine learning framework — native to the M4 chip, runs on the Neural Engine and GPU directly. Significantly faster than anything I've run via llama.cpp on the same hardware.

I use it through Hermes, my local agent framework. The local profile routes all draft work, research scraping, and brainstorming to Qwen3.5 at localhost:8888. The rule is simple: if I'd feel stupid paying API fees for this task, it goes local.

The thing nobody tells you

The speed numbers look worse than cloud. ~10-15 tokens per second vs near-instant cloud responses. But it doesn't feel slower in practice — because when you're not rationing prompts, you actually iterate. You run a draft, don't like it, run it again with a different angle. Tasks you'd never pay API fees for become habits.

The local model hasn't replaced cloud APIs. It's absorbed the volume work that shouldn't have been going to cloud APIs in the first place.

That's the real value proposition — not speed, not capability, but the removal of the friction that was quietly degrading how I worked.


This post is part of the Local LLM Lab case study.

Tags:

local-llmmac-minimlxllama-cppai-infrastructure

Recommended Reading