Case Studies
Personal Infrastructure

Local LLM Lab
What Actually Works on a Mac Mini

Six models. One Mac Mini M4 with 24GB RAM.

Some were fast. Some were accurate. Some ran out of memory before finishing a sentence.

Here's what I learned — and what I'm still running.

Act I — What I Tested

Six models, three runtimes, one question: will this run reliably at 02:00 unattended without crashing?

2025Running

Hermes 3 Llama 3.1 8B (Q8_0)

First local model. llama.cpp, localhost:8080. Chat only — no automation, no scheduled work. Still running as a legacy fallback.

Early 2026Available

Ollama — Various Models

Installed and available, multiple models pulled. Not used in scheduled jobs — Abacus and MLX proved faster and more reliable for automation workloads.

2026Primary

Qwen3.5-9B-MLX-4bit

Current primary local model. MLX framework, localhost:8888. Zero marginal cost. Handles all draft and volume work via the Hermes "local" profile.

2026Benchmarked

Qwen3 14B (MLX) — Benchmark

10.68 tokens/second average throughput. 0.764s time to first token. 10GB RAM usage. Baseline benchmark for local capability.

2026 (Pending)Pending

Qwen3.6 35B-A3B — Planned

Pending benchmark. Hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex reasoning tasks.

What “Running a Local Model” Actually Means

The real test isn't benchmark scores — it's whether it runs unattended at 02:00 without crashing.

llama.cpp

Manually start a server process, point at a GGUF file, manage RAM headroom. Portable but not the fastest on M4.

MLX Framework

Apple Silicon native. Significantly faster on M4 chip. Different model format (MLX-converted). Current primary for local inference.

Ollama

Convenient wrapper. Not used for production automation — startup latency and reliability issues in scheduled jobs.

The Numbers

Verified figures. Not marketing claims.

6+
Models tested across llama.cpp, MLX, Ollama, cloud
2
Local models actively running
M4 24GB
Mac Mini — Apple Silicon unified memory
10.68 t/s
Qwen3 14B benchmark — average throughput
0.764s
Time to first token — Qwen3 14B on MLX
10GB
RAM usage — Qwen3 14B during inference
Zero
Marginal cost — Qwen3.5-9B handles all draft work
0
Scheduled jobs using local — cloud for reliability

Act II — What I Learned

Five lessons from running local models in a real production workflow.

1. Speed benchmarks don't tell you what you need to know.

The question isn't tokens per second. It's: does it finish a 2,000-word draft without hallucinating a URL that doesn't exist? Does it follow a system prompt consistently across a 90-minute session? Does it stay running at 02:00 when nothing is watching it? Those tests are not in any benchmark suite.

2. The model format matters as much as the model.

Qwen3.5 9B via MLX runs faster and uses less RAM than the same model via Ollama on Apple Silicon. The MLX framework is native to the chip. GGUF via llama.cpp is more portable but not as fast on M4. Choosing a model without choosing a runtime is only half the decision.

3. Scheduled jobs need cloud reliability, not local cost savings.

The 31 automated jobs that run daily don't use local models. They use Abacus RouteLLM. The reason: a local model that crashes at 06:00 means no morning briefing, no grid intelligence snapshot, no content review. The cost saving is not worth the reliability risk for unattended automation. Local models are for interactive, supervised work — drafts, research, brainstorming — where a failure is visible and recoverable.

4. Zero marginal cost changes how you use it.

When a model is free at the point of use, you stop rationing. You run a draft, don't like it, run it again with a different prompt. You use it for throwaway tasks you'd never pay API fees for. The creative and exploratory value of a local model comes from the psychology of free — not the token speed.

5. What I'm still working out.

The 35B model benchmark is pending. The hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex tasks — strategy, analysis, long-form writing. If it does, the routing logic changes: small model for drafts, large model for anything requiring reasoning depth.

“The best local model is the one that runs unattended at 02:00 without crashing. Token speed is a distant second.”

Supporting Evidence

Blog posts documenting the local LLM experiments in detail.