Local LLM Lab
What Actually Works on a Mac Mini
Six models. One Mac Mini M4 with 24GB RAM.
Some were fast. Some were accurate. Some ran out of memory before finishing a sentence.
Here's what I learned — and what I'm still running.
Act I — What I Tested
Six models, three runtimes, one question: will this run reliably at 02:00 unattended without crashing?
Hermes 3 Llama 3.1 8B (Q8_0)
First local model. llama.cpp, localhost:8080. Chat only — no automation, no scheduled work. Still running as a legacy fallback.
Ollama — Various Models
Installed and available, multiple models pulled. Not used in scheduled jobs — Abacus and MLX proved faster and more reliable for automation workloads.
Qwen3.5-9B-MLX-4bit
Current primary local model. MLX framework, localhost:8888. Zero marginal cost. Handles all draft and volume work via the Hermes "local" profile.
Qwen3 14B (MLX) — Benchmark
10.68 tokens/second average throughput. 0.764s time to first token. 10GB RAM usage. Baseline benchmark for local capability.
Qwen3.6 35B-A3B — Planned
Pending benchmark. Hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex reasoning tasks.
What “Running a Local Model” Actually Means
The real test isn't benchmark scores — it's whether it runs unattended at 02:00 without crashing.
llama.cpp
Manually start a server process, point at a GGUF file, manage RAM headroom. Portable but not the fastest on M4.
MLX Framework
Apple Silicon native. Significantly faster on M4 chip. Different model format (MLX-converted). Current primary for local inference.
Ollama
Convenient wrapper. Not used for production automation — startup latency and reliability issues in scheduled jobs.
The Numbers
Verified figures. Not marketing claims.
Act II — What I Learned
Five lessons from running local models in a real production workflow.
1. Speed benchmarks don't tell you what you need to know.
The question isn't tokens per second. It's: does it finish a 2,000-word draft without hallucinating a URL that doesn't exist? Does it follow a system prompt consistently across a 90-minute session? Does it stay running at 02:00 when nothing is watching it? Those tests are not in any benchmark suite.
2. The model format matters as much as the model.
Qwen3.5 9B via MLX runs faster and uses less RAM than the same model via Ollama on Apple Silicon. The MLX framework is native to the chip. GGUF via llama.cpp is more portable but not as fast on M4. Choosing a model without choosing a runtime is only half the decision.
3. Scheduled jobs need cloud reliability, not local cost savings.
The 31 automated jobs that run daily don't use local models. They use Abacus RouteLLM. The reason: a local model that crashes at 06:00 means no morning briefing, no grid intelligence snapshot, no content review. The cost saving is not worth the reliability risk for unattended automation. Local models are for interactive, supervised work — drafts, research, brainstorming — where a failure is visible and recoverable.
4. Zero marginal cost changes how you use it.
When a model is free at the point of use, you stop rationing. You run a draft, don't like it, run it again with a different prompt. You use it for throwaway tasks you'd never pay API fees for. The creative and exploratory value of a local model comes from the psychology of free — not the token speed.
5. What I'm still working out.
The 35B model benchmark is pending. The hypothesis: a 35B MoE model at Q4 quantisation might fit in 24GB and produce meaningfully better output for complex tasks — strategy, analysis, long-form writing. If it does, the routing logic changes: small model for drafts, large model for anything requiring reasoning depth.
“The best local model is the one that runs unattended at 02:00 without crashing. Token speed is a distant second.”
Supporting Evidence
Blog posts documenting the local LLM experiments in detail.
Why I Run a Local LLM on My Mac Mini
The cost, control, and psychology of zero-marginal-cost inference.
Benchmarking Qwen3 14B on Apple Silicon
MLX vs llama.cpp vs Ollama — real numbers, not marketing claims.
When to Use Local vs Cloud — My Actual Routing Logic
Why scheduled jobs stay on cloud and interactive work stays local.
