How-To Series · Episode 51 / 59 · Module 8: Provider Plumbing

Hermes · Run Local LLMs on Mac

A capable model on your own Apple Silicon: private, free, offline. Then point Hermes at it.

After this videoYou can now run Hermes fully offline on a Mac.

On Apple Silicon you can run a genuinely capable model locally: full privacy, $0/token, usable speed. The only real constraint is memory, budget = model size + KV cache, so quantize the cache to fit (a q4 KV cache cuts memory ~75%). Two backends: llama.cpp (brew install llama.cpp, fastest first token, tight memory) and MLX via omlx (Apple's own framework, fastest sustained generation). Point Hermes at either with hermes model → Custom endpoint → http://localhost:8080 + the model name; Hermes auto-detects local endpoints and relaxes streaming timeouts. Want it simpler? Ollama is one command, zero config.

About these resources. Every command comes from the Run Local LLMs on Mac guide; the AI Providers doc is cited for the Ollama and custom-endpoint paths.

Sources · What this video distills

2 docs pages · every command below traces to one of them
Primary · llama.cpp + omlx/MLX, model sizing, quantized KV cache, connecting Hermes, timeouts
Run Local LLMs on Mac
Read ↗
The Ollama zero-config path and custom-endpoint flow
AI Providers
Read ↗

Commands shown · Copy and paste

each shows the source doc it came from
Install + serve (llama.cpp)from source ↗
brew install llama.cpp · llama-server -m model.gguf -ngl 99 -c 131072 -fa on --cache-type-k q4_0 --cache-type-v q4_0
Download a modelfrom source ↗
huggingface-cli download unsloth/Qwen3.5-9B-GGUF Qwen3.5-9B-Q4_K_M.gguf --local-dir ~/models
Connect Hermesfrom source ↗
hermes model → Custom endpoint → http://localhost:8080 + model name

Going deeper · Related Hermes docs

further reading · not sources of facts shown above

Next in the series · Episodes that build on this

E50
The Provider Landscape
E55
Configuring Models
E52
Provider Routing