How-To Series · Episode 51 / 59 · Module 8: Provider Plumbing
Hermes · Run Local LLMs on Mac
A capable model on your own Apple Silicon: private, free, offline. Then point Hermes at it.
After this videoYou can now run Hermes fully offline on a Mac.
On Apple Silicon you can run a genuinely capable model locally: full privacy, $0/token, usable speed. The only real constraint is memory, budget = model size + KV cache, so quantize the cache to fit (a q4 KV cache cuts memory ~75%). Two backends: llama.cpp (brew install llama.cpp, fastest first token, tight memory) and MLX via omlx (Apple's own framework, fastest sustained generation). Point Hermes at either with hermes model → Custom endpoint → http://localhost:8080 + the model name; Hermes auto-detects local endpoints and relaxes streaming timeouts. Want it simpler? Ollama is one command, zero config.
About these resources. Every command comes from the Run Local LLMs on Mac guide; the AI Providers doc is cited for the Ollama and custom-endpoint paths.
Sources · What this video distills
2 docs pages · every command below traces to one of themCommands shown · Copy and paste
each shows the source doc it came frombrew install llama.cpp · llama-server -m model.gguf -ngl 99 -c 131072 -fa on --cache-type-k q4_0 --cache-type-v q4_0huggingface-cli download unsloth/Qwen3.5-9B-GGUF Qwen3.5-9B-Q4_K_M.gguf --local-dir ~/modelshermes model → Custom endpoint → http://localhost:8080 + model nameGoing deeper · Related Hermes docs
further reading · not sources of facts shown aboveNext in the series · Episodes that build on this
E50
The Provider Landscape
E55
Configuring Models
E52
Provider Routing