How-To Series · Episode 51 / 59 · Module 8: Provider Plumbing

Hermes · Run Local LLMs on Mac

A capable model on your own Apple Silicon: private, free, offline. Then point Hermes at it.

After this videoYou can now run Hermes fully offline on a Mac.

On Apple Silicon you can run a genuinely capable model locally: full privacy, $0/token, usable speed. The only real constraint is memory, budget = model size + KV cache, so quantize the cache to fit (a q4 KV cache cuts memory ~75%). Two backends: llama.cpp (brew install llama.cpp, fastest first token, tight memory) and MLX via omlx (Apple's own framework, fastest sustained generation). Point Hermes at either with hermes model → Custom endpoint → http://localhost:8080 + the model name; Hermes auto-detects local endpoints and relaxes streaming timeouts. Want it simpler? Ollama is one command, zero config.

About these resources. Every command comes from the Run Local LLMs on Mac guide; the AI Providers doc is cited for the Ollama and custom-endpoint paths.

New words here · Plain English

one sentence each · full glossary

Apple SiliconThe M-series chips in modern Macs. They are fast enough to run AI models locally.

OllamaA tool that runs AI models on your own machine. No API costs, no internet needed.

Sources · What this video distills

2 docs pages · every command below traces to one of them

Primary · llama.cpp + omlx/MLX, model sizing, quantized KV cache, connecting Hermes, timeouts

Run Local LLMs on Mac

The Ollama zero-config path and custom-endpoint flow

AI Providers

Commands shown · Copy and paste

each shows the source doc it came from

Install + serve (llama.cpp)from source ↗

brew install llama.cpp · llama-server -m model.gguf -ngl 99 -c 131072 -fa on --cache-type-k q4_0 --cache-type-v q4_0

Download a modelfrom source ↗

huggingface-cli download unsloth/Qwen3.5-9B-GGUF Qwen3.5-9B-Q4_K_M.gguf --local-dir ~/models

Connect Hermesfrom source ↗

hermes model → Custom endpoint → http://localhost:8080 + model name

Going deeper · Related Hermes docs

further reading · not sources of facts shown above

The Provider Landscape

where local fits among the options

Configuration · context length

sizing context for tools

Next in the series · Episodes that build on this

E50

The Provider Landscape

E55

Configuring Models

E52

Provider Routing