brianletort.ai

HomeLab

Four machines. One OpenAI-compatible endpoint.

A private AI cluster running entirely on local hardware. Each machine has one job — fast vision, heavy coding, deep reasoning, or general chat — and a small custom gateway called homelab-router decides which one should answer each request.

Everything speaks the OpenAI Chat Completions schema, so Cursor, OpenCode, a personal Telegram agent, Home Assistant, and the security-camera pipeline all hit a single endpoint and never need to know which GPU is doing the work.

Architecture

One gateway in front of four GPUs

Every client speaks to the gateway. None of them know which machine will answer.

CLIENTSCursorOpenCodeHermesHome AssistantFrigateRingGATEWAYhomelab-routerdeterministic routing · fallback chain · health checks · PrometheusTHE EYESubuntu-4090
Fast vision, JSON utility, low-latency routing pre-filter.
THE HANDSubuntu-6000
Heavy code generation, refactoring, agentic / multi-file work.
THE BRAINSpark
Senior reasoning, architecture, final review, safety fallback.
THE MOUTHwin-5090
General chat, rewrites, medium-context summarization, spillover.

One OpenAI-compatible endpoint. Four machines. Every routing decision is deterministic and auditable.

Clients

Who calls the endpoint

Cursor, OpenCode, a personal Telegram agent called Hermes, Home Assistant, Frigate, and a Ring camera pipeline all speak the same Chat Completions schema. The router does the rest.

  • Cursor

    AI-native editor → coding bucket → ubuntu-6000

  • OpenCode

    Multi-file refactors → agentic bucket → ubuntu-6000

  • Hermes

    Personal agent — manages the lab, also a chat client

  • Home Assistant

    Camera + automation orchestrator

  • Frigate

    NVR motion events → vision pipeline

  • Ring pipeline

    Doorbell snapshots → vision pipeline

The four machines

Each one optimized for a single role

Tap a card to jump to that machine’s deep dive.

The Eyes

ubuntu-4090

Fast vision, JSON utility, low-latency routing pre-filter.

homelab/visionhomelab/fast

Hardware

GPURTX 4090 Laptop GPU
VRAM16 GB
BusInternal PCIe (workstation)
OSUbuntu

Production model

ModelQwen2.5-VL-7B-Instruct-AWQ
Served asqwen2.5-vl-7b-router-awq
RuntimevLLM (OpenAI-compatible)

What it does

  • Vision — every image-bearing request lands here first
  • Ring camera snapshots
  • Frigate events
  • Screenshot understanding
  • Fast classifier: the “is this even worth escalating?” pre-filter
  • JSON utility: structured outputs for the camera pipeline, notifications, and tagging

Latency it actually hits

p50 target

≤ 1 s

p95 target

≤ 2 s

p50 measured

0.61 s

p95 measured

0.81 s

  • · 4 / 4 routing safety cases pass
  • · Zero false suppressions
  • · Zero invalid JSON

Optimizations

  • AWQ INT4 quantization — one-fifth the weights of FP16
  • vLLM with continuous batching and prefix caching
  • Production context locked to 32K tokens after a latency / KV / VRAM sweep
  • Auto-start as a user-level systemd unit — no manual login after reboot
  • Approximately 9.5 GB of 16 GB used, leaving headroom for CUDA-graph capture and prefix-cache growth

Hard rules

  • Nothing else co-hosts on the 4090 — no coding model, no large-context model
  • The slot is single-purpose on purpose

Recovery story

When the eGPU on the sibling machine has a PCIe link-down/up cycle, the entire NVIDIA driver enters a degraded state where nvidia-smi still works but every fresh CUDA init fails. The router detects this, fails traffic over to Spark, and a recover-cuda.sh script reloads the NVIDIA UVM module and brings everything back up automatically. The whole sequence is documented step-by-step in the runbook.

vLLM launch (excerpt)

# qwen2.5-vl-7b-router-awq on ubuntu-4090
vllm serve Qwen/Qwen2.5-VL-7B-Instruct-AWQ \
  --served-model-name qwen2.5-vl-7b-router-awq \
  --quantization awq_marlin \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.78 \
  --enable-prefix-caching \
  --enforce-eager false \
  --host 127.0.0.1

The Hands

ubuntu-6000

Heavy code generation, refactoring, agentic / multi-file work.

homelab/codehomelab/agent

Hardware

GPURTX PRO 6000 Blackwell
VRAM96 GB
BusThunderbolt eGPU
OSUbuntu

Production model

ModelGLM-4.5-Air-AWQ
Served asglm-4.5-air-awq
RuntimevLLM (OpenAI-compatible)

What it does

  • Code generation
  • Refactoring
  • Multi-file edits
  • Agentic tool-use work
  • Cursor and OpenCode are routed here by default
  • Longer-context utility work

Optimizations

  • AWQ W4A16 quantization using compressed tensors
  • TurboQuant KV cache in k8v4 mode — 8-bit keys, 4-bit values
  • ≈ 2.6× KV compression vs FP8 with negligible quality cost
  • ≈ 14% more KV cache slots than the FP8 baseline at the same context budget
  • Per-layer aware: 42 quantized layers use TurboQuant; 4 boundary layers auto-skip to FLASH_ATTN for quality safety
  • 128K-token context window
  • Auto-start as a user-level systemd unit with --profile long

Hard rules

  • No cross-GPU tensor parallelism with the 4090 — Thunderbolt latency would kill it
  • No model swapping — treat it as a single resident endpoint
  • No CPU offload — 96 GB is plenty of headroom and any swapping would gut throughput

Reliability story — why TurboQuant beat FP8

The earlier FP8 KV configuration with TRITON_ATTN was faster per token on paper, but consistently crashed the AWQ Marlin kernel mid-load with `CUDA error: unspecified launch failure`. TurboQuant k8v4 was slower in raw decode but completed every load on this Blackwell + driver combination. Reliability won. The configuration choice is documented as “the only one that survived the sustained bench.” MTP (multi-token speculative decoding) is intentionally off: the current AWQ checkpoint drops the speculative head’s weights, so draft acceptance is 0%. It will be re-enabled when the checkpoint is fixed upstream.

vLLM launch (excerpt)

# glm-4.5-air-awq on ubuntu-6000 (RTX PRO 6000 Blackwell)
vllm serve cpatonn/GLM-4.5-Air-AWQ \
  --served-model-name glm-4.5-air-awq \
  --quantization compressed-tensors \
  --kv-cache-dtype turboquant_k8v4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --speculative-config '{"method":"none"}' \
  --host 127.0.0.1

The Brain

Spark

Senior reasoning, architecture, final review, safety fallback.

homelab/reasoninghomelab/review

Hardware

GPUNVIDIA DGX Spark
VRAMUnified CPU + GPU memory
BusGB10 ARM SoC, attached via private overlay
OSLinux

Production model

ModelQwen3.6-35B-A3B
Served asqwen3.6-35b-a3b-fp8
RuntimevLLM (OpenAI-compatible)

What it does

  • Deep reasoning
  • Architecture review
  • Planning
  • Hard debugging
  • Final review and risk analysis
  • Safety fallback for every other route

What it does not do

  • Spark never falls back — it is the bottom of the chain
  • If Spark is down, the router returns a 503, period
  • No silent downgrade to a less-capable model on a safety-critical decision

Optimizations

  • FP8 weights — throughput and KV-cache efficiency on GB10’s unified memory
  • vLLM with prefix caching
  • Chunked prefill
  • YaRN for long-context overrides
  • Open WebUI front-end pinned to its own data directory so other UIs do not collide
  • Treated by the router as a network-attached upstream over the private overlay

Hard rules

  • Spark never falls back to another model
  • If Spark is down, the answer is 503 — never a quiet downgrade
  • The router does not try to start, stop, or recover Spark — the operator gets paged

vLLM launch (excerpt)

# qwen3.6-35b-a3b-fp8 on Spark (DGX, GB10)
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name qwen3.6-35b-a3b-fp8 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --rope-scaling '{"type":"yarn","factor":4.0}' \
  --host 127.0.0.1

The Mouth

win-5090

General chat, rewrites, medium-context summarization, spillover.

homelab/chathomelab/utilityhomelab/gemma

Hardware

GPURTX 5090
VRAM32 GB
BusInternal PCIe
OSWindows 11

Production model

ModelGemma 4 31B
Served asgemma-4-31b
RuntimeLM Studio (OpenAI-compatible)

What it does

  • General chat
  • Light Q&A
  • Rewrites, format cleanup, paraphrasing, translation
  • Medium-context summarization (≈ 1K–8K tokens)
  • Text-only spillover when ubuntu-4090 is saturated
  • Code spillover when ubuntu-6000 is busy

What it does not do

  • No image work — Gemma cannot see, so the router never sends image-bearing requests here
  • Images always go to ubuntu-4090 or, during an outage, to Spark
  • No final architecture review or deep reasoning — that stays on Spark
  • No camera safety decisions

Optimizations

  • vLLM-style serving via LM Studio with three concurrent slots
  • Router concurrency cap matched exactly to the upstream slot count
  • Third caller gets a `busy` response
  • Fourth caller is sent to the fallback instead of queueing forever
  • gpu-memory-utilization set to 0.86
  • 32K context
  • Batched-token budget tuned to the 32 GB card
  • Inbound traffic pinned to the private overlay network at the Windows Firewall layer

Hard rules

  • SSH, the LLM endpoint, and the Netdata UI are unreachable from public Wi-Fi by design
  • Never the image path — the router carries a has_image flag through preflight so this rule wins per request

LM Studio config (excerpt)

# gemma-4-31b on win-5090 (LM Studio, OpenAI-compatible)
{
  "served_model_name": "gemma-4-31b",
  "context_length": 32768,
  "gpu_memory_utilization": 0.86,
  "max_concurrent_slots": 3,
  "bind": "private-overlay-only",
  "firewall": "block all non-overlay inbound"
}

The router

homelab-router — the keystone

A small FastAPI service that exposes one OpenAI-compatible endpoint on the loopback interface. Every client speaks to it; none of them know about the four upstream models individually.

Design principles

Deterministic, not LLM-based

Routing decisions are made by a keyword classifier scoring the prompt against a YAML policy file, plus three override mechanisms — request header, pinned model name, metadata hint.

A control plane, not a fourth model

The router never generates text itself. Every decision is auditable and reproducible.

Safety-first tie-breaks

When the classifier scores multiple buckets: vision > coding > reasoning > chat. 'Review the security of this code' lands on Spark, not on the chat model.

Image presence is always honored

Any request carrying an image_url part goes to ubuntu-4090 regardless of keywords — nothing else in the rig can see.

Camera tasks never silently say 'ignore'

Schema-validation failure on a camera event produces a synthetic escalate_to_spark envelope, never a suppression. The router fails closed.

Hot-reloadable policy

Routing rules live in routing_policy.yaml and can be reloaded over an admin endpoint without restarting the gateway.

Loopback endpoints

MethodPathPurpose
POST/v1/chat/completionsOpenAI Chat Completions shim — every client speaks here
GET/v1/modelsFriendly model catalog: homelab/auto, homelab/code, homelab/vision, homelab/agent, …
POST/v1/embeddingsEmbeddings shim for downstream RAG and vector stores
GET/metricsPrometheus exposition — counters, histograms, gauges
GET/statsJSON snapshot of the same counters for humans
GET/admin/upstreamsPer-route health, load, saturation, drained flag
POST/admin/drain/<route>Drain a route without restarting — explicit pins still pass through
POST/admin/reload-configHot-reload routing_policy.yaml
POST/admin/recover-cudaAuthenticated one-button CUDA recovery

All routes are bound to the loopback interface on the workstation. No public surface, no Wi-Fi exposure.

Operational ergonomics

Drain a route live

POST /admin/drain/<route> redirects auto-routed traffic but lets explicit homelab/<route> pins pass through. No restart.

Hot-reload policy

POST /admin/reload-config swaps routing_policy.yaml in place. The keyword classifier picks up the new buckets immediately.

Watchdog every minute

A one-minute systemd timer re-probes every upstream, restarts containers whose /v1/models is down, and triggers CUDA recovery on context failure.

Structured logs

logs/router-decisions.jsonl       # bucket, route, reason, latency, fallback path
logs/upstream-failures.jsonl     # 5xx, timeouts, CUDA poisoning events
logs/recovery-events.jsonl       # CUDA reloads, container restarts
logs/structured-validation.jsonl # schema failures on camera events

Routing policy

Capability buckets

The router classifies each request into a bucket, then maps the bucket to a machine. The mapping lives in routing_policy.yaml and can be hot-reloaded.

BucketDefault targetExample prompts
visionubuntu-4090image, screenshot, ring, frigate, motion
notificationwin-5090orubuntu-4090if imagesms, alert, push notification
fastubuntu-4090classify, json, tags, label, summarize briefly
general_chatwin-5090tell me about, ask, what is, how do I, opinion
utilitywin-5090rewrite, clean up, format, paraphrase, polish, translate
medium_summarywin-5090summarize, tl;dr, condense, recap, executive summary
codingubuntu-6000refactor, implement, code, FastAPI, pytest
agenticubuntu-6000agent, multi-step, tool use, OpenCode, Hermes
reasoningSparkarchitecture, trade-off, review, risk, plan
unknownSparknothing matched — route to senior model for safety

Tie-breaks: vision > coding > reasoning > chat. Image-bearing requests always route to ubuntu-4090 regardless of keywords.

Size promotions

When prompts get big, the bucket promotes up

If a prompt is large, the router promotes it to a heavier model even if the bucket is cheap.

Bucket≤ 8K tokens8K–24K> 24K
visionubuntu-4090ubuntu-4090ubuntu-4090
coding / agenticubuntu-6000ubuntu-6000ubuntu-6000
reasoning / reviewSparkSparkSpark
fastubuntu-4090ubuntu-6000Spark
chat / utility / summarywin-5090win-5090Spark

Size promotions are applied after the bucket is chosen. A large chat prompt promotes to Spark; vision and coding stay pinned to their dedicated machines regardless of size.

Fallback chain

Selected · busy · outage · everything down

The asymmetric ubuntu-4090 fallback is the prettiest part — text spills to win-5090 for a fast local hop, but anything carrying an image cascades straight to Spark because win-5090 is blind.

SELECTEDBUSY / SATURATED5XX / OUTAGEEVERYTHING DOWNubuntu-4090THE EYESTextwin-5090ImageSparkFailoverSparkSynthetic envelope (cameras)503ubuntu-6000THE HANDSSpillwin-5090FailoverSpark503win-5090THE MOUTHSpillSparkFailoverSpark503SparkTHE BRAINNo spill — terminal503503SENIOR MODEL NEVER FALLS BACK

The asymmetric ubuntu-4090 fallback

On a saturated 4090, text-only requests spill to win-5090 (fast local hop). Image-bearing requests cascade to Spark — because win-5090 is blind to images. The router carries a `has_image` flag so the right rule wins per request.

Spark is the bottom

The senior model never falls back. If Spark itself is down, the answer is 503 — never a quiet downgrade to a smaller model on a safety-critical decision.

Sample request flow

Same prompt. Two timelines.

The left timeline is the normal path. The right one is what happens when ubuntu-6000 is saturated and the router spills to win-5090 instead of queueing forever.

Normal request

Cursor → router → ubuntu-6000

  1. 01

    Cursor

    POST /v1/chat/completions model: "homelab/auto"

  2. 02

    homelab-router

    Preflight + has_image flag set false

  3. 03

    classifier

    Bucket: "coding" (keywords: refactor, FastAPI, pytest)

  4. 04

    ubuntu-6000

    glm-4.5-air-awq · turboquant k8v4 · 128K context

  5. 05

    200 OK

    Streamed reply · route=ubuntu-6000 · reason=bucket_match

Saturated request

Cursor → router → ubuntu-6000 busy → spill to win-5090

  1. 01

    Cursor

    POST /v1/chat/completions model: "homelab/auto"

  2. 02

    homelab-router

    Preflight + has_image flag set false

  3. 03

    classifier

    Bucket: "coding" (keywords: refactor, FastAPI, pytest)

  4. 04

    ubuntu-6000 busy

    ubuntu-6000 concurrency cap reached — spill to win-5090

  5. 05

    win-5090

    gemma-4-31b · LM Studio · 32K context

  6. 06

    200 OK

    Streamed reply · route=win-5090 · reason=ubuntu-6000_saturated

Home automation + cameras

Two-stage vision triage

A Ring or Frigate event lands on a sidecar. The sidecar asks ubuntu-4090 whether the image is worth deeper analysis. Confident negatives drop locally. Positives and ambiguous events escalate to Spark for a detailed description.

Ring / Frigatemotion eventeventvision-notifiersidecarclassifyubuntu-4090fast classifierWorthescalating?positive / ambiguousSparkdetailed descriptionnotifyHome Assistantorchestratorconfident negativeDrop locallyno notification firedif Spark downSynthetic envelopedegraded descriptionsynthetic still reaches HARICH PRIMARYImage + AI descriptionRELIABLE SECONDARYText-only pushARCHIVAL TIMELINEPersistent recordONE CHANNEL FAILING NEVER SILENCES THE OTHERS

Two-stage triage. The fast classifier filters confident negatives locally; everything ambiguous escalates to Spark for a detailed description. If Spark is unreachable, the sidecar builds a synthetic envelope from the classifier output and still reaches Home Assistant — a reasoning-model outage degrades description quality but never silences the alert.

Defensive monkey-patch on Ring listener

Malformed cloud payloads no longer kill the listener thread.

Parallel push via MQTT

An independent push source that does not depend on the cloud-side listener.

Triple triggers + two-layer dedup

Real-time push, parallel push, and polled fallback — deduped so they never fan out into three notifications.

DNS pinned to public resolvers

Inside the affected containers, to bypass an intermittent local DNS issue that broke Ring REST calls.

Observability

Netdata, MQTT, Home Assistant

Every machine runs Netdata. The Ubuntu workstation is the parent; Spark and win-5090 stream metrics into it over the private overlay. A small MQTT bridge polls Netdata every 15 seconds and publishes curated metrics as auto-discovered Home Assistant sensors.

Sparknetdata agentubuntu-6000 + 4090local agents (same host as parent)win-5090netdata agentstream over private overlaylocalstreamNetdata parentubuntu workstation · time-series rolluppoll every 15sMQTT bridgecurated metrics per hostMQTT discoveryHome AssistantSystem Health dashboardOverviewGPUsStorageVisible on a phoneMETRICS PER HOSTCPU · RAM · Load · DiskPer-GPU util · memory · temp · power

Sensor probing

The bridge probes each candidate sensor at discovery time and skips ones the hardware does not actually expose — so dashboards never show ghost charts.

Unified-memory Spark

Spark has unified CPU + GPU memory, so its VRAM chart is intentionally absent. The dashboard tracks Spark GPU memory pressure via the unified-RAM sensor instead.

Netdata parent dashboard showing system metrics across all hosts

Netdata parent — system metrics across the cluster, top nodes by CPU and RAM

/images/homelab/netdata-overview.jpg

Home Assistant System Health view showing per-GPU utilization, power and temperature

Home Assistant System Health — per-GPU utilization, power draw, temperature

/images/homelab/home-assistant-gpus.jpg

Automation

Hermes runs the lab

Hermes is two things at once: a normal client of homelab-router for inference, and an operator of it via the admin API. It schedules health checks, runs incident-response playbooks as skills, and uses Conductor sub-agents to do parallel work without polluting the main loop. It reads ubuntu-6000 and Spark through Netdata + Home Assistant, and alerts on Telegram first with HA as the backup.

Hermes WorkspacePWA command center · Chat · Terminal · Memory · Skillsoperator UIHermes Agentmain loop on the workstation · OpenAI-compatible clientcronskillsmemoryConductorLLM callsadmin APIhomelab-routerOpenAI-compatible · /admin endpointsNetdata parentmetrics streamsHome Assistantstate & alertsread-only signalsSub-agent 1log triageSub-agent 2A/B prompt evalSub-agent 3batch evaldelegate_taskeach calls routerAGENT ↔ ROUTERHermes is both a client of the router and an operator of it.

Scheduled health checks

Cron polls /admin/upstreams and watches GPU temps. no_agent mode runs pure script-output watchdogs without spending tokens.

"every 5m" → poll /admin/upstreams

Runbooks as skills

drain-route, recover-cuda, reload-policy become first-class skills the agent can invoke and improve over time.

skill: recover-cuda

Conductor sub-agents

delegate_task spawns isolated children with their own context and toolset. Parallel log analysis and batch evals stay out of the main loop.

delegate_task → 4× parallel

Persistent memory

Every decision and outcome lives in editable markdown under ~/.hermes/. Honcho builds an evolving model of the operator over time.

~/.hermes/memory/

Multi-platform alerts

Telegram first; Home Assistant fallback when Hermes itself is down. The lab is never silent on a real failure.

Self-improvement

Skills get better with use. Runbooks evolve as the lab evolves. The agent and the lab compound on each other.

Scheduled work — examples

every 5m         → GET /admin/upstreams → flag any not-healthy  # no_agent=True
*/15 * * * *     → scrape /metrics, alert if router p95 > 4s
0 9 * * 1        → weekly: summarize last week's routing-decisions.jsonl
every 2h         → verify camera pipeline end-to-end on a known sample
30m              → one-shot: re-check after an incident-response drain

The scheduler accepts relative delays (30m), intervals (every 2h), cron expressions, and ISO one-shots. Anything labeled no_agent=True runs as a pure script-output watchdog and never spends LLM tokens.

Why it fits

Open-source, OpenAI-compatible, local-LLM friendly

Hermes is MIT-licensed (Nous Research) and speaks the OpenAI Chat Completions schema natively, so it points at homelab-router with no adapters. It already supports Ollama, vLLM, and LM Studio — exactly what is running on the four machines.

Where Hermes never overrides

The router still owns routing

Hermes can drain a route, reload policy, or trigger CUDA recovery, but it never picks which model answers a request. If Spark is down, Hermes still gets a 503 like every other client. The deterministic policy is the only thing that decides where requests go.

Design principles

The rules the lab actually lives by

These are the difference between “four GPUs in a closet” and “a system.”

Every machine has one job

ubuntu-4090 sees. ubuntu-6000 codes. Spark thinks. win-5090 chats. No co-hosting. No swapping.

One endpoint, no client-side routing

Every client speaks to the gateway. The gateway hides which machine is busy, down, or saturated.

Deterministic routing, never LLM-based

Every routing decision is made by a keyword classifier scoring against a YAML policy. No model gets to decide where requests go.

Safety routes fail closed

Camera events never silently produce 'ignore'. Reasoning ambiguity always lands on Spark.

The senior model never falls back

Spark is the bottom of the stack. If it is down, the answer is 503 — not 'we sent it to a smaller model and hoped.'

Reliability beats raw speed

FP8 KV on ubuntu-6000, MTP on GLM, and Gemma on ubuntu-4090 were faster on paper. All three were rejected because they were slower under real failure conditions.

Recovery is one command

Driver poisoning, eGPU PCIe link-down, and container hangs all have documented one-line fixes. Most are exposed as authenticated admin API calls.

Nothing private leaves the house

Every model runs locally. The only external dependencies are model weights at install time and the private overlay network. Ring cloud is an input, never a destination.

Operate. Publish. Teach.

Every model runs locally. The only external dependencies are model weights at install time and the private overlay network. Ring cloud is an input, never a destination. Nothing private leaves the house.

See also: /stack for the full daily-driver toolkit · /lab for the rest of the workshop.