HomeLab

Four machines. One OpenAI-compatible endpoint.

A private AI cluster running entirely on local hardware. Each machine has one job — fast vision, heavy coding, deep reasoning, or general chat — and a small custom gateway called homelab-router decides which one should answer each request.

Everything speaks the OpenAI Chat Completions schema, so Cursor, OpenCode, a personal Telegram agent, Home Assistant, and the security-camera pipeline all hit a single endpoint and never need to know which GPU is doing the work.

ubuntu-4090 ubuntu-6000 Spark win-5090

Architecture

One gateway in front of four GPUs

Every client speaks to the gateway. None of them know which machine will answer.

One OpenAI-compatible endpoint. Four machines. Every routing decision is deterministic and auditable.

Clients

Who calls the endpoint

Cursor, OpenCode, a personal Telegram agent called Hermes, Home Assistant, Frigate, and a Ring camera pipeline all speak the same Chat Completions schema. The router does the rest.

Cursor
AI-native editor → coding bucket → ubuntu-6000
OpenCode
Multi-file refactors → agentic bucket → ubuntu-6000
Hermes
Personal agent — manages the lab, also a chat client
Home Assistant
Camera + automation orchestrator
Frigate
NVR motion events → vision pipeline
Ring pipeline
Doorbell snapshots → vision pipeline

The four machines

Each one optimized for a single role

Tap a card to jump to that machine’s deep dive.

The Eyes

ubuntu-4090

RTX 4090 Laptop GPU · 16 GB

Fast vision, JSON utility, low-latency routing pre-filter.

homelab/visionhomelab/fast

The Hands

ubuntu-6000

RTX PRO 6000 Blackwell · 96 GB

Heavy code generation, refactoring, agentic / multi-file work.

homelab/codehomelab/agent

The Brain

Spark

NVIDIA DGX Spark · Unified CPU + GPU memory

Senior reasoning, architecture, final review, safety fallback.

homelab/reasoninghomelab/review

The Mouth

win-5090

RTX 5090 · 32 GB

General chat, rewrites, medium-context summarization, spillover.

homelab/chathomelab/utilityhomelab/gemma

The Eyes

ubuntu-4090

Fast vision, JSON utility, low-latency routing pre-filter.

homelab/visionhomelab/fast

Hardware

GPURTX 4090 Laptop GPU

VRAM16 GB

BusInternal PCIe (workstation)

OSUbuntu

Production model

ModelQwen2.5-VL-7B-Instruct-AWQ

Served asqwen2.5-vl-7b-router-awq

RuntimevLLM (OpenAI-compatible)

What it does

Vision — every image-bearing request lands here first
Ring camera snapshots
Frigate events
Screenshot understanding
Fast classifier: the “is this even worth escalating?” pre-filter
JSON utility: structured outputs for the camera pipeline, notifications, and tagging

Latency it actually hits

p50 target

≤ 1 s

p95 target

≤ 2 s

p50 measured

0.61 s

p95 measured

0.81 s

· 4 / 4 routing safety cases pass
· Zero false suppressions
· Zero invalid JSON

Optimizations

AWQ INT4 quantization — one-fifth the weights of FP16
vLLM with continuous batching and prefix caching
Production context locked to 32K tokens after a latency / KV / VRAM sweep
Auto-start as a user-level systemd unit — no manual login after reboot
Approximately 9.5 GB of 16 GB used, leaving headroom for CUDA-graph capture and prefix-cache growth

Hard rules

Nothing else co-hosts on the 4090 — no coding model, no large-context model
The slot is single-purpose on purpose

Recovery story

When the eGPU on the sibling machine has a PCIe link-down/up cycle, the entire NVIDIA driver enters a degraded state where nvidia-smi still works but every fresh CUDA init fails. The router detects this, fails traffic over to Spark, and a recover-cuda.sh script reloads the NVIDIA UVM module and brings everything back up automatically. The whole sequence is documented step-by-step in the runbook.

vLLM launch (excerpt)

# qwen2.5-vl-7b-router-awq on ubuntu-4090
vllm serve Qwen/Qwen2.5-VL-7B-Instruct-AWQ \
  --served-model-name qwen2.5-vl-7b-router-awq \
  --quantization awq_marlin \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.78 \
  --enable-prefix-caching \
  --enforce-eager false \
  --host 127.0.0.1

The Hands

ubuntu-6000

Heavy code generation, refactoring, agentic / multi-file work.

homelab/codehomelab/agent

Hardware

GPURTX PRO 6000 Blackwell

VRAM96 GB

BusThunderbolt eGPU

OSUbuntu

Production model

ModelGLM-4.5-Air-AWQ

Served asglm-4.5-air-awq

RuntimevLLM (OpenAI-compatible)

What it does

Code generation
Refactoring
Multi-file edits
Agentic tool-use work
Cursor and OpenCode are routed here by default
Longer-context utility work

Optimizations

AWQ W4A16 quantization using compressed tensors
TurboQuant KV cache in k8v4 mode — 8-bit keys, 4-bit values
≈ 2.6× KV compression vs FP8 with negligible quality cost
≈ 14% more KV cache slots than the FP8 baseline at the same context budget
Per-layer aware: 42 quantized layers use TurboQuant; 4 boundary layers auto-skip to FLASH_ATTN for quality safety
128K-token context window
Auto-start as a user-level systemd unit with --profile long

Hard rules

No cross-GPU tensor parallelism with the 4090 — Thunderbolt latency would kill it
No model swapping — treat it as a single resident endpoint
No CPU offload — 96 GB is plenty of headroom and any swapping would gut throughput

Reliability story — why TurboQuant beat FP8

The earlier FP8 KV configuration with TRITON_ATTN was faster per token on paper, but consistently crashed the AWQ Marlin kernel mid-load with `CUDA error: unspecified launch failure`. TurboQuant k8v4 was slower in raw decode but completed every load on this Blackwell + driver combination. Reliability won. The configuration choice is documented as “the only one that survived the sustained bench.” MTP (multi-token speculative decoding) is intentionally off: the current AWQ checkpoint drops the speculative head’s weights, so draft acceptance is 0%. It will be re-enabled when the checkpoint is fixed upstream.

vLLM launch (excerpt)

# glm-4.5-air-awq on ubuntu-6000 (RTX PRO 6000 Blackwell)
vllm serve cpatonn/GLM-4.5-Air-AWQ \
  --served-model-name glm-4.5-air-awq \
  --quantization compressed-tensors \
  --kv-cache-dtype turboquant_k8v4 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --speculative-config '{"method":"none"}' \
  --host 127.0.0.1

The Brain

Spark

Senior reasoning, architecture, final review, safety fallback.

homelab/reasoninghomelab/review

Hardware

GPUNVIDIA DGX Spark

VRAMUnified CPU + GPU memory

BusGB10 ARM SoC, attached via private overlay

OSLinux

Production model

ModelQwen3.6-35B-A3B

Served asqwen3.6-35b-a3b-fp8

RuntimevLLM (OpenAI-compatible)

What it does

Deep reasoning
Architecture review
Planning
Hard debugging
Final review and risk analysis
Safety fallback for every other route

What it does not do

Spark never falls back — it is the bottom of the chain
If Spark is down, the router returns a 503, period
No silent downgrade to a less-capable model on a safety-critical decision

Optimizations

FP8 weights — throughput and KV-cache efficiency on GB10’s unified memory
vLLM with prefix caching
Chunked prefill
YaRN for long-context overrides
Open WebUI front-end pinned to its own data directory so other UIs do not collide
Treated by the router as a network-attached upstream over the private overlay

Hard rules

Spark never falls back to another model
If Spark is down, the answer is 503 — never a quiet downgrade
The router does not try to start, stop, or recover Spark — the operator gets paged

vLLM launch (excerpt)

# qwen3.6-35b-a3b-fp8 on Spark (DGX, GB10)
vllm serve Qwen/Qwen3.6-35B-A3B-FP8 \
  --served-model-name qwen3.6-35b-a3b-fp8 \
  --dtype auto \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --rope-scaling '{"type":"yarn","factor":4.0}' \
  --host 127.0.0.1

The Mouth

win-5090

General chat, rewrites, medium-context summarization, spillover.

homelab/chathomelab/utilityhomelab/gemma

Hardware

GPURTX 5090

VRAM32 GB

BusInternal PCIe

OSWindows 11

Production model

ModelGemma 4 31B

Served asgemma-4-31b

RuntimeLM Studio (OpenAI-compatible)

What it does

General chat
Light Q&A
Rewrites, format cleanup, paraphrasing, translation
Medium-context summarization (≈ 1K–8K tokens)
Text-only spillover when ubuntu-4090 is saturated
Code spillover when ubuntu-6000 is busy

What it does not do

No image work — Gemma cannot see, so the router never sends image-bearing requests here
Images always go to ubuntu-4090 or, during an outage, to Spark
No final architecture review or deep reasoning — that stays on Spark
No camera safety decisions

Optimizations

vLLM-style serving via LM Studio with three concurrent slots
Router concurrency cap matched exactly to the upstream slot count
Third caller gets a `busy` response
Fourth caller is sent to the fallback instead of queueing forever
gpu-memory-utilization set to 0.86
32K context
Batched-token budget tuned to the 32 GB card
Inbound traffic pinned to the private overlay network at the Windows Firewall layer

Hard rules

SSH, the LLM endpoint, and the Netdata UI are unreachable from public Wi-Fi by design
Never the image path — the router carries a has_image flag through preflight so this rule wins per request

LM Studio config (excerpt)

# gemma-4-31b on win-5090 (LM Studio, OpenAI-compatible)
{
  "served_model_name": "gemma-4-31b",
  "context_length": 32768,
  "gpu_memory_utilization": 0.86,
  "max_concurrent_slots": 3,
  "bind": "private-overlay-only",
  "firewall": "block all non-overlay inbound"
}

The router

homelab-router — the keystone

A small FastAPI service that exposes one OpenAI-compatible endpoint on the loopback interface. Every client speaks to it; none of them know about the four upstream models individually.

Design principles

Deterministic, not LLM-based

Routing decisions are made by a keyword classifier scoring the prompt against a YAML policy file, plus three override mechanisms — request header, pinned model name, metadata hint.

A control plane, not a fourth model

The router never generates text itself. Every decision is auditable and reproducible.

Safety-first tie-breaks

When the classifier scores multiple buckets: vision > coding > reasoning > chat. 'Review the security of this code' lands on Spark, not on the chat model.

Image presence is always honored

Any request carrying an image_url part goes to ubuntu-4090 regardless of keywords — nothing else in the rig can see.

Camera tasks never silently say 'ignore'

Schema-validation failure on a camera event produces a synthetic escalate_to_spark envelope, never a suppression. The router fails closed.

Hot-reloadable policy

Routing rules live in routing_policy.yaml and can be reloaded over an admin endpoint without restarting the gateway.

Loopback endpoints

Method	Path	Purpose
POST	/v1/chat/completions	OpenAI Chat Completions shim — every client speaks here
GET	/v1/models	Friendly model catalog: homelab/auto, homelab/code, homelab/vision, homelab/agent, …
POST	/v1/embeddings	Embeddings shim for downstream RAG and vector stores
GET	/metrics	Prometheus exposition — counters, histograms, gauges
GET	/stats	JSON snapshot of the same counters for humans
GET	/admin/upstreams	Per-route health, load, saturation, drained flag
POST	/admin/drain/<route>	Drain a route without restarting — explicit pins still pass through
POST	/admin/reload-config	Hot-reload routing_policy.yaml
POST	/admin/recover-cuda	Authenticated one-button CUDA recovery

All routes are bound to the loopback interface on the workstation. No public surface, no Wi-Fi exposure.

Operational ergonomics

Drain a route live

POST /admin/drain/<route> redirects auto-routed traffic but lets explicit homelab/<route> pins pass through. No restart.

Hot-reload policy

POST /admin/reload-config swaps routing_policy.yaml in place. The keyword classifier picks up the new buckets immediately.

Watchdog every minute

A one-minute systemd timer re-probes every upstream, restarts containers whose /v1/models is down, and triggers CUDA recovery on context failure.

Structured logs

logs/router-decisions.jsonl       # bucket, route, reason, latency, fallback path
logs/upstream-failures.jsonl     # 5xx, timeouts, CUDA poisoning events
logs/recovery-events.jsonl       # CUDA reloads, container restarts
logs/structured-validation.jsonl # schema failures on camera events

Routing policy

Capability buckets

The router classifies each request into a bucket, then maps the bucket to a machine. The mapping lives in routing_policy.yaml and can be hot-reloaded.

Bucket	Default target	Example prompts
vision	ubuntu-4090	image, screenshot, ring, frigate, motion
notification	win-5090orubuntu-4090if image	sms, alert, push notification
fast	ubuntu-4090	classify, json, tags, label, summarize briefly
general_chat	win-5090	tell me about, ask, what is, how do I, opinion
utility	win-5090	rewrite, clean up, format, paraphrase, polish, translate
medium_summary	win-5090	summarize, tl;dr, condense, recap, executive summary
coding	ubuntu-6000	refactor, implement, code, FastAPI, pytest
agentic	ubuntu-6000	agent, multi-step, tool use, OpenCode, Hermes
reasoning	Spark	architecture, trade-off, review, risk, plan
unknown	Spark	nothing matched — route to senior model for safety

Tie-breaks: vision > coding > reasoning > chat. Image-bearing requests always route to ubuntu-4090 regardless of keywords.

Size promotions

When prompts get big, the bucket promotes up

If a prompt is large, the router promotes it to a heavier model even if the bucket is cheap.

Bucket	≤ 8K tokens	8K–24K	> 24K
vision	ubuntu-4090	ubuntu-4090	ubuntu-4090
coding / agentic	ubuntu-6000	ubuntu-6000	ubuntu-6000
reasoning / review	Spark	Spark	Spark
fast	ubuntu-4090	ubuntu-6000	Spark
chat / utility / summary	win-5090	win-5090	Spark

Size promotions are applied after the bucket is chosen. A large chat prompt promotes to Spark; vision and coding stay pinned to their dedicated machines regardless of size.

Fallback chain

Selected · busy · outage · everything down

The asymmetric ubuntu-4090 fallback is the prettiest part — text spills to win-5090 for a fast local hop, but anything carrying an image cascades straight to Spark because win-5090 is blind.

The asymmetric ubuntu-4090 fallback

On a saturated 4090, text-only requests spill to win-5090 (fast local hop). Image-bearing requests cascade to Spark — because win-5090 is blind to images. The router carries a `has_image` flag so the right rule wins per request.

Spark is the bottom

The senior model never falls back. If Spark itself is down, the answer is 503 — never a quiet downgrade to a smaller model on a safety-critical decision.

Sample request flow

Same prompt. Two timelines.

The left timeline is the normal path. The right one is what happens when ubuntu-6000 is saturated and the router spills to win-5090 instead of queueing forever.

Normal request

Cursor → router → ubuntu-6000

01
Cursor
POST /v1/chat/completions model: "homelab/auto"
02
homelab-router
Preflight + has_image flag set false
03
classifier
Bucket: "coding" (keywords: refactor, FastAPI, pytest)
04
ubuntu-6000
glm-4.5-air-awq · turboquant k8v4 · 128K context
05
200 OK
Streamed reply · route=ubuntu-6000 · reason=bucket_match

Saturated request

Cursor → router → ubuntu-6000 busy → spill to win-5090

01
Cursor
POST /v1/chat/completions model: "homelab/auto"
02
homelab-router
Preflight + has_image flag set false
03
classifier
Bucket: "coding" (keywords: refactor, FastAPI, pytest)
04
ubuntu-6000 busy
ubuntu-6000 concurrency cap reached — spill to win-5090
05
win-5090
gemma-4-31b · LM Studio · 32K context
06
200 OK
Streamed reply · route=win-5090 · reason=ubuntu-6000_saturated

Home automation + cameras

Two-stage vision triage

A Ring or Frigate event lands on a sidecar. The sidecar asks ubuntu-4090 whether the image is worth deeper analysis. Confident negatives drop locally. Positives and ambiguous events escalate to Spark for a detailed description.

Two-stage triage. The fast classifier filters confident negatives locally; everything ambiguous escalates to Spark for a detailed description. If Spark is unreachable, the sidecar builds a synthetic envelope from the classifier output and still reaches Home Assistant — a reasoning-model outage degrades description quality but never silences the alert.

Defensive monkey-patch on Ring listener

Malformed cloud payloads no longer kill the listener thread.

Parallel push via MQTT

An independent push source that does not depend on the cloud-side listener.

Triple triggers + two-layer dedup

Real-time push, parallel push, and polled fallback — deduped so they never fan out into three notifications.

DNS pinned to public resolvers

Inside the affected containers, to bypass an intermittent local DNS issue that broke Ring REST calls.

Observability

Netdata, MQTT, Home Assistant

Every machine runs Netdata. The Ubuntu workstation is the parent; Spark and win-5090 stream metrics into it over the private overlay. A small MQTT bridge polls Netdata every 15 seconds and publishes curated metrics as auto-discovered Home Assistant sensors.

Sensor probing

The bridge probes each candidate sensor at discovery time and skips ones the hardware does not actually expose — so dashboards never show ghost charts.

Unified-memory Spark

Spark has unified CPU + GPU memory, so its VRAM chart is intentionally absent. The dashboard tracks Spark GPU memory pressure via the unified-RAM sensor instead.

Netdata parent dashboard showing system metrics across all hosts — Netdata parent — system metrics across the cluster, top nodes by CPU and RAM
/images/homelab/netdata-overview.jpg

Home Assistant System Health view showing per-GPU utilization, power and temperature — Home Assistant System Health — per-GPU utilization, power draw, temperature
/images/homelab/home-assistant-gpus.jpg

Automation

Hermes runs the lab

Hermes is two things at once: a normal client of homelab-router for inference, and an operator of it via the admin API. It schedules health checks, runs incident-response playbooks as skills, and uses Conductor sub-agents to do parallel work without polluting the main loop. It reads ubuntu-6000 and Spark through Netdata + Home Assistant, and alerts on Telegram first with HA as the backup.

Scheduled health checks

Cron polls /admin/upstreams and watches GPU temps. no_agent mode runs pure script-output watchdogs without spending tokens.

"every 5m" → poll /admin/upstreams

Runbooks as skills

drain-route, recover-cuda, reload-policy become first-class skills the agent can invoke and improve over time.

skill: recover-cuda

Conductor sub-agents

delegate_task spawns isolated children with their own context and toolset. Parallel log analysis and batch evals stay out of the main loop.

delegate_task → 4× parallel

Persistent memory

Every decision and outcome lives in editable markdown under ~/.hermes/. Honcho builds an evolving model of the operator over time.

~/.hermes/memory/

Multi-platform alerts

Telegram first; Home Assistant fallback when Hermes itself is down. The lab is never silent on a real failure.

Self-improvement

Skills get better with use. Runbooks evolve as the lab evolves. The agent and the lab compound on each other.

Scheduled work — examples

every 5m         → GET /admin/upstreams → flag any not-healthy  # no_agent=True
*/15 * * * *     → scrape /metrics, alert if router p95 > 4s
0 9 * * 1        → weekly: summarize last week's routing-decisions.jsonl
every 2h         → verify camera pipeline end-to-end on a known sample
30m              → one-shot: re-check after an incident-response drain

The scheduler accepts relative delays (30m), intervals (every 2h), cron expressions, and ISO one-shots. Anything labeled no_agent=True runs as a pure script-output watchdog and never spends LLM tokens.

Why it fits

Open-source, OpenAI-compatible, local-LLM friendly

Hermes is MIT-licensed (Nous Research) and speaks the OpenAI Chat Completions schema natively, so it points at homelab-router with no adapters. It already supports Ollama, vLLM, and LM Studio — exactly what is running on the four machines.

Where Hermes never overrides

The router still owns routing

Hermes can drain a route, reload policy, or trigger CUDA recovery, but it never picks which model answers a request. If Spark is down, Hermes still gets a 503 like every other client. The deterministic policy is the only thing that decides where requests go.

Design principles

The rules the lab actually lives by

These are the difference between “four GPUs in a closet” and “a system.”

Every machine has one job

ubuntu-4090 sees. ubuntu-6000 codes. Spark thinks. win-5090 chats. No co-hosting. No swapping.

One endpoint, no client-side routing

Every client speaks to the gateway. The gateway hides which machine is busy, down, or saturated.

Deterministic routing, never LLM-based

Every routing decision is made by a keyword classifier scoring against a YAML policy. No model gets to decide where requests go.

Safety routes fail closed

Camera events never silently produce 'ignore'. Reasoning ambiguity always lands on Spark.

The senior model never falls back

Spark is the bottom of the stack. If it is down, the answer is 503 — not 'we sent it to a smaller model and hoped.'

Reliability beats raw speed

FP8 KV on ubuntu-6000, MTP on GLM, and Gemma on ubuntu-4090 were faster on paper. All three were rejected because they were slower under real failure conditions.

Recovery is one command

Driver poisoning, eGPU PCIe link-down, and container hangs all have documented one-line fixes. Most are exposed as authenticated admin API calls.

Nothing private leaves the house

Every model runs locally. The only external dependencies are model weights at install time and the private overlay network. Ring cloud is an input, never a destination.