LLMs in Production: A Reliability Engineering View

Feb 22, 2026 5 min

LLMs are software. Software in production needs SLOs, observability, and a kill switch. So do these.

Two years ago "we use an LLM" was a feature pitch. Today it is plumbing — running entitlement decisions, customer support routing, and content moderation in production at thousands of companies. Production LLM workloads need the same reliability engineering as any other tier-one service. Most teams are figuring this out late.

SLOs for LLM features

Define them per feature, not per model call. "The customer support agent responds with a relevant answer in under 30 seconds, 99% of the time." Latency, accuracy, and availability are all dimensions. Monitor each.

Eval harnesses are the test suite

Every LLM feature in production needs an eval set: a curated collection of input / expected output pairs that runs on every model change, prompt change, or pipeline change. When you upgrade GPT versions or swap providers, the eval suite tells you whether quality regressed.

Fallback behavior is mandatory

The model will be down. Rate-limited. Unavailable in your region. Slow today. Your application has to handle every one of those failure modes — typically by routing to a smaller model, a cached response, or graceful degradation. "We assume the LLM works" is not a strategy.

Observability for token streams

Log every prompt, every completion, every latency. Tag with user, session, feature. When something goes wrong in production, you need to see exactly what the model saw and what it produced. Privacy considerations apply — redact PII before logging.

Cost as a SLO

LLM workloads have cost variability classical software does not. A prompt that grows over time, a customer who gets a long response repeatedly, a regression that triggers verbose outputs — all visible only if you track cost per request. Alert on cost spikes the same way you alert on error rate spikes.

The kill switch

Every LLM feature has a kill switch. A boolean flag, deployable in seconds, that bypasses the LLM entirely and uses fallback logic. When the model misbehaves at 2 AM, you flip the switch first, debug second.

What we ship

For LLM-driven features, we install: an eval harness in CI, structured logging of prompts / completions to a queryable store, cost dashboards per feature, fallback paths to smaller models or rule-based logic, and a kill switch in the feature flag system. Cost: about a sprint's work. Value: every subsequent deploy is safer.