Blogs

SRE Without an SRE Team: A Pragmatic Starting Point

Most companies cannot hire a dedicated SRE team. Here are the SRE practices that pay off even when run by your existing engineers.

Dec 30, 2025 4 min

SRE is a set of practices, not a team structure. The practices work even at companies with no SREs on staff.

Google invented SRE because Google had unique scale problems. Every company since then has tried to hire SREs and discovered they cost twice what software engineers cost and are twice as hard to hire. The solution for the rest of us is to install the SRE practices without the SRE team.

Step 1: define SLOs

Pick the three to five user journeys that matter most. For each, define a service-level indicator (e.g., "checkout completes in under 3 seconds") and a target ("99.5% of the time"). That is your SLO. Publish it.

Step 2: install observability

You need to be able to answer: "did we hit the SLO last week?" Datadog, Grafana Cloud, Honeycomb, or open-source Prometheus + Grafana — all of them work. The choice matters less than installing one.

Step 3: error budget conversations

An error budget is the inverse of the SLO. 99.5% reliability = 0.5% error budget = roughly 3.6 hours of downtime per month. When you burn the budget faster than expected, the conversation is "ship safer, slower" until the budget recovers. When you do not burn it, you have permission to take risks.

Step 4: incident response

Define paging tiers (P0/P1/P2) by user impact. Define response targets (P0: ack in 5 min, mitigation in 30 min). Run a post-mortem for every P0 and P1 within 5 business days. The post-mortem is blameless — assume good intent, focus on the system, identify action items.

Step 5: chaos and game days

Once a quarter, schedule a 90-minute exercise where engineering deliberately breaks something — a database, a region, a deploy — and the team practices recovering. The first one is humbling. The fourth one is routine.

What you avoid by skipping the SRE hire

The "SRE team" antipattern: a separate group that owns reliability while the product team owns features. The boundary becomes a wall. With shared ownership and the practices above, you get most of the SRE benefit without the staffing cost.