Every team says "zero downtime." The patterns that deliver it are well-known. The execution is what separates teams.
Customers do not care about your deployment strategy. They care that the site does not flicker when you ship. The patterns that achieve this are old; the executions that fail are predictable. Here are the four that work and the gotchas that kill them.
Rolling deploys
The default in Kubernetes and ECS. Replace pods one at a time, wait for health checks, move on. Works for most stateless services. Failure mode: a bad deploy poisons the rolling cohort and you have to manually rollback. Mitigation: aggressive health checks and an automatic rollback on error rate spike.
Blue/green
Run two identical environments. Deploy to the inactive one, switch the load balancer over. Failure mode: long-running requests on the old environment, sticky sessions that break, database migrations that need to be backwards-compatible. Mitigation: drain connections gracefully, version your APIs.
Canary
Route a small percentage of traffic to the new version, watch the metrics, ramp up gradually. Most sophisticated of the four. Failure mode: the canary subset is not representative (e.g., all from one region). Mitigation: weighted random routing, not header-based.
Feature flags
Deploy the code dark, enable for a small cohort, ramp up by user attribute. The gold standard for customer-facing changes. Failure mode: flag debt — flags that should have been removed years ago. Mitigation: flag lifecycle policy with a hard expiration date.
Database migrations: the real hard part
The deployment patterns above assume your code can run against both old and new schemas during the deploy window. Backwards-compatible migrations are the actual discipline: never drop a column in the same release that stops writing to it. Two-phase migrations are safer than one-phase.
What we ship
For client production environments we default to canary on Kubernetes (Argo Rollouts) or weighted target groups on ALB. Feature flags via Unleash or LaunchDarkly. Migrations follow the expand-contract pattern. Mean time to detect a bad deploy: under 5 minutes. Rollback time: under 2 minutes.