Loading
DevOps Kubernetes CI/CD

Zero-Downtime Deployments: How We Migrate Production Systems Without Sweating

· Ezaan.tech Engineering Team
Zero-Downtime Deployments: How We Migrate Production Systems Without Sweating

The Challenge

Taking a production system with active users — real traffic, real data — from a bare metal monolith to Kubernetes without a single second of downtime sounds impossible. It's not. Here's the exact playbook we use.

Phase 1: Shadow Traffic

Before touching production, we mirror live traffic to the new system. Using Nginx or Envoy, every request hits both old and new simultaneously. We compare responses, measure latency differences, and identify edge cases in production-like conditions — without users ever noticing.

# Envoy mirror filter
route:
  cluster: legacy-backend
  request_mirror_policies:
  - cluster: new-k8s-backend
    runtime_fraction:
      default_value:
        numerator: 10
        denominator: HUNDRED

Start with 10% mirrored traffic. Increase to 100% over days.

Phase 2: Database Migration Strategy

The hardest part is always the database. Our approach:

  1. Add columns, never remove them — the old code and new code must be able to read the same schema
  2. Dual writes — new code writes to both old and new schemas during transition
  3. Backfill in batches — never lock tables; process 1,000 rows at a time with sleeps between batches
  4. Verify before cutting — row count checks, checksum validation

Phase 3: The Cutover

With shadow traffic stable and the database migrated:

  1. Enable feature flag for 1% of users on new infrastructure
  2. Monitor error rates, latency p99, and business metrics for 24h
  3. Roll to 10%, then 50%, then 100%
  4. Keep the old system running for 72h post-cutover as a rollback option

What Makes This Work

The key insight: zero-downtime is a property of the process, not the technology. Kubernetes, Helm, and ArgoCD give us the tools. The discipline of incremental migration, comprehensive observability, and a clear rollback plan make it actually safe.

If you're planning a production migration and want a second opinion, reach out.