Build a Zero-Downtime Deployment Pipeline for Microservices on EKS
Design a complete CI/CD pipeline for a microservices application on EKS with canary deployments, automated rollbacks, and zero downtime.
You're the DevOps lead for a 30-service e-commerce platform running on EKS. The team deploys 10 times per day. Last month, a bad deploy took down the payment service for 8 minutes before anyone noticed. The CTO says: 'Fix deployments. Zero downtime. Automated rollback. I don't want to hear about another bad deploy taking us down.'
The Problem
A traditional kubectl apply replaces all pods simultaneously — if the new image is broken, users hit errors until you notice and roll back manually. The goal is a pipeline that:
- Automatically builds, scans, and tests every commit
- Releases to production incrementally (5% → 20% → 50% → 100%)
- Watches real error rates and rolls back automatically if they spike
- Never takes the service fully offline
Pipeline Architecture
Developer Push to main
│
▼
GitHub / CodeCommit
│
▼
CodePipeline
├── Stage 1: Source
├── Stage 2: Build (CodeBuild)
│ ├── Unit Tests
│ ├── SAST scan (Semgrep / Checkov)
│ ├── Docker build → push to ECR
│ └── Trivy image vulnerability scan
├── Stage 3: Deploy to Dev (kubectl apply)
├── Stage 4: Integration Tests
├── Stage 5: Deploy to Staging (Blue/Green)
├── Stage 6: Load Tests (k6)
└── Stage 7: Deploy to Prod (Canary via Argo Rollouts)
Step 1: CodeBuild — Build, Scan, Push
# buildspec.yml
version: 0.2
phases:
pre_build:
commands:
- aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
- export IMAGE_TAG=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-8)
build:
commands:
- docker build -t $ECR_REGISTRY/payment-service:$IMAGE_TAG .
- trivy image --exit-code 1 --severity HIGH,CRITICAL $ECR_REGISTRY/payment-service:$IMAGE_TAG
post_build:
commands:
- docker push $ECR_REGISTRY/payment-service:$IMAGE_TAG
- printf '[{"name":"payment-service","imageUri":"%s"}]' $ECR_REGISTRY/payment-service:$IMAGE_TAG > imagedefinitions.json
artifacts:
files: imagedefinitions.json
--exit-code 1 flag on Trivy means a CRITICAL vulnerability fails the CodeBuild stage and blocks the pipeline. Never ship a container image you haven’t scanned.Step 2: Argo Rollouts — Canary Strategy
Install Argo Rollouts in EKS and replace your Deployment with a Rollout:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payment-service
namespace: production
spec:
replicas: 20
selector:
matchLabels:
app: payment-service
template:
metadata:
labels:
app: payment-service
spec:
containers:
- name: payment-service
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-service:latest
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
strategy:
canary:
steps:
- setWeight: 5 # 5% of traffic to new version
- pause: {duration: 5m}
- setWeight: 20
- pause: {duration: 10m}
- analysis: # Auto-check error rate before continuing
templates:
- templateName: error-rate-check
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 100
Step 3: AnalysisTemplate — Automatic Rollback
The AnalysisTemplate queries Prometheus every minute. If the error rate exceeds 1% three times, Argo Rollouts automatically rolls back to the previous version — no human needed.
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-rate-check
namespace: production
spec:
metrics:
- name: error-rate
interval: 1m
provider:
prometheus:
address: http://prometheus-server.monitoring:9090
query: |
sum(rate(http_requests_total{
service="payment-service",
status=~"5.."
}[5m]))
/
sum(rate(http_requests_total{
service="payment-service"
}[5m])) * 100
successCondition: result[0] < 1 # Pass if error rate < 1%
failureLimit: 3 # Rollback after 3 failures
Step 4: PodDisruptionBudget — No Sudden Gaps
A canary deploy terminates old pods. Without a PDB, Kubernetes might terminate so many pods that traffic spikes overwhelm the remaining ones:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: payment-service-pdb
spec:
minAvailable: 80% # Always keep 80% of pods running
selector:
matchLabels:
app: payment-service
Step 5: Readiness Probe — No Traffic Before Ready
Your pod must pass readiness before it receives any production traffic. If your service needs 30 seconds to warm up a connection pool, account for that:
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 20
periodSeconds: 5
failureThreshold: 3
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
Step 6: Pre-Sync DB Migrations
Database schema changes must run before the new pod version starts. Use an init container or an Argo Rollouts pre-hook:
# As an init container in the Rollout template
initContainers:
- name: db-migrate
image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-service:latest
command: ["flyway", "migrate"]
env:
- name: FLYWAY_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
Step 7: Notifications
Wire Slack into every stage so the team knows what’s happening without watching dashboards:
# Argo Rollouts notification config
apiVersion: v1
kind: ConfigMap
metadata:
name: argo-rollouts-notification-cm
data:
trigger.on-rollback: |
- send: [slack-rollback]
when: rollout.status.phase == 'Degraded'
template.slack-rollback: |
message: |
🚨 *{{.rollout.metadata.name}}* auto-rolled back in *{{.rollout.metadata.namespace}}*
Error rate exceeded 1% threshold.
Summary
| Concern | Solution |
|---|---|
| Bad image ships | Trivy blocks high/critical CVEs in CodeBuild |
| App breaks in prod | Canary at 5% limits blast radius |
| Slow rollback | AnalysisTemplate triggers automatic rollback in < 3 min |
| Deploy kills capacity | PodDisruptionBudget keeps 80% pods alive |
| Migration runs after app | Init container runs Flyway first |
- How to structure a multi-stage CodePipeline for EKS
- How Argo Rollouts enables progressive delivery with canary steps
- How AnalysisTemplate auto-rollback works using Prometheus metrics
- How PodDisruptionBudgets prevent downtime during deploys
- How to wire Slack/PagerDuty notifications into each pipeline stage
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Migrate from ECS Fargate to EKS With Zero Downtime
The Problem A big-bang cutover from ECS to EKS is too risky — if EKS has issues, you’ve already disconnected ECS. The Strangler Fig …
Design an Observability Stack for 50+ Microservices on EKS
The Problem Without centralized observability, you’re flying blind. Debugging requires SSH-ing into pods, grepping logs, and guessing …
EKS Pods Getting OOMKilled in Production — Diagnose and Fix
The Problem OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its …