Scenario Advanced Aws AWS CI/CD

Build a Zero-Downtime Deployment Pipeline for Microservices on EKS

Design a complete CI/CD pipeline for a microservices application on EKS with canary deployments, automated rollbacks, and zero downtime.

January 20, 2025 4 min read ~40 min to complete DB
The Situation

You're the DevOps lead for a 30-service e-commerce platform running on EKS. The team deploys 10 times per day. Last month, a bad deploy took down the payment service for 8 minutes before anyone noticed. The CTO says: 'Fix deployments. Zero downtime. Automated rollback. I don't want to hear about another bad deploy taking us down.'

7 Steps
7 Services Used
~40 min Duration
Advanced Difficulty

The Problem

A traditional kubectl apply replaces all pods simultaneously — if the new image is broken, users hit errors until you notice and roll back manually. The goal is a pipeline that:

  1. Automatically builds, scans, and tests every commit
  2. Releases to production incrementally (5% → 20% → 50% → 100%)
  3. Watches real error rates and rolls back automatically if they spike
  4. Never takes the service fully offline

Pipeline Architecture

Developer Push to main
        │
        ▼
  GitHub / CodeCommit
        │
        ▼
   CodePipeline
   ├── Stage 1: Source
   ├── Stage 2: Build (CodeBuild)
   │    ├── Unit Tests
   │    ├── SAST scan (Semgrep / Checkov)
   │    ├── Docker build → push to ECR
   │    └── Trivy image vulnerability scan
   ├── Stage 3: Deploy to Dev (kubectl apply)
   ├── Stage 4: Integration Tests
   ├── Stage 5: Deploy to Staging (Blue/Green)
   ├── Stage 6: Load Tests (k6)
   └── Stage 7: Deploy to Prod (Canary via Argo Rollouts)

Step 1: CodeBuild — Build, Scan, Push

# buildspec.yml
version: 0.2
phases:
  pre_build:
    commands:
      - aws ecr get-login-password | docker login --username AWS --password-stdin $ECR_REGISTRY
      - export IMAGE_TAG=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-8)

  build:
    commands:
      - docker build -t $ECR_REGISTRY/payment-service:$IMAGE_TAG .
      - trivy image --exit-code 1 --severity HIGH,CRITICAL $ECR_REGISTRY/payment-service:$IMAGE_TAG

  post_build:
    commands:
      - docker push $ECR_REGISTRY/payment-service:$IMAGE_TAG
      - printf '[{"name":"payment-service","imageUri":"%s"}]' $ECR_REGISTRY/payment-service:$IMAGE_TAG > imagedefinitions.json

artifacts:
  files: imagedefinitions.json
Fail the Build on Critical CVEs
The --exit-code 1 flag on Trivy means a CRITICAL vulnerability fails the CodeBuild stage and blocks the pipeline. Never ship a container image you haven’t scanned.

Step 2: Argo Rollouts — Canary Strategy

Install Argo Rollouts in EKS and replace your Deployment with a Rollout:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
  namespace: production
spec:
  replicas: 20
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-service:latest
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
  strategy:
    canary:
      steps:
      - setWeight: 5        # 5% of traffic to new version
      - pause: {duration: 5m}
      - setWeight: 20
      - pause: {duration: 10m}
      - analysis:           # Auto-check error rate before continuing
          templates:
          - templateName: error-rate-check
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100

Step 3: AnalysisTemplate — Automatic Rollback

The AnalysisTemplate queries Prometheus every minute. If the error rate exceeds 1% three times, Argo Rollouts automatically rolls back to the previous version — no human needed.

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-rate-check
  namespace: production
spec:
  metrics:
  - name: error-rate
    interval: 1m
    provider:
      prometheus:
        address: http://prometheus-server.monitoring:9090
        query: |
          sum(rate(http_requests_total{
            service="payment-service",
            status=~"5.."
          }[5m]))
          /
          sum(rate(http_requests_total{
            service="payment-service"
          }[5m])) * 100
    successCondition: result[0] < 1    # Pass if error rate < 1%
    failureLimit: 3                     # Rollback after 3 failures

Step 4: PodDisruptionBudget — No Sudden Gaps

A canary deploy terminates old pods. Without a PDB, Kubernetes might terminate so many pods that traffic spikes overwhelm the remaining ones:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-service-pdb
spec:
  minAvailable: 80%   # Always keep 80% of pods running
  selector:
    matchLabels:
      app: payment-service

Step 5: Readiness Probe — No Traffic Before Ready

Your pod must pass readiness before it receives any production traffic. If your service needs 30 seconds to warm up a connection pool, account for that:

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3
livenessProbe:
  httpGet:
    path: /health/live
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

Step 6: Pre-Sync DB Migrations

Database schema changes must run before the new pod version starts. Use an init container or an Argo Rollouts pre-hook:

# As an init container in the Rollout template
initContainers:
- name: db-migrate
  image: 123456789.dkr.ecr.us-east-1.amazonaws.com/payment-service:latest
  command: ["flyway", "migrate"]
  env:
  - name: FLYWAY_URL
    valueFrom:
      secretKeyRef:
        name: db-credentials
        key: url

Step 7: Notifications

Wire Slack into every stage so the team knows what’s happening without watching dashboards:

# Argo Rollouts notification config
apiVersion: v1
kind: ConfigMap
metadata:
  name: argo-rollouts-notification-cm
data:
  trigger.on-rollback: |
    - send: [slack-rollback]
    when: rollout.status.phase == 'Degraded'
  template.slack-rollback: |
    message: |
      🚨 *{{.rollout.metadata.name}}* auto-rolled back in *{{.rollout.metadata.namespace}}*
      Error rate exceeded 1% threshold.

Summary

ConcernSolution
Bad image shipsTrivy blocks high/critical CVEs in CodeBuild
App breaks in prodCanary at 5% limits blast radius
Slow rollbackAnalysisTemplate triggers automatic rollback in < 3 min
Deploy kills capacityPodDisruptionBudget keeps 80% pods alive
Migration runs after appInit container runs Flyway first
Interview Angle
Mention error budget: if you deploy 10x per day, each deploy must be safe enough that the aggregate risk stays within your SLO. Canary + automated rollback is the mechanism that lets you ship fast without burning the error budget.
Services Used
CodePipelineCodeBuildECREKSArgo RolloutsPrometheusSNS
Prerequisites
  • Familiarity with Kubernetes deployments and services
  • Basic understanding of CI/CD pipelines
  • Understanding of canary and blue/green deployment concepts
What You Learned
  • How to structure a multi-stage CodePipeline for EKS
  • How Argo Rollouts enables progressive delivery with canary steps
  • How AnalysisTemplate auto-rollback works using Prometheus metrics
  • How PodDisruptionBudgets prevent downtime during deploys
  • How to wire Slack/PagerDuty notifications into each pipeline stage

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios