Single AZ Failure Took Down Black Friday — Root Cause & Fix

The Problem

Your application went down because critical resources were effectively pinned to a single Availability Zone. Even though AWS runs your ASG across multiple AZs on paper, subtle misconfigurations mean all your “important” instances were in the zone that failed.

AZ Failure Alert

[CRITICAL] AWS Health Dashboard: “We are investigating increased error rates and latency in the us-east-1b Availability Zone affecting EC2, EBS, and RDS.”
Time: 11:42 AM — peak Black Friday traffic.

Step 1: Identify What Failed

Before fixing anything, find exactly what was AZ-pinned.

# Check ASG instance distribution across AZs
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query 'AutoScalingGroups[0].Instances[*].{AZ:AvailabilityZone,State:LifecycleState,Health:HealthStatus}'

# Check if RDS is Multi-AZ
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres \
  --query 'DBInstances[0].{MultiAZ:MultiAZ,AZ:AvailabilityZone}'

# Check ElastiCache cluster AZs
aws elasticache describe-cache-clusters \
  --show-cache-node-info \
  --query 'CacheClusters[*].{Cluster:CacheClusterId,AZ:PreferredAvailabilityZone}'

Common findings:

ASG was configured with only one subnet (in us-east-1b)
RDS was single-AZ (cost-saving decision from 2 years ago)
ElastiCache had no replica — single node in the failed AZ

Step 2: Immediate Fixes

Fix 1: Force ASG rebalance across AZs

If your ASG has subnets in multiple AZs but instances got imbalanced:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-app-asg \
  --preferences '{
    "MinHealthyPercentage": 50,
    "InstanceWarmup": 60
  }'

Fix 2: Fail over RDS to Multi-AZ standby

If Multi-AZ was already enabled, AWS automatically fails over (usually 1-2 minutes). If not, enable it now — this causes a brief restart:

aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --multi-az \
  --apply-immediately

Fix 3: Restore ElastiCache from backup in a healthy AZ

aws elasticache create-cache-cluster \
  --cache-cluster-id prod-cache-recovery \
  --preferred-availability-zone us-east-1c \
  --snapshot-name my-cache-snapshot-latest

Step 3: Permanent Fix — True Multi-AZ Infrastructure

Replace fragile single-AZ deployments with proper multi-AZ Terraform:

resource "aws_autoscaling_group" "app" {
  # Three subnets across three AZs — mandatory
  vpc_zone_identifier = [
    aws_subnet.private_az1.id,   # us-east-1a
    aws_subnet.private_az2.id,   # us-east-1b
    aws_subnet.private_az3.id    # us-east-1c
  ]
  
  min_size         = 3    # At least 1 per AZ
  max_size         = 30
  desired_capacity = 6

  # Mix On-Demand (stable) with Spot (cost-saving)
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 3
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.app.id
        version            = "$Latest"
      }
      overrides {
        instance_type = "m5.large"
      }
      overrides {
        instance_type = "m5a.large"
      }
      overrides {
        instance_type = "m6i.large"
      }
    }
  }
}

# RDS — always Multi-AZ in production
resource "aws_db_instance" "prod" {
  multi_az               = true   # Non-negotiable
  allocated_storage      = 100
  storage_autoscaling    = true   # Prevent storage outages
  max_allocated_storage  = 1000
  deletion_protection    = true
}

# ElastiCache — replica in each AZ
resource "aws_elasticache_replication_group" "prod" {
  automatic_failover_enabled  = true
  num_cache_clusters          = 3       # 1 primary + 2 replicas
  preferred_cache_cluster_azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Step 4: ALB — Verify All AZs Are Registered

ALB can only route to targets in AZs that it’s been told about:

# Enable all three AZs on the ALB
aws elbv2 set-subnets \
  --load-balancer-arn arn:aws:elasticloadbalancing:... \
  --subnets subnet-az1 subnet-az2 subnet-az3

Step 5: Enforce Multi-AZ With SCPs

Prevent future engineers from deploying single-AZ resources using a Service Control Policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyRDSSingleAZ",
    "Effect": "Deny",
    "Action": ["rds:CreateDBInstance", "rds:ModifyDBInstance"],
    "Resource": "*",
    "Condition": {
      "Bool": {
        "rds:MultiAz": "false"
      },
      "StringEquals": {
        "aws:ResourceTag/Environment": "production"
      }
    }
  }]
}

Post-Incident Actions

Action	Priority	Owner
Enable RDS Multi-AZ on all prod databases	🔴 Today	DBA
Add 3rd subnet to all ASGs	🔴 Today	Infra
Add ElastiCache replicas	🔴 This week	Infra
SCP to deny single-AZ in prod OU	🟡 This week	Platform
Run “AZ failure” chaos test monthly	🟡 This month	SRE

Interview Angle

Interviewers love follow-ups: “What if you can’t afford Multi-AZ RDS?” Answer: Aurora Serverless v2 with Multi-AZ is often cheaper than a standard Multi-AZ RDS instance because it scales to zero when idle. Always know the cost-availability trade-off.

Single AZ Failure Took Down Black Friday — Root Cause & Fix

The Problem

Step 1: Identify What Failed

Step 2: Immediate Fixes

Step 3: Permanent Fix — True Multi-AZ Infrastructure

Step 4: ALB — Verify All AZs Are Registered

Step 5: Enforce Multi-AZ With SCPs

Post-Incident Actions

Have a similar scenario to share?

Related Scenarios

AWS Cloud Foundations — Fresher Learning Path

Design a Multi-Region AWS Architecture for 99.99% Uptime

EC2 Instance Communicating with Malicious IP — Incident Response