Single AZ Failure Took Down Black Friday — Root Cause & Fix
Your e-commerce application crashed during Black Friday because a single Availability Zone failed. Walk through what went wrong, the immediate fix, and how to prevent it permanently.
It's Black Friday. Orders are flowing. Suddenly, your monitoring fires 50 alerts in 10 seconds. The application is down. Revenue loss: $50,000/minute. Your Slack is on fire. The cause: AWS reported degraded capacity in us-east-1b — the AZ where all your critical resources happened to live.
The Problem
Your application went down because critical resources were effectively pinned to a single Availability Zone. Even though AWS runs your ASG across multiple AZs on paper, subtle misconfigurations mean all your “important” instances were in the zone that failed.
Time: 11:42 AM — peak Black Friday traffic.
Step 1: Identify What Failed
Before fixing anything, find exactly what was AZ-pinned.
# Check ASG instance distribution across AZs
aws autoscaling describe-auto-scaling-groups \
--auto-scaling-group-names my-app-asg \
--query 'AutoScalingGroups[0].Instances[*].{AZ:AvailabilityZone,State:LifecycleState,Health:HealthStatus}'
# Check if RDS is Multi-AZ
aws rds describe-db-instances \
--db-instance-identifier prod-postgres \
--query 'DBInstances[0].{MultiAZ:MultiAZ,AZ:AvailabilityZone}'
# Check ElastiCache cluster AZs
aws elasticache describe-cache-clusters \
--show-cache-node-info \
--query 'CacheClusters[*].{Cluster:CacheClusterId,AZ:PreferredAvailabilityZone}'
Common findings:
- ASG was configured with only one subnet (in us-east-1b)
- RDS was single-AZ (cost-saving decision from 2 years ago)
- ElastiCache had no replica — single node in the failed AZ
Step 2: Immediate Fixes
Fix 1: Force ASG rebalance across AZs
If your ASG has subnets in multiple AZs but instances got imbalanced:
aws autoscaling start-instance-refresh \
--auto-scaling-group-name my-app-asg \
--preferences '{
"MinHealthyPercentage": 50,
"InstanceWarmup": 60
}'
Fix 2: Fail over RDS to Multi-AZ standby
If Multi-AZ was already enabled, AWS automatically fails over (usually 1-2 minutes). If not, enable it now — this causes a brief restart:
aws rds modify-db-instance \
--db-instance-identifier prod-postgres \
--multi-az \
--apply-immediately
Fix 3: Restore ElastiCache from backup in a healthy AZ
aws elasticache create-cache-cluster \
--cache-cluster-id prod-cache-recovery \
--preferred-availability-zone us-east-1c \
--snapshot-name my-cache-snapshot-latest
Step 3: Permanent Fix — True Multi-AZ Infrastructure
Replace fragile single-AZ deployments with proper multi-AZ Terraform:
resource "aws_autoscaling_group" "app" {
# Three subnets across three AZs — mandatory
vpc_zone_identifier = [
aws_subnet.private_az1.id, # us-east-1a
aws_subnet.private_az2.id, # us-east-1b
aws_subnet.private_az3.id # us-east-1c
]
min_size = 3 # At least 1 per AZ
max_size = 30
desired_capacity = 6
# Mix On-Demand (stable) with Spot (cost-saving)
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 3
on_demand_percentage_above_base_capacity = 20
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.app.id
version = "$Latest"
}
overrides {
instance_type = "m5.large"
}
overrides {
instance_type = "m5a.large"
}
overrides {
instance_type = "m6i.large"
}
}
}
}
# RDS — always Multi-AZ in production
resource "aws_db_instance" "prod" {
multi_az = true # Non-negotiable
allocated_storage = 100
storage_autoscaling = true # Prevent storage outages
max_allocated_storage = 1000
deletion_protection = true
}
# ElastiCache — replica in each AZ
resource "aws_elasticache_replication_group" "prod" {
automatic_failover_enabled = true
num_cache_clusters = 3 # 1 primary + 2 replicas
preferred_cache_cluster_azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
Step 4: ALB — Verify All AZs Are Registered
ALB can only route to targets in AZs that it’s been told about:
# Enable all three AZs on the ALB
aws elbv2 set-subnets \
--load-balancer-arn arn:aws:elasticloadbalancing:... \
--subnets subnet-az1 subnet-az2 subnet-az3
Step 5: Enforce Multi-AZ With SCPs
Prevent future engineers from deploying single-AZ resources using a Service Control Policy:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyRDSSingleAZ",
"Effect": "Deny",
"Action": ["rds:CreateDBInstance", "rds:ModifyDBInstance"],
"Resource": "*",
"Condition": {
"Bool": {
"rds:MultiAz": "false"
},
"StringEquals": {
"aws:ResourceTag/Environment": "production"
}
}
}]
}
Post-Incident Actions
| Action | Priority | Owner |
|---|---|---|
| Enable RDS Multi-AZ on all prod databases | 🔴 Today | DBA |
| Add 3rd subnet to all ASGs | 🔴 Today | Infra |
| Add ElastiCache replicas | 🔴 This week | Infra |
| SCP to deny single-AZ in prod OU | 🟡 This week | Platform |
| Run “AZ failure” chaos test monthly | 🟡 This month | SRE |
- How to diagnose which resources are AZ-pinned
- How to force an ASG to rebalance across AZs
- Terraform patterns for truly multi-AZ deployments
- How to use SCPs to enforce multi-AZ at the organization level
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
AWS Cloud Foundations — Fresher Learning Path
How to Use This Path Each section below shows an AWS architecture diagram. Click any coloured block to see:
Design a Multi-Region AWS Architecture for 99.99% Uptime
The Problem Your e-commerce application runs entirely in us-east-1. A single region failure would take the site down for hours — …
EC2 Instance Communicating with Malicious IP — Incident Response
The Problem A running EC2 instance may be compromised — malware installed, data being exfiltrated, or the instance being used as a pivot to …