Scenario Advanced Aws AWS High Availability

Single AZ Failure Took Down Black Friday — Root Cause & Fix

Your e-commerce application crashed during Black Friday because a single Availability Zone failed. Walk through what went wrong, the immediate fix, and how to prevent it permanently.

January 20, 2025 3 min read ~25 min to complete DB
The Situation

It's Black Friday. Orders are flowing. Suddenly, your monitoring fires 50 alerts in 10 seconds. The application is down. Revenue loss: $50,000/minute. Your Slack is on fire. The cause: AWS reported degraded capacity in us-east-1b — the AZ where all your critical resources happened to live.

5 Steps
5 Services Used
~25 min Duration
Advanced Difficulty

The Problem

Your application went down because critical resources were effectively pinned to a single Availability Zone. Even though AWS runs your ASG across multiple AZs on paper, subtle misconfigurations mean all your “important” instances were in the zone that failed.

AZ Failure Alert
[CRITICAL] AWS Health Dashboard: “We are investigating increased error rates and latency in the us-east-1b Availability Zone affecting EC2, EBS, and RDS.”
Time: 11:42 AM — peak Black Friday traffic.

Step 1: Identify What Failed

Before fixing anything, find exactly what was AZ-pinned.

# Check ASG instance distribution across AZs
aws autoscaling describe-auto-scaling-groups \
  --auto-scaling-group-names my-app-asg \
  --query 'AutoScalingGroups[0].Instances[*].{AZ:AvailabilityZone,State:LifecycleState,Health:HealthStatus}'
# Check if RDS is Multi-AZ
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres \
  --query 'DBInstances[0].{MultiAZ:MultiAZ,AZ:AvailabilityZone}'
# Check ElastiCache cluster AZs
aws elasticache describe-cache-clusters \
  --show-cache-node-info \
  --query 'CacheClusters[*].{Cluster:CacheClusterId,AZ:PreferredAvailabilityZone}'

Common findings:

  • ASG was configured with only one subnet (in us-east-1b)
  • RDS was single-AZ (cost-saving decision from 2 years ago)
  • ElastiCache had no replica — single node in the failed AZ

Step 2: Immediate Fixes

Fix 1: Force ASG rebalance across AZs

If your ASG has subnets in multiple AZs but instances got imbalanced:

aws autoscaling start-instance-refresh \
  --auto-scaling-group-name my-app-asg \
  --preferences '{
    "MinHealthyPercentage": 50,
    "InstanceWarmup": 60
  }'

Fix 2: Fail over RDS to Multi-AZ standby

If Multi-AZ was already enabled, AWS automatically fails over (usually 1-2 minutes). If not, enable it now — this causes a brief restart:

aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --multi-az \
  --apply-immediately

Fix 3: Restore ElastiCache from backup in a healthy AZ

aws elasticache create-cache-cluster \
  --cache-cluster-id prod-cache-recovery \
  --preferred-availability-zone us-east-1c \
  --snapshot-name my-cache-snapshot-latest

Step 3: Permanent Fix — True Multi-AZ Infrastructure

Replace fragile single-AZ deployments with proper multi-AZ Terraform:

resource "aws_autoscaling_group" "app" {
  # Three subnets across three AZs — mandatory
  vpc_zone_identifier = [
    aws_subnet.private_az1.id,   # us-east-1a
    aws_subnet.private_az2.id,   # us-east-1b
    aws_subnet.private_az3.id    # us-east-1c
  ]
  
  min_size         = 3    # At least 1 per AZ
  max_size         = 30
  desired_capacity = 6

  # Mix On-Demand (stable) with Spot (cost-saving)
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 3
      on_demand_percentage_above_base_capacity = 20
      spot_allocation_strategy                 = "capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.app.id
        version            = "$Latest"
      }
      overrides {
        instance_type = "m5.large"
      }
      overrides {
        instance_type = "m5a.large"
      }
      overrides {
        instance_type = "m6i.large"
      }
    }
  }
}
# RDS — always Multi-AZ in production
resource "aws_db_instance" "prod" {
  multi_az               = true   # Non-negotiable
  allocated_storage      = 100
  storage_autoscaling    = true   # Prevent storage outages
  max_allocated_storage  = 1000
  deletion_protection    = true
}
# ElastiCache — replica in each AZ
resource "aws_elasticache_replication_group" "prod" {
  automatic_failover_enabled  = true
  num_cache_clusters          = 3       # 1 primary + 2 replicas
  preferred_cache_cluster_azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

Step 4: ALB — Verify All AZs Are Registered

ALB can only route to targets in AZs that it’s been told about:

# Enable all three AZs on the ALB
aws elbv2 set-subnets \
  --load-balancer-arn arn:aws:elasticloadbalancing:... \
  --subnets subnet-az1 subnet-az2 subnet-az3

Step 5: Enforce Multi-AZ With SCPs

Prevent future engineers from deploying single-AZ resources using a Service Control Policy:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyRDSSingleAZ",
    "Effect": "Deny",
    "Action": ["rds:CreateDBInstance", "rds:ModifyDBInstance"],
    "Resource": "*",
    "Condition": {
      "Bool": {
        "rds:MultiAz": "false"
      },
      "StringEquals": {
        "aws:ResourceTag/Environment": "production"
      }
    }
  }]
}

Post-Incident Actions

ActionPriorityOwner
Enable RDS Multi-AZ on all prod databases🔴 TodayDBA
Add 3rd subnet to all ASGs🔴 TodayInfra
Add ElastiCache replicas🔴 This weekInfra
SCP to deny single-AZ in prod OU🟡 This weekPlatform
Run “AZ failure” chaos test monthly🟡 This monthSRE
Interview Angle
Interviewers love follow-ups: “What if you can’t afford Multi-AZ RDS?” Answer: Aurora Serverless v2 with Multi-AZ is often cheaper than a standard Multi-AZ RDS instance because it scales to zero when idle. Always know the cost-availability trade-off.
Services Used
EC2 Auto ScalingRDS Multi-AZElastiCacheApplication Load BalancerVPC
Prerequisites
  • Understanding of AWS Availability Zones
  • Familiarity with Auto Scaling Groups and RDS
What You Learned
  • How to diagnose which resources are AZ-pinned
  • How to force an ASG to rebalance across AZs
  • Terraform patterns for truly multi-AZ deployments
  • How to use SCPs to enforce multi-AZ at the organization level

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios