Design a Multi-Region AWS Architecture for 99.99% Uptime
Your company's e-commerce app runs in a single AWS region. The business demands 99.99% uptime with RTO of 15 minutes and RPO of 5 minutes. Design the architecture.
You're the lead architect. The CTO walks in on a Monday: 'Legal just signed a contract guaranteeing 99.99% uptime. We currently run everything in us-east-1. What do we need to change — and by when?'
The Problem
Your e-commerce application runs entirely in us-east-1. A single region failure would take the site down for hours — unacceptable under the new SLA. The contract requires:
- RTO: ≤ 15 minutes (how fast you recover)
- RPO: ≤ 5 minutes (how much data you can lose)
- Uptime: 99.99% (~52 minutes downtime per year)
Architecture Overview
Primary Region (us-east-1) DR Region (us-west-2)
─────────────────────────── ───────────────────────────
Route53 (Health Check) ──────────► Route53 Failover
ALB (Multi-AZ) ALB (Warm Standby)
ASG (3 AZs) ASG (Min Capacity)
Aurora Global DB ──────────► Aurora Secondary
ElastiCache Cluster ElastiCache (Replica)
S3 (Source of Truth) ──────────► S3 (Cross-Region Rep.)
─────────────────────────── ───────────────────────────
Step 1: Route 53 — Traffic Control & Failover
Configure Route 53 with a Failover routing policy. Health checks run every 10 seconds against your ALB endpoint; after 3 consecutive failures (30 seconds), Route 53 flips DNS to the DR region.
# Create health check for primary ALB
aws route53 create-health-check \
--caller-reference "primary-alb-$(date +%s)" \
--health-check-config '{
"Type": "HTTPS",
"FullyQualifiedDomainName": "api.yourstore.com",
"ResourcePath": "/health",
"RequestInterval": 10,
"FailureThreshold": 3
}'
Add Global Accelerator in front of Route 53. It provides anycast IPs that route to the nearest healthy endpoint and cuts over in under 30 seconds — avoiding the DNS TTL wait that Route 53 alone requires.
Step 2: Aurora Global Database — Sub-Second RPO
Standard cross-region RDS Read Replicas have replication lag measured in seconds to minutes. Aurora Global Database replicates at the storage layer with < 1 second lag and promotes in < 1 minute.
-- After failover: promote the secondary Aurora cluster
-- (done via Console or CLI — no app code change needed)
aws rds promote-read-replica-db-cluster \
--db-cluster-identifier my-aurora-secondary-us-west-2
This single choice takes your RPO from “minutes” to “< 1 second”.
Step 3: S3 Cross-Region Replication — User Content
Enable Replication Time Control (RTC) on your source S3 bucket. RTC guarantees 99.99% of objects are replicated within 15 minutes and provides replication metrics in CloudWatch.
{
"Rules": [{
"Status": "Enabled",
"Destination": {
"Bucket": "arn:aws:s3:::my-app-content-us-west-2",
"ReplicationTime": {
"Status": "Enabled",
"Time": { "Minutes": 15 }
},
"Metrics": { "Status": "Enabled" }
}
}]
}
Step 4: Auto Scaling — Warm Standby in DR
Keep the DR region at minimum capacity (e.g., 2 instances) so it can scale up during failover without waiting for full cold-start provisioning.
resource "aws_autoscaling_group" "dr_app" {
min_size = 2 # Always warm
max_size = 30
desired_capacity = 2
vpc_zone_identifier = [
aws_subnet.dr_private_az1.id,
aws_subnet.dr_private_az2.id
]
tag {
key = "Environment"
value = "dr-standby"
propagate_at_launch = true
}
}
Step 5: AMI Replication — Keep DR Current
Use EventBridge + Lambda to copy your golden AMI to the DR region after every deployment, so the DR ASG can launch the right application version.
import boto3
def replicate_ami(event, context):
ec2_dr = boto3.client('ec2', region_name='us-west-2')
source_ami_id = event['detail']['responseElements']['imageId']
ec2_dr.copy_image(
Name=f"dr-copy-{source_ami_id}",
SourceImageId=source_ami_id,
SourceRegion='us-east-1',
Description="Auto-replicated for DR"
)
Step 6: Failover Runbook with Systems Manager
Automate the failover sequence using an SSM Automation runbook so any on-call engineer can trigger a reliable, tested process instead of improvising at 3 AM.
# SSM Automation runbook steps (simplified)
- name: PromoteAuroraCluster
action: aws:executeAwsApi
inputs:
Service: rds
Api: PromoteReadReplicaDBCluster
DbClusterIdentifier: my-aurora-secondary-us-west-2
- name: ScaleUpDRAsg
action: aws:executeAwsApi
inputs:
Service: autoscaling
Api: UpdateAutoScalingGroup
AutoScalingGroupName: dr-app-asg
DesiredCapacity: 10
- name: UpdateRoute53
action: aws:executeAwsApi
inputs:
Service: route53
Api: ChangeResourceRecordSets
# Switch primary record to DR ALB
RTO / RPO Summary
| Component | RTO Contribution | RPO Contribution |
|---|---|---|
| Route 53 + Global Accelerator failover | ~1 min | — |
| Aurora Global DB promotion | ~1 min | < 1 sec |
| ASG scale-up in DR | ~5 min | — |
| Application warmup | ~2 min | — |
| Total | ~9 min ✅ | < 5 min ✅ |
Interview Angle
Interviewers want to hear three things: why you chose each component (not just listing services), the trade-offs (Aurora Global DB costs more than a Read Replica), and your testing strategy. A candidate who says “I’d do game day testing and automate the runbook” stands out.
- How to design an active-passive multi-region architecture
- Why Aurora Global Database beats cross-region RDS Read Replicas for RPO
- How Route 53 health checks drive automatic failover
- How to calculate actual RTO/RPO per component
- What Global Accelerator adds over plain Route 53
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO
The Problem A backup strategy that you haven’t tested is not a backup strategy — it’s a hope. Before designing the solution, be …
Single AZ Failure Took Down Black Friday — Root Cause & Fix
The Problem Your application went down because critical resources were effectively pinned to a single Availability Zone. Even though AWS …
AWS Cloud Foundations — Fresher Learning Path
How to Use This Path Each section below shows an AWS architecture diagram. Click any coloured block to see: