Design a Multi-Region AWS Architecture for 99.99% Uptime

The Problem

Your e-commerce application runs entirely in us-east-1. A single region failure would take the site down for hours — unacceptable under the new SLA. The contract requires:

RTO: ≤ 15 minutes (how fast you recover)
RPO: ≤ 5 minutes (how much data you can lose)
Uptime: 99.99% (~52 minutes downtime per year)

Current Risk

A single-region deployment has no protection against an AWS regional outage. In November 2020, us-east-1 experienced a multi-hour outage affecting dozens of services. If your app ran there alone, you had zero options.

Architecture Overview

Primary Region (us-east-1)          DR Region (us-west-2)
───────────────────────────         ───────────────────────────
  Route53 (Health Check) ──────────►  Route53 Failover
  ALB (Multi-AZ)                      ALB (Warm Standby)
  ASG (3 AZs)                         ASG (Min Capacity)
  Aurora Global DB        ──────────►  Aurora Secondary
  ElastiCache Cluster                  ElastiCache (Replica)
  S3 (Source of Truth)    ──────────►  S3 (Cross-Region Rep.)
───────────────────────────         ───────────────────────────

Step 1: Route 53 — Traffic Control & Failover

Configure Route 53 with a Failover routing policy. Health checks run every 10 seconds against your ALB endpoint; after 3 consecutive failures (30 seconds), Route 53 flips DNS to the DR region.

# Create health check for primary ALB
aws route53 create-health-check \
  --caller-reference "primary-alb-$(date +%s)" \
  --health-check-config '{
    "Type": "HTTPS",
    "FullyQualifiedDomainName": "api.yourstore.com",
    "ResourcePath": "/health",
    "RequestInterval": 10,
    "FailureThreshold": 3
  }'

Add Global Accelerator in front of Route 53. It provides anycast IPs that route to the nearest healthy endpoint and cuts over in under 30 seconds — avoiding the DNS TTL wait that Route 53 alone requires.

Step 2: Aurora Global Database — Sub-Second RPO

Standard cross-region RDS Read Replicas have replication lag measured in seconds to minutes. Aurora Global Database replicates at the storage layer with < 1 second lag and promotes in < 1 minute.

-- After failover: promote the secondary Aurora cluster
-- (done via Console or CLI — no app code change needed)
aws rds promote-read-replica-db-cluster \
  --db-cluster-identifier my-aurora-secondary-us-west-2

This single choice takes your RPO from “minutes” to “< 1 second”.

Step 3: S3 Cross-Region Replication — User Content

Enable Replication Time Control (RTC) on your source S3 bucket. RTC guarantees 99.99% of objects are replicated within 15 minutes and provides replication metrics in CloudWatch.

{
  "Rules": [{
    "Status": "Enabled",
    "Destination": {
      "Bucket": "arn:aws:s3:::my-app-content-us-west-2",
      "ReplicationTime": {
        "Status": "Enabled",
        "Time": { "Minutes": 15 }
      },
      "Metrics": { "Status": "Enabled" }
    }
  }]
}

Step 4: Auto Scaling — Warm Standby in DR

Keep the DR region at minimum capacity (e.g., 2 instances) so it can scale up during failover without waiting for full cold-start provisioning.

resource "aws_autoscaling_group" "dr_app" {
  min_size         = 2        # Always warm
  max_size         = 30
  desired_capacity = 2
  
  vpc_zone_identifier = [
    aws_subnet.dr_private_az1.id,
    aws_subnet.dr_private_az2.id
  ]

  tag {
    key                 = "Environment"
    value               = "dr-standby"
    propagate_at_launch = true
  }
}

Step 5: AMI Replication — Keep DR Current

Use EventBridge + Lambda to copy your golden AMI to the DR region after every deployment, so the DR ASG can launch the right application version.

import boto3

def replicate_ami(event, context):
    ec2_dr = boto3.client('ec2', region_name='us-west-2')
    source_ami_id = event['detail']['responseElements']['imageId']
    
    ec2_dr.copy_image(
        Name=f"dr-copy-{source_ami_id}",
        SourceImageId=source_ami_id,
        SourceRegion='us-east-1',
        Description="Auto-replicated for DR"
    )

Step 6: Failover Runbook with Systems Manager

Automate the failover sequence using an SSM Automation runbook so any on-call engineer can trigger a reliable, tested process instead of improvising at 3 AM.

# SSM Automation runbook steps (simplified)
- name: PromoteAuroraCluster
  action: aws:executeAwsApi
  inputs:
    Service: rds
    Api: PromoteReadReplicaDBCluster
    DbClusterIdentifier: my-aurora-secondary-us-west-2

- name: ScaleUpDRAsg
  action: aws:executeAwsApi
  inputs:
    Service: autoscaling
    Api: UpdateAutoScalingGroup
    AutoScalingGroupName: dr-app-asg
    DesiredCapacity: 10

- name: UpdateRoute53
  action: aws:executeAwsApi
  inputs:
    Service: route53
    Api: ChangeResourceRecordSets
    # Switch primary record to DR ALB

RTO / RPO Summary

Component	RTO Contribution	RPO Contribution
Route 53 + Global Accelerator failover	~1 min	—
Aurora Global DB promotion	~1 min	< 1 sec
ASG scale-up in DR	~5 min	—
Application warmup	~2 min	—
Total	~9 min ✅	< 5 min ✅

Game Day Testing

Run a quarterly failover drill using AWS Fault Injection Simulator (FIS). Inject an “availability zone impaired” fault and time your actual RTO against the target. A DR plan untested is a DR plan that fails when you need it most.

Interview Angle

Interviewers want to hear three things: why you chose each component (not just listing services), the trade-offs (Aurora Global DB costs more than a Read Replica), and your testing strategy. A candidate who says “I’d do game day testing and automate the runbook” stands out.

Design a Multi-Region AWS Architecture for 99.99% Uptime

The Problem

Architecture Overview

Step 1: Route 53 — Traffic Control & Failover

Step 2: Aurora Global Database — Sub-Second RPO

Step 3: S3 Cross-Region Replication — User Content

Step 4: Auto Scaling — Warm Standby in DR

Step 5: AMI Replication — Keep DR Current

Step 6: Failover Runbook with Systems Manager

RTO / RPO Summary

Interview Angle

Have a similar scenario to share?

Related Scenarios

Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO

Single AZ Failure Took Down Black Friday — Root Cause & Fix

AWS Cloud Foundations — Fresher Learning Path