Scenario Advanced Aws AWS Database

Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO

Design a multi-layer backup strategy for a critical 10TB RDS PostgreSQL database including automated backups, cross-region snapshots, point-in-time recovery, and pre-warmed read replicas.

January 20, 2025 5 min read ~30 min to complete DB
The Situation

The CTO asks: 'If our production database is destroyed right now — hardware failure, ransomware, accidental deletion — how long before we're back online, and how much data do we lose?' Your current answer is: 'I'm not sure.' That needs to change. Design a backup strategy that guarantees RPO ≤ 4 hours and RTO ≤ 2 hours for a 10TB PostgreSQL RDS instance.

5 Steps
7 Services Used
~30 min Duration
Advanced Difficulty

The Problem

A backup strategy that you haven’t tested is not a backup strategy — it’s a hope. Before designing the solution, be honest about what “RTO ≤ 2 hours” means for 10TB: you cannot restore 10TB from a cold snapshot in 2 hours with standard RDS restore. You need a pre-warmed read replica strategy.

Backup Architecture (Three Layers)

Layer 1: RDS Automated Backups (continuous)
  ├── Daily full backup + transaction logs
  ├── RPO: 5 minutes (point-in-time recovery)
  ├── RTO: 30-90 minutes (restore creates new instance)
  └── Retention: 35 days, stored in managed S3

Layer 2: Manual Snapshots (pre-deployment)
  ├── Triggered by CodePipeline before every prod deploy
  ├── Cross-region copy automated via Lambda
  ├── Retention: 90 days
  └── Encrypted with customer-managed KMS key

Layer 3: pg_dump Logical Exports (weekly)
  ├── Full schema + data export
  ├── Stored in S3 with versioning enabled
  ├── S3 Lifecycle: → Glacier after 30 days
  └── Enables partial restore (single table, single schema)

Step 1: Configure RDS Automated Backups

# Enable automated backups with maximum retention
aws rds modify-db-instance \
  --db-instance-identifier prod-postgres-10tb \
  --backup-retention-period 35 \   # Maximum allowed
  --preferred-backup-window "03:00-04:00" \   # Low-traffic window
  --apply-immediately

# Verify PITR is available
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres-10tb \
  --query 'DBInstances[0].{
    EarliestRestore:LatestRestorableTime,
    BackupRetention:BackupRetentionPeriod,
    BackupWindow:PreferredBackupWindow
  }'

Point-in-time restore — achieves your 4-hour RPO (actually gives you 5-minute RPO):

# Restore to 3 hours ago (within RPO)
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier prod-postgres-10tb \
  --target-db-instance-identifier prod-postgres-recovery \
  --restore-time "2024-01-15T11:17:00Z" \
  --db-instance-class db.r6g.4xlarge \
  --multi-az
PITR RTO Reality Check
Restoring 10TB from a PITR snapshot to a new instance takes 45-90 minutes. To hit a 2-hour RTO, you need the pre-warmed read replica approach in Step 3.

Step 2: Automate Cross-Region Snapshot Copies

import boto3

def copy_snapshot_to_dr_region(event, context):
    """Triggered by EventBridge when RDS creates an automated snapshot."""
    
    # Parse the snapshot ARN from the EventBridge event
    snapshot_arn = event['detail']['SourceArn']
    snapshot_id = snapshot_arn.split(':')[-1]
    
    # Copy to DR region
    rds_dr = boto3.client('rds', region_name='us-west-2')
    
    response = rds_dr.copy_db_snapshot(
        SourceDBSnapshotIdentifier=snapshot_arn,
        TargetDBSnapshotIdentifier=f"dr-{snapshot_id}",
        SourceRegion='us-east-1',
        KmsKeyId='arn:aws:kms:us-west-2:123456789:key/abc-def-123',
        CopyTags=True
    )
    
    print(f"DR snapshot creation started: {response['DBSnapshot']['DBSnapshotIdentifier']}")
    
    # Cleanup snapshots older than 90 days in DR region
    snapshots = rds_dr.describe_db_snapshots(
        DBInstanceIdentifier='prod-postgres-10tb',
        SnapshotType='manual'
    )
    
    from datetime import datetime, timezone, timedelta
    cutoff = datetime.now(timezone.utc) - timedelta(days=90)
    
    for snap in snapshots['DBSnapshots']:
        if snap['SnapshotCreateTime'] < cutoff and snap['Status'] == 'available':
            rds_dr.delete_db_snapshot(DBSnapshotIdentifier=snap['DBSnapshotIdentifier'])
# EventBridge rule: trigger Lambda on every automated snapshot completion
aws events put-rule \
  --name "RDSSnapshotCopyToDR" \
  --event-pattern '{
    "source": ["aws.rds"],
    "detail-type": ["RDS DB Snapshot Event"],
    "detail": {
      "EventID": ["RDS-EVENT-0091"],
      "SourceType": ["SNAPSHOT"]
    }
  }' \
  --state ENABLED

Step 3: Pre-Warmed Read Replica — The Key to 2-Hour RTO

A cold restore from snapshot takes 45-90 minutes. A pre-warmed cross-region read replica can be promoted in 2-5 minutes:

# Create cross-region read replica in DR region
aws rds create-db-instance-read-replica \
  --db-instance-identifier prod-postgres-dr-replica \
  --source-db-instance-identifier prod-postgres-10tb \
  --source-region us-east-1 \
  --db-instance-class db.r6g.4xlarge \
  --region us-west-2 \
  --multi-az \
  --publicly-accessible false

During a disaster — promote the replica:

# Promote to standalone primary (takes 2-5 minutes)
aws rds promote-read-replica \
  --db-instance-identifier prod-postgres-dr-replica \
  --region us-west-2

# Wait for promotion to complete
aws rds wait db-instance-available \
  --db-instance-identifier prod-postgres-dr-replica \
  --region us-west-2

# Update Route 53 CNAME to point to DR database
aws route53 change-resource-record-sets \
  --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "db.prod.internal",
        "Type": "CNAME",
        "TTL": 30,
        "ResourceRecords": [{"Value": "prod-postgres-dr-replica.us-west-2.rds.amazonaws.com"}]
      }
    }]
  }'

Total RTO with pre-warmed replica:

  • Detect disaster: ~1 min
  • Promote replica: ~3 min
  • Update DNS + application reconnect: ~3 min
  • Total: ~7-10 minutes ✅ (well within 2-hour RTO)

Step 4: Weekly pg_dump Logical Backup

For partial restores (single table corruption), snapshots are overkill. pg_dump lets you restore just the affected table:

# Run from an EC2 instance in the same VPC
pg_dump \
  --host=prod-postgres-10tb.us-east-1.rds.amazonaws.com \
  --username=admin \
  --format=directory \       # Allows parallel dump and restore
  --jobs=8 \                 # 8 parallel workers for 10TB
  --compress=1 \
  --file=/tmp/pgdump-$(date +%Y%m%d) \
  mydb

# Upload to S3
aws s3 cp /tmp/pgdump-$(date +%Y%m%d) \
  s3://my-backup-bucket/postgres/logical/$(date +%Y%m%d)/ \
  --recursive \
  --sse aws:kms \
  --sse-kms-key-id arn:aws:kms:us-east-1:...:key/abc123

Partial restore — single table:

pg_restore \
  --host=prod-postgres-recovery.rds.amazonaws.com \
  --username=admin \
  --table=orders \        # Restore ONLY the orders table
  --jobs=8 \
  /tmp/pgdump-20240115

Step 5: Test Your Backups Quarterly

An untested backup is not a backup. Run quarterly restore drills:

def quarterly_restore_drill(event, context):
    """Lambda function for automated restore testing."""
    rds = boto3.client('rds')
    
    # 1. Get the most recent snapshot
    snapshots = rds.describe_db_snapshots(
        DBInstanceIdentifier='prod-postgres-10tb',
        SnapshotType='automated'
    )
    latest = sorted(snapshots['DBSnapshots'], 
                   key=lambda x: x['SnapshotCreateTime'], 
                   reverse=True)[0]
    
    start_time = time.time()
    
    # 2. Restore to a test instance
    rds.restore_db_instance_from_db_snapshot(
        DBInstanceIdentifier='restore-test-quarterly',
        DBSnapshotIdentifier=latest['DBSnapshotIdentifier'],
        DBInstanceClass='db.r6g.2xlarge'
    )
    
    rds.get_waiter('db_instance_available').wait(
        DBInstanceIdentifier='restore-test-quarterly'
    )
    
    restore_time_minutes = (time.time() - start_time) / 60
    
    # 3. Run validation queries
    # ... (connect and verify row counts, checksums)
    
    # 4. Document actual RTO
    print(f"Actual restore time: {restore_time_minutes:.1f} minutes")
    
    # 5. Clean up test instance
    rds.delete_db_instance(
        DBInstanceIdentifier='restore-test-quarterly',
        SkipFinalSnapshot=True
    )
    
    # 6. Publish RTO metric to CloudWatch
    cloudwatch = boto3.client('cloudwatch')
    cloudwatch.put_metric_data(
        Namespace='BackupDrillMetrics',
        MetricData=[{
            'MetricName': 'RestoreTimeMins',
            'Value': restore_time_minutes,
            'Unit': 'Count'
        }]
    )

RPO/RTO Summary

Backup MethodRPORTOUse Case
Automated PITR + cold restore5 min45-90 minData corruption, accidental delete
Pre-warmed cross-region replica1-5 min (replication lag)7-10 minFull region failure
Manual pre-deploy snapshotPoint of last deploy60+ minBad deploy rollback
pg_dump logical backupWeeklyVaries (table-specific)Single-table restore
Interview Angle
Always mention Aurora Global Database as the alternative: it provides < 1 second RPO and < 1 minute RTO with cross-region replication built in — at a higher price point. The interview question often becomes “when would you choose Aurora Global Database over RDS with cross-region replica?” The answer: when you need sub-5-minute RTO and can justify the 30-40% premium.
Services Used
RDSS3LambdaEventBridgeKMSRoute 53Aurora Global Database
Prerequisites
  • Familiarity with RDS automated backups and snapshots
  • Understanding of RTO and RPO concepts
  • Basic understanding of cross-region AWS services
What You Learned
  • The three-layer backup strategy (automated backups, snapshots, pg_dump exports)
  • Why a pre-warmed cross-region read replica dramatically reduces RTO
  • How to automate cross-region snapshot copies with Lambda + EventBridge
  • How to test your backup strategy before an actual disaster
  • Why Aurora Global Database changes the RPO/RTO math entirely

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios