Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO
Design a multi-layer backup strategy for a critical 10TB RDS PostgreSQL database including automated backups, cross-region snapshots, point-in-time recovery, and pre-warmed read replicas.
The CTO asks: 'If our production database is destroyed right now — hardware failure, ransomware, accidental deletion — how long before we're back online, and how much data do we lose?' Your current answer is: 'I'm not sure.' That needs to change. Design a backup strategy that guarantees RPO ≤ 4 hours and RTO ≤ 2 hours for a 10TB PostgreSQL RDS instance.
The Problem
A backup strategy that you haven’t tested is not a backup strategy — it’s a hope. Before designing the solution, be honest about what “RTO ≤ 2 hours” means for 10TB: you cannot restore 10TB from a cold snapshot in 2 hours with standard RDS restore. You need a pre-warmed read replica strategy.
Backup Architecture (Three Layers)
Layer 1: RDS Automated Backups (continuous)
├── Daily full backup + transaction logs
├── RPO: 5 minutes (point-in-time recovery)
├── RTO: 30-90 minutes (restore creates new instance)
└── Retention: 35 days, stored in managed S3
Layer 2: Manual Snapshots (pre-deployment)
├── Triggered by CodePipeline before every prod deploy
├── Cross-region copy automated via Lambda
├── Retention: 90 days
└── Encrypted with customer-managed KMS key
Layer 3: pg_dump Logical Exports (weekly)
├── Full schema + data export
├── Stored in S3 with versioning enabled
├── S3 Lifecycle: → Glacier after 30 days
└── Enables partial restore (single table, single schema)
Step 1: Configure RDS Automated Backups
# Enable automated backups with maximum retention
aws rds modify-db-instance \
--db-instance-identifier prod-postgres-10tb \
--backup-retention-period 35 \ # Maximum allowed
--preferred-backup-window "03:00-04:00" \ # Low-traffic window
--apply-immediately
# Verify PITR is available
aws rds describe-db-instances \
--db-instance-identifier prod-postgres-10tb \
--query 'DBInstances[0].{
EarliestRestore:LatestRestorableTime,
BackupRetention:BackupRetentionPeriod,
BackupWindow:PreferredBackupWindow
}'
Point-in-time restore — achieves your 4-hour RPO (actually gives you 5-minute RPO):
# Restore to 3 hours ago (within RPO)
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier prod-postgres-10tb \
--target-db-instance-identifier prod-postgres-recovery \
--restore-time "2024-01-15T11:17:00Z" \
--db-instance-class db.r6g.4xlarge \
--multi-az
Step 2: Automate Cross-Region Snapshot Copies
import boto3
def copy_snapshot_to_dr_region(event, context):
"""Triggered by EventBridge when RDS creates an automated snapshot."""
# Parse the snapshot ARN from the EventBridge event
snapshot_arn = event['detail']['SourceArn']
snapshot_id = snapshot_arn.split(':')[-1]
# Copy to DR region
rds_dr = boto3.client('rds', region_name='us-west-2')
response = rds_dr.copy_db_snapshot(
SourceDBSnapshotIdentifier=snapshot_arn,
TargetDBSnapshotIdentifier=f"dr-{snapshot_id}",
SourceRegion='us-east-1',
KmsKeyId='arn:aws:kms:us-west-2:123456789:key/abc-def-123',
CopyTags=True
)
print(f"DR snapshot creation started: {response['DBSnapshot']['DBSnapshotIdentifier']}")
# Cleanup snapshots older than 90 days in DR region
snapshots = rds_dr.describe_db_snapshots(
DBInstanceIdentifier='prod-postgres-10tb',
SnapshotType='manual'
)
from datetime import datetime, timezone, timedelta
cutoff = datetime.now(timezone.utc) - timedelta(days=90)
for snap in snapshots['DBSnapshots']:
if snap['SnapshotCreateTime'] < cutoff and snap['Status'] == 'available':
rds_dr.delete_db_snapshot(DBSnapshotIdentifier=snap['DBSnapshotIdentifier'])
# EventBridge rule: trigger Lambda on every automated snapshot completion
aws events put-rule \
--name "RDSSnapshotCopyToDR" \
--event-pattern '{
"source": ["aws.rds"],
"detail-type": ["RDS DB Snapshot Event"],
"detail": {
"EventID": ["RDS-EVENT-0091"],
"SourceType": ["SNAPSHOT"]
}
}' \
--state ENABLED
Step 3: Pre-Warmed Read Replica — The Key to 2-Hour RTO
A cold restore from snapshot takes 45-90 minutes. A pre-warmed cross-region read replica can be promoted in 2-5 minutes:
# Create cross-region read replica in DR region
aws rds create-db-instance-read-replica \
--db-instance-identifier prod-postgres-dr-replica \
--source-db-instance-identifier prod-postgres-10tb \
--source-region us-east-1 \
--db-instance-class db.r6g.4xlarge \
--region us-west-2 \
--multi-az \
--publicly-accessible false
During a disaster — promote the replica:
# Promote to standalone primary (takes 2-5 minutes)
aws rds promote-read-replica \
--db-instance-identifier prod-postgres-dr-replica \
--region us-west-2
# Wait for promotion to complete
aws rds wait db-instance-available \
--db-instance-identifier prod-postgres-dr-replica \
--region us-west-2
# Update Route 53 CNAME to point to DR database
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "db.prod.internal",
"Type": "CNAME",
"TTL": 30,
"ResourceRecords": [{"Value": "prod-postgres-dr-replica.us-west-2.rds.amazonaws.com"}]
}
}]
}'
Total RTO with pre-warmed replica:
- Detect disaster: ~1 min
- Promote replica: ~3 min
- Update DNS + application reconnect: ~3 min
- Total: ~7-10 minutes ✅ (well within 2-hour RTO)
Step 4: Weekly pg_dump Logical Backup
For partial restores (single table corruption), snapshots are overkill. pg_dump lets you restore just the affected table:
# Run from an EC2 instance in the same VPC
pg_dump \
--host=prod-postgres-10tb.us-east-1.rds.amazonaws.com \
--username=admin \
--format=directory \ # Allows parallel dump and restore
--jobs=8 \ # 8 parallel workers for 10TB
--compress=1 \
--file=/tmp/pgdump-$(date +%Y%m%d) \
mydb
# Upload to S3
aws s3 cp /tmp/pgdump-$(date +%Y%m%d) \
s3://my-backup-bucket/postgres/logical/$(date +%Y%m%d)/ \
--recursive \
--sse aws:kms \
--sse-kms-key-id arn:aws:kms:us-east-1:...:key/abc123
Partial restore — single table:
pg_restore \
--host=prod-postgres-recovery.rds.amazonaws.com \
--username=admin \
--table=orders \ # Restore ONLY the orders table
--jobs=8 \
/tmp/pgdump-20240115
Step 5: Test Your Backups Quarterly
An untested backup is not a backup. Run quarterly restore drills:
def quarterly_restore_drill(event, context):
"""Lambda function for automated restore testing."""
rds = boto3.client('rds')
# 1. Get the most recent snapshot
snapshots = rds.describe_db_snapshots(
DBInstanceIdentifier='prod-postgres-10tb',
SnapshotType='automated'
)
latest = sorted(snapshots['DBSnapshots'],
key=lambda x: x['SnapshotCreateTime'],
reverse=True)[0]
start_time = time.time()
# 2. Restore to a test instance
rds.restore_db_instance_from_db_snapshot(
DBInstanceIdentifier='restore-test-quarterly',
DBSnapshotIdentifier=latest['DBSnapshotIdentifier'],
DBInstanceClass='db.r6g.2xlarge'
)
rds.get_waiter('db_instance_available').wait(
DBInstanceIdentifier='restore-test-quarterly'
)
restore_time_minutes = (time.time() - start_time) / 60
# 3. Run validation queries
# ... (connect and verify row counts, checksums)
# 4. Document actual RTO
print(f"Actual restore time: {restore_time_minutes:.1f} minutes")
# 5. Clean up test instance
rds.delete_db_instance(
DBInstanceIdentifier='restore-test-quarterly',
SkipFinalSnapshot=True
)
# 6. Publish RTO metric to CloudWatch
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
Namespace='BackupDrillMetrics',
MetricData=[{
'MetricName': 'RestoreTimeMins',
'Value': restore_time_minutes,
'Unit': 'Count'
}]
)
RPO/RTO Summary
| Backup Method | RPO | RTO | Use Case |
|---|---|---|---|
| Automated PITR + cold restore | 5 min | 45-90 min | Data corruption, accidental delete |
| Pre-warmed cross-region replica | 1-5 min (replication lag) | 7-10 min | Full region failure |
| Manual pre-deploy snapshot | Point of last deploy | 60+ min | Bad deploy rollback |
| pg_dump logical backup | Weekly | Varies (table-specific) | Single-table restore |
- The three-layer backup strategy (automated backups, snapshots, pg_dump exports)
- Why a pre-warmed cross-region read replica dramatically reduces RTO
- How to automate cross-region snapshot copies with Lambda + EventBridge
- How to test your backup strategy before an actual disaster
- Why Aurora Global Database changes the RPO/RTO math entirely
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Design a Multi-Region AWS Architecture for 99.99% Uptime
The Problem Your e-commerce application runs entirely in us-east-1. A single region failure would take the site down for hours — …
Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA
The Problem RDS storage full is a P0 incident. Unlike CPU spikes that throttle gracefully, a full disk makes the database refuse all writes …
AWS Cloud Foundations — Fresher Learning Path
How to Use This Path Each section below shows an AWS architecture diagram. Click any coloured block to see: