Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA
Walk through the full incident response when an RDS PostgreSQL instance runs out of storage, including immediate recovery, 5-why root cause analysis, and preventive controls.
It's 3:17 AM. PagerDuty fires: 'RDS prod-postgres — StorageFullException. Application cannot write to database.' Every API request that requires a database write is failing. The application is effectively read-only. Orders cannot be placed. The on-call engineer is you.
The Problem
RDS storage full is a P0 incident. Unlike CPU spikes that throttle gracefully, a full disk makes the database refuse all writes — inserts, updates, and deletes all fail. If you have Multi-AZ enabled, this is recoverable without downtime. If not, you need to act faster.
prod-postgres — storage at 100%Error:
ERROR: could not extend file: No space left on deviceImpact: All write operations failing — orders, sessions, audit logs
Time: 03:17 UTC
Step 1: Immediate Recovery — Expand Storage
RDS allows storage scaling with no downtime on Multi-AZ deployments. The storage expansion happens on the standby first, then switches over:
# Check current storage allocation
aws rds describe-db-instances \
--db-instance-identifier prod-postgres \
--query 'DBInstances[0].{Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage,MultiAZ:MultiAZ}'
# Expand storage (takes 5-15 minutes, no downtime on Multi-AZ)
aws rds modify-db-instance \
--db-instance-identifier prod-postgres \
--allocated-storage 500 \ # Double from current 250GB
--apply-immediately
# Monitor the modification progress
aws rds describe-db-instances \
--db-instance-identifier prod-postgres \
--query 'DBInstances[0].PendingModifiedValues'
# Wait for status to return to 'available'
aws rds wait db-instance-available \
--db-instance-identifier prod-postgres
Time T+5: Storage expanding. Post an update to stakeholders via your status page.
Step 2: Communicate Status
# Template for incident Slack message
echo "
[INC-2024-0115] RDS Storage Outage
Status: Mitigating
Impact: All write operations failing (orders, sessions)
Action: Expanding RDS storage from 250GB to 500GB (no downtime)
ETA: 15 minutes to full recovery
Owner: @on-call-engineer
"
Step 3: Find the Root Cause — What Consumed the Storage?
Once storage is expanding (can take 10+ minutes), investigate what filled the disk:
-- Connect to PostgreSQL and check database sizes
SELECT
datname,
pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Find the largest tables
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size,
pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as table_size,
pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)
- pg_relation_size(schemaname || '.' || tablename)) as index_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;
-- Check if table has grown recently (look at row count vs size ratio)
SELECT
relname as table_name,
n_live_tup as live_rows,
n_dead_tup as dead_rows,
pg_size_pretty(pg_relation_size(oid)) as size,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY pg_relation_size(oid) DESC
LIMIT 10;
Typical findings:
- A logging or audit table with no archiving or truncation schedule
- Dead tuples accumulating because autovacuum is behind (very high write rate)
- Unindexed sort operations writing large temporary files
- WAL (Write-Ahead Log) accumulation from a stalled replication slot
Step 4: Immediate Cleanup (If Safe)
If you find a clear culprit (e.g., a logging table with 10M rows of old data):
-- Delete old log entries (use DELETE with LIMIT to avoid long lock times)
-- Run in batches to avoid blocking other queries
DO $$
DECLARE
deleted_count INTEGER;
BEGIN
LOOP
DELETE FROM application_logs
WHERE id IN (
SELECT id FROM application_logs
WHERE created_at < NOW() - INTERVAL '30 days'
LIMIT 10000
);
GET DIAGNOSTICS deleted_count = ROW_COUNT;
EXIT WHEN deleted_count = 0;
PERFORM pg_sleep(0.1); -- 100ms pause between batches
END LOOP;
END $$;
-- After deleting, VACUUM to reclaim space
VACUUM ANALYZE application_logs;
VACUUM FULL acquires an exclusive lock and rebuilds the table on disk. It will block all queries. Use VACUUM (without FULL) which reclaims space for future inserts without locking. If you need to return space to the OS immediately, run it during a maintenance window.Step 5: Root Cause Analysis (5 Whys)
| Why | Answer |
|---|---|
| Why did RDS run out of storage? | Table application_logs grew from 8GB to 248GB in 72 hours |
| Why did it grow so fast? | The log level was set to DEBUG in production 3 weeks ago and never reverted |
| Why wasn’t it cleaned up? | The nightly archival job failed silently 3 weeks ago |
| Why did the job fail silently? | No alerting configured on Lambda function errors; CloudWatch logs weren’t reviewed |
| Why was there no storage alarm? | CloudWatch alarm was set at 95% free space — not 20% (80% used) |
Corrective Actions
Immediate (done during incident):
- RDS storage expanded to 500GB
- Reverted log level from DEBUG to INFO
Short-term (this sprint):
- Enable RDS Storage Auto Scaling (max 1TB — automatic expansion prevents future incidents)
- Add CloudWatch alarm at 20% free space remaining (not 95%)
- Add Lambda error rate alarm for the archival job
- Add rotation/truncation policy to
application_logs(30-day TTL)
Preventive (this quarter):
- Implement table partitioning on
application_logsby month - Add weekly storage growth report to ops review
- Document this incident in the runbook with the recovery steps
# Enable RDS Storage Auto Scaling right now
aws rds modify-db-instance \
--db-instance-identifier prod-postgres \
--max-allocated-storage 1000 \ # Auto-scale up to 1TB max
--apply-immediately
# Create proper CloudWatch alarm (70% used = 30% free)
aws cloudwatch put-metric-alarm \
--alarm-name "RDS-prod-postgres-LowFreeStorage" \
--alarm-description "RDS storage below 30% free" \
--metric-name FreeStorageSpace \
--namespace AWS/RDS \
--dimensions Name=DBInstanceIdentifier,Value=prod-postgres \
--statistic Average \
--period 300 \
--threshold 80000000000 \ # 80GB free = 30% of 250GB
--comparison-operator LessThanThreshold \
--alarm-actions arn:aws:sns:us-east-1:...:ops-critical \
--ok-actions arn:aws:sns:us-east-1:...:ops-critical \
--evaluation-periods 2
- How to expand RDS storage while the database is still running (no downtime on Multi-AZ)
- How to identify which tables consumed the storage
- A proper 5-Why root cause analysis for this incident class
- How to configure CloudWatch alarms and RDS Storage Auto Scaling to prevent recurrence
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO
The Problem A backup strategy that you haven’t tested is not a backup strategy — it’s a hope. Before designing the solution, be …
Single AZ Failure Took Down Black Friday — Root Cause & Fix
The Problem Your application went down because critical resources were effectively pinned to a single Availability Zone. Even though AWS …
AWS Cloud Foundations — Fresher Learning Path
How to Use This Path Each section below shows an AWS architecture diagram. Click any coloured block to see: