Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA

The Problem

RDS storage full is a P0 incident. Unlike CPU spikes that throttle gracefully, a full disk makes the database refuse all writes — inserts, updates, and deletes all fail. If you have Multi-AZ enabled, this is recoverable without downtime. If not, you need to act faster.

Active Incident

[CRITICAL] RDS instance prod-postgres — storage at 100%
Error: ERROR: could not extend file: No space left on device
Impact: All write operations failing — orders, sessions, audit logs
Time: 03:17 UTC

Step 1: Immediate Recovery — Expand Storage

RDS allows storage scaling with no downtime on Multi-AZ deployments. The storage expansion happens on the standby first, then switches over:

# Check current storage allocation
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres \
  --query 'DBInstances[0].{Storage:AllocatedStorage,MaxStorage:MaxAllocatedStorage,MultiAZ:MultiAZ}'

# Expand storage (takes 5-15 minutes, no downtime on Multi-AZ)
aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --allocated-storage 500 \     # Double from current 250GB
  --apply-immediately

# Monitor the modification progress
aws rds describe-db-instances \
  --db-instance-identifier prod-postgres \
  --query 'DBInstances[0].PendingModifiedValues'

# Wait for status to return to 'available'
aws rds wait db-instance-available \
  --db-instance-identifier prod-postgres

Note on Non-Multi-AZ Instances

On single-AZ RDS, storage expansion causes a brief restart (typically 1-3 minutes). This is why Multi-AZ is non-negotiable for production databases.

Time T+5: Storage expanding. Post an update to stakeholders via your status page.

Step 2: Communicate Status

# Template for incident Slack message
echo "
[INC-2024-0115] RDS Storage Outage
Status: Mitigating
Impact: All write operations failing (orders, sessions)
Action: Expanding RDS storage from 250GB to 500GB (no downtime)
ETA: 15 minutes to full recovery
Owner: @on-call-engineer
"

Step 3: Find the Root Cause — What Consumed the Storage?

Once storage is expanding (can take 10+ minutes), investigate what filled the disk:

-- Connect to PostgreSQL and check database sizes
SELECT 
  datname,
  pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Find the largest tables
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename)) as total_size,
  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename)) as table_size,
  pg_size_pretty(pg_total_relation_size(schemaname || '.' || tablename) 
    - pg_relation_size(schemaname || '.' || tablename)) as index_size
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname || '.' || tablename) DESC
LIMIT 20;

-- Check if table has grown recently (look at row count vs size ratio)
SELECT
  relname as table_name,
  n_live_tup as live_rows,
  n_dead_tup as dead_rows,
  pg_size_pretty(pg_relation_size(oid)) as size,
  last_autovacuum,
  last_autoanalyze
FROM pg_stat_user_tables
ORDER BY pg_relation_size(oid) DESC
LIMIT 10;

Typical findings:

A logging or audit table with no archiving or truncation schedule
Dead tuples accumulating because autovacuum is behind (very high write rate)
Unindexed sort operations writing large temporary files
WAL (Write-Ahead Log) accumulation from a stalled replication slot

Step 4: Immediate Cleanup (If Safe)

If you find a clear culprit (e.g., a logging table with 10M rows of old data):

-- Delete old log entries (use DELETE with LIMIT to avoid long lock times)
-- Run in batches to avoid blocking other queries
DO $$
DECLARE
  deleted_count INTEGER;
BEGIN
  LOOP
    DELETE FROM application_logs
    WHERE id IN (
      SELECT id FROM application_logs
      WHERE created_at < NOW() - INTERVAL '30 days'
      LIMIT 10000
    );
    
    GET DIAGNOSTICS deleted_count = ROW_COUNT;
    EXIT WHEN deleted_count = 0;
    
    PERFORM pg_sleep(0.1);  -- 100ms pause between batches
  END LOOP;
END $$;

-- After deleting, VACUUM to reclaim space
VACUUM ANALYZE application_logs;

Don't VACUUM FULL

VACUUM FULL acquires an exclusive lock and rebuilds the table on disk. It will block all queries. Use VACUUM (without FULL) which reclaims space for future inserts without locking. If you need to return space to the OS immediately, run it during a maintenance window.

Step 5: Root Cause Analysis (5 Whys)

Why	Answer
Why did RDS run out of storage?	Table `application_logs` grew from 8GB to 248GB in 72 hours
Why did it grow so fast?	The log level was set to DEBUG in production 3 weeks ago and never reverted
Why wasn’t it cleaned up?	The nightly archival job failed silently 3 weeks ago
Why did the job fail silently?	No alerting configured on Lambda function errors; CloudWatch logs weren’t reviewed
Why was there no storage alarm?	CloudWatch alarm was set at 95% free space — not 20% (80% used)

Corrective Actions

Immediate (done during incident):

RDS storage expanded to 500GB
Reverted log level from DEBUG to INFO

Short-term (this sprint):

Enable RDS Storage Auto Scaling (max 1TB — automatic expansion prevents future incidents)
Add CloudWatch alarm at 20% free space remaining (not 95%)
Add Lambda error rate alarm for the archival job
Add rotation/truncation policy to application_logs (30-day TTL)

Preventive (this quarter):

Implement table partitioning on application_logs by month
Add weekly storage growth report to ops review
Document this incident in the runbook with the recovery steps

# Enable RDS Storage Auto Scaling right now
aws rds modify-db-instance \
  --db-instance-identifier prod-postgres \
  --max-allocated-storage 1000 \   # Auto-scale up to 1TB max
  --apply-immediately

# Create proper CloudWatch alarm (70% used = 30% free)
aws cloudwatch put-metric-alarm \
  --alarm-name "RDS-prod-postgres-LowFreeStorage" \
  --alarm-description "RDS storage below 30% free" \
  --metric-name FreeStorageSpace \
  --namespace AWS/RDS \
  --dimensions Name=DBInstanceIdentifier,Value=prod-postgres \
  --statistic Average \
  --period 300 \
  --threshold 80000000000 \   # 80GB free = 30% of 250GB
  --comparison-operator LessThanThreshold \
  --alarm-actions arn:aws:sns:us-east-1:...:ops-critical \
  --ok-actions arn:aws:sns:us-east-1:...:ops-critical \
  --evaluation-periods 2

Interview Angle

Interviewers often follow up: “What if the database was so full that even the storage expansion failed?” If the disk is 100% full, PostgreSQL may not be able to write even the WAL entries needed to process the expansion request. In that case: temporarily disable WAL archiving, delete expendable data (if possible), or take a snapshot and restore to a larger instance class. Always have a break-glass runbook for this scenario.

Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA

The Problem

Step 1: Immediate Recovery — Expand Storage

Step 2: Communicate Status

Step 3: Find the Root Cause — What Consumed the Storage?

Step 4: Immediate Cleanup (If Safe)

Step 5: Root Cause Analysis (5 Whys)

Corrective Actions

Have a similar scenario to share?

Related Scenarios

Design a Backup Strategy for a 10TB RDS PostgreSQL With 4-Hour RPO and 2-Hour RTO

Single AZ Failure Took Down Black Friday — Root Cause & Fix

AWS Cloud Foundations — Fresher Learning Path