DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix
Your DynamoDB table is experiencing hot partition issues during peak traffic. Debug using CloudWatch metrics, implement write sharding, DAX caching, and SQS buffering.
Your e-commerce platform just launched a flash sale — 50% off for one hour. Product page loads are instant, but add-to-cart is failing with ProvisionedThroughputExceededException errors for the top 3 sale items. DynamoDB's throttling is causing 8% error rate on the checkout flow. Revenue is at risk.
The Problem
DynamoDB partitions data based on the partition key. If all requests go to items with the same partition key value (like productId=FLASH-ITEM-001), all traffic hits one partition — even if DynamoDB has provisioned capacity across 10 partitions. This is a hot partition.
A hot partition throttles even when the table’s total consumed capacity is well below the table-level limit.
product-inventoryMetric:
SystemErrors rate: 8%Error:
ProvisionedThroughputExceededExceptionAffected items: FLASH-ITEM-001, FLASH-ITEM-002, FLASH-ITEM-003
Time: 14:00 UTC (flash sale start)
Step 1: Diagnose — Confirm Hot Partition
# Check consumed WCU vs provisioned WCU per partition (via CloudWatch)
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ConsumedWriteCapacityUnits \
--dimensions Name=TableName,Value=product-inventory \
--statistics Sum \
--period 60 \
--start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
# Check throttled requests
aws cloudwatch get-metric-statistics \
--namespace AWS/DynamoDB \
--metric-name ThrottledRequests \
--dimensions Name=TableName,Value=product-inventory Name=Operation,Value=PutItem \
--statistics Sum \
--period 60 \
--start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
If ThrottledRequests is high but ConsumedWriteCapacityUnits is below the table limit, you have a hot partition — not an overall capacity problem.
# Switch to on-demand immediately to stop throttling (emergency fix)
aws dynamodb update-table \
--table-name product-inventory \
--billing-mode PAY_PER_REQUEST
This eliminates throttling immediately but doesn’t fix the root cause. On-demand can still have hot partition throttling at extreme rates.
Step 2: Write Sharding — Eliminate the Hot Partition
Add a random suffix to the partition key so writes distribute across multiple partitions:
import boto3
import random
from decimal import Decimal
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('product-inventory')
SHARD_COUNT = 10 # Distribute across 10 virtual partitions
def write_inventory_update(product_id: str, quantity_delta: int):
shard = random.randint(0, SHARD_COUNT - 1)
sharded_key = f"{product_id}#SHARD{shard}"
table.update_item(
Key={'PK': sharded_key},
UpdateExpression='ADD quantity :delta',
ExpressionAttributeValues={':delta': Decimal(quantity_delta)},
ReturnValues='UPDATED_NEW'
)
def get_total_inventory(product_id: str) -> int:
"""Query all shards and sum the quantity."""
total = 0
for shard in range(SHARD_COUNT):
sharded_key = f"{product_id}#SHARD{shard}"
response = table.get_item(
Key={'PK': sharded_key},
ProjectionExpression='quantity'
)
if 'Item' in response:
total += int(response['Item'].get('quantity', 0))
return total
Trade-off: Reading inventory now requires querying all 10 shards and summing. This is fine for a write-heavy workload — reads can be served from cache (DAX or ElastiCache).
Step 3: DAX — Microsecond Read Caching
For product catalog reads (not writes), DAX provides in-memory caching at microsecond latency directly in front of DynamoDB:
resource "aws_dax_cluster" "product_catalog" {
cluster_name = "product-catalog-dax"
node_type = "dax.r4.large"
replication_factor = 3 # 1 primary + 2 replicas for HA
iam_role_arn = aws_iam_role.dax_role.arn
server_side_encryption {
enabled = true
}
}
import amazondax
import boto3
# Use DAX client instead of DynamoDB client for reads
dax = amazondax.AmazonDaxClient.resource(
endpoints=['my-dax-cluster.us-east-1.dax-clusters.amazonaws.com:8111'],
region_name='us-east-1'
)
table = dax.Table('product-inventory')
# This read is served from DAX cache if available
response = table.get_item(Key={'PK': 'FLASH-ITEM-001#SHARD0'})
DAX is transparent to your application — it uses the same DynamoDB API. Cache TTL defaults to 5 minutes for reads.
Step 4: SQS Buffering — Smooth Write Spikes
For extremely high write rates (flash sale: 10,000 add-to-cart per second), buffer writes through SQS:
import boto3
import json
sqs = boto3.client('sqs')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/inventory-updates'
def add_to_cart(product_id: str, user_id: str, quantity: int):
# Fast: just enqueue the request (< 5ms)
sqs.send_message(
QueueUrl=QUEUE_URL,
MessageBody=json.dumps({
'product_id': product_id,
'user_id': user_id,
'quantity': quantity,
'timestamp': int(time.time())
}),
MessageGroupId=product_id # FIFO queue for ordering per product
)
return {'status': 'queued', 'message': 'Your cart is being updated'}
# Lambda consumer: reads from SQS and writes to DynamoDB at controlled rate
def process_inventory_batch(event, context):
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('product-inventory')
with table.batch_writer() as batch:
for record in event['Records']:
body = json.loads(record['body'])
shard = hash(body['user_id']) % 10
batch.put_item(Item={
'PK': f"{body['product_id']}#SHARD{shard}",
'user_id': body['user_id'],
'quantity': body['quantity'],
'timestamp': body['timestamp']
})
Result: DynamoDB sees a smooth, controlled write rate from the Lambda consumer instead of a 10,000 req/sec spike.
Step 5: ElastiCache for Read-Heavy Flash Sale Pages
Product page reads (inventory count, price) during a flash sale are read-heavy. Cache them in ElastiCache Redis with a short TTL:
import redis
import json
cache = redis.Redis(host='my-elasticache.cache.amazonaws.com', port=6379, ssl=True)
def get_product_inventory(product_id: str) -> dict:
cache_key = f"inventory:{product_id}"
# Check cache first
cached = cache.get(cache_key)
if cached:
return json.loads(cached)
# Cache miss — query DynamoDB (sum all shards)
total = get_total_inventory(product_id)
result = {'product_id': product_id, 'available': total}
# Cache for 10 seconds (acceptable staleness for flash sale)
cache.setex(cache_key, 10, json.dumps(result))
return result
Fix Summary
| Problem | Root Cause | Fix | Latency |
|---|---|---|---|
| ThrottledRequests on writes | Hot partition (same product_id) | Write sharding (10 shards) | Immediate |
| High read latency | Every read hits DynamoDB | DAX cluster (microsecond cache) | < 1 day to deploy |
| Write spike overwhelms capacity | 10K req/sec burst | SQS buffering + Lambda consumer | < 1 day to deploy |
| Inventory reads under load | Flash sale traffic | ElastiCache Redis (10s TTL) | < 1 day to deploy |
- How DynamoDB partitions data and why hot partitions occur
- How write sharding eliminates hot partitions for high-cardinality writes
- When to use DAX vs ElastiCache for DynamoDB caching
- How SQS buffering smooths write spikes
- The trade-offs between provisioned, on-demand, and adaptive capacity
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Application Latency Spiked After Migrating EC2 to ECS Fargate
The Problem Latency regressions after migrating to Fargate are almost always caused by networking changes — not application code. Fargate …
Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA
The Problem RDS storage full is a P0 incident. Unlike CPU spikes that throttle gracefully, a full disk makes the database refuse all writes …
Terraform Plan Takes 45 Minutes — How to Fix It at Scale
The Problem Terraform’s plan command calls AWS APIs to refresh the current state of every resource — if you have 500 resources, …