DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix

The Problem

DynamoDB partitions data based on the partition key. If all requests go to items with the same partition key value (like productId=FLASH-ITEM-001), all traffic hits one partition — even if DynamoDB has provisioned capacity across 10 partitions. This is a hot partition.

A hot partition throttles even when the table’s total consumed capacity is well below the table-level limit.

Throttling Alert

[CRITICAL] DynamoDB table product-inventory
Metric: SystemErrors rate: 8%
Error: ProvisionedThroughputExceededException
Affected items: FLASH-ITEM-001, FLASH-ITEM-002, FLASH-ITEM-003
Time: 14:00 UTC (flash sale start)

Step 1: Diagnose — Confirm Hot Partition

# Check consumed WCU vs provisioned WCU per partition (via CloudWatch)
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedWriteCapacityUnits \
  --dimensions Name=TableName,Value=product-inventory \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

# Check throttled requests
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=product-inventory Name=Operation,Value=PutItem \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If ThrottledRequests is high but ConsumedWriteCapacityUnits is below the table limit, you have a hot partition — not an overall capacity problem.

# Switch to on-demand immediately to stop throttling (emergency fix)
aws dynamodb update-table \
  --table-name product-inventory \
  --billing-mode PAY_PER_REQUEST

This eliminates throttling immediately but doesn’t fix the root cause. On-demand can still have hot partition throttling at extreme rates.

Step 2: Write Sharding — Eliminate the Hot Partition

Add a random suffix to the partition key so writes distribute across multiple partitions:

import boto3
import random
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('product-inventory')

SHARD_COUNT = 10  # Distribute across 10 virtual partitions

def write_inventory_update(product_id: str, quantity_delta: int):
    shard = random.randint(0, SHARD_COUNT - 1)
    sharded_key = f"{product_id}#SHARD{shard}"
    
    table.update_item(
        Key={'PK': sharded_key},
        UpdateExpression='ADD quantity :delta',
        ExpressionAttributeValues={':delta': Decimal(quantity_delta)},
        ReturnValues='UPDATED_NEW'
    )

def get_total_inventory(product_id: str) -> int:
    """Query all shards and sum the quantity."""
    total = 0
    for shard in range(SHARD_COUNT):
        sharded_key = f"{product_id}#SHARD{shard}"
        response = table.get_item(
            Key={'PK': sharded_key},
            ProjectionExpression='quantity'
        )
        if 'Item' in response:
            total += int(response['Item'].get('quantity', 0))
    return total

Trade-off: Reading inventory now requires querying all 10 shards and summing. This is fine for a write-heavy workload — reads can be served from cache (DAX or ElastiCache).

Step 3: DAX — Microsecond Read Caching

For product catalog reads (not writes), DAX provides in-memory caching at microsecond latency directly in front of DynamoDB:

resource "aws_dax_cluster" "product_catalog" {
  cluster_name       = "product-catalog-dax"
  node_type          = "dax.r4.large"
  replication_factor = 3    # 1 primary + 2 replicas for HA
  iam_role_arn       = aws_iam_role.dax_role.arn
  
  server_side_encryption {
    enabled = true
  }
}

import amazondax
import boto3

# Use DAX client instead of DynamoDB client for reads
dax = amazondax.AmazonDaxClient.resource(
    endpoints=['my-dax-cluster.us-east-1.dax-clusters.amazonaws.com:8111'],
    region_name='us-east-1'
)
table = dax.Table('product-inventory')

# This read is served from DAX cache if available
response = table.get_item(Key={'PK': 'FLASH-ITEM-001#SHARD0'})

DAX is transparent to your application — it uses the same DynamoDB API. Cache TTL defaults to 5 minutes for reads.

Step 4: SQS Buffering — Smooth Write Spikes

For extremely high write rates (flash sale: 10,000 add-to-cart per second), buffer writes through SQS:

import boto3
import json

sqs = boto3.client('sqs')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/inventory-updates'

def add_to_cart(product_id: str, user_id: str, quantity: int):
    # Fast: just enqueue the request (< 5ms)
    sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps({
            'product_id': product_id,
            'user_id': user_id,
            'quantity': quantity,
            'timestamp': int(time.time())
        }),
        MessageGroupId=product_id  # FIFO queue for ordering per product
    )
    return {'status': 'queued', 'message': 'Your cart is being updated'}

# Lambda consumer: reads from SQS and writes to DynamoDB at controlled rate
def process_inventory_batch(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('product-inventory')
    
    with table.batch_writer() as batch:
        for record in event['Records']:
            body = json.loads(record['body'])
            shard = hash(body['user_id']) % 10
            
            batch.put_item(Item={
                'PK': f"{body['product_id']}#SHARD{shard}",
                'user_id': body['user_id'],
                'quantity': body['quantity'],
                'timestamp': body['timestamp']
            })

Result: DynamoDB sees a smooth, controlled write rate from the Lambda consumer instead of a 10,000 req/sec spike.

Step 5: ElastiCache for Read-Heavy Flash Sale Pages

Product page reads (inventory count, price) during a flash sale are read-heavy. Cache them in ElastiCache Redis with a short TTL:

import redis
import json

cache = redis.Redis(host='my-elasticache.cache.amazonaws.com', port=6379, ssl=True)

def get_product_inventory(product_id: str) -> dict:
    cache_key = f"inventory:{product_id}"
    
    # Check cache first
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss — query DynamoDB (sum all shards)
    total = get_total_inventory(product_id)
    result = {'product_id': product_id, 'available': total}
    
    # Cache for 10 seconds (acceptable staleness for flash sale)
    cache.setex(cache_key, 10, json.dumps(result))
    return result

Fix Summary

Problem	Root Cause	Fix	Latency
ThrottledRequests on writes	Hot partition (same product_id)	Write sharding (10 shards)	Immediate
High read latency	Every read hits DynamoDB	DAX cluster (microsecond cache)	< 1 day to deploy
Write spike overwhelms capacity	10K req/sec burst	SQS buffering + Lambda consumer	< 1 day to deploy
Inventory reads under load	Flash sale traffic	ElastiCache Redis (10s TTL)	< 1 day to deploy

Interview Angle

Interviewers want to hear that you understand the difference between a table-level capacity problem and a partition-level problem. Hot partitions throttle even when the table has plenty of overall capacity. Write sharding is the structural fix; on-demand mode is an emergency stop-gap that doesn’t solve the underlying distribution problem.

DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix

The Problem

Step 1: Diagnose — Confirm Hot Partition

Step 2: Write Sharding — Eliminate the Hot Partition

Step 3: DAX — Microsecond Read Caching

Step 4: SQS Buffering — Smooth Write Spikes

Step 5: ElastiCache for Read-Heavy Flash Sale Pages

Fix Summary

Have a similar scenario to share?

Related Scenarios

Application Latency Spiked After Migrating EC2 to ECS Fargate

Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA

Terraform Plan Takes 45 Minutes — How to Fix It at Scale