Scenario Advanced Aws AWS Database

DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix

Your DynamoDB table is experiencing hot partition issues during peak traffic. Debug using CloudWatch metrics, implement write sharding, DAX caching, and SQS buffering.

January 20, 2025 4 min read ~30 min to complete DB
The Situation

Your e-commerce platform just launched a flash sale — 50% off for one hour. Product page loads are instant, but add-to-cart is failing with ProvisionedThroughputExceededException errors for the top 3 sale items. DynamoDB's throttling is causing 8% error rate on the checkout flow. Revenue is at risk.

5 Steps
5 Services Used
~30 min Duration
Advanced Difficulty

The Problem

DynamoDB partitions data based on the partition key. If all requests go to items with the same partition key value (like productId=FLASH-ITEM-001), all traffic hits one partition — even if DynamoDB has provisioned capacity across 10 partitions. This is a hot partition.

A hot partition throttles even when the table’s total consumed capacity is well below the table-level limit.

Throttling Alert
[CRITICAL] DynamoDB table product-inventory
Metric: SystemErrors rate: 8%
Error: ProvisionedThroughputExceededException
Affected items: FLASH-ITEM-001, FLASH-ITEM-002, FLASH-ITEM-003
Time: 14:00 UTC (flash sale start)

Step 1: Diagnose — Confirm Hot Partition

# Check consumed WCU vs provisioned WCU per partition (via CloudWatch)
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ConsumedWriteCapacityUnits \
  --dimensions Name=TableName,Value=product-inventory \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

# Check throttled requests
aws cloudwatch get-metric-statistics \
  --namespace AWS/DynamoDB \
  --metric-name ThrottledRequests \
  --dimensions Name=TableName,Value=product-inventory Name=Operation,Value=PutItem \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '30 minutes ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If ThrottledRequests is high but ConsumedWriteCapacityUnits is below the table limit, you have a hot partition — not an overall capacity problem.

# Switch to on-demand immediately to stop throttling (emergency fix)
aws dynamodb update-table \
  --table-name product-inventory \
  --billing-mode PAY_PER_REQUEST

This eliminates throttling immediately but doesn’t fix the root cause. On-demand can still have hot partition throttling at extreme rates.

Step 2: Write Sharding — Eliminate the Hot Partition

Add a random suffix to the partition key so writes distribute across multiple partitions:

import boto3
import random
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('product-inventory')

SHARD_COUNT = 10  # Distribute across 10 virtual partitions

def write_inventory_update(product_id: str, quantity_delta: int):
    shard = random.randint(0, SHARD_COUNT - 1)
    sharded_key = f"{product_id}#SHARD{shard}"
    
    table.update_item(
        Key={'PK': sharded_key},
        UpdateExpression='ADD quantity :delta',
        ExpressionAttributeValues={':delta': Decimal(quantity_delta)},
        ReturnValues='UPDATED_NEW'
    )

def get_total_inventory(product_id: str) -> int:
    """Query all shards and sum the quantity."""
    total = 0
    for shard in range(SHARD_COUNT):
        sharded_key = f"{product_id}#SHARD{shard}"
        response = table.get_item(
            Key={'PK': sharded_key},
            ProjectionExpression='quantity'
        )
        if 'Item' in response:
            total += int(response['Item'].get('quantity', 0))
    return total

Trade-off: Reading inventory now requires querying all 10 shards and summing. This is fine for a write-heavy workload — reads can be served from cache (DAX or ElastiCache).

Step 3: DAX — Microsecond Read Caching

For product catalog reads (not writes), DAX provides in-memory caching at microsecond latency directly in front of DynamoDB:

resource "aws_dax_cluster" "product_catalog" {
  cluster_name       = "product-catalog-dax"
  node_type          = "dax.r4.large"
  replication_factor = 3    # 1 primary + 2 replicas for HA
  iam_role_arn       = aws_iam_role.dax_role.arn
  
  server_side_encryption {
    enabled = true
  }
}
import amazondax
import boto3

# Use DAX client instead of DynamoDB client for reads
dax = amazondax.AmazonDaxClient.resource(
    endpoints=['my-dax-cluster.us-east-1.dax-clusters.amazonaws.com:8111'],
    region_name='us-east-1'
)
table = dax.Table('product-inventory')

# This read is served from DAX cache if available
response = table.get_item(Key={'PK': 'FLASH-ITEM-001#SHARD0'})

DAX is transparent to your application — it uses the same DynamoDB API. Cache TTL defaults to 5 minutes for reads.

Step 4: SQS Buffering — Smooth Write Spikes

For extremely high write rates (flash sale: 10,000 add-to-cart per second), buffer writes through SQS:

import boto3
import json

sqs = boto3.client('sqs')
QUEUE_URL = 'https://sqs.us-east-1.amazonaws.com/123456789/inventory-updates'

def add_to_cart(product_id: str, user_id: str, quantity: int):
    # Fast: just enqueue the request (< 5ms)
    sqs.send_message(
        QueueUrl=QUEUE_URL,
        MessageBody=json.dumps({
            'product_id': product_id,
            'user_id': user_id,
            'quantity': quantity,
            'timestamp': int(time.time())
        }),
        MessageGroupId=product_id  # FIFO queue for ordering per product
    )
    return {'status': 'queued', 'message': 'Your cart is being updated'}
# Lambda consumer: reads from SQS and writes to DynamoDB at controlled rate
def process_inventory_batch(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('product-inventory')
    
    with table.batch_writer() as batch:
        for record in event['Records']:
            body = json.loads(record['body'])
            shard = hash(body['user_id']) % 10
            
            batch.put_item(Item={
                'PK': f"{body['product_id']}#SHARD{shard}",
                'user_id': body['user_id'],
                'quantity': body['quantity'],
                'timestamp': body['timestamp']
            })

Result: DynamoDB sees a smooth, controlled write rate from the Lambda consumer instead of a 10,000 req/sec spike.

Step 5: ElastiCache for Read-Heavy Flash Sale Pages

Product page reads (inventory count, price) during a flash sale are read-heavy. Cache them in ElastiCache Redis with a short TTL:

import redis
import json

cache = redis.Redis(host='my-elasticache.cache.amazonaws.com', port=6379, ssl=True)

def get_product_inventory(product_id: str) -> dict:
    cache_key = f"inventory:{product_id}"
    
    # Check cache first
    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Cache miss — query DynamoDB (sum all shards)
    total = get_total_inventory(product_id)
    result = {'product_id': product_id, 'available': total}
    
    # Cache for 10 seconds (acceptable staleness for flash sale)
    cache.setex(cache_key, 10, json.dumps(result))
    return result

Fix Summary

ProblemRoot CauseFixLatency
ThrottledRequests on writesHot partition (same product_id)Write sharding (10 shards)Immediate
High read latencyEvery read hits DynamoDBDAX cluster (microsecond cache)< 1 day to deploy
Write spike overwhelms capacity10K req/sec burstSQS buffering + Lambda consumer< 1 day to deploy
Inventory reads under loadFlash sale trafficElastiCache Redis (10s TTL)< 1 day to deploy
Interview Angle
Interviewers want to hear that you understand the difference between a table-level capacity problem and a partition-level problem. Hot partitions throttle even when the table has plenty of overall capacity. Write sharding is the structural fix; on-demand mode is an emergency stop-gap that doesn’t solve the underlying distribution problem.
Services Used
DynamoDBDAX (DynamoDB Accelerator)SQSCloudWatchElastiCache
Prerequisites
  • Understanding of DynamoDB partition key design
  • Basic familiarity with DynamoDB read/write capacity modes
What You Learned
  • How DynamoDB partitions data and why hot partitions occur
  • How write sharding eliminates hot partitions for high-cardinality writes
  • When to use DAX vs ElastiCache for DynamoDB caching
  • How SQS buffering smooths write spikes
  • The trade-offs between provisioned, on-demand, and adaptive capacity

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios