Scenario Advanced Aws AWS Serverless

Lambda Function Timing Out Processing Kinesis Stream Records

Debug intermittent Lambda timeouts on a Kinesis stream consumer. Investigate iterator age, cold starts, poison pill records, and implement BisectBatchOnFunctionError with DLQ.

January 20, 2025 5 min read ~25 min to complete DB
The Situation

Your real-time analytics pipeline reads from a Kinesis stream and writes aggregated metrics to DynamoDB. Alerts show Lambda is timing out intermittently — not every invocation, but enough that the Kinesis iterator age keeps climbing. By the time you investigate, you're 45 minutes behind real-time. The pipeline processes 10,000 records per second at peak.

5 Steps
5 Services Used
~25 min Duration
Advanced Difficulty

The Problem

Lambda + Kinesis timeouts are particularly dangerous because Kinesis retries failed batches indefinitely by default. A single malformed record (poison pill) can block an entire shard, causing the iterator to fall behind and — if left unchecked — eventually expire (24-hour default retention).

Iterator Age Alert
[WARNING] Lambda function analytics-stream-processor
Metric: IteratorAgeMilliseconds — 2,700,000 ms (45 minutes behind)
Metric: Throttles — 0 (not a concurrency issue)
Metric: Errors — 12 in last hour
Timeout: 30 seconds

Step 1: Diagnose — Find the Timeout Pattern

# CloudWatch Logs Insights — find timeout events
aws logs start-query \
  --log-group-name /aws/lambda/analytics-stream-processor \
  --start-time $(date -d '2 hours ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, @message, @duration
    | filter @message like /Task timed out/ or @message like /Runtime exited/
    | sort @timestamp asc
    | limit 50
  '

# Get query results
aws logs get-query-results --query-id <query-id>
# Check iterator age trend (is it growing or stable?)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name IteratorAge \
  --dimensions Name=FunctionName,Value=analytics-stream-processor \
  --statistics Maximum \
  --period 300 \
  --start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Growing iterator age = Lambda can’t keep up with the stream. Stable but high = Lambda caught up then fell behind on a specific shard.

# Check per-shard iterator age (if using enhanced monitoring)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Kinesis \
  --metric-name GetRecords.IteratorAgeMilliseconds \
  --dimensions Name=StreamName,Value=analytics-stream \
  --statistics Maximum \
  --period 300 \
  --start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Step 2: Identify the Root Cause

Scenario A: Poison pill record causing repeated timeouts

One malformed record forces Lambda to retry the entire batch containing it, timing out each time. Iterator age climbs because that shard is stuck.

Scenario B: Downstream bottleneck (DynamoDB throttling)

Lambda starts processing but waits 25+ seconds for DynamoDB writes to complete before timing out.

Scenario C: Lambda cold start eating into timeout budget

At low traffic, Lambda cold starts (300ms-3s) reduce effective processing time below the timeout.

# Add timing instrumentation to identify which step is slow
import time
import logging
import boto3

logger = logging.getLogger()

def handler(event, context):
    records = event['Records']
    logger.info(f"Processing batch of {len(records)} records")
    
    start = time.time()
    
    # Step 1: Parse records
    parsed = parse_records(records)
    logger.info(f"Parse took: {(time.time()-start)*1000:.0f}ms")
    
    step2_start = time.time()
    
    # Step 2: Write to DynamoDB
    write_to_dynamodb(parsed)
    logger.info(f"DynamoDB write took: {(time.time()-step2_start)*1000:.0f}ms")
    
    logger.info(f"Total execution: {(time.time()-start)*1000:.0f}ms")

Step 3: Fix — Tune Event Source Mapping

# Update the Kinesis event source mapping with best practices
aws lambda update-event-source-mapping \
  --uuid $(aws lambda list-event-source-mappings \
    --function-name analytics-stream-processor \
    --query 'EventSourceMappings[0].UUID' \
    --output text) \
  --batch-size 100 \                     # Down from 200 — smaller batches = faster processing
  --maximum-batching-window-in-seconds 5 \  # Wait up to 5s to fill batch
  --parallelization-factor 10 \          # Up to 10 concurrent invocations per shard
  --bisect-batch-on-function-error \     # KEY: splits failing batch to isolate poison pills
  --maximum-retry-attempts 3 \           # Don't retry forever
  --destination-config '{
    "OnFailure": {
      "Destination": "arn:aws:sqs:us-east-1:123456789012:analytics-dlq"
    }
  }'

What each parameter does:

ParameterValueWhy
batch-size100Smaller batches process faster and reduce timeout risk
parallelization-factor1010 Lambda invocations per shard in parallel
bisect-batch-on-function-errortrueOn timeout: split batch in half, retry each half. Isolates the poison pill to a single-record batch.
maximum-retry-attempts3After 3 failures, send to DLQ instead of blocking forever
DLQ destinationSQSFailed records go to SQS for investigation, not lost

Step 4: Fix Lambda Timeout and Memory

# Increase timeout and memory (more memory = more CPU = faster execution)
aws lambda update-function-configuration \
  --function-name analytics-stream-processor \
  --timeout 120 \         # Was 30s — give more headroom for DynamoDB latency
  --memory-size 1024      # Was 256MB — more memory gives proportionally more CPU

Use AWS Lambda Power Tuning to find the optimal memory setting:

# Deploy the Lambda Power Tuning state machine (open source tool)
# https://github.com/alexcasalboni/aws-lambda-power-tuning
aws cloudformation deploy \
  --template-file aws-lambda-power-tuning.yaml \
  --stack-name lambda-power-tuning \
  --capabilities CAPABILITY_IAM

# Run tuning for your function
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:...:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:...:function:analytics-stream-processor",
    "powerValues": [128, 256, 512, 1024, 2048, 3008],
    "num": 50,
    "payload": {},
    "parallelInvocation": true,
    "strategy": "cost"
  }'

Step 5: Provisioned Concurrency — Eliminate Cold Starts

If cold starts are contributing to timeouts, enable Provisioned Concurrency:

# Keep 5 execution environments warm
aws lambda put-provisioned-concurrency-config \
  --function-name analytics-stream-processor \
  --qualifier production \
  --provisioned-concurrent-executions 5

For Kinesis workloads, you typically need Provisioned Concurrency only during scale-out events when Lambda needs to create many new execution environments simultaneously.

DLQ — Handle Failed Records

Set up an SQS DLQ and process failed records for alerting and reprocessing:

# Lambda to process DLQ messages and alert
def process_analytics_dlq(event, context):
    for record in event['Records']:
        body = json.loads(record['body'])
        
        # Parse the failed Kinesis record details
        failed_record = body.get('requestContext', {})
        kinesis_record = body.get('responseContext', {})
        
        logger.error(f"Failed record from shard {failed_record.get('shardId')}: "
                    f"{kinesis_record}")
        
        # Alert on-call if DLQ depth grows
        # (CloudWatch alarm on ApproximateNumberOfMessagesVisible > 100)
        
        # Optionally: parse and store for manual reprocessing
        s3.put_object(
            Bucket='analytics-failed-records',
            Key=f"failed/{datetime.now().isoformat()}/{record['messageId']}.json",
            Body=record['body']
        )

Optimized Configuration Summary

ParameterBeforeAfterReason
Timeout30s120sDynamoDB latency headroom
Memory256MB1024MBMore CPU for processing
Batch size200100Smaller = faster per batch
Parallelization factor11010× throughput per shard
BisectBatchOnFunctionErrordisabledenabledIsolates poison pills
Max retry attemptsunlimited3Prevents infinite blocking
DLQnoneSQSFailed records preserved
Interview Angle
The key insight for interviewers: BisectBatchOnFunctionError is the most underused Lambda Kinesis feature. When a batch of 100 records times out, Lambda bisects it: tries records 1-50, then 51-100. If 1-50 fails, tries 1-25, then 26-50. This binary search eventually isolates the single poison pill record, which goes to the DLQ. Without it, one bad record can block a shard indefinitely.
Services Used
LambdaKinesis Data StreamsCloudWatch Logs InsightsSQS (DLQ)DynamoDB
Prerequisites
  • Familiarity with Lambda event source mappings for Kinesis
  • Basic understanding of Kinesis shards and iterator age
What You Learned
  • How to use CloudWatch Logs Insights to find timeout patterns
  • What iterator age tells you about pipeline health
  • How BisectBatchOnFunctionError isolates poison pill records
  • How to tune BatchSize, parallelization factor, and timeout
  • When to use Provisioned Concurrency to eliminate cold start timeouts

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios