Lambda Function Timing Out Processing Kinesis Stream Records
Debug intermittent Lambda timeouts on a Kinesis stream consumer. Investigate iterator age, cold starts, poison pill records, and implement BisectBatchOnFunctionError with DLQ.
Your real-time analytics pipeline reads from a Kinesis stream and writes aggregated metrics to DynamoDB. Alerts show Lambda is timing out intermittently — not every invocation, but enough that the Kinesis iterator age keeps climbing. By the time you investigate, you're 45 minutes behind real-time. The pipeline processes 10,000 records per second at peak.
The Problem
Lambda + Kinesis timeouts are particularly dangerous because Kinesis retries failed batches indefinitely by default. A single malformed record (poison pill) can block an entire shard, causing the iterator to fall behind and — if left unchecked — eventually expire (24-hour default retention).
analytics-stream-processorMetric:
IteratorAgeMilliseconds — 2,700,000 ms (45 minutes behind)Metric:
Throttles — 0 (not a concurrency issue)Metric:
Errors — 12 in last hourTimeout: 30 seconds
Step 1: Diagnose — Find the Timeout Pattern
# CloudWatch Logs Insights — find timeout events
aws logs start-query \
--log-group-name /aws/lambda/analytics-stream-processor \
--start-time $(date -d '2 hours ago' +%s) \
--end-time $(date +%s) \
--query-string '
fields @timestamp, @message, @duration
| filter @message like /Task timed out/ or @message like /Runtime exited/
| sort @timestamp asc
| limit 50
'
# Get query results
aws logs get-query-results --query-id <query-id>
# Check iterator age trend (is it growing or stable?)
aws cloudwatch get-metric-statistics \
--namespace AWS/Lambda \
--metric-name IteratorAge \
--dimensions Name=FunctionName,Value=analytics-stream-processor \
--statistics Maximum \
--period 300 \
--start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
Growing iterator age = Lambda can’t keep up with the stream. Stable but high = Lambda caught up then fell behind on a specific shard.
# Check per-shard iterator age (if using enhanced monitoring)
aws cloudwatch get-metric-statistics \
--namespace AWS/Kinesis \
--metric-name GetRecords.IteratorAgeMilliseconds \
--dimensions Name=StreamName,Value=analytics-stream \
--statistics Maximum \
--period 300 \
--start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
Step 2: Identify the Root Cause
Scenario A: Poison pill record causing repeated timeouts
One malformed record forces Lambda to retry the entire batch containing it, timing out each time. Iterator age climbs because that shard is stuck.
Scenario B: Downstream bottleneck (DynamoDB throttling)
Lambda starts processing but waits 25+ seconds for DynamoDB writes to complete before timing out.
Scenario C: Lambda cold start eating into timeout budget
At low traffic, Lambda cold starts (300ms-3s) reduce effective processing time below the timeout.
# Add timing instrumentation to identify which step is slow
import time
import logging
import boto3
logger = logging.getLogger()
def handler(event, context):
records = event['Records']
logger.info(f"Processing batch of {len(records)} records")
start = time.time()
# Step 1: Parse records
parsed = parse_records(records)
logger.info(f"Parse took: {(time.time()-start)*1000:.0f}ms")
step2_start = time.time()
# Step 2: Write to DynamoDB
write_to_dynamodb(parsed)
logger.info(f"DynamoDB write took: {(time.time()-step2_start)*1000:.0f}ms")
logger.info(f"Total execution: {(time.time()-start)*1000:.0f}ms")
Step 3: Fix — Tune Event Source Mapping
# Update the Kinesis event source mapping with best practices
aws lambda update-event-source-mapping \
--uuid $(aws lambda list-event-source-mappings \
--function-name analytics-stream-processor \
--query 'EventSourceMappings[0].UUID' \
--output text) \
--batch-size 100 \ # Down from 200 — smaller batches = faster processing
--maximum-batching-window-in-seconds 5 \ # Wait up to 5s to fill batch
--parallelization-factor 10 \ # Up to 10 concurrent invocations per shard
--bisect-batch-on-function-error \ # KEY: splits failing batch to isolate poison pills
--maximum-retry-attempts 3 \ # Don't retry forever
--destination-config '{
"OnFailure": {
"Destination": "arn:aws:sqs:us-east-1:123456789012:analytics-dlq"
}
}'
What each parameter does:
| Parameter | Value | Why |
|---|---|---|
batch-size | 100 | Smaller batches process faster and reduce timeout risk |
parallelization-factor | 10 | 10 Lambda invocations per shard in parallel |
bisect-batch-on-function-error | true | On timeout: split batch in half, retry each half. Isolates the poison pill to a single-record batch. |
maximum-retry-attempts | 3 | After 3 failures, send to DLQ instead of blocking forever |
| DLQ destination | SQS | Failed records go to SQS for investigation, not lost |
Step 4: Fix Lambda Timeout and Memory
# Increase timeout and memory (more memory = more CPU = faster execution)
aws lambda update-function-configuration \
--function-name analytics-stream-processor \
--timeout 120 \ # Was 30s — give more headroom for DynamoDB latency
--memory-size 1024 # Was 256MB — more memory gives proportionally more CPU
Use AWS Lambda Power Tuning to find the optimal memory setting:
# Deploy the Lambda Power Tuning state machine (open source tool)
# https://github.com/alexcasalboni/aws-lambda-power-tuning
aws cloudformation deploy \
--template-file aws-lambda-power-tuning.yaml \
--stack-name lambda-power-tuning \
--capabilities CAPABILITY_IAM
# Run tuning for your function
aws stepfunctions start-execution \
--state-machine-arn arn:aws:states:us-east-1:...:stateMachine:powerTuningStateMachine \
--input '{
"lambdaARN": "arn:aws:lambda:us-east-1:...:function:analytics-stream-processor",
"powerValues": [128, 256, 512, 1024, 2048, 3008],
"num": 50,
"payload": {},
"parallelInvocation": true,
"strategy": "cost"
}'
Step 5: Provisioned Concurrency — Eliminate Cold Starts
If cold starts are contributing to timeouts, enable Provisioned Concurrency:
# Keep 5 execution environments warm
aws lambda put-provisioned-concurrency-config \
--function-name analytics-stream-processor \
--qualifier production \
--provisioned-concurrent-executions 5
For Kinesis workloads, you typically need Provisioned Concurrency only during scale-out events when Lambda needs to create many new execution environments simultaneously.
DLQ — Handle Failed Records
Set up an SQS DLQ and process failed records for alerting and reprocessing:
# Lambda to process DLQ messages and alert
def process_analytics_dlq(event, context):
for record in event['Records']:
body = json.loads(record['body'])
# Parse the failed Kinesis record details
failed_record = body.get('requestContext', {})
kinesis_record = body.get('responseContext', {})
logger.error(f"Failed record from shard {failed_record.get('shardId')}: "
f"{kinesis_record}")
# Alert on-call if DLQ depth grows
# (CloudWatch alarm on ApproximateNumberOfMessagesVisible > 100)
# Optionally: parse and store for manual reprocessing
s3.put_object(
Bucket='analytics-failed-records',
Key=f"failed/{datetime.now().isoformat()}/{record['messageId']}.json",
Body=record['body']
)
Optimized Configuration Summary
| Parameter | Before | After | Reason |
|---|---|---|---|
| Timeout | 30s | 120s | DynamoDB latency headroom |
| Memory | 256MB | 1024MB | More CPU for processing |
| Batch size | 200 | 100 | Smaller = faster per batch |
| Parallelization factor | 1 | 10 | 10× throughput per shard |
| BisectBatchOnFunctionError | disabled | enabled | Isolates poison pills |
| Max retry attempts | unlimited | 3 | Prevents infinite blocking |
| DLQ | none | SQS | Failed records preserved |
BisectBatchOnFunctionError is the most underused Lambda Kinesis feature. When a batch of 100 records times out, Lambda bisects it: tries records 1-50, then 51-100. If 1-50 fails, tries 1-25, then 26-50. This binary search eventually isolates the single poison pill record, which goes to the DLQ. Without it, one bad record can block a shard indefinitely.- How to use CloudWatch Logs Insights to find timeout patterns
- What iterator age tells you about pipeline health
- How BisectBatchOnFunctionError isolates poison pill records
- How to tune BatchSize, parallelization factor, and timeout
- When to use Provisioned Concurrency to eliminate cold start timeouts
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
EKS Pods Getting OOMKilled in Production — Diagnose and Fix
The Problem OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its …
Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA
The Problem RDS storage full is a P0 incident. Unlike CPU spikes that throttle gracefully, a full disk makes the database refuse all writes …
AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs
The Problem A 40% cost spike rarely comes from one source. It’s usually a combination of: a forgotten service that kept running, an …