Lambda Function Timing Out Processing Kinesis Stream Records

The Problem

Lambda + Kinesis timeouts are particularly dangerous because Kinesis retries failed batches indefinitely by default. A single malformed record (poison pill) can block an entire shard, causing the iterator to fall behind and — if left unchecked — eventually expire (24-hour default retention).

Iterator Age Alert

[WARNING] Lambda function analytics-stream-processor
Metric: IteratorAgeMilliseconds — 2,700,000 ms (45 minutes behind)
Metric: Throttles — 0 (not a concurrency issue)
Metric: Errors — 12 in last hour
Timeout: 30 seconds

Step 1: Diagnose — Find the Timeout Pattern

# CloudWatch Logs Insights — find timeout events
aws logs start-query \
  --log-group-name /aws/lambda/analytics-stream-processor \
  --start-time $(date -d '2 hours ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, @message, @duration
    | filter @message like /Task timed out/ or @message like /Runtime exited/
    | sort @timestamp asc
    | limit 50
  '

# Get query results
aws logs get-query-results --query-id <query-id>

# Check iterator age trend (is it growing or stable?)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Lambda \
  --metric-name IteratorAge \
  --dimensions Name=FunctionName,Value=analytics-stream-processor \
  --statistics Maximum \
  --period 300 \
  --start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Growing iterator age = Lambda can’t keep up with the stream. Stable but high = Lambda caught up then fell behind on a specific shard.

# Check per-shard iterator age (if using enhanced monitoring)
aws cloudwatch get-metric-statistics \
  --namespace AWS/Kinesis \
  --metric-name GetRecords.IteratorAgeMilliseconds \
  --dimensions Name=StreamName,Value=analytics-stream \
  --statistics Maximum \
  --period 300 \
  --start-time $(date -d '2 hours ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

Step 2: Identify the Root Cause

Scenario A: Poison pill record causing repeated timeouts

One malformed record forces Lambda to retry the entire batch containing it, timing out each time. Iterator age climbs because that shard is stuck.

Scenario B: Downstream bottleneck (DynamoDB throttling)

Lambda starts processing but waits 25+ seconds for DynamoDB writes to complete before timing out.

Scenario C: Lambda cold start eating into timeout budget

At low traffic, Lambda cold starts (300ms-3s) reduce effective processing time below the timeout.

# Add timing instrumentation to identify which step is slow
import time
import logging
import boto3

logger = logging.getLogger()

def handler(event, context):
    records = event['Records']
    logger.info(f"Processing batch of {len(records)} records")
    
    start = time.time()
    
    # Step 1: Parse records
    parsed = parse_records(records)
    logger.info(f"Parse took: {(time.time()-start)*1000:.0f}ms")
    
    step2_start = time.time()
    
    # Step 2: Write to DynamoDB
    write_to_dynamodb(parsed)
    logger.info(f"DynamoDB write took: {(time.time()-step2_start)*1000:.0f}ms")
    
    logger.info(f"Total execution: {(time.time()-start)*1000:.0f}ms")

Step 3: Fix — Tune Event Source Mapping

# Update the Kinesis event source mapping with best practices
aws lambda update-event-source-mapping \
  --uuid $(aws lambda list-event-source-mappings \
    --function-name analytics-stream-processor \
    --query 'EventSourceMappings[0].UUID' \
    --output text) \
  --batch-size 100 \                     # Down from 200 — smaller batches = faster processing
  --maximum-batching-window-in-seconds 5 \  # Wait up to 5s to fill batch
  --parallelization-factor 10 \          # Up to 10 concurrent invocations per shard
  --bisect-batch-on-function-error \     # KEY: splits failing batch to isolate poison pills
  --maximum-retry-attempts 3 \           # Don't retry forever
  --destination-config '{
    "OnFailure": {
      "Destination": "arn:aws:sqs:us-east-1:123456789012:analytics-dlq"
    }
  }'

What each parameter does:

Parameter	Value	Why
`batch-size`	100	Smaller batches process faster and reduce timeout risk
`parallelization-factor`	10	10 Lambda invocations per shard in parallel
`bisect-batch-on-function-error`	true	On timeout: split batch in half, retry each half. Isolates the poison pill to a single-record batch.
`maximum-retry-attempts`	3	After 3 failures, send to DLQ instead of blocking forever
DLQ destination	SQS	Failed records go to SQS for investigation, not lost

Step 4: Fix Lambda Timeout and Memory

# Increase timeout and memory (more memory = more CPU = faster execution)
aws lambda update-function-configuration \
  --function-name analytics-stream-processor \
  --timeout 120 \         # Was 30s — give more headroom for DynamoDB latency
  --memory-size 1024      # Was 256MB — more memory gives proportionally more CPU

Use AWS Lambda Power Tuning to find the optimal memory setting:

# Deploy the Lambda Power Tuning state machine (open source tool)
# https://github.com/alexcasalboni/aws-lambda-power-tuning
aws cloudformation deploy \
  --template-file aws-lambda-power-tuning.yaml \
  --stack-name lambda-power-tuning \
  --capabilities CAPABILITY_IAM

# Run tuning for your function
aws stepfunctions start-execution \
  --state-machine-arn arn:aws:states:us-east-1:...:stateMachine:powerTuningStateMachine \
  --input '{
    "lambdaARN": "arn:aws:lambda:us-east-1:...:function:analytics-stream-processor",
    "powerValues": [128, 256, 512, 1024, 2048, 3008],
    "num": 50,
    "payload": {},
    "parallelInvocation": true,
    "strategy": "cost"
  }'

Step 5: Provisioned Concurrency — Eliminate Cold Starts

If cold starts are contributing to timeouts, enable Provisioned Concurrency:

# Keep 5 execution environments warm
aws lambda put-provisioned-concurrency-config \
  --function-name analytics-stream-processor \
  --qualifier production \
  --provisioned-concurrent-executions 5

For Kinesis workloads, you typically need Provisioned Concurrency only during scale-out events when Lambda needs to create many new execution environments simultaneously.

DLQ — Handle Failed Records

Set up an SQS DLQ and process failed records for alerting and reprocessing:

# Lambda to process DLQ messages and alert
def process_analytics_dlq(event, context):
    for record in event['Records']:
        body = json.loads(record['body'])
        
        # Parse the failed Kinesis record details
        failed_record = body.get('requestContext', {})
        kinesis_record = body.get('responseContext', {})
        
        logger.error(f"Failed record from shard {failed_record.get('shardId')}: "
                    f"{kinesis_record}")
        
        # Alert on-call if DLQ depth grows
        # (CloudWatch alarm on ApproximateNumberOfMessagesVisible > 100)
        
        # Optionally: parse and store for manual reprocessing
        s3.put_object(
            Bucket='analytics-failed-records',
            Key=f"failed/{datetime.now().isoformat()}/{record['messageId']}.json",
            Body=record['body']
        )

Optimized Configuration Summary

Parameter	Before	After	Reason
Timeout	30s	120s	DynamoDB latency headroom
Memory	256MB	1024MB	More CPU for processing
Batch size	200	100	Smaller = faster per batch
Parallelization factor	1	10	10× throughput per shard
BisectBatchOnFunctionError	disabled	enabled	Isolates poison pills
Max retry attempts	unlimited	3	Prevents infinite blocking
DLQ	none	SQS	Failed records preserved

Interview Angle

The key insight for interviewers: BisectBatchOnFunctionError is the most underused Lambda Kinesis feature. When a batch of 100 records times out, Lambda bisects it: tries records 1-50, then 51-100. If 1-50 fails, tries 1-25, then 26-50. This binary search eventually isolates the single poison pill record, which goes to the DLQ. Without it, one bad record can block a shard indefinitely.

Lambda Function Timing Out Processing Kinesis Stream Records

The Problem

Step 1: Diagnose — Find the Timeout Pattern

Step 2: Identify the Root Cause

Step 3: Fix — Tune Event Source Mapping

Step 4: Fix Lambda Timeout and Memory

Step 5: Provisioned Concurrency — Eliminate Cold Starts

DLQ — Handle Failed Records

Optimized Configuration Summary

Have a similar scenario to share?

Related Scenarios

EKS Pods Getting OOMKilled in Production — Diagnose and Fix

Production RDS Ran Out of Storage at 3 AM — Incident Response and RCA

AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs