EKS Pods Getting OOMKilled in Production — Diagnose and Fix
Your EKS pods are being OOMKilled frequently. Walk through diagnosing memory limits, finding leaks, using VPA recommendations, and implementing Karpenter for better bin-packing.
Your EKS production cluster is sending PagerDuty alerts every 2-3 hours: pods in the payment-service deployment are getting OOMKilled and restarting. Customers see intermittent 503 errors during the restart window. Memory is set to '512Mi limit' but you're not sure if that's too low or if there's a memory leak.
The Problem
OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its memory limit. The cause is one of two things:
- Limits set too low — the app needs more memory than you’ve allocated
- Memory leak — the app consumes ever-increasing memory over time until it’s killed
The restart itself causes the 503 errors. The fix depends on which root cause you’re dealing with.
payment-service-7d4f8b9-xk2p restarted 4 times in 30 minutes.Exit code: 137 (OOMKilled)
Last restart: 14:32 UTC
Customer impact: ~340 failed payment attempts during restart windows
Step 1: Confirm OOMKilled and Get Timeline
# Find all recently OOMKilled pods
kubectl get pods -A --field-selector=status.phase=Running | head -20
# Check restart counts — high restarts = chronic OOM
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'
# Describe the pod to see OOM details
kubectl describe pod payment-service-7d4f8b9-xk2p -n production
The describe output will show:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Started: Mon, 15 Jan 2024 14:29:12 +0000
Finished: Mon, 15 Jan 2024 14:32:44 +0000
Step 2: Check Actual Memory Usage vs Limits
# Real-time pod memory usage
kubectl top pods -n production --sort-by=memory
# Check node memory pressure
kubectl describe nodes | grep -A 10 "Conditions:"
# Get per-container resource limits
kubectl get pods payment-service-7d4f8b9-xk2p -n production \
-o jsonpath='{.spec.containers[*].resources}' | python3 -m json.tool
In CloudWatch Container Insights:
# Query pod memory working set vs limit (Logs Insights)
aws logs start-query \
--log-group-name /aws/containerinsights/prod-cluster/performance \
--start-time $(date -d '2 hours ago' +%s) \
--end-time $(date +%s) \
--query-string '
fields @timestamp, pod_name, pod_memory_working_set, pod_memory_limit
| filter pod_name like "payment-service"
| stats max(pod_memory_working_set), max(pod_memory_limit) by pod_name
'
If pod_memory_working_set regularly reaches pod_memory_limit, the limit is too low. If memory grows steadily over hours without stabilizing, it’s a leak.
Step 3: Right-Size Memory Limits With VPA
Don’t guess. Use the Vertical Pod Autoscaler in Recommendation mode to get data-driven limits:
# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: payment-service-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
updatePolicy:
updateMode: "Off" # Recommendation only — don't auto-apply yet
kubectl apply -f vpa-recommendation.yaml
# After running for 24-48 hours:
kubectl get vpa payment-service-vpa -n production -o yaml | grep -A 20 recommendation
Output shows:
recommendation:
containerRecommendations:
- containerName: payment-service
lowerBound:
memory: 384Mi
target:
memory: 640Mi # Use this as your new limit
upperBound:
memory: 1024Mi
Apply the VPA’s target recommendation:
resources:
requests:
memory: "384Mi"
cpu: "250m"
limits:
memory: "768Mi" # ~1.2× the VPA target for headroom
cpu: "500m"
Step 4: Detect Memory Leaks
If memory grows continuously over time (VPA’s upperBound keeps climbing), there’s a leak.
Java (Spring Boot / JVM):
# Enable heap dump on OOM (add to JVM flags)
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof
# Check GC pressure in CloudWatch
# Look for high gc.pause.total or increasing heap after GC
Node.js:
# Run with heap profiling enabled
node --expose-gc --inspect app.js
# Or use clinic.js in a staging environment
npx clinic heapprofiler -- node app.js
Python:
# Add tracemalloc for memory profiling
import tracemalloc
tracemalloc.start()
# ... run code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
print(stat)
Common memory leak causes:
- Event listeners added but never removed
- Unbounded caches growing without eviction policy
- Database connection pools not returning connections
- Prometheus metrics registered inside request handlers (duplicates accumulate)
Step 5: Set Resource Requests Correctly
OOMKilled also happens when nodes are oversubscribed — many pods each have low requests but spike simultaneously. Setting requests correctly prevents over-scheduling:
resources:
requests:
memory: "384Mi" # Scheduler uses this for placement
cpu: "250m"
limits:
memory: "768Mi" # OOM killer triggers here
cpu: "1000m" # CPU throttling (not OOMKill)
Rule of thumb: Set requests to the VPA lowerBound and limits to 1.5× the VPA target.
Step 6: Karpenter — Smarter Node Bin-Packing
Karpenter provisions nodes based on actual pod needs. It avoids OOM by ensuring nodes have enough memory for the pods scheduled on them:
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
name: production
spec:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.xlarge", "m5.2xlarge", "m6i.xlarge", "m6i.2xlarge"]
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
limits:
resources:
memory: "1Ti" # Max total memory across Karpenter-managed nodes
ttlSecondsAfterEmpty: 30
Karpenter chooses instance types that fit your pod’s memory requests without wasting capacity — reducing both OOM risk and cost.
Summary
| Root Cause | Fix | Timeline |
|---|---|---|
| Limits too low | Set limits based on VPA recommendation | This sprint |
| Memory leak | Profile with heap dump / tracemalloc | This sprint |
| Requests too low → oversubscription | Set requests = VPA lowerBound | Today |
| Bad bin-packing | Deploy Karpenter | This quarter |
requests (scheduling hint) and limits (hard cap), that you’d use VPA recommendations before guessing, and that you’d look for leaks before just raising limits blindly.- How to diagnose OOMKilled pods using kubectl and CloudWatch
- The difference between memory requests and limits, and why both matter
- How to use VPA in recommendation mode before committing to new limits
- How to detect memory leaks in Java, Node.js, and Python applications
- How Karpenter improves node bin-packing to prevent unnecessary OOM events
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Build a Zero-Downtime Deployment Pipeline for Microservices on EKS
The Problem A traditional kubectl apply replaces all pods simultaneously — if the new image is broken, users hit errors until you notice and …
Design an Observability Stack for 50+ Microservices on EKS
The Problem Without centralized observability, you’re flying blind. Debugging requires SSH-ing into pods, grepping logs, and guessing …
Migrate from ECS Fargate to EKS With Zero Downtime
The Problem A big-bang cutover from ECS to EKS is too risky — if EKS has issues, you’ve already disconnected ECS. The Strangler Fig …