Scenario Advanced Aws AWS Containers

EKS Pods Getting OOMKilled in Production — Diagnose and Fix

Your EKS pods are being OOMKilled frequently. Walk through diagnosing memory limits, finding leaks, using VPA recommendations, and implementing Karpenter for better bin-packing.

January 20, 2025 4 min read ~30 min to complete DB
The Situation

Your EKS production cluster is sending PagerDuty alerts every 2-3 hours: pods in the payment-service deployment are getting OOMKilled and restarting. Customers see intermittent 503 errors during the restart window. Memory is set to '512Mi limit' but you're not sure if that's too low or if there's a memory leak.

6 Steps
4 Services Used
~30 min Duration
Advanced Difficulty

The Problem

OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its memory limit. The cause is one of two things:

  1. Limits set too low — the app needs more memory than you’ve allocated
  2. Memory leak — the app consumes ever-increasing memory over time until it’s killed

The restart itself causes the 503 errors. The fix depends on which root cause you’re dealing with.

OOMKilled Alert
[CRITICAL] payment-service pod payment-service-7d4f8b9-xk2p restarted 4 times in 30 minutes.
Exit code: 137 (OOMKilled)
Last restart: 14:32 UTC
Customer impact: ~340 failed payment attempts during restart windows

Step 1: Confirm OOMKilled and Get Timeline

# Find all recently OOMKilled pods
kubectl get pods -A --field-selector=status.phase=Running | head -20

# Check restart counts — high restarts = chronic OOM
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'

# Describe the pod to see OOM details
kubectl describe pod payment-service-7d4f8b9-xk2p -n production

The describe output will show:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 15 Jan 2024 14:29:12 +0000
  Finished:     Mon, 15 Jan 2024 14:32:44 +0000

Step 2: Check Actual Memory Usage vs Limits

# Real-time pod memory usage
kubectl top pods -n production --sort-by=memory

# Check node memory pressure
kubectl describe nodes | grep -A 10 "Conditions:"

# Get per-container resource limits
kubectl get pods payment-service-7d4f8b9-xk2p -n production \
  -o jsonpath='{.spec.containers[*].resources}' | python3 -m json.tool

In CloudWatch Container Insights:

# Query pod memory working set vs limit (Logs Insights)
aws logs start-query \
  --log-group-name /aws/containerinsights/prod-cluster/performance \
  --start-time $(date -d '2 hours ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, pod_name, pod_memory_working_set, pod_memory_limit
    | filter pod_name like "payment-service"
    | stats max(pod_memory_working_set), max(pod_memory_limit) by pod_name
  '

If pod_memory_working_set regularly reaches pod_memory_limit, the limit is too low. If memory grows steadily over hours without stabilizing, it’s a leak.

Step 3: Right-Size Memory Limits With VPA

Don’t guess. Use the Vertical Pod Autoscaler in Recommendation mode to get data-driven limits:

# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply yet
kubectl apply -f vpa-recommendation.yaml

# After running for 24-48 hours:
kubectl get vpa payment-service-vpa -n production -o yaml | grep -A 20 recommendation

Output shows:

recommendation:
  containerRecommendations:
  - containerName: payment-service
    lowerBound:
      memory: 384Mi
    target:
      memory: 640Mi       # Use this as your new limit
    upperBound:
      memory: 1024Mi

Apply the VPA’s target recommendation:

resources:
  requests:
    memory: "384Mi"
    cpu: "250m"
  limits:
    memory: "768Mi"   # ~1.2× the VPA target for headroom
    cpu: "500m"

Step 4: Detect Memory Leaks

If memory grows continuously over time (VPA’s upperBound keeps climbing), there’s a leak.

Java (Spring Boot / JVM):

# Enable heap dump on OOM (add to JVM flags)
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof

# Check GC pressure in CloudWatch
# Look for high gc.pause.total or increasing heap after GC

Node.js:

# Run with heap profiling enabled
node --expose-gc --inspect app.js

# Or use clinic.js in a staging environment
npx clinic heapprofiler -- node app.js

Python:

# Add tracemalloc for memory profiling
import tracemalloc
tracemalloc.start()
# ... run code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

Common memory leak causes:

  • Event listeners added but never removed
  • Unbounded caches growing without eviction policy
  • Database connection pools not returning connections
  • Prometheus metrics registered inside request handlers (duplicates accumulate)

Step 5: Set Resource Requests Correctly

OOMKilled also happens when nodes are oversubscribed — many pods each have low requests but spike simultaneously. Setting requests correctly prevents over-scheduling:

resources:
  requests:
    memory: "384Mi"   # Scheduler uses this for placement
    cpu: "250m"
  limits:
    memory: "768Mi"   # OOM killer triggers here
    cpu: "1000m"      # CPU throttling (not OOMKill)

Rule of thumb: Set requests to the VPA lowerBound and limits to 1.5× the VPA target.

Step 6: Karpenter — Smarter Node Bin-Packing

Karpenter provisions nodes based on actual pod needs. It avoids OOM by ensuring nodes have enough memory for the pods scheduled on them:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: production
spec:
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["m5.xlarge", "m5.2xlarge", "m6i.xlarge", "m6i.2xlarge"]
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["on-demand"]
  limits:
    resources:
      memory: "1Ti"   # Max total memory across Karpenter-managed nodes
  ttlSecondsAfterEmpty: 30

Karpenter chooses instance types that fit your pod’s memory requests without wasting capacity — reducing both OOM risk and cost.

Summary

Root CauseFixTimeline
Limits too lowSet limits based on VPA recommendationThis sprint
Memory leakProfile with heap dump / tracemallocThis sprint
Requests too low → oversubscriptionSet requests = VPA lowerBoundToday
Bad bin-packingDeploy KarpenterThis quarter
Interview Angle
OOMKilled is a K8s operations favorite interview question. Show you know the difference between requests (scheduling hint) and limits (hard cap), that you’d use VPA recommendations before guessing, and that you’d look for leaks before just raising limits blindly.
Services Used
EKSCloudWatch Container InsightsVertical Pod Autoscaler (VPA)Karpenter
Prerequisites
  • Familiarity with Kubernetes resource requests and limits
  • Basic understanding of how Linux OOM killer works
What You Learned
  • How to diagnose OOMKilled pods using kubectl and CloudWatch
  • The difference between memory requests and limits, and why both matter
  • How to use VPA in recommendation mode before committing to new limits
  • How to detect memory leaks in Java, Node.js, and Python applications
  • How Karpenter improves node bin-packing to prevent unnecessary OOM events

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios