EKS Pods Getting OOMKilled in Production — Diagnose and Fix

The Problem

OOMKilled (exit code 137) means the Linux kernel’s Out-Of-Memory killer terminated your container because it exceeded its memory limit. The cause is one of two things:

Limits set too low — the app needs more memory than you’ve allocated
Memory leak — the app consumes ever-increasing memory over time until it’s killed

The restart itself causes the 503 errors. The fix depends on which root cause you’re dealing with.

OOMKilled Alert

[CRITICAL] payment-service pod payment-service-7d4f8b9-xk2p restarted 4 times in 30 minutes.
Exit code: 137 (OOMKilled)
Last restart: 14:32 UTC
Customer impact: ~340 failed payment attempts during restart windows

Step 1: Confirm OOMKilled and Get Timeline

# Find all recently OOMKilled pods
kubectl get pods -A --field-selector=status.phase=Running | head -20

# Check restart counts — high restarts = chronic OOM
kubectl get pods -n production --sort-by='.status.containerStatuses[0].restartCount'

# Describe the pod to see OOM details
kubectl describe pod payment-service-7d4f8b9-xk2p -n production

The describe output will show:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Mon, 15 Jan 2024 14:29:12 +0000
  Finished:     Mon, 15 Jan 2024 14:32:44 +0000

Step 2: Check Actual Memory Usage vs Limits

# Real-time pod memory usage
kubectl top pods -n production --sort-by=memory

# Check node memory pressure
kubectl describe nodes | grep -A 10 "Conditions:"

# Get per-container resource limits
kubectl get pods payment-service-7d4f8b9-xk2p -n production \
  -o jsonpath='{.spec.containers[*].resources}' | python3 -m json.tool

In CloudWatch Container Insights:

# Query pod memory working set vs limit (Logs Insights)
aws logs start-query \
  --log-group-name /aws/containerinsights/prod-cluster/performance \
  --start-time $(date -d '2 hours ago' +%s) \
  --end-time $(date +%s) \
  --query-string '
    fields @timestamp, pod_name, pod_memory_working_set, pod_memory_limit
    | filter pod_name like "payment-service"
    | stats max(pod_memory_working_set), max(pod_memory_limit) by pod_name
  '

If pod_memory_working_set regularly reaches pod_memory_limit, the limit is too low. If memory grows steadily over hours without stabilizing, it’s a leak.

Step 3: Right-Size Memory Limits With VPA

Don’t guess. Use the Vertical Pod Autoscaler in Recommendation mode to get data-driven limits:

# vpa-recommendation.yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: payment-service-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  updatePolicy:
    updateMode: "Off"   # Recommendation only — don't auto-apply yet

kubectl apply -f vpa-recommendation.yaml

# After running for 24-48 hours:
kubectl get vpa payment-service-vpa -n production -o yaml | grep -A 20 recommendation

Output shows:

recommendation:
  containerRecommendations:
  - containerName: payment-service
    lowerBound:
      memory: 384Mi
    target:
      memory: 640Mi       # Use this as your new limit
    upperBound:
      memory: 1024Mi

Apply the VPA’s target recommendation:

resources:
  requests:
    memory: "384Mi"
    cpu: "250m"
  limits:
    memory: "768Mi"   # ~1.2× the VPA target for headroom
    cpu: "500m"

Step 4: Detect Memory Leaks

If memory grows continuously over time (VPA’s upperBound keeps climbing), there’s a leak.

Java (Spring Boot / JVM):

# Enable heap dump on OOM (add to JVM flags)
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/tmp/heapdump.hprof

# Check GC pressure in CloudWatch
# Look for high gc.pause.total or increasing heap after GC

Node.js:

# Run with heap profiling enabled
node --expose-gc --inspect app.js

# Or use clinic.js in a staging environment
npx clinic heapprofiler -- node app.js

Python:

# Add tracemalloc for memory profiling
import tracemalloc
tracemalloc.start()
# ... run code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

Common memory leak causes:

Event listeners added but never removed
Unbounded caches growing without eviction policy
Database connection pools not returning connections
Prometheus metrics registered inside request handlers (duplicates accumulate)

Step 5: Set Resource Requests Correctly

OOMKilled also happens when nodes are oversubscribed — many pods each have low requests but spike simultaneously. Setting requests correctly prevents over-scheduling:

resources:
  requests:
    memory: "384Mi"   # Scheduler uses this for placement
    cpu: "250m"
  limits:
    memory: "768Mi"   # OOM killer triggers here
    cpu: "1000m"      # CPU throttling (not OOMKill)

Rule of thumb: Set requests to the VPA lowerBound and limits to 1.5× the VPA target.

Step 6: Karpenter — Smarter Node Bin-Packing

Karpenter provisions nodes based on actual pod needs. It avoids OOM by ensuring nodes have enough memory for the pods scheduled on them:

apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: production
spec:
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["m5.xlarge", "m5.2xlarge", "m6i.xlarge", "m6i.2xlarge"]
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["on-demand"]
  limits:
    resources:
      memory: "1Ti"   # Max total memory across Karpenter-managed nodes
  ttlSecondsAfterEmpty: 30

Karpenter chooses instance types that fit your pod’s memory requests without wasting capacity — reducing both OOM risk and cost.

Summary

Root Cause	Fix	Timeline
Limits too low	Set limits based on VPA recommendation	This sprint
Memory leak	Profile with heap dump / tracemalloc	This sprint
Requests too low → oversubscription	Set requests = VPA lowerBound	Today
Bad bin-packing	Deploy Karpenter	This quarter

Interview Angle

OOMKilled is a K8s operations favorite interview question. Show you know the difference between requests (scheduling hint) and limits (hard cap), that you’d use VPA recommendations before guessing, and that you’d look for leaks before just raising limits blindly.

EKS Pods Getting OOMKilled in Production — Diagnose and Fix

The Problem

Step 1: Confirm OOMKilled and Get Timeline

Step 2: Check Actual Memory Usage vs Limits

Step 3: Right-Size Memory Limits With VPA

Step 4: Detect Memory Leaks

Step 5: Set Resource Requests Correctly

Step 6: Karpenter — Smarter Node Bin-Packing

Summary

Have a similar scenario to share?

Related Scenarios

Build a Zero-Downtime Deployment Pipeline for Microservices on EKS

Design an Observability Stack for 50+ Microservices on EKS

Migrate from ECS Fargate to EKS With Zero Downtime