Pod Stuck in CrashLoopBackOff — Investigate and Resolve
A newly deployed microservice keeps crashing immediately on startup. Learn to read logs, describe pods, and identify misconfigured environment variables and missing secrets.
You just merged a PR and your CD pipeline deployed the new image. Thirty seconds later Slack shows: 'payment-service pod restarting repeatedly in production namespace.' Your deploy just broke payments.
Overview
CrashLoopBackOff is one of the most common Kubernetes errors. It means a container started, crashed, Kubernetes tried to restart it, it crashed again — and now Kubernetes is backing off exponentially before each retry. This scenario covers the full debug workflow you’ll use in every real cluster.
The Problem
Your CD pipeline completed successfully — image pushed, manifest applied. But the pod won’t stay up.
NAME READY STATUS RESTARTS AGE
payment-service-7d9f4b8c6-xk2p9 0/1 CrashLoopBackOff 4 3m12s
RESTARTS: 4 means it’s crashed four times in 3 minutes. Kubernetes will keep trying but the backoff is now several minutes.
Step 1: Get the Logs
The first thing you always check is the container logs from the most recent crash:
kubectl logs payment-service-7d9f4b8c6-xk2p9 -n production
If the pod has already restarted and you want the previous container’s logs:
kubectl logs payment-service-7d9f4b8c6-xk2p9 -n production --previous
In this scenario you’ll see:
2024-03-15T02:14:22Z INFO Starting payment-service v2.1.0
2024-03-15T02:14:22Z INFO Loading configuration...
2024-03-15T02:14:22Z FATAL Missing required environment variable: STRIPE_SECRET_KEY
2024-03-15T02:14:22Z FATAL Application startup failed. Exiting.
The application is exiting because a required environment variable is missing. But before we fix it, let’s understand the full picture.
Step 2: Describe the Pod
kubectl describe gives you everything Kubernetes knows about the pod — events, resource requests, env vars, mount points, and crucially the exit code:
kubectl describe pod payment-service-7d9f4b8c6-xk2p9 -n production
Look at two sections:
Environment variables — is STRIPE_SECRET_KEY listed?
Environment:
DATABASE_URL: <set to the key 'url' in secret 'postgres-secret'> Optional: false
STRIPE_SECRET_KEY: <not set>
Last State / Exit Code:
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Fri, 15 Mar 2024 02:14:22 +0000
Finished: Fri, 15 Mar 2024 02:14:22 +0000
| Exit Code | Meaning |
|---|---|
0 | Clean exit (shouldn’t cause CrashLoopBackOff) |
1 | Application error — check logs |
137 | Killed by signal 9 (OOMKilled or kubectl delete) |
139 | Segmentation fault |
143 | Killed by signal 15 (graceful shutdown) |
Events section at the bottom:
Events:
Warning BackOff 2m kubelet Back-off restarting failed container
Step 3: Find the Root Cause
The app needs STRIPE_SECRET_KEY but the deployment manifest doesn’t set it. Check your deployment:
kubectl get deployment payment-service -n production -o yaml | grep -A30 env:
You’ll see STRIPE_SECRET_KEY is referenced from a Kubernetes Secret called stripe-secret, but let’s check if that secret exists:
kubectl get secret stripe-secret -n production
Error from server (NotFound): secrets "stripe-secret" not found
The secret was never created in the production namespace. It exists in staging but wasn’t applied to production — a classic environment promotion miss.
Step 4: Fix the Issue
Create the missing secret in the production namespace:
# NEVER put real secrets in command history — use a file or a secrets manager
# For this scenario we'll use kubectl create secret directly
kubectl create secret generic stripe-secret \
--from-literal=secret-key="sk_live_your_actual_key_here" \
-n production
Now update your deployment manifest to reference the secret correctly:
# In your deployment spec, under containers[].env:
env:
- name: STRIPE_SECRET_KEY
valueFrom:
secretKeyRef:
name: stripe-secret
key: secret-key
Apply the change — or just restart the pod if the manifest was already correct and only the secret was missing:
# Rollout restart forces new pods to be created (picks up the new secret)
kubectl rollout restart deployment/payment-service -n production
Step 5: Verify Recovery
Watch the rollout:
kubectl rollout status deployment/payment-service -n production
# Waiting for deployment "payment-service" rollout to finish: 1 old replicas are pending termination...
# deployment "payment-service" successfully rolled out
kubectl get pods -n production -l app=payment-service
# NAME READY STATUS RESTARTS AGE
# payment-service-8f6d5c9b7-m3nt1 1/1 Running 0 45s
READY: 1/1 and RESTARTS: 0 — fixed.
Prevent It Happening Again
Add a readiness probe so Kubernetes knows the app is truly ready before sending traffic:
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
And add startup validation to your deploy pipeline — check that all required secrets exist in the target namespace before the rollout starts.
# Pre-deploy check script
REQUIRED_SECRETS=("stripe-secret" "postgres-secret" "jwt-secret")
NAMESPACE="production"
for secret in "${REQUIRED_SECRETS[@]}"; do
if ! kubectl get secret "$secret" -n "$NAMESPACE" &>/dev/null; then
echo "ERROR: Required secret '$secret' not found in namespace '$NAMESPACE'"
exit 1
fi
done
echo "All required secrets present. Proceeding with deploy."
Summary
CrashLoopBackOff always means the container is starting and then exiting. The fix is always the same three-step process:
kubectl logs --previous— what did the app say before it died?kubectl describe pod— what did Kubernetes observe? (exit code, events, env vars)- Fix the underlying cause and
kubectl rollout restart
This scenario covered a missing Kubernetes Secret, which is one of the most common causes. Other frequent culprits: wrong image tag, misconfigured liveness probe killing a slow-starting app, and OOMKilled (need to increase memory limits).
kubectl get events --sort-by=.lastTimestamp -n production to your triage toolkit. It shows you everything that’s happened across all resources in the namespace in time order — incredibly useful when you don’t know what broke.- The exact kubectl commands to diagnose a CrashLoopBackOff
- How to read container exit codes to understand why a crash happened
- Common causes — missing env vars, missing secrets, wrong image, OOM
- How to safely edit a running deployment to fix the issue
- How to set up liveness/readiness probes to catch this earlier
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google Form