Pod Stuck in CrashLoopBackOff — Investigate and Resolve

Overview

CrashLoopBackOff is one of the most common Kubernetes errors. It means a container started, crashed, Kubernetes tried to restart it, it crashed again — and now Kubernetes is backing off exponentially before each retry. This scenario covers the full debug workflow you’ll use in every real cluster.

The Problem

Your CD pipeline completed successfully — image pushed, manifest applied. But the pod won’t stay up.

kubectl get pods output

NAME                               READY   STATUS             RESTARTS   AGE
payment-service-7d9f4b8c6-xk2p9   0/1     CrashLoopBackOff   4          3m12s

RESTARTS: 4 means it’s crashed four times in 3 minutes. Kubernetes will keep trying but the backoff is now several minutes.

Step 1: Get the Logs

The first thing you always check is the container logs from the most recent crash:

kubectl logs payment-service-7d9f4b8c6-xk2p9 -n production

If the pod has already restarted and you want the previous container’s logs:

kubectl logs payment-service-7d9f4b8c6-xk2p9 -n production --previous

In this scenario you’ll see:

2024-03-15T02:14:22Z INFO  Starting payment-service v2.1.0
2024-03-15T02:14:22Z INFO  Loading configuration...
2024-03-15T02:14:22Z FATAL Missing required environment variable: STRIPE_SECRET_KEY
2024-03-15T02:14:22Z FATAL Application startup failed. Exiting.

The application is exiting because a required environment variable is missing. But before we fix it, let’s understand the full picture.

Step 2: Describe the Pod

kubectl describe gives you everything Kubernetes knows about the pod — events, resource requests, env vars, mount points, and crucially the exit code:

kubectl describe pod payment-service-7d9f4b8c6-xk2p9 -n production

Look at two sections:

Environment variables — is STRIPE_SECRET_KEY listed?

Environment:
  DATABASE_URL:     <set to the key 'url' in secret 'postgres-secret'>   Optional: false
  STRIPE_SECRET_KEY:  <not set>

Last State / Exit Code:

Last State:     Terminated
  Reason:       Error
  Exit Code:    1
  Started:      Fri, 15 Mar 2024 02:14:22 +0000
  Finished:     Fri, 15 Mar 2024 02:14:22 +0000

Exit Code Cheat Sheet

Exit Code	Meaning
`0`	Clean exit (shouldn’t cause CrashLoopBackOff)
`1`	Application error — check logs
`137`	Killed by signal 9 (OOMKilled or `kubectl delete`)
`139`	Segmentation fault
`143`	Killed by signal 15 (graceful shutdown)

Events section at the bottom:

Events:
  Warning  BackOff   2m    kubelet  Back-off restarting failed container

Step 3: Find the Root Cause

The app needs STRIPE_SECRET_KEY but the deployment manifest doesn’t set it. Check your deployment:

kubectl get deployment payment-service -n production -o yaml | grep -A30 env:

You’ll see STRIPE_SECRET_KEY is referenced from a Kubernetes Secret called stripe-secret, but let’s check if that secret exists:

kubectl get secret stripe-secret -n production

Error from server (NotFound): secrets "stripe-secret" not found

The secret was never created in the production namespace. It exists in staging but wasn’t applied to production — a classic environment promotion miss.

Step 4: Fix the Issue

Create the missing secret in the production namespace:

# NEVER put real secrets in command history — use a file or a secrets manager
# For this scenario we'll use kubectl create secret directly

kubectl create secret generic stripe-secret \
  --from-literal=secret-key="sk_live_your_actual_key_here" \
  -n production

Real Production Advice

In production, secrets should come from AWS Secrets Manager / HashiCorp Vault / GCP Secret Manager and be injected via the External Secrets Operator or similar. Manually creating secrets from the CLI is fine for emergencies but adds a secret that isn’t tracked in your IaC.

Now update your deployment manifest to reference the secret correctly:

# In your deployment spec, under containers[].env:
env:
  - name: STRIPE_SECRET_KEY
    valueFrom:
      secretKeyRef:
        name: stripe-secret
        key: secret-key

Apply the change — or just restart the pod if the manifest was already correct and only the secret was missing:

# Rollout restart forces new pods to be created (picks up the new secret)
kubectl rollout restart deployment/payment-service -n production

Step 5: Verify Recovery

Watch the rollout:

kubectl rollout status deployment/payment-service -n production
# Waiting for deployment "payment-service" rollout to finish: 1 old replicas are pending termination...
# deployment "payment-service" successfully rolled out

kubectl get pods -n production -l app=payment-service
# NAME                               READY   STATUS    RESTARTS   AGE
# payment-service-8f6d5c9b7-m3nt1   1/1     Running   0          45s

READY: 1/1 and RESTARTS: 0 — fixed.

Prevent It Happening Again

Add a readiness probe so Kubernetes knows the app is truly ready before sending traffic:

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

And add startup validation to your deploy pipeline — check that all required secrets exist in the target namespace before the rollout starts.

# Pre-deploy check script
REQUIRED_SECRETS=("stripe-secret" "postgres-secret" "jwt-secret")
NAMESPACE="production"

for secret in "${REQUIRED_SECRETS[@]}"; do
  if ! kubectl get secret "$secret" -n "$NAMESPACE" &>/dev/null; then
    echo "ERROR: Required secret '$secret' not found in namespace '$NAMESPACE'"
    exit 1
  fi
done
echo "All required secrets present. Proceeding with deploy."

Summary

CrashLoopBackOff always means the container is starting and then exiting. The fix is always the same three-step process:

kubectl logs --previous — what did the app say before it died?
kubectl describe pod — what did Kubernetes observe? (exit code, events, env vars)
Fix the underlying cause and kubectl rollout restart

This scenario covered a missing Kubernetes Secret, which is one of the most common causes. Other frequent culprits: wrong image tag, misconfigured liveness probe killing a slow-starting app, and OOMKilled (need to increase memory limits).

Pro Tip

Add kubectl get events --sort-by=.lastTimestamp -n production to your triage toolkit. It shows you everything that’s happened across all resources in the namespace in time order — incredibly useful when you don’t know what broke.

Overview

The Problem

Step 1: Get the Logs

Step 2: Describe the Pod

Step 3: Find the Root Cause

Step 4: Fix the Issue

Step 5: Verify Recovery

Prevent It Happening Again

Summary

Have a similar scenario to share?