Scenario Intermediate Aws AWS Incident Response

EC2 Production Instance Unreachable — Diagnose and Recover

Your e-commerce site's EC2 instance stopped responding at 2 AM. Customers can't checkout. Walk through the full incident response from alert to resolution.

The Situation

It's 2:14 AM. PagerDuty fires: 'HTTP health check FAILED — api.yourstore.com'. Your on-call phone won't stop ringing. The checkout page is returning 502. Revenue is bleeding at $400/minute.

March 10, 2024 5 min read ~30 min to complete DB

Overview

An EC2 instance going unreachable is one of the most common production incidents. The tricky part isn’t fixing it — it’s finding the cause quickly under pressure. This scenario walks you through a methodical triage process that works whether the issue is a security group change, a runaway process eating CPU, a full disk, or a network ACL misconfiguration.

The Problem

At 2:14 AM your monitoring fires. The ELB health checks are failing for one of your backend EC2 instances. The other instances in the Auto Scaling Group are healthy, so traffic is being routed around it — but you’re now running at reduced capacity and any further failures will cause a full outage.

Active Alert
[CRITICAL] ELB Target Unhealthy
Target: i-0a1b2c3d4e5f (10.0.1.45)
Health check path: /health
Status: Target.FailedHealthChecks — 3 consecutive failures
Time: 02:14 UTC

Step 1: Confirm the Scope

Before touching anything, understand what’s actually broken. Open the AWS Console and check:

EC2 → Instances → find i-0a1b2c3d4e5f

Look at the Instance State and Status Checks columns:

CheckWhat it testsFailure means
System statusAWS hypervisor / hardwareAWS-side problem, reboot may fix
Instance statusOS-level reachabilityYour problem — OS, disk, CPU
# Or via CLI:
aws ec2 describe-instance-status \
  --instance-ids i-0a1b2c3d4e5f \
  --query 'InstanceStatuses[*].{System:SystemStatus.Status,Instance:InstanceStatus.Status}'

In this scenario you’ll see:

[
  {
    "System": "ok",
    "Instance": "impaired"
  }
]

The hardware is fine. The OS is impaired. This means the problem is inside the instance.

Step 2: Check CloudWatch Metrics

Before trying to connect, look at the metrics to understand the timeline.

Navigate to CloudWatch → Metrics → EC2 → Per-Instance Metrics and check:

  • CPUUtilization — was there a spike before the failure?
  • StatusCheckFailed_Instance — when did it start?
  • NetworkIn / NetworkOut — did traffic drop suddenly?
What You'll Find
In this scenario, CPU hit 100% at 02:09 UTC and stayed there. The health check started failing at 02:11 UTC. Something is eating all the CPU.

Step 3: Try to Connect

Since CPU is pegged, SSH may time out. Use SSM Session Manager instead — it doesn’t need the SSH port open and works even when the OS is barely responsive.

# Install SSM plugin first if needed:
# https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

aws ssm start-session --target i-0a1b2c3d4e5f

If SSM is not configured on the instance, try SSH — but be prepared for it to be slow:

ssh -i ~/.ssh/your-key.pem ec2-user@10.0.1.45
SSM Not Working?
If SSM fails too, the OS may be fully locked up. Skip to Step 6 (force reboot). Document the instance state first by taking a screenshot of the CloudWatch metrics.

Step 4: Find the Rogue Process

Once connected, immediately check what’s consuming CPU:

# See top CPU consumers, update every 1 second
top -b -n 1 | head -20

# Or for a cleaner view:
ps aux --sort=-%cpu | head -15

You’ll see something like this:

USER       PID %CPU %MEM    VSZ   RSS COMMAND
www-data  4821 99.2  2.1 854321 43210 /usr/bin/python3 /app/worker.py

A Python worker process is consuming 99% CPU. Check how long it’s been running:

ps -p 4821 -o pid,etimes,cmd
# etimes = elapsed time in seconds
# Check if it's a legitimate process or something unexpected
ls -la /proc/4821/exe
cat /proc/4821/cmdline | tr '\0' ' '

Step 5: Fix the Immediate Problem

In this scenario the worker is stuck in an infinite loop due to a bad database query result. Kill it to restore the instance:

# Graceful kill first
kill -15 4821
sleep 3

# Check if it's gone
ps -p 4821 2>/dev/null && echo "Still running" || echo "Process killed"

# Force kill if it's still there
kill -9 4821

Now restart the application service properly:

sudo systemctl restart your-app-worker
sudo systemctl status your-app-worker

Step 6: Verify Recovery

Check the health endpoint directly from the instance:

curl -v http://localhost/health

Then check the ELB target group — it should flip back to healthy within 2–3 health check intervals (typically 30–60 seconds).

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --query 'TargetHealthDescriptions[?Target.Id==`i-0a1b2c3d4e5f`]'
Don't Close the Incident Yet
The instance is healthy again but you haven’t found the root cause. The bad query will run again. Continue to Step 7 before closing the alert.

Step 7: Root Cause and Prevention

Check the application logs for the original error:

sudo journalctl -u your-app-worker --since "02:00" --until "02:20" | grep -i "error\|exception\|traceback"

# Or application-specific logs:
tail -200 /var/log/app/worker.log | grep -A5 "ERROR"

You’ll find a database query returning a result set 50x larger than expected due to a missing LIMIT clause in a recent deploy.

Post-Incident Actions

ActionWhyPriority
Add CPU utilization alarm (threshold: 85%)Catch this before it causes failures🔴 High
Add process-level CPU limit (cgroups or systemd CPUQuota)Contain blast radius🔴 High
Fix the missing LIMIT in the database queryFix root cause🔴 High
Add the offending query to your integration test suitePrevent regression🟡 Medium
Document this in your runbookSpeed up future response🟡 Medium
# Example: limit CPU for the worker service via systemd
sudo systemctl edit your-app-worker

Add to the override file:

[Service]
CPUQuota=80%

Summary

The instance wasn’t truly “unreachable” — it was alive but too CPU-starved to respond to health checks. The triage pattern here applies to most EC2 incidents:

  1. Scope — is it one instance or many? Is it AWS-side or OS-side?
  2. Metrics — what changed right before the failure?
  3. Connect — SSM first, SSH second, force-reboot as last resort
  4. Stabilize — fix the immediate problem to restore capacity
  5. Root cause — don’t close until you know why and have a prevention plan
Interview Angle
This scenario is a common SRE / DevOps interview question: “Walk me through how you’d debug an EC2 instance that’s failing ELB health checks.” The answer interviewers want to hear follows exactly this structure — scope first, metrics second, connection third, fix fourth, prevention always.
What You Learned
  • How to triage an unreachable EC2 instance systematically
  • The difference between network-layer vs OS-layer failures
  • How to use SSM Session Manager when SSH is blocked
  • How to use CloudWatch metrics to confirm a timeline
  • Three runbook changes that prevent repeat incidents

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form