EC2 Production Instance Unreachable — Diagnose and Recover

Overview

An EC2 instance going unreachable is one of the most common production incidents. The tricky part isn’t fixing it — it’s finding the cause quickly under pressure. This scenario walks you through a methodical triage process that works whether the issue is a security group change, a runaway process eating CPU, a full disk, or a network ACL misconfiguration.

The Problem

At 2:14 AM your monitoring fires. The ELB health checks are failing for one of your backend EC2 instances. The other instances in the Auto Scaling Group are healthy, so traffic is being routed around it — but you’re now running at reduced capacity and any further failures will cause a full outage.

Active Alert

[CRITICAL] ELB Target Unhealthy
Target: i-0a1b2c3d4e5f (10.0.1.45)
Health check path: /health
Status: Target.FailedHealthChecks — 3 consecutive failures
Time: 02:14 UTC

Step 1: Confirm the Scope

Before touching anything, understand what’s actually broken. Open the AWS Console and check:

EC2 → Instances → find i-0a1b2c3d4e5f

Look at the Instance State and Status Checks columns:

Check	What it tests	Failure means
System status	AWS hypervisor / hardware	AWS-side problem, reboot may fix
Instance status	OS-level reachability	Your problem — OS, disk, CPU

# Or via CLI:
aws ec2 describe-instance-status \
  --instance-ids i-0a1b2c3d4e5f \
  --query 'InstanceStatuses[*].{System:SystemStatus.Status,Instance:InstanceStatus.Status}'

In this scenario you’ll see:

[
  {
    "System": "ok",
    "Instance": "impaired"
  }
]

The hardware is fine. The OS is impaired. This means the problem is inside the instance.

Step 2: Check CloudWatch Metrics

Before trying to connect, look at the metrics to understand the timeline.

Navigate to CloudWatch → Metrics → EC2 → Per-Instance Metrics and check:

CPUUtilization — was there a spike before the failure?
StatusCheckFailed_Instance — when did it start?
NetworkIn / NetworkOut — did traffic drop suddenly?

What You'll Find

In this scenario, CPU hit 100% at 02:09 UTC and stayed there. The health check started failing at 02:11 UTC. Something is eating all the CPU.

Step 3: Try to Connect

Since CPU is pegged, SSH may time out. Use SSM Session Manager instead — it doesn’t need the SSH port open and works even when the OS is barely responsive.

# Install SSM plugin first if needed:
# https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html

aws ssm start-session --target i-0a1b2c3d4e5f

If SSM is not configured on the instance, try SSH — but be prepared for it to be slow:

ssh -i ~/.ssh/your-key.pem ec2-user@10.0.1.45

SSM Not Working?

If SSM fails too, the OS may be fully locked up. Skip to Step 6 (force reboot). Document the instance state first by taking a screenshot of the CloudWatch metrics.

Step 4: Find the Rogue Process

Once connected, immediately check what’s consuming CPU:

# See top CPU consumers, update every 1 second
top -b -n 1 | head -20

# Or for a cleaner view:
ps aux --sort=-%cpu | head -15

You’ll see something like this:

USER       PID %CPU %MEM    VSZ   RSS COMMAND
www-data  4821 99.2  2.1 854321 43210 /usr/bin/python3 /app/worker.py

A Python worker process is consuming 99% CPU. Check how long it’s been running:

ps -p 4821 -o pid,etimes,cmd
# etimes = elapsed time in seconds

# Check if it's a legitimate process or something unexpected
ls -la /proc/4821/exe
cat /proc/4821/cmdline | tr '\0' ' '

Step 5: Fix the Immediate Problem

In this scenario the worker is stuck in an infinite loop due to a bad database query result. Kill it to restore the instance:

# Graceful kill first
kill -15 4821
sleep 3

# Check if it's gone
ps -p 4821 2>/dev/null && echo "Still running" || echo "Process killed"

# Force kill if it's still there
kill -9 4821

Now restart the application service properly:

sudo systemctl restart your-app-worker
sudo systemctl status your-app-worker

Step 6: Verify Recovery

Check the health endpoint directly from the instance:

curl -v http://localhost/health

Then check the ELB target group — it should flip back to healthy within 2–3 health check intervals (typically 30–60 seconds).

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:... \
  --query 'TargetHealthDescriptions[?Target.Id==`i-0a1b2c3d4e5f`]'

Don't Close the Incident Yet

The instance is healthy again but you haven’t found the root cause. The bad query will run again. Continue to Step 7 before closing the alert.

Step 7: Root Cause and Prevention

Check the application logs for the original error:

sudo journalctl -u your-app-worker --since "02:00" --until "02:20" | grep -i "error\|exception\|traceback"

# Or application-specific logs:
tail -200 /var/log/app/worker.log | grep -A5 "ERROR"

You’ll find a database query returning a result set 50x larger than expected due to a missing LIMIT clause in a recent deploy.

Post-Incident Actions

Action	Why	Priority
Add CPU utilization alarm (threshold: 85%)	Catch this before it causes failures	🔴 High
Add process-level CPU limit (cgroups or systemd `CPUQuota`)	Contain blast radius	🔴 High
Fix the missing `LIMIT` in the database query	Fix root cause	🔴 High
Add the offending query to your integration test suite	Prevent regression	🟡 Medium
Document this in your runbook	Speed up future response	🟡 Medium

# Example: limit CPU for the worker service via systemd
sudo systemctl edit your-app-worker

Add to the override file:

[Service]
CPUQuota=80%

Summary

The instance wasn’t truly “unreachable” — it was alive but too CPU-starved to respond to health checks. The triage pattern here applies to most EC2 incidents:

Scope — is it one instance or many? Is it AWS-side or OS-side?
Metrics — what changed right before the failure?
Connect — SSM first, SSH second, force-reboot as last resort
Stabilize — fix the immediate problem to restore capacity
Root cause — don’t close until you know why and have a prevention plan

Interview Angle

This scenario is a common SRE / DevOps interview question: “Walk me through how you’d debug an EC2 instance that’s failing ELB health checks.” The answer interviewers want to hear follows exactly this structure — scope first, metrics second, connection third, fix fourth, prevention always.

Overview

The Problem

Step 1: Confirm the Scope

Step 2: Check CloudWatch Metrics

Step 3: Try to Connect

Step 4: Find the Rogue Process

Step 5: Fix the Immediate Problem

Step 6: Verify Recovery

Step 7: Root Cause and Prevention

Post-Incident Actions

Summary

Have a similar scenario to share?