EC2 Production Instance Unreachable — Diagnose and Recover
Your e-commerce site's EC2 instance stopped responding at 2 AM. Customers can't checkout. Walk through the full incident response from alert to resolution.
It's 2:14 AM. PagerDuty fires: 'HTTP health check FAILED — api.yourstore.com'. Your on-call phone won't stop ringing. The checkout page is returning 502. Revenue is bleeding at $400/minute.
Overview
An EC2 instance going unreachable is one of the most common production incidents. The tricky part isn’t fixing it — it’s finding the cause quickly under pressure. This scenario walks you through a methodical triage process that works whether the issue is a security group change, a runaway process eating CPU, a full disk, or a network ACL misconfiguration.
The Problem
At 2:14 AM your monitoring fires. The ELB health checks are failing for one of your backend EC2 instances. The other instances in the Auto Scaling Group are healthy, so traffic is being routed around it — but you’re now running at reduced capacity and any further failures will cause a full outage.
Target:
i-0a1b2c3d4e5f (10.0.1.45)Health check path:
/healthStatus:
Target.FailedHealthChecks — 3 consecutive failuresTime: 02:14 UTC
Step 1: Confirm the Scope
Before touching anything, understand what’s actually broken. Open the AWS Console and check:
EC2 → Instances → find i-0a1b2c3d4e5f
Look at the Instance State and Status Checks columns:
| Check | What it tests | Failure means |
|---|---|---|
| System status | AWS hypervisor / hardware | AWS-side problem, reboot may fix |
| Instance status | OS-level reachability | Your problem — OS, disk, CPU |
# Or via CLI:
aws ec2 describe-instance-status \
--instance-ids i-0a1b2c3d4e5f \
--query 'InstanceStatuses[*].{System:SystemStatus.Status,Instance:InstanceStatus.Status}'
In this scenario you’ll see:
[
{
"System": "ok",
"Instance": "impaired"
}
]
The hardware is fine. The OS is impaired. This means the problem is inside the instance.
Step 2: Check CloudWatch Metrics
Before trying to connect, look at the metrics to understand the timeline.
Navigate to CloudWatch → Metrics → EC2 → Per-Instance Metrics and check:
- CPUUtilization — was there a spike before the failure?
- StatusCheckFailed_Instance — when did it start?
- NetworkIn / NetworkOut — did traffic drop suddenly?
Step 3: Try to Connect
Since CPU is pegged, SSH may time out. Use SSM Session Manager instead — it doesn’t need the SSH port open and works even when the OS is barely responsive.
# Install SSM plugin first if needed:
# https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html
aws ssm start-session --target i-0a1b2c3d4e5f
If SSM is not configured on the instance, try SSH — but be prepared for it to be slow:
ssh -i ~/.ssh/your-key.pem ec2-user@10.0.1.45
Step 4: Find the Rogue Process
Once connected, immediately check what’s consuming CPU:
# See top CPU consumers, update every 1 second
top -b -n 1 | head -20
# Or for a cleaner view:
ps aux --sort=-%cpu | head -15
You’ll see something like this:
USER PID %CPU %MEM VSZ RSS COMMAND
www-data 4821 99.2 2.1 854321 43210 /usr/bin/python3 /app/worker.py
A Python worker process is consuming 99% CPU. Check how long it’s been running:
ps -p 4821 -o pid,etimes,cmd
# etimes = elapsed time in seconds
# Check if it's a legitimate process or something unexpected
ls -la /proc/4821/exe
cat /proc/4821/cmdline | tr '\0' ' '
Step 5: Fix the Immediate Problem
In this scenario the worker is stuck in an infinite loop due to a bad database query result. Kill it to restore the instance:
# Graceful kill first
kill -15 4821
sleep 3
# Check if it's gone
ps -p 4821 2>/dev/null && echo "Still running" || echo "Process killed"
# Force kill if it's still there
kill -9 4821
Now restart the application service properly:
sudo systemctl restart your-app-worker
sudo systemctl status your-app-worker
Step 6: Verify Recovery
Check the health endpoint directly from the instance:
curl -v http://localhost/health
Then check the ELB target group — it should flip back to healthy within 2–3 health check intervals (typically 30–60 seconds).
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:... \
--query 'TargetHealthDescriptions[?Target.Id==`i-0a1b2c3d4e5f`]'
Step 7: Root Cause and Prevention
Check the application logs for the original error:
sudo journalctl -u your-app-worker --since "02:00" --until "02:20" | grep -i "error\|exception\|traceback"
# Or application-specific logs:
tail -200 /var/log/app/worker.log | grep -A5 "ERROR"
You’ll find a database query returning a result set 50x larger than expected due to a missing LIMIT clause in a recent deploy.
Post-Incident Actions
| Action | Why | Priority |
|---|---|---|
| Add CPU utilization alarm (threshold: 85%) | Catch this before it causes failures | 🔴 High |
Add process-level CPU limit (cgroups or systemd CPUQuota) | Contain blast radius | 🔴 High |
Fix the missing LIMIT in the database query | Fix root cause | 🔴 High |
| Add the offending query to your integration test suite | Prevent regression | 🟡 Medium |
| Document this in your runbook | Speed up future response | 🟡 Medium |
# Example: limit CPU for the worker service via systemd
sudo systemctl edit your-app-worker
Add to the override file:
[Service]
CPUQuota=80%
Summary
The instance wasn’t truly “unreachable” — it was alive but too CPU-starved to respond to health checks. The triage pattern here applies to most EC2 incidents:
- Scope — is it one instance or many? Is it AWS-side or OS-side?
- Metrics — what changed right before the failure?
- Connect — SSM first, SSH second, force-reboot as last resort
- Stabilize — fix the immediate problem to restore capacity
- Root cause — don’t close until you know why and have a prevention plan
- How to triage an unreachable EC2 instance systematically
- The difference between network-layer vs OS-layer failures
- How to use SSM Session Manager when SSH is blocked
- How to use CloudWatch metrics to confirm a timeline
- Three runbook changes that prevent repeat incidents
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google Form