Application Latency Spiked After Migrating EC2 to ECS Fargate
Your application's p99 latency increased 3× after moving from EC2 to ECS Fargate. Diagnose the root cause — cold starts, DNS delays, missing VPC endpoints, or undersized tasks.
Your team migrated the API gateway service from EC2 Auto Scaling to ECS Fargate last weekend. p50 latency is the same, but p99 jumped from 120ms to 380ms and some users report requests timing out entirely. Everything looks fine in the Fargate task metrics — CPU is at 15%, memory at 40%.
The Problem
Latency regressions after migrating to Fargate are almost always caused by networking changes — not application code. Fargate uses awsvpc networking mode, where each task gets its own Elastic Network Interface (ENI). This is different from EC2 where the instance ENI is already attached and warm.
The p99 spikes you’re seeing are a classic pattern: most requests are fast, but occasionally something — ENI creation, DNS lookup, VPC endpoint miss — adds hundreds of milliseconds to a small percentage of requests.
Step 1: Use X-Ray to Find the Slow Segment
Enable X-Ray tracing in your ECS task definition and instrument your application:
{
"containerDefinitions": [
{
"name": "api-gateway",
"image": "...",
"environment": [
{"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
]
},
{
"name": "xray-daemon",
"image": "amazon/aws-xray-daemon",
"portMappings": [{"containerPort": 2000, "protocol": "udp"}]
}
]
}
# Analyze X-Ray service map for slow segments
aws xray get-service-graph \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query 'Services[?ResponseTimeHistogram[?BucketWidth>200]]'
# Find traces with high latency
aws xray get-traces \
--filter-expression 'responsetime > 0.3' \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s)
The X-Ray service map will show you exactly which downstream call (S3, DynamoDB, RDS, another service) is slow.
Step 2: Check for Missing VPC Endpoints
On EC2, AWS service calls (to S3, DynamoDB, Secrets Manager, etc.) that go over the internet are intercepted by VPC route tables. On Fargate with awsvpc, each task has its own network namespace and may route traffic over a NAT Gateway instead.
Symptom: Occasional latency spikes on calls to S3, DynamoDB, or other AWS services.
# Check if VPC endpoints exist for services you're calling
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=vpc-0abc123" \
--query 'VpcEndpoints[*].{Service:ServiceName,State:State}'
If you see s3, dynamodb, secretsmanager, ecr.api, ecr.dkr in the list, they’re covered. If missing, traffic goes via NAT Gateway (adds 5-30ms per call).
# Create missing VPC endpoints (interface type for most services)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--service-name com.amazonaws.us-east-1.secretsmanager \
--vpc-endpoint-type Interface \
--subnet-ids subnet-private-az1 subnet-private-az2 \
--security-group-ids sg-allow-https
# Gateway endpoint for S3 and DynamoDB (free, no per-hour charge)
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--service-name com.amazonaws.us-east-1.s3 \
--vpc-endpoint-type Gateway \
--route-table-ids rtb-private
Step 3: ENI Cold Start — Fargate Task Startup Latency
When ECS launches a new Fargate task, it must:
- Pull the container image from ECR (~10-30s)
- Attach an ENI to the task (~5-20s)
- Register the task with the target group (~30s)
During scale-out events, new tasks aren’t ready to serve traffic for 1-3 minutes. During that window, the ALB routes more traffic to existing tasks, increasing their latency.
Fix 1: Keep minimum desired count higher to avoid scale-out:
aws ecs update-service \
--cluster prod-cluster \
--service api-gateway \
--desired-count 6 # Was 3 — more headroom before scale triggers
Fix 2: Use ECS Service Connect (replaces ENI-per-task DNS overhead):
{
"serviceConnectConfiguration": {
"enabled": true,
"namespace": "prod.local",
"services": [{
"portName": "api-gateway-port",
"clientAliases": [{"port": 80, "dnsName": "api-gateway"}]
}]
}
}
Fix 3: Enable connection keep-alive in your application to reuse connections instead of creating new TCP connections per request.
Step 4: DNS Caching in Fargate Tasks
Each Fargate task runs its own DNS resolver. If your application creates a new hostname lookup per request, it pays the DNS RTT every time.
# Check if DNS resolution is slow from inside a running task
aws ecs execute-command \
--cluster prod-cluster \
--task <task-id> \
--container api-gateway \
--interactive \
--command "time nslookup mydb.us-east-1.rds.amazonaws.com"
Fix: Enable DNS caching at the application level:
# Python — use dnspython with a local cache
import dns.resolver
dns.resolver.get_default_resolver().cache = dns.resolver.LRUCache(max_size=500)
Or use CoreDNS with NodeLocal DNSCache if you’ve mixed EC2 and Fargate.
Step 5: Rightsize the Fargate Task
Fargate allocates network bandwidth proportional to vCPU. A 0.25 vCPU task has very limited network throughput. If your task makes many outbound calls (microservice fanout), the network becomes the bottleneck even when CPU and memory show low utilization.
# Check network bytes in/out for the task
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name NetworkRxBytes \
--dimensions Name=ServiceName,Value=api-gateway Name=ClusterName,Value=prod-cluster \
--statistics Sum \
--period 60 \
--start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
--end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)
If network throughput is consistently near the limit for your task size, upgrade:
| Fargate CPU | Max Network Bandwidth |
|---|---|
| 0.25 vCPU | ~Up to 1 Gbps shared |
| 1 vCPU | ~Up to 2.5 Gbps |
| 4 vCPU | Up to 10 Gbps |
{
"cpu": "1024", // Was 256 — increase for more network headroom
"memory": "2048"
}
Root Cause Summary
| Root Cause | Symptom | Fix |
|---|---|---|
| Missing VPC endpoints | Slow S3/DynamoDB/SM calls | Create Interface/Gateway endpoints |
| ENI attachment on scale-out | Latency spikes during scale events | Increase minimum desired count |
| DNS per-request lookups | High p99 on first call per host | Application-level DNS caching |
| Low CPU → low network bandwidth | Network saturation | Increase task CPU |
| No HTTP keep-alive | New TCP handshake per request | Enable keep-alive in app config |
- Why p99 latency spikes on Fargate while p50 stays flat
- How Fargate's awsvpc networking mode differs from EC2 networking
- Why missing VPC endpoints cause latency for calls to AWS services
- How ENI attachment delays cause cold-start latency spikes
- How to use X-Ray to pinpoint the exact segment causing slow requests
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Migrate from ECS Fargate to EKS With Zero Downtime
The Problem A big-bang cutover from ECS to EKS is too risky — if EKS has issues, you’ve already disconnected ECS. The Strangler Fig …
DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix
The Problem DynamoDB partitions data based on the partition key. If all requests go to items with the same partition key value (like …
AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs
The Problem A 40% cost spike rarely comes from one source. It’s usually a combination of: a forgotten service that kept running, an …