Scenario Advanced Aws AWS Containers

Application Latency Spiked After Migrating EC2 to ECS Fargate

Your application's p99 latency increased 3× after moving from EC2 to ECS Fargate. Diagnose the root cause — cold starts, DNS delays, missing VPC endpoints, or undersized tasks.

January 20, 2025 5 min read ~25 min to complete DB
The Situation

Your team migrated the API gateway service from EC2 Auto Scaling to ECS Fargate last weekend. p50 latency is the same, but p99 jumped from 120ms to 380ms and some users report requests timing out entirely. Everything looks fine in the Fargate task metrics — CPU is at 15%, memory at 40%.

5 Steps
6 Services Used
~25 min Duration
Advanced Difficulty

The Problem

Latency regressions after migrating to Fargate are almost always caused by networking changes — not application code. Fargate uses awsvpc networking mode, where each task gets its own Elastic Network Interface (ENI). This is different from EC2 where the instance ENI is already attached and warm.

The p99 spikes you’re seeing are a classic pattern: most requests are fast, but occasionally something — ENI creation, DNS lookup, VPC endpoint miss — adds hundreds of milliseconds to a small percentage of requests.

Step 1: Use X-Ray to Find the Slow Segment

Enable X-Ray tracing in your ECS task definition and instrument your application:

{
  "containerDefinitions": [
    {
      "name": "api-gateway",
      "image": "...",
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ]
    },
    {
      "name": "xray-daemon",
      "image": "amazon/aws-xray-daemon",
      "portMappings": [{"containerPort": 2000, "protocol": "udp"}]
    }
  ]
}
# Analyze X-Ray service map for slow segments
aws xray get-service-graph \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query 'Services[?ResponseTimeHistogram[?BucketWidth>200]]'

# Find traces with high latency
aws xray get-traces \
  --filter-expression 'responsetime > 0.3' \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s)

The X-Ray service map will show you exactly which downstream call (S3, DynamoDB, RDS, another service) is slow.

Step 2: Check for Missing VPC Endpoints

On EC2, AWS service calls (to S3, DynamoDB, Secrets Manager, etc.) that go over the internet are intercepted by VPC route tables. On Fargate with awsvpc, each task has its own network namespace and may route traffic over a NAT Gateway instead.

Symptom: Occasional latency spikes on calls to S3, DynamoDB, or other AWS services.

# Check if VPC endpoints exist for services you're calling
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-0abc123" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'

If you see s3, dynamodb, secretsmanager, ecr.api, ecr.dkr in the list, they’re covered. If missing, traffic goes via NAT Gateway (adds 5-30ms per call).

# Create missing VPC endpoints (interface type for most services)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-private-az1 subnet-private-az2 \
  --security-group-ids sg-allow-https

# Gateway endpoint for S3 and DynamoDB (free, no per-hour charge)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway \
  --route-table-ids rtb-private

Step 3: ENI Cold Start — Fargate Task Startup Latency

When ECS launches a new Fargate task, it must:

  1. Pull the container image from ECR (~10-30s)
  2. Attach an ENI to the task (~5-20s)
  3. Register the task with the target group (~30s)

During scale-out events, new tasks aren’t ready to serve traffic for 1-3 minutes. During that window, the ALB routes more traffic to existing tasks, increasing their latency.

Fix 1: Keep minimum desired count higher to avoid scale-out:

aws ecs update-service \
  --cluster prod-cluster \
  --service api-gateway \
  --desired-count 6   # Was 3 — more headroom before scale triggers

Fix 2: Use ECS Service Connect (replaces ENI-per-task DNS overhead):

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod.local",
    "services": [{
      "portName": "api-gateway-port",
      "clientAliases": [{"port": 80, "dnsName": "api-gateway"}]
    }]
  }
}

Fix 3: Enable connection keep-alive in your application to reuse connections instead of creating new TCP connections per request.

Step 4: DNS Caching in Fargate Tasks

Each Fargate task runs its own DNS resolver. If your application creates a new hostname lookup per request, it pays the DNS RTT every time.

# Check if DNS resolution is slow from inside a running task
aws ecs execute-command \
  --cluster prod-cluster \
  --task <task-id> \
  --container api-gateway \
  --interactive \
  --command "time nslookup mydb.us-east-1.rds.amazonaws.com"

Fix: Enable DNS caching at the application level:

# Python — use dnspython with a local cache
import dns.resolver
dns.resolver.get_default_resolver().cache = dns.resolver.LRUCache(max_size=500)

Or use CoreDNS with NodeLocal DNSCache if you’ve mixed EC2 and Fargate.

Step 5: Rightsize the Fargate Task

Fargate allocates network bandwidth proportional to vCPU. A 0.25 vCPU task has very limited network throughput. If your task makes many outbound calls (microservice fanout), the network becomes the bottleneck even when CPU and memory show low utilization.

# Check network bytes in/out for the task
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name NetworkRxBytes \
  --dimensions Name=ServiceName,Value=api-gateway Name=ClusterName,Value=prod-cluster \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If network throughput is consistently near the limit for your task size, upgrade:

Fargate CPUMax Network Bandwidth
0.25 vCPU~Up to 1 Gbps shared
1 vCPU~Up to 2.5 Gbps
4 vCPUUp to 10 Gbps
{
  "cpu": "1024",    // Was 256 — increase for more network headroom
  "memory": "2048"
}

Root Cause Summary

Root CauseSymptomFix
Missing VPC endpointsSlow S3/DynamoDB/SM callsCreate Interface/Gateway endpoints
ENI attachment on scale-outLatency spikes during scale eventsIncrease minimum desired count
DNS per-request lookupsHigh p99 on first call per hostApplication-level DNS caching
Low CPU → low network bandwidthNetwork saturationIncrease task CPU
No HTTP keep-aliveNew TCP handshake per requestEnable keep-alive in app config
Interview Angle
The key insight: p50 stays flat but p99 spikes. This pattern points to cold events (new connections, new ENIs, DNS misses) not steady-state load. Interviewers want to hear that you’d check X-Ray first to pinpoint the slow segment before guessing at a fix.
Services Used
ECS FargateX-RayCloudWatchVPC EndpointsALBRoute 53
Prerequisites
  • Familiarity with ECS task definitions and Fargate networking
  • Basic understanding of AWS X-Ray distributed tracing
What You Learned
  • Why p99 latency spikes on Fargate while p50 stays flat
  • How Fargate's awsvpc networking mode differs from EC2 networking
  • Why missing VPC endpoints cause latency for calls to AWS services
  • How ENI attachment delays cause cold-start latency spikes
  • How to use X-Ray to pinpoint the exact segment causing slow requests

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios