Application Latency Spiked After Migrating EC2 to ECS Fargate

The Problem

Latency regressions after migrating to Fargate are almost always caused by networking changes — not application code. Fargate uses awsvpc networking mode, where each task gets its own Elastic Network Interface (ENI). This is different from EC2 where the instance ENI is already attached and warm.

The p99 spikes you’re seeing are a classic pattern: most requests are fast, but occasionally something — ENI creation, DNS lookup, VPC endpoint miss — adds hundreds of milliseconds to a small percentage of requests.

Step 1: Use X-Ray to Find the Slow Segment

Enable X-Ray tracing in your ECS task definition and instrument your application:

{
  "containerDefinitions": [
    {
      "name": "api-gateway",
      "image": "...",
      "environment": [
        {"name": "AWS_XRAY_DAEMON_ADDRESS", "value": "127.0.0.1:2000"}
      ]
    },
    {
      "name": "xray-daemon",
      "image": "amazon/aws-xray-daemon",
      "portMappings": [{"containerPort": 2000, "protocol": "udp"}]
    }
  ]
}

# Analyze X-Ray service map for slow segments
aws xray get-service-graph \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query 'Services[?ResponseTimeHistogram[?BucketWidth>200]]'

# Find traces with high latency
aws xray get-traces \
  --filter-expression 'responsetime > 0.3' \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s)

The X-Ray service map will show you exactly which downstream call (S3, DynamoDB, RDS, another service) is slow.

Step 2: Check for Missing VPC Endpoints

On EC2, AWS service calls (to S3, DynamoDB, Secrets Manager, etc.) that go over the internet are intercepted by VPC route tables. On Fargate with awsvpc, each task has its own network namespace and may route traffic over a NAT Gateway instead.

Symptom: Occasional latency spikes on calls to S3, DynamoDB, or other AWS services.

# Check if VPC endpoints exist for services you're calling
aws ec2 describe-vpc-endpoints \
  --filters "Name=vpc-id,Values=vpc-0abc123" \
  --query 'VpcEndpoints[*].{Service:ServiceName,State:State}'

If you see s3, dynamodb, secretsmanager, ecr.api, ecr.dkr in the list, they’re covered. If missing, traffic goes via NAT Gateway (adds 5-30ms per call).

# Create missing VPC endpoints (interface type for most services)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.secretsmanager \
  --vpc-endpoint-type Interface \
  --subnet-ids subnet-private-az1 subnet-private-az2 \
  --security-group-ids sg-allow-https

# Gateway endpoint for S3 and DynamoDB (free, no per-hour charge)
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway \
  --route-table-ids rtb-private

Step 3: ENI Cold Start — Fargate Task Startup Latency

When ECS launches a new Fargate task, it must:

Pull the container image from ECR (~10-30s)
Attach an ENI to the task (~5-20s)
Register the task with the target group (~30s)

During scale-out events, new tasks aren’t ready to serve traffic for 1-3 minutes. During that window, the ALB routes more traffic to existing tasks, increasing their latency.

Fix 1: Keep minimum desired count higher to avoid scale-out:

aws ecs update-service \
  --cluster prod-cluster \
  --service api-gateway \
  --desired-count 6   # Was 3 — more headroom before scale triggers

Fix 2: Use ECS Service Connect (replaces ENI-per-task DNS overhead):

{
  "serviceConnectConfiguration": {
    "enabled": true,
    "namespace": "prod.local",
    "services": [{
      "portName": "api-gateway-port",
      "clientAliases": [{"port": 80, "dnsName": "api-gateway"}]
    }]
  }
}

Fix 3: Enable connection keep-alive in your application to reuse connections instead of creating new TCP connections per request.

Step 4: DNS Caching in Fargate Tasks

Each Fargate task runs its own DNS resolver. If your application creates a new hostname lookup per request, it pays the DNS RTT every time.

# Check if DNS resolution is slow from inside a running task
aws ecs execute-command \
  --cluster prod-cluster \
  --task <task-id> \
  --container api-gateway \
  --interactive \
  --command "time nslookup mydb.us-east-1.rds.amazonaws.com"

Fix: Enable DNS caching at the application level:

# Python — use dnspython with a local cache
import dns.resolver
dns.resolver.get_default_resolver().cache = dns.resolver.LRUCache(max_size=500)

Or use CoreDNS with NodeLocal DNSCache if you’ve mixed EC2 and Fargate.

Step 5: Rightsize the Fargate Task

Fargate allocates network bandwidth proportional to vCPU. A 0.25 vCPU task has very limited network throughput. If your task makes many outbound calls (microservice fanout), the network becomes the bottleneck even when CPU and memory show low utilization.

# Check network bytes in/out for the task
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name NetworkRxBytes \
  --dimensions Name=ServiceName,Value=api-gateway Name=ClusterName,Value=prod-cluster \
  --statistics Sum \
  --period 60 \
  --start-time $(date -d '1 hour ago' -u +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ)

If network throughput is consistently near the limit for your task size, upgrade:

Fargate CPU	Max Network Bandwidth
0.25 vCPU	~Up to 1 Gbps shared
1 vCPU	~Up to 2.5 Gbps
4 vCPU	Up to 10 Gbps

{
  "cpu": "1024",    // Was 256 — increase for more network headroom
  "memory": "2048"
}

Root Cause Summary

Root Cause	Symptom	Fix
Missing VPC endpoints	Slow S3/DynamoDB/SM calls	Create Interface/Gateway endpoints
ENI attachment on scale-out	Latency spikes during scale events	Increase minimum desired count
DNS per-request lookups	High p99 on first call per host	Application-level DNS caching
Low CPU → low network bandwidth	Network saturation	Increase task CPU
No HTTP keep-alive	New TCP handshake per request	Enable keep-alive in app config

Interview Angle

The key insight: p50 stays flat but p99 spikes. This pattern points to cold events (new connections, new ENIs, DNS misses) not steady-state load. Interviewers want to hear that you’d check X-Ray first to pinpoint the slow segment before guessing at a fix.

Application Latency Spiked After Migrating EC2 to ECS Fargate

The Problem

Step 1: Use X-Ray to Find the Slow Segment

Step 2: Check for Missing VPC Endpoints

Step 3: ENI Cold Start — Fargate Task Startup Latency

Step 4: DNS Caching in Fargate Tasks

Step 5: Rightsize the Fargate Task

Root Cause Summary

Have a similar scenario to share?

Related Scenarios

Migrate from ECS Fargate to EKS With Zero Downtime

DynamoDB Hot Partition During a Flash Sale — Diagnose and Fix

AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs