AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs
Walk through a systematic cost investigation using Cost Explorer, Compute Optimizer, and Trusted Advisor. Implement savings via rightsizing, Spot, Savings Plans, and FinOps practices.
Your VP of Engineering drops a Slack message on Monday morning: 'Our AWS bill hit $420K last month — up from $300K. Finance wants an explanation by end of week and a plan to bring it under $300K within 30 days. Find it.' You have access to Cost Explorer, CloudWatch, and Trusted Advisor. Where do you start?
The Problem
A 40% cost spike rarely comes from one source. It’s usually a combination of: a forgotten service that kept running, an unexpected data transfer charge, a scaling event that never scaled back down, and a lack of visibility that meant nobody noticed for 30 days.
The investigation process is the same as debugging an application: reproduce the symptom, isolate the cause, fix it, and add monitoring so it can’t sneak up on you again.
Step 1: Identify the Source With Cost Explorer
# Get cost by service — find which service spiked
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-02-01 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.BlendedCost.Amount}' \
--output table | sort -k2 -rn
# Compare current month to last month by service
aws ce get-cost-and-usage \
--time-period Start=2023-12-01,End=2024-02-01 \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
# Check AWS Cost Anomaly Detection (finds statistical outliers automatically)
aws ce get-anomalies \
--date-interval StartDate=2024-01-01,EndDate=2024-02-01 \
--query 'Anomalies[*].{Service:AnomalyScore.MaxScore,Impact:Impact.TotalImpact,Root:RootCauses}'
Common culprits found in spikes:
| Service | Typical Spike Cause |
|---|---|
| EC2 | Scale-out event that never scaled back; wrong instance type launched |
| NAT Gateway | Data processing jobs routing AWS API calls through NAT instead of VPC endpoints |
| RDS | Read replicas left running after a load test |
| Data Transfer | New service transferring data cross-region or to internet without CloudFront |
| S3 | Request costs from unoptimized application making thousands of HEAD requests |
| CloudWatch | Excessive log ingestion from a verbose debug logging mode left on |
Step 2: Quick Wins — Fix the Immediate Problem
Remove unattached EBS volumes:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].{VolumeId:VolumeId,Size:Size,AZ:AvailabilityZone,Created:CreateTime}' \
--output table
# After verifying they're truly unused:
for vol in $(aws ec2 describe-volumes --filters Name=status,Values=available \
--query 'Volumes[*].VolumeId' --output text); do
aws ec2 delete-volume --volume-id $vol
done
Find idle RDS instances:
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name CPUUtilization \
--dimensions Name=DBInstanceIdentifier,Value=my-db \
--statistics Average \
--period 86400 \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-02-01T00:00:00Z \
--query 'Datapoints[*].Average'
# If average CPU < 5% for 30 days → consider Aurora Serverless v2 or deletion
Eliminate NAT Gateway data transfer costs:
# Check NAT Gateway data transfer cost
aws ce get-cost-and-usage \
--time-period Start=2024-01-01,End=2024-02-01 \
--granularity MONTHLY \
--filter '{"Dimensions":{"Key":"USAGE_TYPE","Values":["USE1-NatGateway-Bytes"]}}' \
--metrics BlendedCost
# Create S3 and DynamoDB VPC endpoints to avoid NAT Gateway charges
aws ec2 create-vpc-endpoint \
--vpc-id vpc-0abc123 \
--service-name com.amazonaws.us-east-1.s3 \
--vpc-endpoint-type Gateway \
--route-table-ids rtb-private
Step 3: Rightsize EC2 With Compute Optimizer
# Enable Compute Optimizer (free service)
aws compute-optimizer update-enrollment-status --status Active
# Get EC2 rightsizing recommendations after 14 days of data
aws compute-optimizer get-ec2-instance-recommendations \
--query 'instanceRecommendations[*].{
Instance:instanceArn,
Finding:finding,
CurrentType:currentInstanceType,
RecommendedType:recommendationOptions[0].instanceType,
Savings:recommendationOptions[0].estimatedMonthlySavings.value
}' \
--output table
Typical Compute Optimizer findings:
| Finding | Meaning | Action |
|---|---|---|
OVER_PROVISIONED | Instance uses < 40% CPU consistently | Downsize or switch to Graviton |
UNDER_PROVISIONED | Instance is CPU-throttled | Upsize to prevent performance degradation |
OPTIMIZED | Right-sized | No action needed |
Step 4: Automate Dev Resource Shutdown
One of the fastest ways to cut costs: stop dev/staging instances outside working hours.
import boto3
from datetime import datetime
def stop_dev_instances(event, context):
ec2 = boto3.client('ec2')
# Find all running dev/staging EC2 instances
response = ec2.describe_instances(
Filters=[
{'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
{'Name': 'instance-state-name', 'Values': ['running']}
]
)
instance_ids = [
i['InstanceId']
for r in response['Reservations']
for i in r['Instances']
# Skip instances tagged to always run
if not any(t['Key'] == 'AlwaysOn' and t['Value'] == 'true' for t in i.get('Tags', []))
]
if instance_ids:
ec2.stop_instances(InstanceIds=instance_ids)
print(f"Stopped {len(instance_ids)} instances: {instance_ids}")
return {'stopped': len(instance_ids)}
Schedule with EventBridge:
# Stop at 7 PM weekdays
aws events put-rule \
--name "StopDevInstances" \
--schedule-expression "cron(0 19 ? * MON-FRI *)" \
--state ENABLED
# Start at 8 AM weekdays
aws events put-rule \
--name "StartDevInstances" \
--schedule-expression "cron(0 8 ? * MON-FRI *)" \
--state ENABLED
Estimated savings: Dev instances running 24/7 vs. 9 AM–7 PM weekdays = 40% cost reduction on those instances.
Step 5: Commit for Savings — Savings Plans vs Spot
Savings Plans (compute commitment, not instance type commitment):
# Get Savings Plans recommendations based on your last 60 days of usage
aws ce get-savings-plans-purchase-recommendation \
--savings-plans-type COMPUTE_SP \ # Most flexible — covers EC2, Fargate, Lambda
--term-in-years ONE_YEAR \
--payment-option NO_UPFRONT \
--query 'SavingsPlansPurchaseRecommendation.SavingsPlansPurchaseRecommendationDetails[*].{
Commitment:HourlyCommitmentToPurchase,
Savings:EstimatedSavingsAmount,
ROI:EstimatedROI
}'
Spot Instances for stateless and fault-tolerant workloads (up to 90% discount):
resource "aws_autoscaling_group" "batch_workers" {
mixed_instances_policy {
instances_distribution {
on_demand_base_capacity = 1 # Always 1 on-demand
on_demand_percentage_above_base_capacity = 0 # Rest all Spot
spot_allocation_strategy = "price-capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.worker.id
}
# Diversify across instance types to maximize Spot availability
overrides {
instance_type = "m5.xlarge"
}
overrides {
instance_type = "m5a.xlarge"
}
overrides {
instance_type = "m6i.xlarge"
}
}
}
}
Step 6: FinOps Practice — Prevent the Next Spike
# Set per-team budget alerts
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "team-payments-monthly",
"BudgetLimit": {"Amount": "8000", "Unit": "USD"},
"BudgetType": "COST",
"TimeUnit": "MONTHLY",
"CostFilters": {
"TagKeyValue": ["team$payments"]
}
}' \
--notifications-with-subscribers '[{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80
},
"Subscribers": [{"SubscriptionType": "EMAIL", "Address": "[email protected]"}]
}]'
Cost visibility checklist:
- Every resource tagged with
team,environment,service - Per-team budgets with 80% email alerts
- Weekly cost review in team standups (show top 5 cost changes)
-
Compute Optimizerrecommendations reviewed monthly - Dev instances shut down on nights and weekends
30-Day Cost Reduction Plan
| Week | Action | Estimated Saving |
|---|---|---|
| Week 1 | Stop dev instances nights/weekends, delete unattached EBS | ~$30K |
| Week 1 | Add VPC endpoints (eliminate NAT Gateway excess) | ~$15K |
| Week 2 | Rightsize top 10 overprovisioned EC2 instances | ~$20K |
| Week 3 | Convert eligible batch workloads to Spot | ~$25K |
| Week 4 | Purchase 1-year Compute Savings Plan based on new baseline | ~$30K |
- How to use Cost Explorer and anomaly detection to pinpoint cost spikes
- The most common AWS cost culprits and their fixes
- How to implement a Lambda-based cost automation (stop dev instances at night)
- The difference between Spot, Reserved Instances, and Savings Plans
- How to build a FinOps practice with tagging and team budgets
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
AWS Cloud Engineer Learning Path
Who Is This Path For? This path is designed for complete beginners who want to break into cloud computing as an AWS engineer. If you know …
AWS Cloud Foundations — Fresher Learning Path
How to Use This Path Each section below shows an AWS architecture diagram. Click any coloured block to see:
AWS Security Interview Questions and Answers
🛡️ AWS WAF — Complete Interview & Implementation Guide for AWS Cloud and DevOps Engineer Q1 What is AWS WAF Advanced Ans: AWS WAF (Web …