AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs

The Problem

A 40% cost spike rarely comes from one source. It’s usually a combination of: a forgotten service that kept running, an unexpected data transfer charge, a scaling event that never scaled back down, and a lack of visibility that meant nobody noticed for 30 days.

The investigation process is the same as debugging an application: reproduce the symptom, isolate the cause, fix it, and add monitoring so it can’t sneak up on you again.

Step 1: Identify the Source With Cost Explorer

# Get cost by service — find which service spiked
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-02-01 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.BlendedCost.Amount}' \
  --output table | sort -k2 -rn

# Compare current month to last month by service
aws ce get-cost-and-usage \
  --time-period Start=2023-12-01,End=2024-02-01 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

# Check AWS Cost Anomaly Detection (finds statistical outliers automatically)
aws ce get-anomalies \
  --date-interval StartDate=2024-01-01,EndDate=2024-02-01 \
  --query 'Anomalies[*].{Service:AnomalyScore.MaxScore,Impact:Impact.TotalImpact,Root:RootCauses}'

Common culprits found in spikes:

Service	Typical Spike Cause
EC2	Scale-out event that never scaled back; wrong instance type launched
NAT Gateway	Data processing jobs routing AWS API calls through NAT instead of VPC endpoints
RDS	Read replicas left running after a load test
Data Transfer	New service transferring data cross-region or to internet without CloudFront
S3	Request costs from unoptimized application making thousands of HEAD requests
CloudWatch	Excessive log ingestion from a verbose debug logging mode left on

Step 2: Quick Wins — Fix the Immediate Problem

Remove unattached EBS volumes:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{VolumeId:VolumeId,Size:Size,AZ:AvailabilityZone,Created:CreateTime}' \
  --output table

# After verifying they're truly unused:
for vol in $(aws ec2 describe-volumes --filters Name=status,Values=available \
  --query 'Volumes[*].VolumeId' --output text); do
  aws ec2 delete-volume --volume-id $vol
done

Find idle RDS instances:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-db \
  --statistics Average \
  --period 86400 \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-02-01T00:00:00Z \
  --query 'Datapoints[*].Average'
# If average CPU < 5% for 30 days → consider Aurora Serverless v2 or deletion

Eliminate NAT Gateway data transfer costs:

# Check NAT Gateway data transfer cost
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-02-01 \
  --granularity MONTHLY \
  --filter '{"Dimensions":{"Key":"USAGE_TYPE","Values":["USE1-NatGateway-Bytes"]}}' \
  --metrics BlendedCost

# Create S3 and DynamoDB VPC endpoints to avoid NAT Gateway charges
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway \
  --route-table-ids rtb-private

Step 3: Rightsize EC2 With Compute Optimizer

# Enable Compute Optimizer (free service)
aws compute-optimizer update-enrollment-status --status Active

# Get EC2 rightsizing recommendations after 14 days of data
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    CurrentType:currentInstanceType,
    RecommendedType:recommendationOptions[0].instanceType,
    Savings:recommendationOptions[0].estimatedMonthlySavings.value
  }' \
  --output table

Typical Compute Optimizer findings:

Finding	Meaning	Action
`OVER_PROVISIONED`	Instance uses < 40% CPU consistently	Downsize or switch to Graviton
`UNDER_PROVISIONED`	Instance is CPU-throttled	Upsize to prevent performance degradation
`OPTIMIZED`	Right-sized	No action needed

Step 4: Automate Dev Resource Shutdown

One of the fastest ways to cut costs: stop dev/staging instances outside working hours.

import boto3
from datetime import datetime

def stop_dev_instances(event, context):
    ec2 = boto3.client('ec2')
    
    # Find all running dev/staging EC2 instances
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )
    
    instance_ids = [
        i['InstanceId']
        for r in response['Reservations']
        for i in r['Instances']
        # Skip instances tagged to always run
        if not any(t['Key'] == 'AlwaysOn' and t['Value'] == 'true' for t in i.get('Tags', []))
    ]
    
    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        print(f"Stopped {len(instance_ids)} instances: {instance_ids}")
    
    return {'stopped': len(instance_ids)}

Schedule with EventBridge:

# Stop at 7 PM weekdays
aws events put-rule \
  --name "StopDevInstances" \
  --schedule-expression "cron(0 19 ? * MON-FRI *)" \
  --state ENABLED

# Start at 8 AM weekdays
aws events put-rule \
  --name "StartDevInstances" \
  --schedule-expression "cron(0 8 ? * MON-FRI *)" \
  --state ENABLED

Estimated savings: Dev instances running 24/7 vs. 9 AM–7 PM weekdays = 40% cost reduction on those instances.

Step 5: Commit for Savings — Savings Plans vs Spot

Savings Plans (compute commitment, not instance type commitment):

# Get Savings Plans recommendations based on your last 60 days of usage
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \   # Most flexible — covers EC2, Fargate, Lambda
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --query 'SavingsPlansPurchaseRecommendation.SavingsPlansPurchaseRecommendationDetails[*].{
    Commitment:HourlyCommitmentToPurchase,
    Savings:EstimatedSavingsAmount,
    ROI:EstimatedROI
  }'

Spot Instances for stateless and fault-tolerant workloads (up to 90% discount):

resource "aws_autoscaling_group" "batch_workers" {
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1   # Always 1 on-demand
      on_demand_percentage_above_base_capacity = 0   # Rest all Spot
      spot_allocation_strategy                 = "price-capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.worker.id
      }
      # Diversify across instance types to maximize Spot availability
      overrides {
        instance_type = "m5.xlarge"
      }
      overrides {
        instance_type = "m5a.xlarge"
      }
      overrides {
        instance_type = "m6i.xlarge"
      }
    }
  }
}

Step 6: FinOps Practice — Prevent the Next Spike

# Set per-team budget alerts
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "team-payments-monthly",
    "BudgetLimit": {"Amount": "8000", "Unit": "USD"},
    "BudgetType": "COST",
    "TimeUnit": "MONTHLY",
    "CostFilters": {
      "TagKeyValue": ["team$payments"]
    }
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "[email protected]"}]
  }]'

Cost visibility checklist:

Every resource tagged with team, environment, service
Per-team budgets with 80% email alerts
Weekly cost review in team standups (show top 5 cost changes)
Compute Optimizer recommendations reviewed monthly
Dev instances shut down on nights and weekends

30-Day Cost Reduction Plan

Week	Action	Estimated Saving
Week 1	Stop dev instances nights/weekends, delete unattached EBS	~$30K
Week 1	Add VPC endpoints (eliminate NAT Gateway excess)	~$15K
Week 2	Rightsize top 10 overprovisioned EC2 instances	~$20K
Week 3	Convert eligible batch workloads to Spot	~$25K
Week 4	Purchase 1-year Compute Savings Plan based on new baseline	~$30K

Interview Angle

Mention the FinOps loop: Inform (show teams their costs), Optimize (rightsize, Spot, delete waste), Operate (budgets, reviews, automation). Most companies fail at “Inform” — teams don’t know they’re spending money until the bill arrives. Tagged resources + per-team dashboards fix this at the cultural level.

AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs

The Problem

Step 1: Identify the Source With Cost Explorer

Step 2: Quick Wins — Fix the Immediate Problem

Step 3: Rightsize EC2 With Compute Optimizer

Step 4: Automate Dev Resource Shutdown

Step 5: Commit for Savings — Savings Plans vs Spot

Step 6: FinOps Practice — Prevent the Next Spike

30-Day Cost Reduction Plan

Have a similar scenario to share?

Related Scenarios

AWS Cloud Engineer Learning Path

AWS Cloud Foundations — Fresher Learning Path

AWS Security Interview Questions and Answers