Scenario Advanced Aws AWS Cost Optimization

AWS Bill Jumped 40% Last Month — Investigate and Reduce Costs

Walk through a systematic cost investigation using Cost Explorer, Compute Optimizer, and Trusted Advisor. Implement savings via rightsizing, Spot, Savings Plans, and FinOps practices.

January 20, 2025 5 min read ~30 min to complete DB
The Situation

Your VP of Engineering drops a Slack message on Monday morning: 'Our AWS bill hit $420K last month — up from $300K. Finance wants an explanation by end of week and a plan to bring it under $300K within 30 days. Find it.' You have access to Cost Explorer, CloudWatch, and Trusted Advisor. Where do you start?

6 Steps
7 Services Used
~30 min Duration
Advanced Difficulty

The Problem

A 40% cost spike rarely comes from one source. It’s usually a combination of: a forgotten service that kept running, an unexpected data transfer charge, a scaling event that never scaled back down, and a lack of visibility that meant nobody noticed for 30 days.

The investigation process is the same as debugging an application: reproduce the symptom, isolate the cause, fix it, and add monitoring so it can’t sneak up on you again.

Step 1: Identify the Source With Cost Explorer

# Get cost by service — find which service spiked
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-02-01 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE \
  --query 'ResultsByTime[0].Groups[*].{Service:Keys[0],Cost:Metrics.BlendedCost.Amount}' \
  --output table | sort -k2 -rn

# Compare current month to last month by service
aws ce get-cost-and-usage \
  --time-period Start=2023-12-01,End=2024-02-01 \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE
# Check AWS Cost Anomaly Detection (finds statistical outliers automatically)
aws ce get-anomalies \
  --date-interval StartDate=2024-01-01,EndDate=2024-02-01 \
  --query 'Anomalies[*].{Service:AnomalyScore.MaxScore,Impact:Impact.TotalImpact,Root:RootCauses}'

Common culprits found in spikes:

ServiceTypical Spike Cause
EC2Scale-out event that never scaled back; wrong instance type launched
NAT GatewayData processing jobs routing AWS API calls through NAT instead of VPC endpoints
RDSRead replicas left running after a load test
Data TransferNew service transferring data cross-region or to internet without CloudFront
S3Request costs from unoptimized application making thousands of HEAD requests
CloudWatchExcessive log ingestion from a verbose debug logging mode left on

Step 2: Quick Wins — Fix the Immediate Problem

Remove unattached EBS volumes:

aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].{VolumeId:VolumeId,Size:Size,AZ:AvailabilityZone,Created:CreateTime}' \
  --output table

# After verifying they're truly unused:
for vol in $(aws ec2 describe-volumes --filters Name=status,Values=available \
  --query 'Volumes[*].VolumeId' --output text); do
  aws ec2 delete-volume --volume-id $vol
done

Find idle RDS instances:

aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name CPUUtilization \
  --dimensions Name=DBInstanceIdentifier,Value=my-db \
  --statistics Average \
  --period 86400 \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-02-01T00:00:00Z \
  --query 'Datapoints[*].Average'
# If average CPU < 5% for 30 days → consider Aurora Serverless v2 or deletion

Eliminate NAT Gateway data transfer costs:

# Check NAT Gateway data transfer cost
aws ce get-cost-and-usage \
  --time-period Start=2024-01-01,End=2024-02-01 \
  --granularity MONTHLY \
  --filter '{"Dimensions":{"Key":"USAGE_TYPE","Values":["USE1-NatGateway-Bytes"]}}' \
  --metrics BlendedCost

# Create S3 and DynamoDB VPC endpoints to avoid NAT Gateway charges
aws ec2 create-vpc-endpoint \
  --vpc-id vpc-0abc123 \
  --service-name com.amazonaws.us-east-1.s3 \
  --vpc-endpoint-type Gateway \
  --route-table-ids rtb-private

Step 3: Rightsize EC2 With Compute Optimizer

# Enable Compute Optimizer (free service)
aws compute-optimizer update-enrollment-status --status Active

# Get EC2 rightsizing recommendations after 14 days of data
aws compute-optimizer get-ec2-instance-recommendations \
  --query 'instanceRecommendations[*].{
    Instance:instanceArn,
    Finding:finding,
    CurrentType:currentInstanceType,
    RecommendedType:recommendationOptions[0].instanceType,
    Savings:recommendationOptions[0].estimatedMonthlySavings.value
  }' \
  --output table

Typical Compute Optimizer findings:

FindingMeaningAction
OVER_PROVISIONEDInstance uses < 40% CPU consistentlyDownsize or switch to Graviton
UNDER_PROVISIONEDInstance is CPU-throttledUpsize to prevent performance degradation
OPTIMIZEDRight-sizedNo action needed

Step 4: Automate Dev Resource Shutdown

One of the fastest ways to cut costs: stop dev/staging instances outside working hours.

import boto3
from datetime import datetime

def stop_dev_instances(event, context):
    ec2 = boto3.client('ec2')
    
    # Find all running dev/staging EC2 instances
    response = ec2.describe_instances(
        Filters=[
            {'Name': 'tag:Environment', 'Values': ['dev', 'staging']},
            {'Name': 'instance-state-name', 'Values': ['running']}
        ]
    )
    
    instance_ids = [
        i['InstanceId']
        for r in response['Reservations']
        for i in r['Instances']
        # Skip instances tagged to always run
        if not any(t['Key'] == 'AlwaysOn' and t['Value'] == 'true' for t in i.get('Tags', []))
    ]
    
    if instance_ids:
        ec2.stop_instances(InstanceIds=instance_ids)
        print(f"Stopped {len(instance_ids)} instances: {instance_ids}")
    
    return {'stopped': len(instance_ids)}

Schedule with EventBridge:

# Stop at 7 PM weekdays
aws events put-rule \
  --name "StopDevInstances" \
  --schedule-expression "cron(0 19 ? * MON-FRI *)" \
  --state ENABLED

# Start at 8 AM weekdays
aws events put-rule \
  --name "StartDevInstances" \
  --schedule-expression "cron(0 8 ? * MON-FRI *)" \
  --state ENABLED

Estimated savings: Dev instances running 24/7 vs. 9 AM–7 PM weekdays = 40% cost reduction on those instances.

Step 5: Commit for Savings — Savings Plans vs Spot

Savings Plans (compute commitment, not instance type commitment):

# Get Savings Plans recommendations based on your last 60 days of usage
aws ce get-savings-plans-purchase-recommendation \
  --savings-plans-type COMPUTE_SP \   # Most flexible — covers EC2, Fargate, Lambda
  --term-in-years ONE_YEAR \
  --payment-option NO_UPFRONT \
  --query 'SavingsPlansPurchaseRecommendation.SavingsPlansPurchaseRecommendationDetails[*].{
    Commitment:HourlyCommitmentToPurchase,
    Savings:EstimatedSavingsAmount,
    ROI:EstimatedROI
  }'

Spot Instances for stateless and fault-tolerant workloads (up to 90% discount):

resource "aws_autoscaling_group" "batch_workers" {
  mixed_instances_policy {
    instances_distribution {
      on_demand_base_capacity                  = 1   # Always 1 on-demand
      on_demand_percentage_above_base_capacity = 0   # Rest all Spot
      spot_allocation_strategy                 = "price-capacity-optimized"
    }
    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.worker.id
      }
      # Diversify across instance types to maximize Spot availability
      overrides {
        instance_type = "m5.xlarge"
      }
      overrides {
        instance_type = "m5a.xlarge"
      }
      overrides {
        instance_type = "m6i.xlarge"
      }
    }
  }
}

Step 6: FinOps Practice — Prevent the Next Spike

# Set per-team budget alerts
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "team-payments-monthly",
    "BudgetLimit": {"Amount": "8000", "Unit": "USD"},
    "BudgetType": "COST",
    "TimeUnit": "MONTHLY",
    "CostFilters": {
      "TagKeyValue": ["team$payments"]
    }
  }' \
  --notifications-with-subscribers '[{
    "Notification": {
      "NotificationType": "ACTUAL",
      "ComparisonOperator": "GREATER_THAN",
      "Threshold": 80
    },
    "Subscribers": [{"SubscriptionType": "EMAIL", "Address": "[email protected]"}]
  }]'

Cost visibility checklist:

  • Every resource tagged with team, environment, service
  • Per-team budgets with 80% email alerts
  • Weekly cost review in team standups (show top 5 cost changes)
  • Compute Optimizer recommendations reviewed monthly
  • Dev instances shut down on nights and weekends

30-Day Cost Reduction Plan

WeekActionEstimated Saving
Week 1Stop dev instances nights/weekends, delete unattached EBS~$30K
Week 1Add VPC endpoints (eliminate NAT Gateway excess)~$15K
Week 2Rightsize top 10 overprovisioned EC2 instances~$20K
Week 3Convert eligible batch workloads to Spot~$25K
Week 4Purchase 1-year Compute Savings Plan based on new baseline~$30K
Interview Angle
Mention the FinOps loop: Inform (show teams their costs), Optimize (rightsize, Spot, delete waste), Operate (budgets, reviews, automation). Most companies fail at “Inform” — teams don’t know they’re spending money until the bill arrives. Tagged resources + per-team dashboards fix this at the cultural level.
Services Used
AWS Cost ExplorerAWS Compute OptimizerAWS Trusted AdvisorS3 Intelligent TieringEC2 SpotSavings PlansLambda
Prerequisites
  • Basic understanding of AWS billing concepts
  • Familiarity with EC2 instance types and pricing models
What You Learned
  • How to use Cost Explorer and anomaly detection to pinpoint cost spikes
  • The most common AWS cost culprits and their fixes
  • How to implement a Lambda-based cost automation (stop dev instances at night)
  • The difference between Spot, Reserved Instances, and Savings Plans
  • How to build a FinOps practice with tagging and team budgets

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios

Learning Paths beginner

AWS Cloud Engineer Learning Path

Who Is This Path For? This path is designed for complete beginners who want to break into cloud computing as an AWS engineer. If you know …

Jan 20, 2025 Read more