Scenario Advanced Aws AWS IaC / Terraform

Terraform Plan Takes 45 Minutes — How to Fix It at Scale

Your team has 500+ Terraform resources and every plan takes 45 minutes, blocking deployments. Diagnose the root cause and implement state splitting, parallelism, and remote execution.

January 20, 2025 4 min read ~30 min to complete DB
The Situation

Your platform team manages 500+ AWS resources in a single Terraform root module. Every `terraform plan` now takes 45 minutes because Terraform refreshes every resource's state against live AWS APIs. Developers are sitting idle waiting for plans. Deployments are stacking up. The monolithic state file is also a blast-radius problem — one bad apply can affect everything.

6 Steps
4 Services Used
~30 min Duration
Advanced Difficulty

The Problem

Terraform’s plan command calls AWS APIs to refresh the current state of every resource — if you have 500 resources, that’s 500+ API calls in sequence. At scale, a monolithic state file creates three compounding problems:

  1. Slow plans — every resource is refreshed every time, even if you only changed 1 resource
  2. High blast radius — a bad apply can touch any of the 500 resources
  3. Concurrency conflicts — only one engineer can run apply at a time; others are blocked

Step 1: Diagnose — Find Your Slow Resources

Before splitting, understand which providers and resources are slowest:

# Enable detailed logging to see per-resource refresh times
TF_LOG=TRACE terraform plan 2>&1 | grep -E "Refreshing|elapsed"

# Or time the refresh phase specifically
time terraform plan -refresh=true -out=plan.tfplan

Common culprits:

  • aws_instance with user_data — reads EC2 metadata
  • aws_s3_bucket_policy — S3 policy reads can be slow
  • Hundreds of aws_route53_record resources
  • data sources that make API calls on every plan

Step 2: Split State Into Independent Modules

Break the monolith along deployment boundaries — components that change independently:

environments/
├── prod/
│   ├── networking/       ← VPC, subnets, TGW — changes rarely
│   │   ├── main.tf
│   │   └── backend.tf    ← separate state file
│   ├── compute/          ← EC2, ASG, ECS — changes frequently
│   │   ├── main.tf
│   │   └── backend.tf
│   ├── databases/        ← RDS, DynamoDB — changes rarely
│   │   ├── main.tf
│   │   └── backend.tf
│   └── security/         ← IAM, SGs — changes on demand
│       ├── main.tf
│       └── backend.tf
└── staging/
    └── ...

Each module has its own S3 backend key:

# environments/prod/compute/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-tfstate-prod"
    key            = "prod/compute/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}
# Set up S3 + DynamoDB backend (one-time)
aws s3 mb s3://company-tfstate-prod
aws s3api put-bucket-versioning \
  --bucket company-tfstate-prod \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Result: A compute plan now refreshes only ~80 resources instead of 500. Plan time drops from 45 minutes to ~7 minutes.

Step 3: Short-Term Flags for Immediate Relief

While you’re migrating to split state, use these flags to speed up existing plans:

# Skip refresh if you trust the state is current (e.g., no manual console changes)
terraform plan -refresh=false

# Only plan resources in a specific module
terraform plan -target=module.ecs_service

# Increase parallelism (default is 10)
terraform plan -parallelism=30
Use -target Carefully
-target can leave your state partially applied. Use it only for emergency single-resource fixes, never as a routine workflow. It masks real state drift.

Step 4: Terragrunt — DRY Configurations and Parallel Runs

Terragrunt wraps Terraform and solves two problems: repetitive backend config, and serial module applies.

# terragrunt.hcl (root)
remote_state {
  backend = "s3"
  config = {
    bucket         = "company-tfstate-${local.env}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}
# environments/prod/compute/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "networking" {
  config_path = "../networking"
}

inputs = {
  vpc_id = dependency.networking.outputs.vpc_id
}
# Plan ALL modules in parallel (respects dependency order)
terragrunt run-all plan

# Apply all modules in parallel
terragrunt run-all apply

Result: What was 45 minutes serial becomes ~10 minutes parallel.

Step 5: Use data Sources for Cross-State References

Instead of hard-coding values from other modules, use Terraform data sources to read remote state:

# In compute/main.tf — reference networking state
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-tfstate-prod"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_autoscaling_group" "app" {
  vpc_zone_identifier = data.terraform_remote_state.networking.outputs.private_subnet_ids
}

Step 6: CI/CD — Automated Plans on Every PR

Move terraform plan out of developer laptops and into CodeBuild:

# buildspec-tf-plan.yml
version: 0.2
phases:
  install:
    commands:
      - wget https://releases.hashicorp.com/terraform/1.7.0/terraform_1.7.0_linux_amd64.zip
      - unzip terraform_1.7.0_linux_amd64.zip && mv terraform /usr/local/bin/
  build:
    commands:
      - cd environments/prod/compute
      - terraform init
      - terraform plan -out=plan.tfplan -no-color 2>&1 | tee plan.txt
      - cat plan.txt
artifacts:
  files:
    - environments/prod/compute/plan.tfplan

Post the plan output as a GitHub PR comment so reviewers see exactly what will change before approving.

Summary

ProblemSolutionTime Saved
500-resource monolithSplit into 4-5 independent state files45 min → 7 min per module
Serial module appliesTerragrunt run-all for parallelism3× faster
Expensive refresh-refresh=false for trusted state60% faster per plan
No lockingS3 + DynamoDB backendEliminates conflicts
Manual plansCodeBuild on every PRConsistent, auditable
Interview Angle
Mention the organizational impact: splitting state is also a blast radius reduction strategy, not just a performance fix. A bad apply in networking/ can no longer accidentally delete an RDS resource in databases/. Interviewers want to hear you think about failure modes, not just speed.
Services Used
S3DynamoDBCodeBuildIAM
Prerequisites
  • Experience working with Terraform state files
  • Understanding of Terraform backends and workspaces
What You Learned
  • Why a monolithic Terraform state file slows plans to a crawl
  • How to break state into independently deployable modules
  • How to use Terragrunt for DRY multi-module orchestration
  • Key Terraform flags that speed up individual plans
  • How to set up S3 + DynamoDB remote state with locking

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios

Learning Paths beginner

AWS Cloud Engineer Learning Path

Who Is This Path For? This path is designed for complete beginners who want to break into cloud computing as an AWS engineer. If you know …

Jan 20, 2025 Read more