Terraform Plan Takes 45 Minutes — How to Fix It at Scale

The Problem

Terraform’s plan command calls AWS APIs to refresh the current state of every resource — if you have 500 resources, that’s 500+ API calls in sequence. At scale, a monolithic state file creates three compounding problems:

Slow plans — every resource is refreshed every time, even if you only changed 1 resource
High blast radius — a bad apply can touch any of the 500 resources
Concurrency conflicts — only one engineer can run apply at a time; others are blocked

Step 1: Diagnose — Find Your Slow Resources

Before splitting, understand which providers and resources are slowest:

# Enable detailed logging to see per-resource refresh times
TF_LOG=TRACE terraform plan 2>&1 | grep -E "Refreshing|elapsed"

# Or time the refresh phase specifically
time terraform plan -refresh=true -out=plan.tfplan

Common culprits:

aws_instance with user_data — reads EC2 metadata
aws_s3_bucket_policy — S3 policy reads can be slow
Hundreds of aws_route53_record resources
data sources that make API calls on every plan

Step 2: Split State Into Independent Modules

Break the monolith along deployment boundaries — components that change independently:

environments/
├── prod/
│   ├── networking/       ← VPC, subnets, TGW — changes rarely
│   │   ├── main.tf
│   │   └── backend.tf    ← separate state file
│   ├── compute/          ← EC2, ASG, ECS — changes frequently
│   │   ├── main.tf
│   │   └── backend.tf
│   ├── databases/        ← RDS, DynamoDB — changes rarely
│   │   ├── main.tf
│   │   └── backend.tf
│   └── security/         ← IAM, SGs — changes on demand
│       ├── main.tf
│       └── backend.tf
└── staging/
    └── ...

Each module has its own S3 backend key:

# environments/prod/compute/backend.tf
terraform {
  backend "s3" {
    bucket         = "company-tfstate-prod"
    key            = "prod/compute/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

# Set up S3 + DynamoDB backend (one-time)
aws s3 mb s3://company-tfstate-prod
aws s3api put-bucket-versioning \
  --bucket company-tfstate-prod \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-state-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Result: A compute plan now refreshes only ~80 resources instead of 500. Plan time drops from 45 minutes to ~7 minutes.

Step 3: Short-Term Flags for Immediate Relief

While you’re migrating to split state, use these flags to speed up existing plans:

# Skip refresh if you trust the state is current (e.g., no manual console changes)
terraform plan -refresh=false

# Only plan resources in a specific module
terraform plan -target=module.ecs_service

# Increase parallelism (default is 10)
terraform plan -parallelism=30

Use -target Carefully

-target can leave your state partially applied. Use it only for emergency single-resource fixes, never as a routine workflow. It masks real state drift.

Step 4: Terragrunt — DRY Configurations and Parallel Runs

Terragrunt wraps Terraform and solves two problems: repetitive backend config, and serial module applies.

# terragrunt.hcl (root)
remote_state {
  backend = "s3"
  config = {
    bucket         = "company-tfstate-${local.env}"
    key            = "${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-state-lock"
    encrypt        = true
  }
}

# environments/prod/compute/terragrunt.hcl
include "root" {
  path = find_in_parent_folders()
}

dependency "networking" {
  config_path = "../networking"
}

inputs = {
  vpc_id = dependency.networking.outputs.vpc_id
}

# Plan ALL modules in parallel (respects dependency order)
terragrunt run-all plan

# Apply all modules in parallel
terragrunt run-all apply

Result: What was 45 minutes serial becomes ~10 minutes parallel.

Step 5: Use `data` Sources for Cross-State References

Instead of hard-coding values from other modules, use Terraform data sources to read remote state:

# In compute/main.tf — reference networking state
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "company-tfstate-prod"
    key    = "prod/networking/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_autoscaling_group" "app" {
  vpc_zone_identifier = data.terraform_remote_state.networking.outputs.private_subnet_ids
}

Step 6: CI/CD — Automated Plans on Every PR

Move terraform plan out of developer laptops and into CodeBuild:

# buildspec-tf-plan.yml
version: 0.2
phases:
  install:
    commands:
      - wget https://releases.hashicorp.com/terraform/1.7.0/terraform_1.7.0_linux_amd64.zip
      - unzip terraform_1.7.0_linux_amd64.zip && mv terraform /usr/local/bin/
  build:
    commands:
      - cd environments/prod/compute
      - terraform init
      - terraform plan -out=plan.tfplan -no-color 2>&1 | tee plan.txt
      - cat plan.txt
artifacts:
  files:
    - environments/prod/compute/plan.tfplan

Post the plan output as a GitHub PR comment so reviewers see exactly what will change before approving.

Summary

Problem	Solution	Time Saved
500-resource monolith	Split into 4-5 independent state files	45 min → 7 min per module
Serial module applies	Terragrunt `run-all` for parallelism	3× faster
Expensive refresh	`-refresh=false` for trusted state	60% faster per plan
No locking	S3 + DynamoDB backend	Eliminates conflicts
Manual plans	CodeBuild on every PR	Consistent, auditable

Interview Angle

Mention the organizational impact: splitting state is also a blast radius reduction strategy, not just a performance fix. A bad apply in networking/ can no longer accidentally delete an RDS resource in databases/. Interviewers want to hear you think about failure modes, not just speed.

Terraform Plan Takes 45 Minutes — How to Fix It at Scale

The Problem

Step 1: Diagnose — Find Your Slow Resources

Step 2: Split State Into Independent Modules

Step 3: Short-Term Flags for Immediate Relief

Step 4: Terragrunt — DRY Configurations and Parallel Runs

Step 5: Use `data` Sources for Cross-State References

Step 6: CI/CD — Automated Plans on Every PR

Summary

Have a similar scenario to share?

Related Scenarios

Handling Terraform State Drift in Production

Implement AWS Control Tower for a 20-Account Organization

AWS Cloud Engineer Learning Path

Terraform Plan Takes 45 Minutes — How to Fix It at Scale

The Problem

Step 1: Diagnose — Find Your Slow Resources

Step 2: Split State Into Independent Modules

Step 3: Short-Term Flags for Immediate Relief

Step 4: Terragrunt — DRY Configurations and Parallel Runs

Step 5: Use data Sources for Cross-State References

Step 6: CI/CD — Automated Plans on Every PR

Summary

Have a similar scenario to share?

Related Scenarios

Handling Terraform State Drift in Production

Implement AWS Control Tower for a 20-Account Organization

AWS Cloud Engineer Learning Path

Step 5: Use `data` Sources for Cross-State References