Terraform Plan Takes 45 Minutes — How to Fix It at Scale
Your team has 500+ Terraform resources and every plan takes 45 minutes, blocking deployments. Diagnose the root cause and implement state splitting, parallelism, and remote execution.
Your platform team manages 500+ AWS resources in a single Terraform root module. Every `terraform plan` now takes 45 minutes because Terraform refreshes every resource's state against live AWS APIs. Developers are sitting idle waiting for plans. Deployments are stacking up. The monolithic state file is also a blast-radius problem — one bad apply can affect everything.
The Problem
Terraform’s plan command calls AWS APIs to refresh the current state of every resource — if you have 500 resources, that’s 500+ API calls in sequence. At scale, a monolithic state file creates three compounding problems:
- Slow plans — every resource is refreshed every time, even if you only changed 1 resource
- High blast radius — a bad
applycan touch any of the 500 resources - Concurrency conflicts — only one engineer can run
applyat a time; others are blocked
Step 1: Diagnose — Find Your Slow Resources
Before splitting, understand which providers and resources are slowest:
# Enable detailed logging to see per-resource refresh times
TF_LOG=TRACE terraform plan 2>&1 | grep -E "Refreshing|elapsed"
# Or time the refresh phase specifically
time terraform plan -refresh=true -out=plan.tfplan
Common culprits:
aws_instancewithuser_data— reads EC2 metadataaws_s3_bucket_policy— S3 policy reads can be slow- Hundreds of
aws_route53_recordresources datasources that make API calls on every plan
Step 2: Split State Into Independent Modules
Break the monolith along deployment boundaries — components that change independently:
environments/
├── prod/
│ ├── networking/ ← VPC, subnets, TGW — changes rarely
│ │ ├── main.tf
│ │ └── backend.tf ← separate state file
│ ├── compute/ ← EC2, ASG, ECS — changes frequently
│ │ ├── main.tf
│ │ └── backend.tf
│ ├── databases/ ← RDS, DynamoDB — changes rarely
│ │ ├── main.tf
│ │ └── backend.tf
│ └── security/ ← IAM, SGs — changes on demand
│ ├── main.tf
│ └── backend.tf
└── staging/
└── ...
Each module has its own S3 backend key:
# environments/prod/compute/backend.tf
terraform {
backend "s3" {
bucket = "company-tfstate-prod"
key = "prod/compute/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
# Set up S3 + DynamoDB backend (one-time)
aws s3 mb s3://company-tfstate-prod
aws s3api put-bucket-versioning \
--bucket company-tfstate-prod \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-lock \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
Result: A compute plan now refreshes only ~80 resources instead of 500. Plan time drops from 45 minutes to ~7 minutes.
Step 3: Short-Term Flags for Immediate Relief
While you’re migrating to split state, use these flags to speed up existing plans:
# Skip refresh if you trust the state is current (e.g., no manual console changes)
terraform plan -refresh=false
# Only plan resources in a specific module
terraform plan -target=module.ecs_service
# Increase parallelism (default is 10)
terraform plan -parallelism=30
-target can leave your state partially applied. Use it only for emergency single-resource fixes, never as a routine workflow. It masks real state drift.Step 4: Terragrunt — DRY Configurations and Parallel Runs
Terragrunt wraps Terraform and solves two problems: repetitive backend config, and serial module applies.
# terragrunt.hcl (root)
remote_state {
backend = "s3"
config = {
bucket = "company-tfstate-${local.env}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "terraform-state-lock"
encrypt = true
}
}
# environments/prod/compute/terragrunt.hcl
include "root" {
path = find_in_parent_folders()
}
dependency "networking" {
config_path = "../networking"
}
inputs = {
vpc_id = dependency.networking.outputs.vpc_id
}
# Plan ALL modules in parallel (respects dependency order)
terragrunt run-all plan
# Apply all modules in parallel
terragrunt run-all apply
Result: What was 45 minutes serial becomes ~10 minutes parallel.
Step 5: Use data Sources for Cross-State References
Instead of hard-coding values from other modules, use Terraform data sources to read remote state:
# In compute/main.tf — reference networking state
data "terraform_remote_state" "networking" {
backend = "s3"
config = {
bucket = "company-tfstate-prod"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
}
}
resource "aws_autoscaling_group" "app" {
vpc_zone_identifier = data.terraform_remote_state.networking.outputs.private_subnet_ids
}
Step 6: CI/CD — Automated Plans on Every PR
Move terraform plan out of developer laptops and into CodeBuild:
# buildspec-tf-plan.yml
version: 0.2
phases:
install:
commands:
- wget https://releases.hashicorp.com/terraform/1.7.0/terraform_1.7.0_linux_amd64.zip
- unzip terraform_1.7.0_linux_amd64.zip && mv terraform /usr/local/bin/
build:
commands:
- cd environments/prod/compute
- terraform init
- terraform plan -out=plan.tfplan -no-color 2>&1 | tee plan.txt
- cat plan.txt
artifacts:
files:
- environments/prod/compute/plan.tfplan
Post the plan output as a GitHub PR comment so reviewers see exactly what will change before approving.
Summary
| Problem | Solution | Time Saved |
|---|---|---|
| 500-resource monolith | Split into 4-5 independent state files | 45 min → 7 min per module |
| Serial module applies | Terragrunt run-all for parallelism | 3× faster |
| Expensive refresh | -refresh=false for trusted state | 60% faster per plan |
| No locking | S3 + DynamoDB backend | Eliminates conflicts |
| Manual plans | CodeBuild on every PR | Consistent, auditable |
networking/ can no longer accidentally delete an RDS resource in databases/. Interviewers want to hear you think about failure modes, not just speed.- Why a monolithic Terraform state file slows plans to a crawl
- How to break state into independently deployable modules
- How to use Terragrunt for DRY multi-module orchestration
- Key Terraform flags that speed up individual plans
- How to set up S3 + DynamoDB remote state with locking
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Handling Terraform State Drift in Production
The Problem Drift is when your live AWS infrastructure no longer matches what Terraform’s state file describes. It happens when: …
Implement AWS Control Tower for a 20-Account Organization
The Problem Without Control Tower, each new AWS account is a blank canvas. Security baselines drift. CloudTrail might be enabled in one …
AWS Cloud Engineer Learning Path
Who Is This Path For? This path is designed for complete beginners who want to break into cloud computing as an AWS engineer. If you know …