Handling Terraform State Drift in Production
AWS resources changed outside of Terraform — via Console, CLI, or another tool. Detect the drift, reconcile state, and prevent it from happening again.
A developer manually changed an RDS parameter group through the AWS Console last week to fix a production issue. Terraform doesn't know about it. Today, a routine `terraform apply` is about to revert that change — taking down the database parameter that's keeping the app stable. How do you detect drift before it causes damage, fix it properly, and prevent it from happening again?
The Problem
Drift is when your live AWS infrastructure no longer matches what Terraform’s state file describes. It happens when:
- Engineers make emergency changes via the Console
- Another team’s automation modifies shared resources
- AWS makes service-side changes (e.g., auto-updating default parameter groups)
- A Terraform apply partially succeeds and leaves orphaned resources
Drift is invisible until you run terraform apply — and by then, it may be too late.
Step 1: Detect Drift — terraform plan With Refresh
The most direct way to see drift is a plan with refresh enabled:
# Refresh reads live AWS state and compares to your .tf files
terraform plan -refresh=true -detailed-exitcode
# Exit codes:
# 0 = no changes (no drift)
# 1 = error
# 2 = changes detected (drift exists)
echo "Exit code: $?"
# For a specific resource only
terraform plan -target=aws_db_parameter_group.prod -refresh=true
The plan output shows exactly what Terraform would change. Before running apply, validate that the “change” is drift (real infra differs from .tf file), not a genuine desired change.
Step 2: Automated Drift Detection Pipeline
Running manual plans isn’t scalable. Set up a scheduled drift detection job:
# EventBridge rule: run drift detection every 6 hours
# ↓ triggers CodeBuild project ↓ which runs terraform plan
# buildspec-drift-check.yml
version: 0.2
phases:
build:
commands:
- |
for module in networking compute databases security; do
cd environments/prod/$module
terraform init -input=false
terraform plan -refresh=true -detailed-exitcode -no-color > /tmp/$module-plan.txt 2>&1
EXIT=$?
if [ $EXIT -eq 2 ]; then
echo "DRIFT DETECTED in $module module"
aws sns publish \
--topic-arn $DRIFT_ALERT_TOPIC \
--subject "Terraform Drift: prod/$module" \
--message "$(cat /tmp/$module-plan.txt)"
fi
cd ../../../
done
AWS Config Rule for resource-level detection:
# Lambda-backed Config rule: detect untagged or manually created resources
def evaluate_compliance(configuration_item, rule_parameters):
# Check if resource was created outside Terraform
# (Terraform always adds a "ManagedBy=Terraform" tag)
tags = configuration_item.get('configuration', {}).get('tags', {})
if 'ManagedBy' not in tags or tags['ManagedBy'] != 'Terraform':
return 'NON_COMPLIANT'
return 'COMPLIANT'
Step 3: Reconcile Drift — Import or Update
Once drift is detected, you have two options:
Option A: Import the Drift Into State
Use this when the manual change was correct and you want Terraform to adopt it:
# Import the manually-modified RDS parameter group
terraform import aws_db_parameter_group.prod my-prod-pg14
# Now update your .tf file to match the actual parameters
# (terraform plan should show no changes after)
terraform plan # verify: 0 changes
Option B: Override Drift — Apply Back to Desired State
Use this when the manual change was wrong or unauthorized:
# Review exactly what will change
terraform plan -refresh=true
# Apply to bring infrastructure back to the declared state
terraform apply -refresh=true -target=aws_db_parameter_group.prod
terraform apply to fix drift can destroy legitimate configuration. Always read the plan output carefully. If in doubt, import the change into Terraform first, discuss with the team, then decide whether to keep or revert it.Step 4: terraform refresh — Sync State Only
If you want to update Terraform’s state file to match reality without making any changes:
# Update state to match real AWS (doesn't touch infrastructure)
terraform refresh
# Then check what your .tf files declare vs the refreshed state
terraform plan # now shows what WOULD change if you applied
This is useful when another team legitimately owns part of a shared resource.
Step 5: Prevent Drift — IaC-Only Enforcement
Tag all Terraform resources and use AWS Config to alert on untagged changes:
# In your Terraform provider configuration
provider "aws" {
default_tags {
tags = {
ManagedBy = "Terraform"
Environment = var.environment
Module = var.module_name
}
}
}
SCP to deny console modifications on tagged production resources:
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyConsoleModifyTerraformResources",
"Effect": "Deny",
"Action": [
"ec2:ModifyInstanceAttribute",
"rds:ModifyDBInstance",
"rds:ModifyDBParameterGroup"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/ManagedBy": "Terraform",
"aws:ResourceTag/Environment": "production"
},
"StringNotEquals": {
"aws:PrincipalArn": [
"arn:aws:iam::*:role/terraform-pipeline-role"
]
}
}
}]
}
Drift Resolution Workflow
Drift Detected (plan shows changes)
│
▼
Is the manual change correct/intentional?
│ │
YES NO
│ │
▼ ▼
terraform import terraform apply
+ update .tf file -target=<resource>
+ git commit (reverts to declared state)
+ PR review
Summary
| Scenario | Command | When to Use |
|---|---|---|
| Check for drift | terraform plan -refresh=true | Before every apply |
| Adopt a manual change | terraform import + update .tf | Change was valid |
| Revert manual change | terraform apply -target=... | Change was wrong |
| Sync state only | terraform refresh | Another team owns the resource |
| Continuous monitoring | AWS Config + EventBridge + Lambda | Catch drift automatically |
- How to detect drift between Terraform state and real AWS resources
- The `terraform import` workflow for reconciling state
- How to use AWS Config for continuous drift detection
- How to enforce IaC-only changes using SCPs and tagging
Have a similar scenario to share?
Production incidents are the best teachers. Submit your real-world scenario and help others learn.
Open Google FormRelated Scenarios
Terraform Plan Takes 45 Minutes — How to Fix It at Scale
The Problem Terraform’s plan command calls AWS APIs to refresh the current state of every resource — if you have 500 resources, …
Implement AWS Control Tower for a 20-Account Organization
The Problem Without Control Tower, each new AWS account is a blank canvas. Security baselines drift. CloudTrail might be enabled in one …
Build a Zero-Downtime Deployment Pipeline for Microservices on EKS
The Problem A traditional kubectl apply replaces all pods simultaneously — if the new image is broken, users hit errors until you notice and …