Handling Terraform State Drift in Production

The Problem

Drift is when your live AWS infrastructure no longer matches what Terraform’s state file describes. It happens when:

Engineers make emergency changes via the Console
Another team’s automation modifies shared resources
AWS makes service-side changes (e.g., auto-updating default parameter groups)
A Terraform apply partially succeeds and leaves orphaned resources

Drift is invisible until you run terraform apply — and by then, it may be too late.

Step 1: Detect Drift — `terraform plan` With Refresh

The most direct way to see drift is a plan with refresh enabled:

# Refresh reads live AWS state and compares to your .tf files
terraform plan -refresh=true -detailed-exitcode

# Exit codes:
# 0 = no changes (no drift)
# 1 = error
# 2 = changes detected (drift exists)
echo "Exit code: $?"

# For a specific resource only
terraform plan -target=aws_db_parameter_group.prod -refresh=true

The plan output shows exactly what Terraform would change. Before running apply, validate that the “change” is drift (real infra differs from .tf file), not a genuine desired change.

Step 2: Automated Drift Detection Pipeline

Running manual plans isn’t scalable. Set up a scheduled drift detection job:

# EventBridge rule: run drift detection every 6 hours
# ↓ triggers CodeBuild project ↓ which runs terraform plan

# buildspec-drift-check.yml
version: 0.2
phases:
  build:
    commands:
      - |
        for module in networking compute databases security; do
          cd environments/prod/$module
          terraform init -input=false
          terraform plan -refresh=true -detailed-exitcode -no-color > /tmp/$module-plan.txt 2>&1
          EXIT=$?
          if [ $EXIT -eq 2 ]; then
            echo "DRIFT DETECTED in $module module"
            aws sns publish \
              --topic-arn $DRIFT_ALERT_TOPIC \
              --subject "Terraform Drift: prod/$module" \
              --message "$(cat /tmp/$module-plan.txt)"
          fi
          cd ../../../
        done

AWS Config Rule for resource-level detection:

# Lambda-backed Config rule: detect untagged or manually created resources
def evaluate_compliance(configuration_item, rule_parameters):
    # Check if resource was created outside Terraform
    # (Terraform always adds a "ManagedBy=Terraform" tag)
    tags = configuration_item.get('configuration', {}).get('tags', {})
    if 'ManagedBy' not in tags or tags['ManagedBy'] != 'Terraform':
        return 'NON_COMPLIANT'
    return 'COMPLIANT'

Step 3: Reconcile Drift — Import or Update

Once drift is detected, you have two options:

Option A: Import the Drift Into State

Use this when the manual change was correct and you want Terraform to adopt it:

# Import the manually-modified RDS parameter group
terraform import aws_db_parameter_group.prod my-prod-pg14

# Now update your .tf file to match the actual parameters
# (terraform plan should show no changes after)
terraform plan  # verify: 0 changes

Option B: Override Drift — Apply Back to Desired State

Use this when the manual change was wrong or unauthorized:

# Review exactly what will change
terraform plan -refresh=true

# Apply to bring infrastructure back to the declared state
terraform apply -refresh=true -target=aws_db_parameter_group.prod

Always Review Before Applying

Blindly running terraform apply to fix drift can destroy legitimate configuration. Always read the plan output carefully. If in doubt, import the change into Terraform first, discuss with the team, then decide whether to keep or revert it.

Step 4: `terraform refresh` — Sync State Only

If you want to update Terraform’s state file to match reality without making any changes:

# Update state to match real AWS (doesn't touch infrastructure)
terraform refresh

# Then check what your .tf files declare vs the refreshed state
terraform plan  # now shows what WOULD change if you applied

This is useful when another team legitimately owns part of a shared resource.

Step 5: Prevent Drift — IaC-Only Enforcement

Tag all Terraform resources and use AWS Config to alert on untagged changes:

# In your Terraform provider configuration
provider "aws" {
  default_tags {
    tags = {
      ManagedBy   = "Terraform"
      Environment = var.environment
      Module      = var.module_name
    }
  }
}

SCP to deny console modifications on tagged production resources:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "DenyConsoleModifyTerraformResources",
    "Effect": "Deny",
    "Action": [
      "ec2:ModifyInstanceAttribute",
      "rds:ModifyDBInstance",
      "rds:ModifyDBParameterGroup"
    ],
    "Resource": "*",
    "Condition": {
      "StringEquals": {
        "aws:ResourceTag/ManagedBy": "Terraform",
        "aws:ResourceTag/Environment": "production"
      },
      "StringNotEquals": {
        "aws:PrincipalArn": [
          "arn:aws:iam::*:role/terraform-pipeline-role"
        ]
      }
    }
  }]
}

Drift Resolution Workflow

Drift Detected (plan shows changes)
         │
         ▼
Is the manual change correct/intentional?
   │                    │
  YES                   NO
   │                    │
   ▼                    ▼
terraform import    terraform apply
+ update .tf file   -target=<resource>
+ git commit        (reverts to declared state)
+ PR review

Summary

Scenario	Command	When to Use
Check for drift	`terraform plan -refresh=true`	Before every apply
Adopt a manual change	`terraform import` + update .tf	Change was valid
Revert manual change	`terraform apply -target=...`	Change was wrong
Sync state only	`terraform refresh`	Another team owns the resource
Continuous monitoring	AWS Config + EventBridge + Lambda	Catch drift automatically

Interview Angle

Mention the cultural fix: drift usually happens because engineers feel they can’t use Terraform fast enough in an emergency. The answer is a pre-approved “break-glass” runbook: emergency Console changes are allowed but must be documented and immediately imported into Terraform within 24 hours.

Handling Terraform State Drift in Production

The Problem

Step 1: Detect Drift — `terraform plan` With Refresh

Step 2: Automated Drift Detection Pipeline

Step 3: Reconcile Drift — Import or Update

Option A: Import the Drift Into State

Option B: Override Drift — Apply Back to Desired State

Step 4: `terraform refresh` — Sync State Only

Step 5: Prevent Drift — IaC-Only Enforcement

Drift Resolution Workflow

Summary

Have a similar scenario to share?

Related Scenarios

Terraform Plan Takes 45 Minutes — How to Fix It at Scale

Implement AWS Control Tower for a 20-Account Organization

Build a Zero-Downtime Deployment Pipeline for Microservices on EKS

Handling Terraform State Drift in Production

The Problem

Step 1: Detect Drift — terraform plan With Refresh

Step 2: Automated Drift Detection Pipeline

Step 3: Reconcile Drift — Import or Update

Option A: Import the Drift Into State

Option B: Override Drift — Apply Back to Desired State

Step 4: terraform refresh — Sync State Only

Step 5: Prevent Drift — IaC-Only Enforcement

Drift Resolution Workflow

Summary

Have a similar scenario to share?

Related Scenarios

Terraform Plan Takes 45 Minutes — How to Fix It at Scale

Implement AWS Control Tower for a 20-Account Organization

Build a Zero-Downtime Deployment Pipeline for Microservices on EKS

Step 1: Detect Drift — `terraform plan` With Refresh

Step 4: `terraform refresh` — Sync State Only