Advanced Terraform & IaC Patterns

State Surgery

18 min Lesson 5 of 28

State Surgery

Terraform state is the single source of truth that maps your HCL configuration to real-world infrastructure. In an ideal world you would never touch it directly — every resource would be born from terraform apply and die from terraform destroy. Production is not ideal. Infrastructure teams face brownfield resources created before Terraform existed, refactors that rename resources, accidental state corruption, and the need to reorganize large monolithic state files. Knowing how to perform state surgery — safely manipulating state without destroying real infrastructure — is a non-negotiable skill at any organization running Terraform at scale.

State is the blast-radius multiplier. A wrong terraform destroy hits one resource. A botched state operation can desync dozens of resources simultaneously, causing Terraform to attempt to recreate everything on the next apply. Always back up state before any surgical operation, and always run terraform plan afterwards to verify Terraform's intent matches yours.

Backing Up State Before Surgery

Whether your backend is S3, GCS, or Terraform Cloud, capture a local snapshot before touching anything:

# Pull current state to a local backup file terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).tfstate # Verify the backup is valid JSON python3 -m json.tool state-backup-*.tfstate > /dev/null && echo "Backup OK"

On S3 backends with versioning enabled, the previous state version is automatically preserved — but the explicit pull gives you a local copy you can inspect and restore from without cloud access.

The terraform import Command

Use import when a resource already exists in the cloud but has no corresponding Terraform state entry. Classic scenarios: a DBA created an RDS instance manually to unblock a release, an ops engineer added a security group rule in the console, or you are adopting an old AWS account that was never managed by IaC.

The workflow is: write the HCL config first, then import the real resource into state. Terraform will not generate HCL for you (the terraform import CLI only writes state, not config — though terraform plan -generate-config-out introduced in Terraform 1.5 can scaffold config as a starting point).

# 1. Write (or scaffold) the HCL for the existing resource # main.tf resource "aws_security_group" "app_sg" { name = "app-prod-sg" description = "Application security group" vpc_id = var.vpc_id } # 2. Import the real AWS resource into Terraform state # Syntax: terraform import <resource_address> <provider_id> terraform import aws_security_group.app_sg sg-0abc123def456789a # 3. Always plan immediately after import — the config must match reality terraform plan # Goal: "No changes. Your infrastructure matches the configuration." # Any diff means your HCL does not fully describe the real resource.

In Terraform 1.5+, you can also declare imports inside HCL with an import block, which is idempotent and pipeline-safe:

# import.tf — committed to Git, reviewed in PR, applied once import { to = aws_security_group.app_sg id = "sg-0abc123def456789a" } resource "aws_security_group" "app_sg" { name = "app-prod-sg" description = "Application security group" vpc_id = var.vpc_id }
Prefer import blocks over CLI import for team workflows. The CLI terraform import is stateful and transient — it runs, modifies state, and leaves no record in Git. Import blocks are declarative, version-controlled, and can be planned and applied through a normal CI/CD pipeline. Once the import is applied, remove the block from the codebase (the resource is now under management and the import block is a no-op if left in, but it adds noise).

The moved Block — Renaming Without Destroying

When you refactor HCL — rename a resource, move it into a module, or change a for_each key — Terraform sees the old address disappear and the new one appear. Without guidance, it plans to destroy the old resource and create a new one. In production this means downtime. The moved block (introduced in Terraform 1.1) tells Terraform that the old and new addresses refer to the same real-world object, so no destroy/create cycle occurs.

# Before refactor — resource was at the root level # resource "aws_instance" "web" { ... } # After refactor — moved into a module module "web" { source = "./modules/ec2" instance_type = "t3.medium" } # moved.tf — explains the rename to Terraform moved { from = aws_instance.web to = module.web.aws_instance.instance }

Moved blocks are permanent historical records — keep them in the codebase long enough for all engineers and pipelines to apply them, then remove them after a stabilization period (typically one sprint). If you remove a moved block before everyone has applied it, the next apply for that engineer will attempt a destroy/create.

The moved block also handles for_each key renames, which are a common refactor pain point:

# Before: keyed by environment name resource "aws_s3_bucket" "data" { for_each = toset(["prod", "staging"]) bucket = "mycompany-data-${each.key}" } # After: keys changed to include region for clarity resource "aws_s3_bucket" "data" { for_each = toset(["prod-us-east-1", "staging-us-east-1"]) bucket = "mycompany-data-${each.key}" } # moved blocks for each renamed key moved { from = aws_s3_bucket.data["prod"] to = aws_s3_bucket.data["prod-us-east-1"] } moved { from = aws_s3_bucket.data["staging"] to = aws_s3_bucket.data["staging-us-east-1"] }

State CLI Commands: mv, rm, list

The terraform state subcommands are surgical tools for cases where HCL blocks are not sufficient — typically cross-workspace or cross-backend moves.

terraform state list — lists all resource addresses currently tracked in state. Essential before any surgery to understand what you are working with:

terraform state list # module.network.aws_vpc.main # module.network.aws_subnet.private[0] # module.network.aws_subnet.private[1] # aws_iam_role.eks_node # aws_eks_cluster.main # Filter by module or resource type terraform state list module.network terraform state list 'aws_iam_role.*'

terraform state mv — moves a resource from one address to another within the same state file, or between two state files. Use this when moved blocks are not an option (e.g., moving resources across workspaces):

# Rename within the same state terraform state mv aws_instance.web aws_instance.web_server # Move from root to a module (when moved block is not usable) terraform state mv aws_security_group.app_sg module.app.aws_security_group.sg # Cross-state move: pull state from both sides, mv, push terraform state pull > old-workspace.tfstate # On the target workspace: terraform state mv -state=old-workspace.tfstate \ -state-out=new-workspace.tfstate \ aws_s3_bucket.data aws_s3_bucket.data terraform state push new-workspace.tfstate

terraform state rm — removes a resource from state without destroying the real infrastructure. Use when you want Terraform to stop managing a resource (hand it back to manual management, or migrate to a different tool) while leaving the real cloud resource intact:

# Remove a single resource from state (the real EC2 instance is NOT terminated) terraform state rm aws_instance.legacy_app # Remove an entire module from state terraform state rm module.legacy_network # Dry run with -dry-run flag (Terraform 1.6+) terraform state rm -dry-run module.legacy_network
After state rm, the next terraform plan will show the resource as "to be created" — because Terraform no longer knows the real resource exists. Either add a lifecycle { ignore_changes = all } block, remove the resource from HCL entirely, or reimport it. Always have a clear intent before removing from state.

Refactoring Without Destroying: The Safe Pattern

The safest way to restructure a large Terraform codebase follows this sequence:

  1. Backup stateterraform state pull > backup.tfstate
  2. Write the new HCL — rename resources, extract modules, restructure hierarchies
  3. Add moved blocks — one per renamed or relocated address
  4. Run terraform plan — the plan must show zero adds/destroys; only renames appear as moved notes
  5. Apply in a non-production workspace first — validate the refactor is clean
  6. Apply to production — with a team member reviewing the plan output before confirming
  7. Remove moved blocks — after all environments have successfully applied
State Surgery Safe Refactor Flow 1. Backup State state pull > backup 2. Rewrite HCL rename / modularize 3. Add moved{} one per rename 4. terraform plan 0 destroy = safe 5. Apply Staging validate clean 6. Apply Prod peer review plan 7. Remove moved{} after all envs apply 0 destroys confirmed if plan shows destroys → restore backup & investigate
Safe state surgery flow: backup first, verify plan shows zero destroys before touching any environment.

Production Failure Modes

The most dangerous state surgery mistakes at scale:

  • Applying without planning after import — the imported resource config does not match reality; Terraform will modify or destroy attributes. Always plan immediately post-import and resolve all diffs before proceeding.
  • Removing moved blocks too early — engineers who have not yet applied will see destroy/create cycles on their next apply. Keep moved blocks for at least one release cycle.
  • State push without locking — manually pushing a modified state file (terraform state push) bypasses the distributed lock. If a CI pipeline is mid-apply, the push corrupts state. Use -lock=false only as a last resort when you know no apply is in progress.
  • Mixing state mv and moved blocks — applying a moved block after already running state mv for the same resource will cause Terraform to error. Pick one path and stick to it.
At Google scale, state surgery is gated behind a break-glass process. Direct terraform state push requires a second engineer sign-off, a linked incident ticket, and triggers an automated alert to the platform team. Even in smaller organizations, treat any manual state modification as an incident-class event: document what you did, why, and what the resulting plan showed.