Suyog Maid
Suyog Maid
šŸ“„
Article2026-01-18

Terraform Best Practices: From Basics to Enterprise-Scale Infrastructure

#terraform#infrastructure-as-code#devops#aws#azure#gcp#automation

Terraform Best Practices: From Basics to Enterprise-Scale Infrastructure

Terraform has revolutionized infrastructure management, but with great power comes great responsibility. After managing infrastructure across AWS, Azure, and GCP using Terraform at enterprise scale, I've learned valuable lessons about what works and what doesn't. This comprehensive guide shares battle-tested best practices for Terraform.

Project Structure and Organization

Recommended Directory Structure

terraform/
ā”œā”€ā”€ environments/
│   ā”œā”€ā”€ dev/
│   │   ā”œā”€ā”€ main.tf
│   │   ā”œā”€ā”€ variables.tf
│   │   ā”œā”€ā”€ outputs.tf
│   │   └── terraform.tfvars
│   ā”œā”€ā”€ staging/
│   └── production/
ā”œā”€ā”€ modules/
│   ā”œā”€ā”€ vpc/
│   │   ā”œā”€ā”€ main.tf
│   │   ā”œā”€ā”€ variables.tf
│   │   ā”œā”€ā”€ outputs.tf
│   │   └── README.md
│   ā”œā”€ā”€ eks/
│   └── rds/
ā”œā”€ā”€ global/
│   ā”œā”€ā”€ iam/
│   └── s3/
└── .terraform-docs.yml

Why This Structure Works

  1. Environment Isolation: Each environment has its own state file
  2. Module Reusability: Shared modules across all environments
  3. Global Resources: IAM roles, S3 buckets that span environments
  4. Clear Boundaries: Obvious separation of concerns

State Management Best Practices

Remote State Configuration

Never store state files locally in production. Use remote backends:

# backend.tf
terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "env/production/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    
    # Enable versioning for rollback capability
    versioning     = true
  }
}

State Locking

Prevent concurrent modifications with DynamoDB locking:

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  tags = {
    Name        = "Terraform State Lock Table"
    Environment = "global"
  }
}

State File Security

resource "aws_s3_bucket" "terraform_state" {
  bucket = "my-terraform-state-bucket"

  lifecycle {
    prevent_destroy = true
  }

  tags = {
    Name        = "Terraform State Bucket"
    Environment = "global"
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

Module Design Principles

Creating Reusable Modules

A well-designed module should be:

  • Focused: Single responsibility
  • Configurable: Flexible via variables
  • Documented: Clear README and examples
  • Tested: Validated with terratest or similar

Example VPC Module

# modules/vpc/main.tf
variable "vpc_cidr" {
  description = "CIDR block for VPC"
  type        = string
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid IPv4 CIDR block."
  }
}

variable "environment" {
  description = "Environment name"
  type        = string
  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
  default     = ["us-east-1a", "us-east-1b", "us-east-1c"]
}

locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Project     = var.project_name
  }
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-vpc"
    }
  )
}

# Public Subnets
resource "aws_subnet" "public" {
  count             = length(var.availability_zones)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index)
  availability_zone = var.availability_zones[count.index]

  map_public_ip_on_launch = true

  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-public-subnet-${count.index + 1}"
      Type = "Public"
    }
  )
}

# Private Subnets
resource "aws_subnet" "private" {
  count             = length(var.availability_zones)
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.availability_zones))
  availability_zone = var.availability_zones[count.index]

  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-private-subnet-${count.index + 1}"
      Type = "Private"
    }
  )
}

# Outputs
output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "public_subnet_ids" {
  description = "List of public subnet IDs"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "List of private subnet IDs"
  value       = aws_subnet.private[*].id
}

Module Usage

# environments/production/main.tf
module "vpc" {
  source = "../../modules/vpc"

  vpc_cidr           = "10.0.0.0/16"
  environment        = "production"
  availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
  project_name       = "my-project"
}

output "vpc_id" {
  value = module.vpc.vpc_id
}

Variable Management

Input Variable Best Practices

# variables.tf
variable "instance_type" {
  description = "EC2 instance type for application servers"
  type        = string
  default     = "t3.medium"

  validation {
    condition     = contains(["t3.small", "t3.medium", "t3.large"], var.instance_type)
    error_message = "Instance type must be t3.small, t3.medium, or t3.large."
  }
}

variable "enable_monitoring" {
  description = "Enable CloudWatch detailed monitoring"
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags for resources"
  type        = map(string)
  default     = {}
}

variable "database_config" {
  description = "Database configuration"
  type = object({
    engine         = string
    engine_version = string
    instance_class = string
    allocated_storage = number
  })

  validation {
    condition     = var.database_config.allocated_storage >= 20
    error_message = "Allocated storage must be at least 20 GB."
  }
}

Environment-Specific Variables

# environments/production/terraform.tfvars
environment   = "production"
instance_type = "t3.large"
enable_monitoring = true

database_config = {
  engine            = "postgres"
  engine_version    = "15.4"
  instance_class    = "db.r6i.xlarge"
  allocated_storage = 100
}

tags = {
  CostCenter = "Engineering"
  Owner      = "DevOps Team"
}

Security Best Practices

Secrets Management

Never hardcode secrets in Terraform files. Use secure secret management:

# Using AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "production/database/password"
}

resource "aws_db_instance" "main" {
  engine               = "postgres"
  instance_class       = "db.t3.medium"
  allocated_storage    = 100
  
  # Use secret from Secrets Manager
  password = data.aws_secretsmanager_secret_version.db_password.secret_string

  # Other configurations...
}

Sensitive Output Protection

output "database_endpoint" {
  description = "Database connection endpoint"
  value       = aws_db_instance.main.endpoint
  sensitive   = false  # Safe to expose
}

output "database_password" {
  description = "Database master password"
  value       = aws_db_instance.main.password
  sensitive   = true  # Protected in output
}

IAM Policy Management

# Use data sources for AWS managed policies
data "aws_iam_policy" "ssm_managed_instance_core" {
  arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}

# Create custom policies with least privilege
resource "aws_iam_policy" "app_s3_access" {
  name        = "${var.environment}-app-s3-access"
  description = "Allows application to access specific S3 bucket"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "arn:aws:s3:::${var.app_bucket}/*"
      }
    ]
  })
}

Testing Terraform Code

Validation and Formatting

# Format code
terraform fmt -recursive

# Validate syntax
terraform validate

# Security scanning with tfsec
tfsec .

# Cost estimation with Infracost
infracost breakdown --path .

Terratest for Integration Testing

// test/vpc_test.go
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVPCCreation(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../environments/test",
        Vars: map[string]interface{}{
            "vpc_cidr":     "10.0.0.0/16",
            "environment":  "test",
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcID := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcID)
}

CI/CD Pipeline Integration

GitHub Actions Workflow

name: Terraform CI/CD

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

env:
  TF_VERSION: 1.6.0

jobs:
  terraform:
    name: Terraform Plan and Apply
    runs-on: ubuntu-latest
    
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v2
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: us-east-1

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Terraform Format Check
        run: terraform fmt -check -recursive

      - name: Terraform Init
        run: terraform init
        working-directory: ./environments/production

      - name: Terraform Validate
        run: terraform validate
        working-directory: ./environments/production

      - name: Terraform Plan
        run: terraform plan -out=tfplan
        working-directory: ./environments/production

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main'
        run: terraform apply -auto-approve tfplan
        working-directory: ./environments/production

Performance Optimization

Use Data Sources Wisely

# Cache AMI lookup with data source
data "aws_ami" "amazon_linux_2" {
  most_recent = true
  owners      = ["amazon"]

  filter {
    name   = "name"
    values = ["amzn2-ami-hvm-*-x86_64-gp2"]
  }
}

# Reference cached data
resource "aws_instance" "app" {
  ami           = data.aws_ami.amazon_linux_2.id
  instance_type = var.instance_type
}

Parallelism Tuning

# Increase parallelism for faster applies (default is 10)
terraform apply -parallelism=30

# Reduce for rate limit sensitive operations
terraform apply -parallelism=5

Targeted Operations

# Update specific resources
terraform apply -target=module.vpc

# Refresh specific module state
terraform refresh -target=module.database

Disaster Recovery and Rollback

State File Backup

# Backup current state before major changes
terraform state pull > terraform.tfstate.backup-$(date +%Y%m%d-%H%M%S)

# Restore from backup if needed
terraform state push terraform.tfstate.backup-20260118-100000

Import Existing Resources

# Import existing AWS resources
terraform import module.vpc.aws_vpc.main vpc-1234567890abcdef0

# Verify import
terraform plan

Common Pitfalls to Avoid

āŒ DON'T: Hardcode Values

# Bad
resource "aws_instance" "app" {
  ami           = "ami-12345678"  # Hardcoded AMI
  instance_type = "t3.medium"     # Hardcoded type
}

āœ… DO: Use Variables and Data Sources

# Good
data "aws_ami" "app" {
  # Dynamic AMI lookup
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.app.id
  instance_type = var.instance_type
}

āŒ DON'T: Use count with Complex Objects

# Bad - fragile to list reordering
resource "aws_instance" "app" {
  count         = 3
  ami           = var.ami_id
  instance_type = var.instance_type
}

āœ… DO: Use for_each with Maps

# Good - stable resource addressing
locals {
  instances = {
    web1 = { type = "t3.medium" }
    web2 = { type = "t3.medium" }
    api  = { type = "t3.large" }
  }
}

resource "aws_instance" "app" {
  for_each      = local.instances
  ami           = var.ami_id
  instance_type = each.value.type

  tags = {
    Name = each.key
  }
}

Key Takeaways

  1. State Management: Always use remote state with locking
  2. Module Design: Create focused, reusable modules
  3. Security: Never commit secrets, use proper IAM policies
  4. Testing: Validate, format, and test your code
  5. CI/CD: Automate with pipelines
  6. Documentation: Comment your code and maintain READMEs
  7. Version Control: Use Git with meaningful commits
  8. Cost Awareness: Use Infracost to track expenses

Conclusion

Terraform is a powerful tool, but it requires discipline and best practices to use effectively at scale. By following these guidelines, you'll build infrastructure that is:

  • Reproducible: Consistent across environments
  • Secure: Protected against common vulnerabilities
  • Maintainable: Easy to understand and modify
  • Reliable: Tested and validated
  • Collaborative: Team-friendly with clear patterns

Remember: good Terraform code is code that your team can read, understand, and confidently modify six months from now.


Want to dive deeper? Check out my other posts on AWS CDK and Kubernetes, or reach out for consulting on your infrastructure challenges!

Share this insight