Terraform Best Practices: From Basics to Enterprise-Scale Infrastructure
Terraform has revolutionized infrastructure management, but with great power comes great responsibility. After managing infrastructure across AWS, Azure, and GCP using Terraform at enterprise scale, I've learned valuable lessons about what works and what doesn't. This comprehensive guide shares battle-tested best practices for Terraform.
Project Structure and Organization
Recommended Directory Structure
terraform/
āāā environments/
ā āāā dev/
ā ā āāā main.tf
ā ā āāā variables.tf
ā ā āāā outputs.tf
ā ā āāā terraform.tfvars
ā āāā staging/
ā āāā production/
āāā modules/
ā āāā vpc/
ā ā āāā main.tf
ā ā āāā variables.tf
ā ā āāā outputs.tf
ā ā āāā README.md
ā āāā eks/
ā āāā rds/
āāā global/
ā āāā iam/
ā āāā s3/
āāā .terraform-docs.yml
Why This Structure Works
- Environment Isolation: Each environment has its own state file
- Module Reusability: Shared modules across all environments
- Global Resources: IAM roles, S3 buckets that span environments
- Clear Boundaries: Obvious separation of concerns
State Management Best Practices
Remote State Configuration
Never store state files locally in production. Use remote backends:
# backend.tf
terraform {
backend "s3" {
bucket = "my-terraform-state-bucket"
key = "env/production/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
# Enable versioning for rollback capability
versioning = true
}
}
State Locking
Prevent concurrent modifications with DynamoDB locking:
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-state-lock"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
tags = {
Name = "Terraform State Lock Table"
Environment = "global"
}
}
State File Security
resource "aws_s3_bucket" "terraform_state" {
bucket = "my-terraform-state-bucket"
lifecycle {
prevent_destroy = true
}
tags = {
Name = "Terraform State Bucket"
Environment = "global"
}
}
resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}
resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_public_access_block" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Module Design Principles
Creating Reusable Modules
A well-designed module should be:
- Focused: Single responsibility
- Configurable: Flexible via variables
- Documented: Clear README and examples
- Tested: Validated with terratest or similar
Example VPC Module
# modules/vpc/main.tf
variable "vpc_cidr" {
description = "CIDR block for VPC"
type = string
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "Must be a valid IPv4 CIDR block."
}
}
variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be dev, staging, or production."
}
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
default = ["us-east-1a", "us-east-1b", "us-east-1c"]
}
locals {
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Project = var.project_name
}
}
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(
local.common_tags,
{
Name = "${var.environment}-vpc"
}
)
}
# Public Subnets
resource "aws_subnet" "public" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index)
availability_zone = var.availability_zones[count.index]
map_public_ip_on_launch = true
tags = merge(
local.common_tags,
{
Name = "${var.environment}-public-subnet-${count.index + 1}"
Type = "Public"
}
)
}
# Private Subnets
resource "aws_subnet" "private" {
count = length(var.availability_zones)
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, 4, count.index + length(var.availability_zones))
availability_zone = var.availability_zones[count.index]
tags = merge(
local.common_tags,
{
Name = "${var.environment}-private-subnet-${count.index + 1}"
Type = "Private"
}
)
}
# Outputs
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "public_subnet_ids" {
description = "List of public subnet IDs"
value = aws_subnet.public[*].id
}
output "private_subnet_ids" {
description = "List of private subnet IDs"
value = aws_subnet.private[*].id
}
Module Usage
# environments/production/main.tf
module "vpc" {
source = "../../modules/vpc"
vpc_cidr = "10.0.0.0/16"
environment = "production"
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
project_name = "my-project"
}
output "vpc_id" {
value = module.vpc.vpc_id
}
Variable Management
Input Variable Best Practices
# variables.tf
variable "instance_type" {
description = "EC2 instance type for application servers"
type = string
default = "t3.medium"
validation {
condition = contains(["t3.small", "t3.medium", "t3.large"], var.instance_type)
error_message = "Instance type must be t3.small, t3.medium, or t3.large."
}
}
variable "enable_monitoring" {
description = "Enable CloudWatch detailed monitoring"
type = bool
default = false
}
variable "tags" {
description = "Additional tags for resources"
type = map(string)
default = {}
}
variable "database_config" {
description = "Database configuration"
type = object({
engine = string
engine_version = string
instance_class = string
allocated_storage = number
})
validation {
condition = var.database_config.allocated_storage >= 20
error_message = "Allocated storage must be at least 20 GB."
}
}
Environment-Specific Variables
# environments/production/terraform.tfvars
environment = "production"
instance_type = "t3.large"
enable_monitoring = true
database_config = {
engine = "postgres"
engine_version = "15.4"
instance_class = "db.r6i.xlarge"
allocated_storage = 100
}
tags = {
CostCenter = "Engineering"
Owner = "DevOps Team"
}
Security Best Practices
Secrets Management
Never hardcode secrets in Terraform files. Use secure secret management:
# Using AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "production/database/password"
}
resource "aws_db_instance" "main" {
engine = "postgres"
instance_class = "db.t3.medium"
allocated_storage = 100
# Use secret from Secrets Manager
password = data.aws_secretsmanager_secret_version.db_password.secret_string
# Other configurations...
}
Sensitive Output Protection
output "database_endpoint" {
description = "Database connection endpoint"
value = aws_db_instance.main.endpoint
sensitive = false # Safe to expose
}
output "database_password" {
description = "Database master password"
value = aws_db_instance.main.password
sensitive = true # Protected in output
}
IAM Policy Management
# Use data sources for AWS managed policies
data "aws_iam_policy" "ssm_managed_instance_core" {
arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
# Create custom policies with least privilege
resource "aws_iam_policy" "app_s3_access" {
name = "${var.environment}-app-s3-access"
description = "Allows application to access specific S3 bucket"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::${var.app_bucket}/*"
}
]
})
}
Testing Terraform Code
Validation and Formatting
# Format code
terraform fmt -recursive
# Validate syntax
terraform validate
# Security scanning with tfsec
tfsec .
# Cost estimation with Infracost
infracost breakdown --path .
Terratest for Integration Testing
// test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVPCCreation(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../environments/test",
Vars: map[string]interface{}{
"vpc_cidr": "10.0.0.0/16",
"environment": "test",
},
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
}
CI/CD Pipeline Integration
GitHub Actions Workflow
name: Terraform CI/CD
on:
push:
branches: [main]
pull_request:
branches: [main]
env:
TF_VERSION: 1.6.0
jobs:
terraform:
name: Terraform Plan and Apply
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: us-east-1
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init
working-directory: ./environments/production
- name: Terraform Validate
run: terraform validate
working-directory: ./environments/production
- name: Terraform Plan
run: terraform plan -out=tfplan
working-directory: ./environments/production
- name: Terraform Apply
if: github.ref == 'refs/heads/main'
run: terraform apply -auto-approve tfplan
working-directory: ./environments/production
Performance Optimization
Use Data Sources Wisely
# Cache AMI lookup with data source
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-hvm-*-x86_64-gp2"]
}
}
# Reference cached data
resource "aws_instance" "app" {
ami = data.aws_ami.amazon_linux_2.id
instance_type = var.instance_type
}
Parallelism Tuning
# Increase parallelism for faster applies (default is 10)
terraform apply -parallelism=30
# Reduce for rate limit sensitive operations
terraform apply -parallelism=5
Targeted Operations
# Update specific resources
terraform apply -target=module.vpc
# Refresh specific module state
terraform refresh -target=module.database
Disaster Recovery and Rollback
State File Backup
# Backup current state before major changes
terraform state pull > terraform.tfstate.backup-$(date +%Y%m%d-%H%M%S)
# Restore from backup if needed
terraform state push terraform.tfstate.backup-20260118-100000
Import Existing Resources
# Import existing AWS resources
terraform import module.vpc.aws_vpc.main vpc-1234567890abcdef0
# Verify import
terraform plan
Common Pitfalls to Avoid
ā DON'T: Hardcode Values
# Bad
resource "aws_instance" "app" {
ami = "ami-12345678" # Hardcoded AMI
instance_type = "t3.medium" # Hardcoded type
}
ā DO: Use Variables and Data Sources
# Good
data "aws_ami" "app" {
# Dynamic AMI lookup
}
resource "aws_instance" "app" {
ami = data.aws_ami.app.id
instance_type = var.instance_type
}
ā DON'T: Use count with Complex Objects
# Bad - fragile to list reordering
resource "aws_instance" "app" {
count = 3
ami = var.ami_id
instance_type = var.instance_type
}
ā DO: Use for_each with Maps
# Good - stable resource addressing
locals {
instances = {
web1 = { type = "t3.medium" }
web2 = { type = "t3.medium" }
api = { type = "t3.large" }
}
}
resource "aws_instance" "app" {
for_each = local.instances
ami = var.ami_id
instance_type = each.value.type
tags = {
Name = each.key
}
}
Key Takeaways
- State Management: Always use remote state with locking
- Module Design: Create focused, reusable modules
- Security: Never commit secrets, use proper IAM policies
- Testing: Validate, format, and test your code
- CI/CD: Automate with pipelines
- Documentation: Comment your code and maintain READMEs
- Version Control: Use Git with meaningful commits
- Cost Awareness: Use Infracost to track expenses
Conclusion
Terraform is a powerful tool, but it requires discipline and best practices to use effectively at scale. By following these guidelines, you'll build infrastructure that is:
- Reproducible: Consistent across environments
- Secure: Protected against common vulnerabilities
- Maintainable: Easy to understand and modify
- Reliable: Tested and validated
- Collaborative: Team-friendly with clear patterns
Remember: good Terraform code is code that your team can read, understand, and confidently modify six months from now.
Want to dive deeper? Check out my other posts on AWS CDK and Kubernetes, or reach out for consulting on your infrastructure challenges!