Building Production-Ready EKS Clusters with AWS CDK
Amazon Elastic Kubernetes Service (EKS) has become the de facto standard for running Kubernetes workloads on AWS. However, deploying a production-ready EKS cluster involves much more than just clicking a button in the console. In this post, I'll walk you through building a robust, scalable, and cost-optimized EKS cluster using AWS CDK (Cloud Development Kit).
Why AWS CDK for EKS?
Infrastructure as Code (IaC) is essential for modern DevOps practices. While CloudFormation and Terraform are popular choices, AWS CDK offers several advantages:
- Type Safety: Write infrastructure code in TypeScript, Python, or Java with full IDE support
- Constructs Library: High-level abstractions for complex AWS resources
- Reusability: Create and share custom constructs across projects
- Native AWS Integration: First-class support for all AWS services
- Testability: Unit test your infrastructure code before deployment
Architecture Overview
Our production EKS cluster follows AWS Well-Architected Framework principles:
Network Architecture
VPC (10.0.0.0/16)
āāā Public Subnets (3 AZs)
ā āāā 10.0.1.0/24 (us-east-1a)
ā āāā 10.0.2.0/24 (us-east-1b)
ā āāā 10.0.3.0/24 (us-east-1c)
ā
āāā Private Subnets (3 AZs)
ā āāā 10.0.11.0/24 (us-east-1a)
ā āāā 10.0.12.0/24 (us-east-1b)
ā āāā 10.0.13.0/24 (us-east-1c)
ā
āāā NAT Gateway (Single for cost optimization)
EKS Components
- Control Plane: Managed by AWS in Multi-AZ configuration
- Worker Nodes: EC2 instances in private subnets across 3 AZs
- Add-ons: VPC CNI, CoreDNS, kube-proxy, EBS CSI Driver
- Observability: CloudWatch Container Insights, Prometheus, Grafana
- Ingress: AWS Load Balancer Controller for ALB/NLB integration
Implementation with AWS CDK
Setting Up the CDK Project
// Initialize CDK project
import * as cdk from 'aws-cdk-lib';
import * as eks from 'aws-cdk-lib/aws-eks';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as iam from 'aws-cdk-lib/aws-iam';
export class EksClusterStack extends cdk.Stack {
constructor(scope: Construct, id: string, props?: cdk.StackProps) {
super(scope, id, props);
// VPC with optimized NAT Gateway configuration
const vpc = new ec2.Vpc(this, 'EksVpc', {
maxAzs: 3,
natGateways: 1, // Cost optimization
subnetConfiguration: [
{
name: 'Public',
subnetType: ec2.SubnetType.PUBLIC,
cidrMask: 24,
},
{
name: 'Private',
subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS,
cidrMask: 24,
},
],
});
// EKS Cluster
const cluster = new eks.Cluster(this, 'EksCluster', {
version: eks.KubernetesVersion.V1_34,
vpc,
vpcSubnets: [{ subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS }],
defaultCapacity: 0, // We'll add managed node groups separately
});
}
}
Node Group Configuration
// Multi-AZ Managed Node Group
cluster.addNodegroupCapacity('ManagedNodeGroup', {
instanceTypes: [new ec2.InstanceType('t3.medium')],
minSize: 2,
maxSize: 10,
desiredSize: 3,
subnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
labels: {
role: 'general-purpose',
},
tags: {
'k8s.io/cluster-autoscaler/enabled': 'true',
'k8s.io/cluster-autoscaler/cluster-name': cluster.clusterName,
},
});
Essential Add-ons and Controllers
1. AWS Load Balancer Controller
The AWS Load Balancer Controller enables Kubernetes Ingress resources to provision Application Load Balancers automatically.
// IAM Policy for Load Balancer Controller
const albControllerPolicy = new iam.PolicyDocument({
statements: [
new iam.PolicyStatement({
actions: [
'ec2:CreateTags',
'ec2:DeleteTags',
'elasticloadbalancing:*',
'ec2:AuthorizeSecurityGroupIngress',
'ec2:RevokeSecurityGroupIngress',
'ec2:DeleteSecurityGroup',
],
resources: ['*'],
}),
],
});
// Install AWS Load Balancer Controller via Helm
cluster.addHelmChart('AwsLoadBalancerController', {
chart: 'aws-load-balancer-controller',
repository: 'https://aws.github.io/eks-charts',
namespace: 'kube-system',
values: {
clusterName: cluster.clusterName,
serviceAccount: {
create: true,
name: 'aws-load-balancer-controller',
},
},
});
2. EBS CSI Driver
For persistent storage, the EBS CSI Driver is essential:
// Enable EBS CSI Driver
cluster.addHelmChart('EbsCsiDriver', {
chart: 'aws-ebs-csi-driver',
repository: 'https://kubernetes-sigs.github.io/aws-ebs-csi-driver',
namespace: 'kube-system',
values: {
enableVolumeScheduling: true,
enableVolumeResizing: true,
enableVolumeSnapshot: true,
},
});
3. Cluster Autoscaler
Automatic node scaling based on pod resource requirements:
// Cluster Autoscaler Deployment
const clusterAutoscaler = cluster.addManifest('ClusterAutoscaler', {
apiVersion: 'apps/v1',
kind: 'Deployment',
metadata: {
name: 'cluster-autoscaler',
namespace: 'kube-system',
},
spec: {
selector: {
matchLabels: {
app: 'cluster-autoscaler',
},
},
template: {
metadata: {
labels: {
app: 'cluster-autoscaler',
},
},
spec: {
containers: [
{
name: 'cluster-autoscaler',
image: 'k8s.gcr.io/autoscaling/cluster-autoscaler:v1.26.2',
command: [
'./cluster-autoscaler',
`--cluster-name=${cluster.clusterName}`,
'--aws-region=us-east-1',
'--balance-similar-node-groups',
'--skip-nodes-with-system-pods=false',
],
},
],
},
},
},
});
Security Best Practices
1. IAM Roles for Service Accounts (IRSA)
IRSA provides fine-grained IAM permissions to Kubernetes pods:
const s3AccessServiceAccount = cluster.addServiceAccount('S3AccessServiceAccount', {
name: 's3-access',
namespace: 'default',
});
s3AccessServiceAccount.addToPrincipalPolicy(
new iam.PolicyStatement({
actions: ['s3:GetObject', 's3:PutObject'],
resources: ['arn:aws:s3:::my-bucket/*'],
})
);
2. Network Policies
Implement network segmentation using Calico or Cilium:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: deny-all-ingress
namespace: production
spec:
podSelector: {}
policyTypes:
- Ingress
ingress: []
3. Pod Security Standards
Enforce pod security policies:
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Monitoring and Observability
CloudWatch Container Insights
Enable comprehensive cluster monitoring:
// Enable CloudWatch Container Insights
cluster.addHelmChart('CloudWatchInsights', {
chart: 'aws-cloudwatch-metrics',
repository: 'https://aws.github.io/eks-charts',
namespace: 'amazon-cloudwatch',
values: {
clusterName: cluster.clusterName,
},
});
Prometheus and Grafana Stack
Deploy the kube-prometheus-stack for advanced monitoring:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--values prometheus-values.yaml
Cost Optimization Strategies
1. Spot Instances for Non-Critical Workloads
cluster.addNodegroupCapacity('SpotNodeGroup', {
instanceTypes: [
new ec2.InstanceType('t3.medium'),
new ec2.InstanceType('t3.large'),
],
capacityType: eks.CapacityType.SPOT,
minSize: 0,
maxSize: 5,
labels: {
role: 'spot',
},
taints: [
{
key: 'spot',
value: 'true',
effect: eks.TaintEffect.NO_SCHEDULE,
},
],
});
2. Right-Sizing with Vertical Pod Autoscaler
Install VPA to automatically adjust resource requests:
kubectl apply -f https://github.com/kubernetes/autoscaler/releases/download/vertical-pod-autoscaler-0.13.0/vpa-release.yaml
3. Automated Scaling During Off-Hours
Use CronJobs or external schedulers to scale down during non-business hours:
apiVersion: batch/v1
kind: CronJob
metadata:
name: scale-down-evening
spec:
schedule: "0 18 * * 1-5" # 6 PM weekdays
jobTemplate:
spec:
template:
spec:
containers:
- name: kubectl
image: bitnami/kubectl:latest
command:
- /bin/sh
- -c
- kubectl scale deployment myapp --replicas=1
Deployment Pipeline Integration
GitOps with ArgoCD
Implement GitOps for declarative cluster management:
# Install ArgoCD
kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Access ArgoCD UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
Helm Chart Deployment
Example application deployment with Helm:
# values.yaml
replicaCount: 3
image:
repository: my-registry/my-app
tag: v1.0.0
pullPolicy: IfNotPresent
resources:
limits:
cpu: 500m
memory: 512Mi
requests:
cpu: 250m
memory: 256Mi
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
Disaster Recovery and Backup
Velero for Cluster Backup
# Install Velero
velero install \
--provider aws \
--bucket eks-backup-bucket \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1 \
--snapshot-location-config region=us-east-1
# Create scheduled backup
velero schedule create daily-backup --schedule="0 2 * * *"
Key Takeaways
- Infrastructure as Code: Use AWS CDK for reproducible, version-controlled infrastructure
- Multi-AZ Design: Ensure high availability across availability zones
- Security First: Implement IRSA, network policies, and pod security standards
- Cost Optimization: Leverage spot instances, autoscaling, and right-sizing
- Observability: Deploy comprehensive monitoring with CloudWatch and Prometheus
- GitOps: Adopt declarative configuration management with ArgoCD
- Automation: Automate everything from deployment to scaling and backup
Conclusion
Building production-ready EKS clusters requires careful planning and implementation of best practices across security, scalability, cost, and operations. AWS CDK provides an excellent framework for codifying these practices and ensuring consistent, reliable deployments.
The investment in proper EKS architecture pays dividends in:
- Reduced operational overhead through automation
- Improved reliability with multi-AZ deployments
- Cost savings through optimization strategies
- Enhanced security with defense-in-depth approach
- Better developer experience with self-service capabilities
Ready to build your own EKS cluster? Start with the CDK examples above and customize them for your specific requirements!
Questions or feedback? Feel free to reach out through the contact form. I'd love to hear about your EKS journey!