Karpenter Configuration for Dory Orchestrator¶
This directory contains Karpenter configuration for automatic node provisioning.
Files¶
| File | Purpose |
|---|---|
nodepool-application.yaml |
NodePool (dory-app-pool) + EC2NodeClass (dory-app-nodeclass) for application workloads |
ecr-token-refresh-cronjob.yaml |
CronJob to refresh ECR tokens every 6 hours |
ecr-token-refresh-rbac.yaml |
RBAC permissions for ECR token refresh |
Quick Start¶
Prerequisites¶
- EKS Cluster running Kubernetes 1.33+ (tested on 1.35)
- Karpenter v1.7+ installed in the cluster (uses the
karpenter.sh/v1/karpenter.k8s.aws/v1APIs) - IAM Roles configured for Karpenter and nodes
- A separate system node group (e.g. an eksctl managed nodegroup) labeled
role=systemand taintednode-role=system:NoSchedulefor the orchestrator and cluster add-ons — Karpenter only provisions the application pool. See the deployment guide.
Apply Configuration¶
# Apply NodePool and EC2NodeClass
kubectl apply -f nodepool-application.yaml
# Apply ECR token refresh (required for image pulls)
kubectl apply -f ecr-token-refresh-rbac.yaml
kubectl apply -f ecr-token-refresh-cronjob.yaml
Verify Installation¶
# Check NodePool
kubectl get nodepool dory-app-pool
# Check EC2NodeClass
kubectl get ec2nodeclass dory-app-nodeclass
# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50
NodePool Configuration¶
dory-app-pool¶
| Setting | Value | Notes |
|---|---|---|
| Node Label | workload-type: application |
Used by pods for scheduling |
| Capacity Type | on-demand | Predictable for orchestrator workloads |
| Instance Types | t3.small, t3.medium, t3a.small, t3a.medium | Cost-effective AMD64 instances |
| Consolidation | WhenEmpty after 3m | Only consolidate empty nodes |
| Disruption Budget | 20% | Max nodes disrupted at once |
| CPU Limit | 100 cores | Pool-wide limit |
| Memory Limit | 400Gi | Pool-wide limit |
EC2NodeClass (dory-app-nodeclass)¶
| Setting | Value |
|---|---|
| AMI Family | AL2023 (Amazon Linux 2023) |
| AMI Selector | al2023@latest |
| IAM Role | KarpenterNodeRole-dory-demo |
| Volume Type | GP3 SSD, 30Gi |
| IMDSv2 | Required |
| Encryption | Enabled |
How It Works¶
Normal Pod Creation (Managed Workloads)¶
- Orchestrator creates pod with
nodeSelector: {workload-type: application} - Pod is initially Pending
- Karpenter detects Pending pod and provisions EC2 instance
- New node joins cluster with
workload-type=applicationlabel - Pod is scheduled on the new node
Edge Failover (Critical)¶
When edge nodes fail, pods are migrated to managed nodes:
- Edge pod detected on NotReady node (>30 seconds)
- Orchestrator force-deletes stuck edge pod
- Creates new pod with
nodeSelector: {workload-type: application} - Karpenter provisions node for failover pod
- When edge recovers, pod is migrated back
IMPORTANT: The dory-app-pool NodePool is required for edge failover to work.
Node Decommissioning¶
- Orchestrator deletes pod
- Karpenter detects empty/underutilized node
- After 3 minutes, Karpenter consolidates by terminating the node
Monitoring¶
# List nodes managed by Karpenter
kubectl get nodes -l karpenter.sh/nodepool=dory-app-pool
# View node claims
kubectl get nodeclaims
# Check node utilization
kubectl top nodes
# View Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f
Troubleshooting¶
Nodes Not Provisioning¶
# Check Karpenter logs for errors
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100
# Verify NodePool status
kubectl describe nodepool dory-app-pool
# Check EC2NodeClass status
kubectl describe ec2nodeclass dory-app-nodeclass
# Verify subnet/security group discovery tags
aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"
aws ec2 describe-security-groups --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"
Pods Stuck in ImagePullBackOff¶
# Check if ECR secret exists
kubectl get secret ecr-registry-secret -n default
# Manually refresh ECR token (CronJob is in kube-system)
kubectl create job --from=cronjob/ecr-token-refresh ecr-refresh-now -n kube-system
Failover Pods Not Scheduling¶
# Verify NodePool can provision nodes
kubectl get nodepool dory-app-pool -o yaml | grep -A5 "status:"
# Check if pool has capacity
kubectl describe nodepool dory-app-pool | grep -A10 "Limits:"