Karpenter Configuration for Dory Orchestrator¶

This directory contains Karpenter configuration for automatic node provisioning.

Files¶

File	Purpose
`nodepool-application.yaml`	NodePool (`dory-app-pool`) + EC2NodeClass (`dory-app-nodeclass`) for application workloads
`ecr-token-refresh-cronjob.yaml`	CronJob to refresh ECR tokens every 6 hours
`ecr-token-refresh-rbac.yaml`	RBAC permissions for ECR token refresh

Quick Start¶

Prerequisites¶

EKS Cluster running Kubernetes 1.33+ (tested on 1.35)
Karpenter v1.7+ installed in the cluster (uses the karpenter.sh/v1 / karpenter.k8s.aws/v1 APIs)
IAM Roles configured for Karpenter and nodes
A separate system node group (e.g. an eksctl managed nodegroup) labeled role=system and tainted node-role=system:NoSchedule for the orchestrator and cluster add-ons — Karpenter only provisions the application pool. See the deployment guide.

Apply Configuration¶

# Apply NodePool and EC2NodeClass
kubectl apply -f nodepool-application.yaml

# Apply ECR token refresh (required for image pulls)
kubectl apply -f ecr-token-refresh-rbac.yaml
kubectl apply -f ecr-token-refresh-cronjob.yaml

Verify Installation¶

# Check NodePool
kubectl get nodepool dory-app-pool

# Check EC2NodeClass
kubectl get ec2nodeclass dory-app-nodeclass

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

NodePool Configuration¶

dory-app-pool¶

Setting	Value	Notes
Node Label	`workload-type: application`	Used by pods for scheduling
Capacity Type	on-demand	Predictable for orchestrator workloads
Instance Types	t3.small, t3.medium, t3a.small, t3a.medium	Cost-effective AMD64 instances
Consolidation	WhenEmpty after 3m	Only consolidate empty nodes
Disruption Budget	20%	Max nodes disrupted at once
CPU Limit	100 cores	Pool-wide limit
Memory Limit	400Gi	Pool-wide limit

EC2NodeClass (dory-app-nodeclass)¶

Setting	Value
AMI Family	AL2023 (Amazon Linux 2023)
AMI Selector	`al2023@latest`
IAM Role	`KarpenterNodeRole-dory-demo`
Volume Type	GP3 SSD, 30Gi
IMDSv2	Required
Encryption	Enabled

How It Works¶

Normal Pod Creation (Managed Workloads)¶

Orchestrator creates pod with nodeSelector: {workload-type: application}
Pod is initially Pending
Karpenter detects Pending pod and provisions EC2 instance
New node joins cluster with workload-type=application label
Pod is scheduled on the new node

Edge Failover (Critical)¶

When edge nodes fail, pods are migrated to managed nodes:

Edge pod detected on NotReady node (>30 seconds)
Orchestrator force-deletes stuck edge pod
Creates new pod with nodeSelector: {workload-type: application}
Karpenter provisions node for failover pod
When edge recovers, pod is migrated back

IMPORTANT: The dory-app-pool NodePool is required for edge failover to work.

Node Decommissioning¶

Orchestrator deletes pod
Karpenter detects empty/underutilized node
After 3 minutes, Karpenter consolidates by terminating the node

Monitoring¶

# List nodes managed by Karpenter
kubectl get nodes -l karpenter.sh/nodepool=dory-app-pool

# View node claims
kubectl get nodeclaims

# Check node utilization
kubectl top nodes

# View Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f

Troubleshooting¶

Nodes Not Provisioning¶

# Check Karpenter logs for errors
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

# Verify NodePool status
kubectl describe nodepool dory-app-pool

# Check EC2NodeClass status
kubectl describe ec2nodeclass dory-app-nodeclass

# Verify subnet/security group discovery tags
aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"
aws ec2 describe-security-groups --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"

Pods Stuck in ImagePullBackOff¶

# Check if ECR secret exists
kubectl get secret ecr-registry-secret -n default

# Manually refresh ECR token (CronJob is in kube-system)
kubectl create job --from=cronjob/ecr-token-refresh ecr-refresh-now -n kube-system

Failover Pods Not Scheduling¶

# Verify NodePool can provision nodes
kubectl get nodepool dory-app-pool -o yaml | grep -A5 "status:"

# Check if pool has capacity
kubectl describe nodepool dory-app-pool | grep -A10 "Limits:"