Skip to content

Karpenter Configuration for Dory Orchestrator

This directory contains Karpenter configuration for automatic node provisioning.

Files

File Purpose
nodepool-application.yaml NodePool (dory-app-pool) + EC2NodeClass (dory-app-nodeclass) for application workloads
ecr-token-refresh-cronjob.yaml CronJob to refresh ECR tokens every 6 hours
ecr-token-refresh-rbac.yaml RBAC permissions for ECR token refresh

Quick Start

Prerequisites

  1. EKS Cluster running Kubernetes 1.33+ (tested on 1.35)
  2. Karpenter v1.7+ installed in the cluster (uses the karpenter.sh/v1 / karpenter.k8s.aws/v1 APIs)
  3. IAM Roles configured for Karpenter and nodes
  4. A separate system node group (e.g. an eksctl managed nodegroup) labeled role=system and tainted node-role=system:NoSchedule for the orchestrator and cluster add-ons — Karpenter only provisions the application pool. See the deployment guide.

Apply Configuration

# Apply NodePool and EC2NodeClass
kubectl apply -f nodepool-application.yaml

# Apply ECR token refresh (required for image pulls)
kubectl apply -f ecr-token-refresh-rbac.yaml
kubectl apply -f ecr-token-refresh-cronjob.yaml

Verify Installation

# Check NodePool
kubectl get nodepool dory-app-pool

# Check EC2NodeClass
kubectl get ec2nodeclass dory-app-nodeclass

# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=50

NodePool Configuration

dory-app-pool

Setting Value Notes
Node Label workload-type: application Used by pods for scheduling
Capacity Type on-demand Predictable for orchestrator workloads
Instance Types t3.small, t3.medium, t3a.small, t3a.medium Cost-effective AMD64 instances
Consolidation WhenEmpty after 3m Only consolidate empty nodes
Disruption Budget 20% Max nodes disrupted at once
CPU Limit 100 cores Pool-wide limit
Memory Limit 400Gi Pool-wide limit

EC2NodeClass (dory-app-nodeclass)

Setting Value
AMI Family AL2023 (Amazon Linux 2023)
AMI Selector al2023@latest
IAM Role KarpenterNodeRole-dory-demo
Volume Type GP3 SSD, 30Gi
IMDSv2 Required
Encryption Enabled

How It Works

Normal Pod Creation (Managed Workloads)

  1. Orchestrator creates pod with nodeSelector: {workload-type: application}
  2. Pod is initially Pending
  3. Karpenter detects Pending pod and provisions EC2 instance
  4. New node joins cluster with workload-type=application label
  5. Pod is scheduled on the new node

Edge Failover (Critical)

When edge nodes fail, pods are migrated to managed nodes:

  1. Edge pod detected on NotReady node (>30 seconds)
  2. Orchestrator force-deletes stuck edge pod
  3. Creates new pod with nodeSelector: {workload-type: application}
  4. Karpenter provisions node for failover pod
  5. When edge recovers, pod is migrated back

IMPORTANT: The dory-app-pool NodePool is required for edge failover to work.

Node Decommissioning

  1. Orchestrator deletes pod
  2. Karpenter detects empty/underutilized node
  3. After 3 minutes, Karpenter consolidates by terminating the node

Monitoring

# List nodes managed by Karpenter
kubectl get nodes -l karpenter.sh/nodepool=dory-app-pool

# View node claims
kubectl get nodeclaims

# Check node utilization
kubectl top nodes

# View Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f

Troubleshooting

Nodes Not Provisioning

# Check Karpenter logs for errors
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100

# Verify NodePool status
kubectl describe nodepool dory-app-pool

# Check EC2NodeClass status
kubectl describe ec2nodeclass dory-app-nodeclass

# Verify subnet/security group discovery tags
aws ec2 describe-subnets --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"
aws ec2 describe-security-groups --filters "Name=tag:karpenter.sh/discovery,Values=dory-demo"

Pods Stuck in ImagePullBackOff

# Check if ECR secret exists
kubectl get secret ecr-registry-secret -n default

# Manually refresh ECR token (CronJob is in kube-system)
kubectl create job --from=cronjob/ecr-token-refresh ecr-refresh-now -n kube-system

Failover Pods Not Scheduling

# Verify NodePool can provision nodes
kubectl get nodepool dory-app-pool -o yaml | grep -A5 "status:"

# Check if pool has capacity
kubectl describe nodepool dory-app-pool | grep -A10 "Limits:"

References