Architecture¶
The Dory Orchestrator is a single-replica Go control plane (version v0.1.0) that reconciles processor rows in a PostgreSQL database into running Kubernetes pods. It owns pod lifecycle, bin-packing placement, Karpenter-driven scaling, edge failover, and state-preserving migrations.
Components¶
| Component | Package | Responsibility |
|---|---|---|
| Config client | config |
Polls PostgreSQL (ListProcessorConfigs) for the desired processor set |
| Reconciler | pkg/reconciler |
Computes the desired-vs-actual diff and drives pod create/delete, failover, consolidation |
| Scheduler | pkg/scheduler |
First-fit bin-packing placement; emits provision_node when nothing fits |
| Migrator | pkg/migrator |
Create-before-delete pod migration (default and enhanced/validated paths) |
| State transfer | pkg/state |
HTTP /state capture and restore between pods |
| Drain manager | pkg/drain |
State-preserving migration when an application node is cordoned |
| Failover manager | failover |
Edge node failover/failback decisions (see Edge Failover) |
| Edge store | pkg/state |
Edge node registration, heartbeat, decommission |
| Event monitor | monitor |
Optional pod/node watch; triggers drain on a NoSchedule taint |
| K8s client | pkg/k8s |
Pod spec construction and cluster operations |
| Health manager | health |
Liveness/readiness checkers (DB + K8s) |
| Shutdown manager | — | Gates reconciliation and coordinates graceful shutdown |
| HTTP server | cmd/orchestrator |
Single server on :MetricsPort for metrics, health, and edge APIs |
graph TD
DB[(PostgreSQL)] -->|ListProcessorConfigs| CW[Config Watcher]
CW -->|onChange every PollInterval| REC[Reconciler]
REC --> SCH[Scheduler]
REC --> MIG[Migrator]
REC --> FO[Failover Manager]
SCH -->|provision_node| KARP[Karpenter]
MIG --> ST[State Transfer]
REC -->|create/delete| K8S[(Kubernetes API)]
MON[Event Monitor<br/>--enable-monitor] -->|node cordoned| DRAIN[Drain Manager]
DRAIN --> ST
DRAIN --> K8S
HTTP[HTTP Server :8080] --> METRICS[/metrics, /healthz, /readyz, /livez/]
HTTP --> EDGE[/POST /api/v1/edge/.../]
HP[Health Poller] -->|/healthz| K8S
HTTP server¶
A single http.Server listens on :MetricsPort (default 8080) and serves:
GET /metrics— Prometheus metrics (see Metrics)GET /healthz,GET /readyz,GET /livez— orchestrator healthPOST /api/v1/edge/heartbeatPOST /api/v1/edge/nodesPOST /api/v1/edge/nodes/decommission
See the API Reference for request and response shapes.
Startup sequence¶
- Parse CLI flags.
- Layered config load — file → defaults → CLI → env (see Configuration).
Validatethe resolved config.- Initialize the logger.
- Create the PostgreSQL config client and pgx connection pool.
- Initialize the
EdgeStore. - Initialize the Kubernetes client.
- Build the health manager with DB and K8s checkers.
- If
EnableStartupValidation(defaulttrue), run critical checks:database_connectivity,kubernetes_connectivity,namespace_exists. Any failure is fatal. - Construct scheduler, migrator, failover manager, reconciler, and shutdown manager.
- Start the HTTP server.
- Start the config-watcher goroutine (the control loop).
- Start the processor health poller (polls each running pod's
/healthz, concurrency 10). MarkReady.- Block on
SIGINT/SIGTERMor a watcher error. GracefulShutdown.
Note
There is no leader election. The Deployment runs replicas=1 with the Recreate strategy. See Deployment.
The control loop is the config watcher¶
There is no dedicated reconcile goroutine. The config watcher is the control loop:
- Every
PollInterval, poll the DB viaListProcessorConfigs. - Always call
onChange(processors)— even when nothing changed — so that consolidation can run. onChangeinvokesrecon.Reconcile(ctx, processors)with aReconciliationTimeoutcontext, gated by the shutdown manager.
Tip
Because onChange fires on every poll regardless of change, consolidation gets a chance to run on a steady-state cluster, not only when the desired set changes.
The reconcile cycle¶
Reconcile(ctx, desiredApps []config.ProcessorConfig) runs these phases in order:
- gatherClusterState — list pods labeled
managed-by=dory-orchestrator(the actual set) and list all pods (capacity). - calculateDiff — see below.
- deleteRemovedPods — delete pods no longer desired.
- createNewPods — create pods for newly desired processors.
- handleFailover — edge failover decisions.
- runConsolidation — bin-pack onto fewer nodes when possible.
Diff calculation¶
- Desired pods are keyed by processor ID.
- Actual pods are keyed by
pod.Labels["processor-id"]. PodFailedpods are scheduled to recreate.ToAdd= desired keys absent from actual.ToRemove= actual keys absent from desired.
One processor ID maps to exactly one pod. Pods are immutable, so an "update" is a delete-plus-recreate. See Pod Lifecycle.
Scheduler (bin-packing + Karpenter)¶
The scheduler does first-fit bin-packing on the most-utilized healthy node: nodes are sorted by descending CPU+memory utilization, and a 10% resource buffer is reserved. A node is healthy when it is not cordoned, has Ready=True, and carries no unschedulable taint.
- Defaults:
DefaultCPUMillis=100,DefaultMemoryBytes=128Mi. - The
NodeResourceCachehas a 30s TTL and is invalidated by the reconciler on each pod create/delete (InvalidatePodChange).
Decision types:
| Decision | Meaning |
|---|---|
schedule_existing |
Pod fits on an existing healthy node |
provision_node |
Nothing fits — create a Pending pod (empty NodeName) so Karpenter provisions a node |
wait |
Defer scheduling |
The orchestrator does not create NodePool/EC2NodeClass CRDs or call any Karpenter API. It relies on native Karpenter behavior: Pending pods with an empty NodeName trigger provisioning.
Warning
provisioner.RequestNodeProvisioning validates only by listing nodes labeled karpenter.sh/nodepool=dory-pool, but the cluster NodePool is dory-app-pool. This label mismatch means the validation step may not match real nodes. The eager teardown path is karpenter/decommission.go DecommissionNode (cordon → evict → delete Node).
Consolidation¶
runConsolidation is cooldown-gated (default 1m) and requires at least two workload-type=application nodes. It calls scheduler.ConsolidatePods to plan moves, then migrator.MigrateBatch to execute them. Emptied nodes are decommissioned by Karpenter, not the orchestrator.
Event monitor (--enable-monitor)¶
The event monitor is optional and only runs with --enable-monitor:
- Pods — watches pods labeled
managed-by=dory-orchestrator(log only). - Nodes — watches all nodes. A
NoScheduletaint on anapplicationnode triggers an async drain handler with a 5-minute timeout, invoking the state-transfer migration path.
Warning
Without --enable-monitor, node drains are handled only by Karpenter/Kubernetes — the orchestrator's state-transfer migration path does not run. See State Migration & Node Drain.
Graceful shutdown¶
On SIGINT/SIGTERM (or a watcher error), the orchestrator runs GracefulShutdown. The shutdown manager gates reconciliation so no new reconcile cycle starts during teardown, and in-flight work is allowed to wind down before the process exits.