Architecture¶

The Dory Orchestrator is a single-replica Go control plane (version v0.1.0) that reconciles processor rows in a PostgreSQL database into running Kubernetes pods. It owns pod lifecycle, bin-packing placement, Karpenter-driven scaling, edge failover, and state-preserving migrations.

Components¶

Component	Package	Responsibility
Config client	`config`	Polls PostgreSQL (`ListProcessorConfigs`) for the desired processor set
Reconciler	`pkg/reconciler`	Computes the desired-vs-actual diff and drives pod create/delete, failover, consolidation
Scheduler	`pkg/scheduler`	First-fit bin-packing placement; emits `provision_node` when nothing fits
Migrator	`pkg/migrator`	Create-before-delete pod migration (default and enhanced/validated paths)
State transfer	`pkg/state`	HTTP `/state` capture and restore between pods
Drain manager	`pkg/drain`	State-preserving migration when an application node is cordoned
Failover manager	`failover`	Edge node failover/failback decisions (see Edge Failover)
Edge store	`pkg/state`	Edge node registration, heartbeat, decommission
Event monitor	`monitor`	Optional pod/node watch; triggers drain on a NoSchedule taint
K8s client	`pkg/k8s`	Pod spec construction and cluster operations
Health manager	`health`	Liveness/readiness checkers (DB + K8s)
Shutdown manager	—	Gates reconciliation and coordinates graceful shutdown
HTTP server	`cmd/orchestrator`	Single server on `:MetricsPort` for metrics, health, and edge APIs

graph TD
    DB[(PostgreSQL)] -->|ListProcessorConfigs| CW[Config Watcher]
    CW -->|onChange every PollInterval| REC[Reconciler]
    REC --> SCH[Scheduler]
    REC --> MIG[Migrator]
    REC --> FO[Failover Manager]
    SCH -->|provision_node| KARP[Karpenter]
    MIG --> ST[State Transfer]
    REC -->|create/delete| K8S[(Kubernetes API)]
    MON[Event Monitor<br/>--enable-monitor] -->|node cordoned| DRAIN[Drain Manager]
    DRAIN --> ST
    DRAIN --> K8S
    HTTP[HTTP Server :8080] --> METRICS[/metrics, /healthz, /readyz, /livez/]
    HTTP --> EDGE[/POST /api/v1/edge/.../]
    HP[Health Poller] -->|/healthz| K8S

HTTP server¶

A single http.Server listens on :MetricsPort (default 8080) and serves:

GET /metrics — Prometheus metrics (see Metrics)
GET /healthz, GET /readyz, GET /livez — orchestrator health
POST /api/v1/edge/heartbeat
POST /api/v1/edge/nodes
POST /api/v1/edge/nodes/decommission

See the API Reference for request and response shapes.

Startup sequence¶

Parse CLI flags.
Layered config load — file → defaults → CLI → env (see Configuration).
Validate the resolved config.
Initialize the logger.
Create the PostgreSQL config client and pgx connection pool.
Initialize the EdgeStore.
Initialize the Kubernetes client.
Build the health manager with DB and K8s checkers.
If EnableStartupValidation (default true), run critical checks: database_connectivity, kubernetes_connectivity, namespace_exists. Any failure is fatal.
Construct scheduler, migrator, failover manager, reconciler, and shutdown manager.
Start the HTTP server.
Start the config-watcher goroutine (the control loop).
Start the processor health poller (polls each running pod's /healthz, concurrency 10).
MarkReady.
Block on SIGINT/SIGTERM or a watcher error.
GracefulShutdown.

Note

There is no leader election. The Deployment runs replicas=1 with the Recreate strategy. See Deployment.

The control loop is the config watcher¶

There is no dedicated reconcile goroutine. The config watcher is the control loop:

Every PollInterval, poll the DB via ListProcessorConfigs.
Always call onChange(processors) — even when nothing changed — so that consolidation can run.
onChange invokes recon.Reconcile(ctx, processors) with a ReconciliationTimeout context, gated by the shutdown manager.

Tip

Because onChange fires on every poll regardless of change, consolidation gets a chance to run on a steady-state cluster, not only when the desired set changes.

The reconcile cycle¶

Reconcile(ctx, desiredApps []config.ProcessorConfig) runs these phases in order:

gatherClusterState — list pods labeled managed-by=dory-orchestrator (the actual set) and list all pods (capacity).
calculateDiff — see below.
deleteRemovedPods — delete pods no longer desired.
createNewPods — create pods for newly desired processors.
handleFailover — edge failover decisions.
runConsolidation — bin-pack onto fewer nodes when possible.

Diff calculation¶

Desired pods are keyed by processor ID.
Actual pods are keyed by pod.Labels["processor-id"].
PodFailed pods are scheduled to recreate.
ToAdd = desired keys absent from actual.
ToRemove = actual keys absent from desired.

One processor ID maps to exactly one pod. Pods are immutable, so an "update" is a delete-plus-recreate. See Pod Lifecycle.

Scheduler (bin-packing + Karpenter)¶

The scheduler does first-fit bin-packing on the most-utilized healthy node: nodes are sorted by descending CPU+memory utilization, and a 10% resource buffer is reserved. A node is healthy when it is not cordoned, has Ready=True, and carries no unschedulable taint.

Defaults: DefaultCPUMillis=100, DefaultMemoryBytes=128Mi.
The NodeResourceCache has a 30s TTL and is invalidated by the reconciler on each pod create/delete (InvalidatePodChange).

Decision types:

Decision	Meaning
`schedule_existing`	Pod fits on an existing healthy node
`provision_node`	Nothing fits — create a Pending pod (empty `NodeName`) so Karpenter provisions a node
`wait`	Defer scheduling

The orchestrator does not create NodePool/EC2NodeClass CRDs or call any Karpenter API. It relies on native Karpenter behavior: Pending pods with an empty NodeName trigger provisioning.

Warning

provisioner.RequestNodeProvisioning validates only by listing nodes labeled karpenter.sh/nodepool=dory-pool, but the cluster NodePool is dory-app-pool. This label mismatch means the validation step may not match real nodes. The eager teardown path is karpenter/decommission.go DecommissionNode (cordon → evict → delete Node).

Consolidation¶

runConsolidation is cooldown-gated (default 1m) and requires at least two workload-type=application nodes. It calls scheduler.ConsolidatePods to plan moves, then migrator.MigrateBatch to execute them. Emptied nodes are decommissioned by Karpenter, not the orchestrator.

Event monitor (`--enable-monitor`)¶

The event monitor is optional and only runs with --enable-monitor:

Pods — watches pods labeled managed-by=dory-orchestrator (log only).
Nodes — watches all nodes. A NoSchedule taint on an application node triggers an async drain handler with a 5-minute timeout, invoking the state-transfer migration path.

Warning

Without --enable-monitor, node drains are handled only by Karpenter/Kubernetes — the orchestrator's state-transfer migration path does not run. See State Migration & Node Drain.

Graceful shutdown¶

On SIGINT/SIGTERM (or a watcher error), the orchestrator runs GracefulShutdown. The shutdown manager gates reconciliation so no new reconcile cycle starts during teardown, and in-flight work is allowed to wind down before the process exits.