Skip to content

Architecture

The Dory Orchestrator is a single-replica Go control plane (version v0.1.0) that reconciles processor rows in a PostgreSQL database into running Kubernetes pods. It owns pod lifecycle, bin-packing placement, Karpenter-driven scaling, edge failover, and state-preserving migrations.

Components

Component Package Responsibility
Config client config Polls PostgreSQL (ListProcessorConfigs) for the desired processor set
Reconciler pkg/reconciler Computes the desired-vs-actual diff and drives pod create/delete, failover, consolidation
Scheduler pkg/scheduler First-fit bin-packing placement; emits provision_node when nothing fits
Migrator pkg/migrator Create-before-delete pod migration (default and enhanced/validated paths)
State transfer pkg/state HTTP /state capture and restore between pods
Drain manager pkg/drain State-preserving migration when an application node is cordoned
Failover manager failover Edge node failover/failback decisions (see Edge Failover)
Edge store pkg/state Edge node registration, heartbeat, decommission
Event monitor monitor Optional pod/node watch; triggers drain on a NoSchedule taint
K8s client pkg/k8s Pod spec construction and cluster operations
Health manager health Liveness/readiness checkers (DB + K8s)
Shutdown manager Gates reconciliation and coordinates graceful shutdown
HTTP server cmd/orchestrator Single server on :MetricsPort for metrics, health, and edge APIs
graph TD
    DB[(PostgreSQL)] -->|ListProcessorConfigs| CW[Config Watcher]
    CW -->|onChange every PollInterval| REC[Reconciler]
    REC --> SCH[Scheduler]
    REC --> MIG[Migrator]
    REC --> FO[Failover Manager]
    SCH -->|provision_node| KARP[Karpenter]
    MIG --> ST[State Transfer]
    REC -->|create/delete| K8S[(Kubernetes API)]
    MON[Event Monitor<br/>--enable-monitor] -->|node cordoned| DRAIN[Drain Manager]
    DRAIN --> ST
    DRAIN --> K8S
    HTTP[HTTP Server :8080] --> METRICS[/metrics, /healthz, /readyz, /livez/]
    HTTP --> EDGE[/POST /api/v1/edge/.../]
    HP[Health Poller] -->|/healthz| K8S

HTTP server

A single http.Server listens on :MetricsPort (default 8080) and serves:

  • GET /metrics — Prometheus metrics (see Metrics)
  • GET /healthz, GET /readyz, GET /livez — orchestrator health
  • POST /api/v1/edge/heartbeat
  • POST /api/v1/edge/nodes
  • POST /api/v1/edge/nodes/decommission

See the API Reference for request and response shapes.

Startup sequence

  1. Parse CLI flags.
  2. Layered config load — file → defaults → CLI → env (see Configuration).
  3. Validate the resolved config.
  4. Initialize the logger.
  5. Create the PostgreSQL config client and pgx connection pool.
  6. Initialize the EdgeStore.
  7. Initialize the Kubernetes client.
  8. Build the health manager with DB and K8s checkers.
  9. If EnableStartupValidation (default true), run critical checks: database_connectivity, kubernetes_connectivity, namespace_exists. Any failure is fatal.
  10. Construct scheduler, migrator, failover manager, reconciler, and shutdown manager.
  11. Start the HTTP server.
  12. Start the config-watcher goroutine (the control loop).
  13. Start the processor health poller (polls each running pod's /healthz, concurrency 10).
  14. MarkReady.
  15. Block on SIGINT/SIGTERM or a watcher error.
  16. GracefulShutdown.

Note

There is no leader election. The Deployment runs replicas=1 with the Recreate strategy. See Deployment.

The control loop is the config watcher

There is no dedicated reconcile goroutine. The config watcher is the control loop:

  1. Every PollInterval, poll the DB via ListProcessorConfigs.
  2. Always call onChange(processors) — even when nothing changed — so that consolidation can run.
  3. onChange invokes recon.Reconcile(ctx, processors) with a ReconciliationTimeout context, gated by the shutdown manager.

Tip

Because onChange fires on every poll regardless of change, consolidation gets a chance to run on a steady-state cluster, not only when the desired set changes.

The reconcile cycle

Reconcile(ctx, desiredApps []config.ProcessorConfig) runs these phases in order:

  1. gatherClusterState — list pods labeled managed-by=dory-orchestrator (the actual set) and list all pods (capacity).
  2. calculateDiff — see below.
  3. deleteRemovedPods — delete pods no longer desired.
  4. createNewPods — create pods for newly desired processors.
  5. handleFailover — edge failover decisions.
  6. runConsolidation — bin-pack onto fewer nodes when possible.

Diff calculation

  • Desired pods are keyed by processor ID.
  • Actual pods are keyed by pod.Labels["processor-id"].
  • PodFailed pods are scheduled to recreate.
  • ToAdd = desired keys absent from actual.
  • ToRemove = actual keys absent from desired.

One processor ID maps to exactly one pod. Pods are immutable, so an "update" is a delete-plus-recreate. See Pod Lifecycle.

Scheduler (bin-packing + Karpenter)

The scheduler does first-fit bin-packing on the most-utilized healthy node: nodes are sorted by descending CPU+memory utilization, and a 10% resource buffer is reserved. A node is healthy when it is not cordoned, has Ready=True, and carries no unschedulable taint.

  • Defaults: DefaultCPUMillis=100, DefaultMemoryBytes=128Mi.
  • The NodeResourceCache has a 30s TTL and is invalidated by the reconciler on each pod create/delete (InvalidatePodChange).

Decision types:

Decision Meaning
schedule_existing Pod fits on an existing healthy node
provision_node Nothing fits — create a Pending pod (empty NodeName) so Karpenter provisions a node
wait Defer scheduling

The orchestrator does not create NodePool/EC2NodeClass CRDs or call any Karpenter API. It relies on native Karpenter behavior: Pending pods with an empty NodeName trigger provisioning.

Warning

provisioner.RequestNodeProvisioning validates only by listing nodes labeled karpenter.sh/nodepool=dory-pool, but the cluster NodePool is dory-app-pool. This label mismatch means the validation step may not match real nodes. The eager teardown path is karpenter/decommission.go DecommissionNode (cordon → evict → delete Node).

Consolidation

runConsolidation is cooldown-gated (default 1m) and requires at least two workload-type=application nodes. It calls scheduler.ConsolidatePods to plan moves, then migrator.MigrateBatch to execute them. Emptied nodes are decommissioned by Karpenter, not the orchestrator.

Event monitor (--enable-monitor)

The event monitor is optional and only runs with --enable-monitor:

  • Pods — watches pods labeled managed-by=dory-orchestrator (log only).
  • Nodes — watches all nodes. A NoSchedule taint on an application node triggers an async drain handler with a 5-minute timeout, invoking the state-transfer migration path.

Warning

Without --enable-monitor, node drains are handled only by Karpenter/Kubernetes — the orchestrator's state-transfer migration path does not run. See State Migration & Node Drain.

Graceful shutdown

On SIGINT/SIGTERM (or a watcher error), the orchestrator runs GracefulShutdown. The shutdown manager gates reconciliation so no new reconcile cycle starts during teardown, and in-flight work is allowed to wind down before the process exits.