Edge Failover & Failback¶

The Dory Orchestrator keeps edge-hosted processors available when an edge node fails by relocating them to managed (cloud / Karpenter) nodes, and returns them to the edge once it recovers. This is implemented by the FailoverManager in pkg/failover/failover.go and runs as part of every reconcile cycle.

See Architecture for how the reconcile loop is structured, Pod Lifecycle for pod creation/deletion mechanics, and State Migration for the HTTP-based transfer path used for live (non-failed) migrations.

Overview¶

FailoverManager.HandleFailover(ctx) performs two phases each cycle:

Failover (edge → managed): PlanFailover then ExecuteFailover.
Failback (managed → edge): PlanFailback then ExecuteFailback.

Note

Failover relies on ConfigMap-based state persistence, not HTTP state transfer. When an edge node is dead, its pod cannot serve a state-export request, so the cloud replacement restores state from the SDK's ConfigMap-backed store keyed by processor-id. Contrast this with the live migration path in State Migration.

Health detection¶

CheckEdgeNodeHealth lists nodes labeled node-type=edge and reads their NodeReady condition, tracking the first time each node was seen NotReady.

When the database is enabled, it additionally calls GetStaleHeartbeatNodes(HeartbeatStaleThreshold). A node that is Ready in Kubernetes but whose DB heartbeat is stale is forced to Ready=false with reason HeartbeatStale. This catches a kubelet that still reports healthy while the processor itself has stopped checking in.

The heartbeat used here is the POST /api/v1/edge/heartbeat endpoint, which updates edge_nodes.last_heartbeat_at. See HTTP API Reference.

Threshold	Value	Meaning
`NodeFailureGracePeriod`	`30s`	`GetFailedEdgeNodes` only returns nodes that have been `NotReady` longer than this.
`HeartbeatStaleThreshold`	`60s`	A K8s-Ready edge node with a DB heartbeat older than this is forced `NotReady` (`HeartbeatStale`). 60s ≈ 12 missed heartbeats at a 5s interval.

Warning

A node must be NotReady for longer than NodeFailureGracePeriod (30s) before it is eligible for failover. This prevents flapping on transient blips.

Planning failover¶

PlanFailover (DB path) requires at least one failed edge node, then calls GetAllFailoverCandidates — apps where the edge node is offline AND failover_enabled AND the app is active. Candidates are de-duplicated against any already-existing Running/Pending pods so a replacement is never created twice.

K8s fallback path (no DB): finds pods running on failed nodes plus Pending edge pods that cannot schedule.

Executing failover¶

ExecuteFailover:

ForceDeletePod the edge pod with grace=0. Graceful deletion is impossible because the NotReady node's kubelet cannot confirm termination.
createFailoverPod — create the cloud replacement.
Record a DB event and increment dory_failover_events_total{type="edge_to_managed"}.

The cloud replacement pod (`createFailoverPod`)¶

Name: {app}-{uuid8}.
Image / port / resources / env: taken from the DB processor config. The image is required — failover aborts if it is missing.
Labels:

Label	Value
`app`	application slug
`managed-by`	`dory-orchestrator`
`workload-location`	`edge`
`migrated-from-edge`	`true`
`original-workload-type`	`edge`
`original-edge-node`	`<node>`
`processor-id`	set if present — prevents the reconciler from creating duplicates

Scheduling:
- nodeSelector: {workload-type: application} — lands on Karpenter-provisioned managed nodes.
- Retains the edge-node toleration so the pod can later fail back to the edge.
- Node affinity: node-role NotIn [system].
Environment: downward-API fields plus:

Env var	Value
`WORKLOAD_TYPE`	`edge`
`DORY_MIGRATED_FROM_EDGE`	`true`
`DORY_ORIGINAL_NODE`	original edge node name
`DORY_STATE_RESTORE_PATH`	`edge_node_apps.state_storage_path`
`PROCESSOR_ID`	processor UUID

Plus custom and sensor env from the DB config.

PreStop hook: python3 calling urlopen against /prestop (no curl — Python containers do not ship it).

The replacement restores its state from its ConfigMap-backed store keyed by processor — no HTTP transfer is attempted.

Failback (managed → edge)¶

GetFailbackCandidates returns apps where the edge node is online again AND the app status is failover, with the node verified Ready in Kubernetes. For each:

Delete the migrated managed pod gracefully.
Recreate the edge pod via CreatePodFromConfig with WorkloadType=edge and NodeName=<target edge node>.
Record a DB event and increment dory_failover_events_total{type="managed_to_edge"}.

DB event recording on failover¶

On failover the orchestrator:

Terminates the old processor row.
CreateFailoverProcessor — inserts a new processor with node_type=managed, status=failover.
UpdateAppStatus(failover).
UpdateCurrentProcessorID.
Records a failover_start event.

See the Deployment page for the edge_nodes, edge_node_apps, and edge_node_events schema.

Warning

edge_node_apps and edge_node_events key on column processor_config_id (FK → processor_templates.id), not processor_template_id. The processors and processor_template_versions tables use processor_template_id. All three FK to processor_templates.id.

The failover state ConfigMap¶

Failover state is tracked in a ConfigMap named dory-failover-state.

Failover sequence¶

sequenceDiagram
    participant R as Reconcile loop
    participant FM as FailoverManager
    participant K8s as Kubernetes API
    participant DB as PostgreSQL
    participant CM as ConfigMap store

    R->>FM: HandleFailover(ctx)
    FM->>K8s: List nodes (node-type=edge), read NodeReady
    FM->>DB: GetStaleHeartbeatNodes(60s)
    Note over FM: Ready-in-K8s + stale heartbeat<br/>=> forced NotReady (HeartbeatStale)
    FM->>FM: GetFailedEdgeNodes (NotReady > 30s)
    FM->>DB: GetAllFailoverCandidates<br/>(offline + failover_enabled + active)
    FM->>FM: Dedup vs existing Running/Pending pods
    FM->>K8s: ForceDeletePod(edge pod, grace=0)
    FM->>K8s: createFailoverPod ({app}-{uuid8}, managed node)
    Note over CM: Replacement restores state from<br/>ConfigMap store keyed by processor-id
    FM->>DB: terminate old proc, CreateFailoverProcessor,<br/>UpdateAppStatus(failover), record failover_start
    FM->>DB: dory_failover_events_total{type=edge_to_managed}++
    Note over FM: Later cycle — edge node Ready again
    FM->>DB: GetFailbackCandidates (online + status=failover)
    FM->>K8s: delete managed pod (graceful)
    FM->>K8s: CreatePodFromConfig (edge, NodeName=target)
    FM->>DB: dory_failover_events_total{type=managed_to_edge}++

See Metrics for full definitions:

dory_edge_nodes_total{status} — edge node count by healthy / not_ready.
dory_failover_pods_active — pods currently failed over edge → managed.
dory_failover_events_total{type} — edge_to_managed / managed_to_edge.
dory_node_failures_total — edge node failures.