Edge Failover & Failback¶
The Dory Orchestrator keeps edge-hosted processors available when an edge node fails by relocating them to managed (cloud / Karpenter) nodes, and returns them to the edge once it recovers. This is implemented by the FailoverManager in pkg/failover/failover.go and runs as part of every reconcile cycle.
See Architecture for how the reconcile loop is structured, Pod Lifecycle for pod creation/deletion mechanics, and State Migration for the HTTP-based transfer path used for live (non-failed) migrations.
Overview¶
FailoverManager.HandleFailover(ctx) performs two phases each cycle:
- Failover (edge → managed):
PlanFailoverthenExecuteFailover. - Failback (managed → edge):
PlanFailbackthenExecuteFailback.
Note
Failover relies on ConfigMap-based state persistence, not HTTP state transfer. When an edge node is dead, its pod cannot serve a state-export request, so the cloud replacement restores state from the SDK's ConfigMap-backed store keyed by processor-id. Contrast this with the live migration path in State Migration.
Health detection¶
CheckEdgeNodeHealth lists nodes labeled node-type=edge and reads their NodeReady condition, tracking the first time each node was seen NotReady.
When the database is enabled, it additionally calls GetStaleHeartbeatNodes(HeartbeatStaleThreshold). A node that is Ready in Kubernetes but whose DB heartbeat is stale is forced to Ready=false with reason HeartbeatStale. This catches a kubelet that still reports healthy while the processor itself has stopped checking in.
The heartbeat used here is the POST /api/v1/edge/heartbeat endpoint, which updates edge_nodes.last_heartbeat_at. See HTTP API Reference.
| Threshold | Value | Meaning |
|---|---|---|
NodeFailureGracePeriod |
30s |
GetFailedEdgeNodes only returns nodes that have been NotReady longer than this. |
HeartbeatStaleThreshold |
60s |
A K8s-Ready edge node with a DB heartbeat older than this is forced NotReady (HeartbeatStale). 60s ≈ 12 missed heartbeats at a 5s interval. |
Warning
A node must be NotReady for longer than NodeFailureGracePeriod (30s) before it is eligible for failover. This prevents flapping on transient blips.
Planning failover¶
PlanFailover (DB path) requires at least one failed edge node, then calls GetAllFailoverCandidates — apps where the edge node is offline AND failover_enabled AND the app is active. Candidates are de-duplicated against any already-existing Running/Pending pods so a replacement is never created twice.
K8s fallback path (no DB): finds pods running on failed nodes plus Pending edge pods that cannot schedule.
Executing failover¶
ExecuteFailover:
ForceDeletePodthe edge pod withgrace=0. Graceful deletion is impossible because theNotReadynode's kubelet cannot confirm termination.createFailoverPod— create the cloud replacement.- Record a DB event and increment
dory_failover_events_total{type="edge_to_managed"}.
The cloud replacement pod (createFailoverPod)¶
- Name:
{app}-{uuid8}. - Image / port / resources / env: taken from the DB processor config. The image is required — failover aborts if it is missing.
- Labels:
| Label | Value |
|---|---|
app |
application slug |
managed-by |
dory-orchestrator |
workload-location |
edge |
migrated-from-edge |
true |
original-workload-type |
edge |
original-edge-node |
<node> |
processor-id |
set if present — prevents the reconciler from creating duplicates |
- Scheduling:
nodeSelector: {workload-type: application}— lands on Karpenter-provisioned managed nodes.- Retains the edge-node toleration so the pod can later fail back to the edge.
- Node affinity:
node-role NotIn [system].
- Environment: downward-API fields plus:
| Env var | Value |
|---|---|
WORKLOAD_TYPE |
edge |
DORY_MIGRATED_FROM_EDGE |
true |
DORY_ORIGINAL_NODE |
original edge node name |
DORY_STATE_RESTORE_PATH |
edge_node_apps.state_storage_path |
PROCESSOR_ID |
processor UUID |
Plus custom and sensor env from the DB config.
- PreStop hook:
python3callingurlopenagainst/prestop(nocurl— Python containers do not ship it).
The replacement restores its state from its ConfigMap-backed store keyed by processor — no HTTP transfer is attempted.
Failback (managed → edge)¶
GetFailbackCandidates returns apps where the edge node is online again AND the app status is failover, with the node verified Ready in Kubernetes. For each:
- Delete the migrated managed pod gracefully.
- Recreate the edge pod via
CreatePodFromConfigwithWorkloadType=edgeandNodeName=<target edge node>. - Record a DB event and increment
dory_failover_events_total{type="managed_to_edge"}.
DB event recording on failover¶
On failover the orchestrator:
- Terminates the old processor row.
CreateFailoverProcessor— inserts a new processor withnode_type=managed,status=failover.UpdateAppStatus(failover).UpdateCurrentProcessorID.- Records a
failover_startevent.
See the Deployment page for the edge_nodes, edge_node_apps, and edge_node_events schema.
Warning
edge_node_apps and edge_node_events key on column processor_config_id (FK → processor_templates.id), not processor_template_id. The processors and processor_template_versions tables use processor_template_id. All three FK to processor_templates.id.
The failover state ConfigMap¶
Failover state is tracked in a ConfigMap named dory-failover-state.
Failover sequence¶
sequenceDiagram
participant R as Reconcile loop
participant FM as FailoverManager
participant K8s as Kubernetes API
participant DB as PostgreSQL
participant CM as ConfigMap store
R->>FM: HandleFailover(ctx)
FM->>K8s: List nodes (node-type=edge), read NodeReady
FM->>DB: GetStaleHeartbeatNodes(60s)
Note over FM: Ready-in-K8s + stale heartbeat<br/>=> forced NotReady (HeartbeatStale)
FM->>FM: GetFailedEdgeNodes (NotReady > 30s)
FM->>DB: GetAllFailoverCandidates<br/>(offline + failover_enabled + active)
FM->>FM: Dedup vs existing Running/Pending pods
FM->>K8s: ForceDeletePod(edge pod, grace=0)
FM->>K8s: createFailoverPod ({app}-{uuid8}, managed node)
Note over CM: Replacement restores state from<br/>ConfigMap store keyed by processor-id
FM->>DB: terminate old proc, CreateFailoverProcessor,<br/>UpdateAppStatus(failover), record failover_start
FM->>DB: dory_failover_events_total{type=edge_to_managed}++
Note over FM: Later cycle — edge node Ready again
FM->>DB: GetFailbackCandidates (online + status=failover)
FM->>K8s: delete managed pod (graceful)
FM->>K8s: CreatePodFromConfig (edge, NodeName=target)
FM->>DB: dory_failover_events_total{type=managed_to_edge}++
Related metrics¶
See Metrics for full definitions:
dory_edge_nodes_total{status}— edge node count byhealthy/not_ready.dory_failover_pods_active— pods currently failed over edge → managed.dory_failover_events_total{type}—edge_to_managed/managed_to_edge.dory_node_failures_total— edge node failures.