Skip to content

State Migration & Node Drain

The Dory Orchestrator preserves processor state across pod moves by capturing it over HTTP from the old pod and restoring it into a replacement. This page covers the state-transfer protocol, the node-drain migration path, the two migrator implementations, and consolidation.

State-transfer protocol

TransferManager (pkg/state/transfer.go) moves state between pods over HTTP.

  • PortstatePort default 8080.
  • Auth — bearer token from env DORY_STATE_TOKEN. If unset, requests are unauthenticated and a warning is logged.
  • TimeoutDefaultHTTPTimeout = 30s.
  • Body limitMaxResponseBodySize = 10MB (enforced via io.LimitReader).

Capture

CaptureState issues GET http://<oldPodIP>:8080/state. A 401 response logs "check DORY_STATE_TOKEN". The body is unmarshalled into ApplicationState:

Fields
PodName, AppName, CapturedAt, StateVersion, Data, Metrics, Connections, ActiveSessions, SessionData, Uptime, RequestCount, LastHealthTime

Restore

TransferState issues POST http://<newPodIP>:8080/state with Content-Type: application/json and the bearer token.

Validate

ValidateState re-fetches /state from the new pod and checks AppName, StateVersion, and the SessionData count match.

Readiness and retries

  • WaitForPodReady polls GET /health every 500ms for up to 15s before capture.
  • Retry helpers CaptureWithRetry, RestoreWithRetry, and TransferWithRetry use exponential backoff: base 1s, ×2, capped at 30s.

Node-drain migration

When the event monitor (--enable-monitor) sees a NoSchedule taint on an application node, it invokes the DrainManager asynchronously with a 5-minute context.

Warning

Without --enable-monitor, node drains are handled only by Karpenter/Kubernetes — this state-preserving migration path does not run.

HandleNodeDrain(ctx, nodeName):

  1. CooldownDefaultDrainHandlingCooldown = 30s between drains for a node.
  2. getPodsOnNode — pods labeled managed-by=dory-orchestrator, fieldSelector spec.nodeName, Running and not terminating.
  3. getHealthyNodes — nodes labeled workload-type=application, excluding draining nodes, Ready with no NoSchedule taint.
  4. If zero healthy nodesneedsKarpenter: create replacements with an empty targetNode so they go Pending and Karpenter provisions a node.
  5. Per pod → migratePodWithStateTransfer.

migratePodWithStateTransfer(oldPod, targetNode)

Step Action
1. Capture Capture state from the old pod IP before replacement (CaptureWithRetry, 30s ctx, 3 retries). Failure → fresh start; never fails the migration.
2. Create Create a replacement named {app}-drain-{unixts}. Pull processor config for port/resources/env, build PodSpecConfig, ensure the sentinel ConfigMap dory-controller-ref and attach its owner ref (Controller: true, BlockOwnerDeletion: false). targetNode == "" leaves NodeName empty for Karpenter.
3. Wait WaitForPodReady — 2m normally, 5m when waiting on Karpenter.
4. Restore Restore state (RestoreWithRetry); non-fatal.
5. Finalize markPodMigrated + DB updates (UpdateProcessorPodName with hostIP, status running, health).

Note

The old pod is not deleted by the drain manager. The kubectl drain eviction removes it. The replacement is created first, so state is never lost during the move.

Tip

The sentinel ConfigMap owner ref with Controller: true, BlockOwnerDeletion: false makes kubectl drain treat each managed pod as controller-owned, so it evicts cleanly without --force — even though these are otherwise bare pods.

Drain constants: DefaultStateTransferTimeout=30s, DefaultPodReadyTimeout=2m, DefaultStateTransferRetries=3, DefaultDrainHandlingCooldown=30s, DefaultKarpenterPodReadyTimeout=5m, MigratedPodTTL=30m.

sequenceDiagram
    participant Mon as Event Monitor
    participant DM as DrainManager
    participant Old as Old Pod
    participant K8s as K8s API
    participant Karp as Karpenter
    participant New as Replacement Pod

    Mon->>DM: NoSchedule taint on application node
    DM->>DM: cooldown (30s) + list pods on node
    DM->>DM: getHealthyNodes
    alt no healthy node
        DM->>DM: needsKarpenter (empty targetNode)
    end
    DM->>Old: GET /state (capture, 3 retries)
    Old-->>DM: ApplicationState (or fresh start)
    DM->>K8s: create {app}-drain-{ts} + sentinel ownerRef
    opt targetNode == ""
        K8s->>Karp: Pending pod triggers provisioning
        Karp-->>K8s: new node
    end
    DM->>New: WaitForPodReady (2m / 5m Karpenter)
    DM->>New: POST /state (restore, non-fatal)
    DM->>K8s: markPodMigrated + DB update (hostIP, running)
    Note over Old: kubectl drain evicts old pod

Migrator paths

The migrator (pkg/migrator) offers two implementations. Both are create-before-delete.

Default — Migrate() (no HTTP state transfer)

Relies on the SDK's own ConfigMap persistence rather than HTTP state transfer:

  1. Preserve image and labels; new pod name toggles a -m suffix.
  2. Create the new pod from DB config.
  3. WaitForPodRunning (on failure: rollback and delete the new pod).
  4. Update DB pod name and nodeIP (pod.Status.HostIP).
  5. Delete the old pod and wait for deletion.

Constants: DefaultMaxConcurrentMigrations=3, PodDeletionTimeout=50s, migrator HTTP client timeout 5s (health checks only). MigrateBatch runs at most 3 migrations concurrently.

Enhanced — MigrateWithValidation() (with state transfer)

Five (+1) phases:

  1. Create the new pod and wait until ready.
  2. Validate /health.
  3. Phase 2.5transferState via state.TransferManager.TransferWithRetry (old → new, 3 retries). Failure is logged but non-fatal.
  4. Gradual traffic shift (no real load balancer).
  5. Drain the old pod (10s).
  6. Delete the old pod.

Consolidation

Consolidation bin-packs running pods onto fewer nodes to let Karpenter reclaim emptied capacity. It is cooldown-gated (default 1m) and requires at least two workload-type=application nodes. scheduler.ConsolidatePods plans the moves and migrator.MigrateBatch executes them (≤3 concurrent). Emptied nodes are decommissioned by Karpenter, not the orchestrator. See Architecture for where consolidation sits in the reconcile cycle.