Skip to content

Pod Lifecycle

Every running pod managed by the Dory Orchestrator originates from a processors row in PostgreSQL. This page traces how a DB row becomes a pod, how the desired-vs-actual diff keeps the cluster in sync, and how the pod spec is assembled from the processor's runtime_config_template.

From DB row to desired set

Each PollInterval, the config client runs:

SELECT p.id, pa.slug, pv.image_uri, pv.digest, pv.version,
       p.node_type, p.k8s_namespace, pv.runtime_config_template,
       <sensor fields>
FROM processors p
JOIN processor_templates pa ON p.processor_template_id = pa.id
JOIN processor_template_versions pv
     ON pv.processor_template_id = pa.id AND pv.is_active = true
LEFT JOIN sensors s ON p.sensor_id = s.id
WHERE p.status NOT IN ('terminated','failed');

Each row becomes a ProcessorConfig in the desired set.

Desired-vs-actual diff (keyed by processor-id)

The reconciler reconciles desired against actual:

  • Desired pods are keyed by processor ID.
  • Actual pods are the live pods labeled managed-by=dory-orchestrator, keyed by pod.Labels["processor-id"].
  • PodFailed pods are scheduled to recreate.
  • ToAdd = desired IDs absent from actual → create.
  • ToRemove = actual IDs absent from desired → delete.

One processor ID maps to exactly one pod. Pods are immutable, so any change to a processor is realized as delete + recreate, never an in-place update. See Architecture for the full reconcile cycle.

graph LR
    DB[(processors rows)] -->|desired by processor ID| DIFF{calculateDiff}
    PODS[Pods labeled<br/>managed-by=dory-orchestrator] -->|actual by processor-id label| DIFF
    DIFF -->|ToAdd| CREATE[createPodForProcessor]
    DIFF -->|ToRemove| DELETE[deleteRemovedPods]
    DIFF -->|PodFailed| CREATE

Pod creation

createPodForProcessor builds a k8s.PodSpecConfig (resources, env via GetEnvVars/GetSensorEnvVars, probes, command/args, termination grace, ProcessorID label), attaches a sentinel ConfigMap owner reference, and calls CreatePodFromConfig. It then:

  1. InvalidatePodChange(node) — invalidates the scheduler's NodeResourceCache for that node.
  2. UpdateProcessorPodName(..., nodeIP="") — records the pod name (node IP not yet known).
  3. edgeStore.UpdateProcessorOnPodCreated.

Async Running follow-up

An async worker pool (size 5, timeout 120s) waits for the pod to reach Running, then:

  • Sets status=Running and records health.
  • Records the pod name plus nodeIP from pod.Status.HostIP.
  • Calls edgeStore.UpdateProcessorOnPodRunning.
  • Records the startup duration.
  • Runs SDK detection.

Pod deletion

Pods whose processor ID is no longer in the desired set are removed in deleteRemovedPods. Because pods are immutable, this is also half of the delete-plus-recreate path for any processor change.

Pod spec built from runtime_config_template

buildPodSpec(cfg PodSpecConfig) assembles the pod. Key rules:

  • Image — REQUIRED. {image_uri}@{digest} when a digest is present, otherwise image_uri.
  • Pod namecfg.PodName, or {app}-{unixts%1e8} if unset.
  • Container portcfg.ContainerPort, or DefaultContainerPort (8080). One container port named health.
  • Labelsapp, managed-by=dory-orchestrator, workload-location=<type>, and processor-id (when set).
  • TerminationGracePeriodSeconds — default 45 (or cfg + 15).
  • ServiceAccountNamedory-processor.
  • ImagePullSecretsecr-registry-secret (override via env DORY_IMAGE_PULL_SECRET).
  • SecurityContextrunAsNonRoot, runAsUser/fsGroup = 1000.

Probes (paths are fixed)

Probe Path Defaults
Readiness /ready (HTTPGet) initialDelay 5, period 2, timeout 1, success 2, failure 3
Liveness /health (HTTPGet) initialDelay 10 (or readiness+5), period 10, timeout 2, success 1, failure 3

Warning

The probe paths /ready and /health are fixed in the pod spec builder. Only the timing values are configurable (via health_probes.*). A processor that serves health on other paths will fail its probes.

PreStop hook

python3 -c "import urllib.request; urllib.request.urlopen('http://localhost:<port>/prestop')" 2>/dev/null || true; sleep 10

Note

The PreStop hook uses python3 + urllib.request (not curl), so it works in Python-based processor containers without extra tooling.

Node placement

  • Node affinity (always)node-role NotIn [system].
  • Managed workloads — nodeSelector{workload-type: application}.
  • Edge workloads — toleration edge-node=true:NoSchedule, plus (when NodeName is unset) nodeSelector{node-type: edge}.
  • OwnerReference — attached when provided; the drain manager supplies a sentinel ConfigMap owner ref (see State Migration & Node Drain).

Environment variables

buildEnvVars(workloadType, processorID) writes system env first:

  • POD_NAME, POD_NAMESPACE, POD_IP, NODE_NAME (downward API)
  • WORKLOAD_TYPE
  • DATABASE_URL (from secret dory-db-secret, key database-url)
  • PROCESSOR_ID (when set)
  • For edge only: ORCHESTRATOR_URL=http://dory-orchestrator-metrics.dory-system.svc.cluster.local:8080

Then cfg.EnvVars (from runtime_config_template) are appended after the system env.

Warning

In Kubernetes, the last duplicate env var wins. Because runtime_config_template env vars are appended after the system defaults, a value in runtime_config_template overrides the matching system default.

Field mapping (runtime_config_template → pod)

Pod field Source / rule
Image {image_uri}@{digest} if digest, else image_uri
Pod name {slug}-{last 8 chars of processor UUID}
Container name slug
Container port container.port → env DORY_HEALTH_PORTdory.health.portresource_defaults.container_port8080
Resources resources.{cpu_request,memory_request,cpu_limit,memory_limit} or legacy resource_defaults.*
GPU resources.gpu_requirednvidia.com/gpu: 1
Env vars runtime_config_template.env_vars map (+ legacy dory.* converted); DORY_APP_VERSION always set from processor_template_versions.version
Sensor env DORY_SENSOR_ID, DORY_SENSOR_TYPE, DORY_SENSOR_CONNECTION_CONFIG, DORY_SENSOR_METADATA, DORY_GEO_REFERENCE_POINTS, DORY_GEOHASH (precision 9 from location_point)
System env (always) POD_NAME, POD_NAMESPACE, POD_IP, NODE_NAME, WORKLOAD_TYPE, DATABASE_URL, PROCESSOR_ID
Command / Args container.command / container.args override
ImagePullPolicy container.image_pull_policy (default IfNotPresent)
Health probes timing from health_probes.* (legacy dory.health)
Labels app=slug, managed-by=dory-orchestrator, workload-location, processor-id

See Deployment for the cluster-side configuration that backs these fields.

SDK detection via X-Dory-SDK-Version

After a pod reaches Running, the reconciler GETs http://<podIP>:<port>/health and reads the X-Dory-SDK-Version response header (up to 3 retries, 2s delay, 5s timeout).

  • Header present — increments dory_sdk_detected_total{sdk_version, app_name}.
  • Header absent — logs:

    SDK NOT DETECTED ... Health probes, state transfer, and graceful shutdown will not work correctly

    and increments dory_sdk_not_detected_total{app_name}.

See Metrics for these counters.