Skip to content

State Management

State management is the reason the Dory SDK exists: in-memory processor state is captured, serialized, and persisted so a pod can shut down, migrate to another node, or restart without losing work. This page covers how state round-trips, the backends, serialization, the migration endpoint, snapshots, and recovery decisions.

How stateful vars round-trip

stateful(...) class attributes (see Core Concepts) are the default state surface:

  • get_state(self) -> dict returns the current values of all stateful vars.
  • restore_state(self, state) writes them back on startup.

You can override both hooks to capture or rebuild state that does not live in stateful attributes (e.g. derived caches, external cursors).

class P(BaseProcessor):
    counter = stateful(0)

    def get_state(self) -> dict:
        return {"counter": self.counter, "derived": self.counter * 2}

    async def restore_state(self, state) -> None:
        self.counter = state["counter"]

State backends

The backend is selected by DORY_STATE_BACKEND (default configmap). The StateBackend enum:

Backend Value Storage Keying
ConfigMap (default) configmap Kubernetes ConfigMap name dory-state-{processor_id}, data key "state"
S3 s3 S3 object + offline SQLite buffer key {prefix}/{processor_id}/state.json
Local local local file dory-state-{processor_id}.json under /data (fallback /tmp)
PVC pvc local file on a mounted PVC same file layout as Local

StateManager(backend="configmap" | StateBackend, config=None) exposes async methods:

await sm.save_state(processor_id, state, restart_count=0)
await sm.load_state(processor_id)            # -> dict | None
await sm.delete_state(processor_id)          # -> bool (golden reset)
await sm.sync_s3_buffer()                    # -> int (flush offline buffer)

Namespace and pod identity come from POD_NAMESPACE / POD_NAME.

ConfigMap backend requires the kubernetes package

Install the [production] extra. If kubernetes is missing, K8S_AVAILABLE is silently False and the failure only surfaces on the first ConfigMap operation — not at startup.

ConfigMap TTL is advisory only

ConfigMapStore writes a dory.io/state-ttl annotation (default 3600s). Nothing reaps it — the annotation is informational. Stale ConfigMaps are not garbage-collected by the SDK.

S3 backend

S3 configuration is purely env-driven via S3Config.from_env():

Variable Default / note
DORY_S3_BUCKET required
DORY_S3_PREFIX dory-state
DORY_S3_REGION / AWS_REGION region
DORY_S3_ENDPOINT_URL custom endpoint
AWS creds, DORY_S3_ROLE_ARN auth
DORY_S3_ENABLE_OFFLINE_BUFFER enable offline SQLite buffer
DORY_S3_BUFFER_PATH /data/dory-state-buffer.db

S3Store retries 3 times with ×2 backoff and buffers writes to a local SQLite file when offline; flush with sync_s3_buffer().

Local/PVC state is lost without a real volume

The Local backend falls back to /tmp when /data is unavailable. /tmp does not survive a pod restart — use a mounted PVC (the pvc backend reuses the same file code) for durable local state.

Serialization (checksum envelope)

StateSerializer wraps state in a metadata envelope with a SHA256 checksum:

StateSerializer.serialize(state, processor_id, pod_name, restart_count=0) -> str
StateSerializer.deserialize(data) -> dict   # validates checksum

deserialize raises DoryStateError on a checksum mismatch. Versioning is a loose string field state_version="1.0".

No StateFormatVersion enum, no serialize_state function

There is no StateFormatVersion enum and no module-level serialize_state helper. Use StateSerializer.serialize / .deserialize.

Size and timeout limits

Two layers of limits apply — the SDK is stricter than the orchestrator:

Constant SDK Orchestrator
Max state size DEFAULT_MAX_STATE_SIZE = 8 MB (warn at 75%) ORCHESTRATOR_MAX_STATE_SIZE = 10 MB
Capture timeout DEFAULT_CAPTURE_TIMEOUT_SEC = 25 s ORCHESTRATOR_STATE_TIMEOUT_SEC = 30 s
Restore timeout DEFAULT_RESTORE_TIMEOUT_SEC = 25 s

Validate size with:

validate_state_size(state_json, max_size=8_388_608, warn_threshold=0.75)

validate_state_size takes a string

The first argument is the serialized JSON string, not a byte count or a dict. It raises StateSizeExceeded when over the limit and warns at 75%.

Stay under the SDK budget

Keep state well under 8 MB and capture/restore under 25 s. The SDK limits are tighter than the orchestrator's, so passing the SDK checks keeps you within the orchestrator's window too.

The /state migration endpoint

During a pod migration, the orchestrator pulls state over HTTP from the source pod's /state endpoint. The response wraps get_state() in an ApplicationState envelope:

{
  "pod_name": "...",
  "app_name": "dory-processor",
  "captured_at": "...",
  "state_version": "1.0",
  "data": { "...": "your get_state() output" },
  "metrics": {},
  "connections": {},
  "active_sessions": {}
}

app_name comes from APP_NAME (default dory-processor). On restore, the SDK extracts state["data"].

/state requires bearer auth

The /state endpoint is protected by a bearer token read from DORY_STATE_TOKEN. The orchestrator must present a matching token; unauthenticated requests are rejected. Set the same token on the source pod and the orchestrator.

Golden snapshots

GoldenSnapshotManager(max_snapshots_per_processor=5) captures point-in-time snapshots independent of the live state:

  • capture / restore / list snapshots.
  • Each snapshot is gzip + base64 encoded with a checksum.
  • Stored through the StateManager under key {processor_id}-snapshot, so it inherits the backend's size limits (~1 MB ConfigMap / 8 MB SDK).
  • SIGUSR1 triggers a snapshot save (see Core Concepts).

Recovery decisions

On startup the SDK decides how to recover. Restart detection and strategy selection are separate steps; for the resilience primitives (circuit breakers, retries) see Resilience.

Restart detection

RestartDetector.detect() reads RESTART_COUNT (set by an init container) first, falling back to a /tmp marker file.

Strategy selection

RecoveryDecisionMaker(golden_image_threshold=3, rapid_restart_window_sec=60, max_backoff_sec=300):

decide(restart_count, is_migrating=False, state_valid=True, state_exists=False)
Strategy When
FIRST_START no prior state
RESTORE_STATE valid saved state exists
GOLDEN_IMAGE restart count ≥ threshold (3) — reset to clean image
GOLDEN_WITH_BACKOFF repeated rapid restarts; exponential backoff min(10 * 2^(n-3), 300) seconds

Golden-image reset and validation

  • GoldenImageManager resets at levels SOFT / MODERATE / FULL / FACTORY. FULL deletes all state; MODERATE/FACTORY are partly stubbed.
  • StateValidator.validate(state: dict) -> bool requires a dict, applies optional state_schema type checks, and rejects state carrying a __corrupted__ marker.
  • PartialRecoveryManager recovers state field-by-field when a full restore is not possible.

Rapid restarts trigger golden reset

Crossing golden_image_threshold (3) restarts moves recovery from RESTORE_STATE to GOLDEN_IMAGE, discarding saved state in favor of a clean start. Override on_rapid_restart_detected(restart_count) to observe or veto this — returning True (the default) allows the reset path.

Variable Default Purpose
DORY_STATE_BACKEND configmap backend selection
DORY_STARTUP_TIMEOUT_SEC 30 startup hook timeout
DORY_SHUTDOWN_TIMEOUT_SEC 30 shutdown hook timeout
DORY_HEALTH_PORT 8080 health/metrics/state port (0 = auto)
DORY_LOG_LEVEL INFO log level
DORY_STATE_TOKEN bearer auth for /state
POD_NAME pod identity / state keying
POD_NAMESPACE default namespace for ConfigMap state
RESTART_COUNT restart count (set by init container)
APP_NAME dory-processor app_name in the /state envelope
DORY_S3_BUCKET see above S3 backend config

See Configuration for the complete environment reference and Edge & Failover for how state interacts with fencing and role transitions.