State Management¶

State management is the reason the Dory SDK exists: in-memory processor state is captured, serialized, and persisted so a pod can shut down, migrate to another node, or restart without losing work. This page covers how state round-trips, the backends, serialization, the migration endpoint, snapshots, and recovery decisions.

How `stateful` vars round-trip¶

stateful(...) class attributes (see Core Concepts) are the default state surface:

get_state(self) -> dict returns the current values of all stateful vars.
restore_state(self, state) writes them back on startup.

You can override both hooks to capture or rebuild state that does not live in stateful attributes (e.g. derived caches, external cursors).

class P(BaseProcessor):
    counter = stateful(0)

    def get_state(self) -> dict:
        return {"counter": self.counter, "derived": self.counter * 2}

    async def restore_state(self, state) -> None:
        self.counter = state["counter"]

State backends¶

The backend is selected by DORY_STATE_BACKEND (default configmap). The StateBackend enum:

Backend	Value	Storage	Keying
ConfigMap (default)	`configmap`	Kubernetes ConfigMap	name `dory-state-{processor_id}`, data key `"state"`
S3	`s3`	S3 object + offline SQLite buffer	key `{prefix}/{processor_id}/state.json`
Local	`local`	local file	`dory-state-{processor_id}.json` under `/data` (fallback `/tmp`)
PVC	`pvc`	local file on a mounted PVC	same file layout as Local

StateManager(backend="configmap" | StateBackend, config=None) exposes async methods:

await sm.save_state(processor_id, state, restart_count=0)
await sm.load_state(processor_id)            # -> dict | None
await sm.delete_state(processor_id)          # -> bool (golden reset)
await sm.sync_s3_buffer()                    # -> int (flush offline buffer)

Namespace and pod identity come from POD_NAMESPACE / POD_NAME.

ConfigMap backend requires the kubernetes package

Install the [production] extra. If kubernetes is missing, K8S_AVAILABLE is silently False and the failure only surfaces on the first ConfigMap operation — not at startup.

ConfigMap TTL is advisory only

ConfigMapStore writes a dory.io/state-ttl annotation (default 3600s). Nothing reaps it — the annotation is informational. Stale ConfigMaps are not garbage-collected by the SDK.

S3 backend¶

S3 configuration is purely env-driven via S3Config.from_env():

Variable	Default / note
`DORY_S3_BUCKET`	required
`DORY_S3_PREFIX`	`dory-state`
`DORY_S3_REGION` / `AWS_REGION`	region
`DORY_S3_ENDPOINT_URL`	custom endpoint
AWS creds, `DORY_S3_ROLE_ARN`	auth
`DORY_S3_ENABLE_OFFLINE_BUFFER`	enable offline SQLite buffer
`DORY_S3_BUFFER_PATH`	`/data/dory-state-buffer.db`

S3Store retries 3 times with ×2 backoff and buffers writes to a local SQLite file when offline; flush with sync_s3_buffer().

Local/PVC state is lost without a real volume

The Local backend falls back to /tmp when /data is unavailable. /tmp does not survive a pod restart — use a mounted PVC (the pvc backend reuses the same file code) for durable local state.

Serialization (checksum envelope)¶

StateSerializer wraps state in a metadata envelope with a SHA256 checksum:

StateSerializer.serialize(state, processor_id, pod_name, restart_count=0) -> str
StateSerializer.deserialize(data) -> dict   # validates checksum

deserialize raises DoryStateError on a checksum mismatch. Versioning is a loose string field state_version="1.0".

No StateFormatVersion enum, no serialize_state function

There is no StateFormatVersion enum and no module-level serialize_state helper. Use StateSerializer.serialize / .deserialize.

Size and timeout limits¶

Two layers of limits apply — the SDK is stricter than the orchestrator:

Constant	SDK	Orchestrator
Max state size	`DEFAULT_MAX_STATE_SIZE` = 8 MB (warn at 75%)	`ORCHESTRATOR_MAX_STATE_SIZE` = 10 MB
Capture timeout	`DEFAULT_CAPTURE_TIMEOUT_SEC` = 25 s	`ORCHESTRATOR_STATE_TIMEOUT_SEC` = 30 s
Restore timeout	`DEFAULT_RESTORE_TIMEOUT_SEC` = 25 s	—

Validate size with:

validate_state_size(state_json, max_size=8_388_608, warn_threshold=0.75)

validate_state_size takes a string

The first argument is the serialized JSON string, not a byte count or a dict. It raises StateSizeExceeded when over the limit and warns at 75%.

Stay under the SDK budget

Keep state well under 8 MB and capture/restore under 25 s. The SDK limits are tighter than the orchestrator's, so passing the SDK checks keeps you within the orchestrator's window too.

The `/state` migration endpoint¶

During a pod migration, the orchestrator pulls state over HTTP from the source pod's /state endpoint. The response wraps get_state() in an ApplicationState envelope:

{
  "pod_name": "...",
  "app_name": "dory-processor",
  "captured_at": "...",
  "state_version": "1.0",
  "data": { "...": "your get_state() output" },
  "metrics": {},
  "connections": {},
  "active_sessions": {}
}

app_name comes from APP_NAME (default dory-processor). On restore, the SDK extracts state["data"].

/state requires bearer auth

The /state endpoint is protected by a bearer token read from DORY_STATE_TOKEN. The orchestrator must present a matching token; unauthenticated requests are rejected. Set the same token on the source pod and the orchestrator.

Golden snapshots¶

GoldenSnapshotManager(max_snapshots_per_processor=5) captures point-in-time snapshots independent of the live state:

capture / restore / list snapshots.
Each snapshot is gzip + base64 encoded with a checksum.
Stored through the StateManager under key {processor_id}-snapshot, so it inherits the backend's size limits (~1 MB ConfigMap / 8 MB SDK).
SIGUSR1 triggers a snapshot save (see Core Concepts).

Recovery decisions¶

On startup the SDK decides how to recover. Restart detection and strategy selection are separate steps; for the resilience primitives (circuit breakers, retries) see Resilience.

Restart detection¶

RestartDetector.detect() reads RESTART_COUNT (set by an init container) first, falling back to a /tmp marker file.

Strategy selection¶

RecoveryDecisionMaker(golden_image_threshold=3, rapid_restart_window_sec=60, max_backoff_sec=300):

decide(restart_count, is_migrating=False, state_valid=True, state_exists=False)

Strategy	When
`FIRST_START`	no prior state
`RESTORE_STATE`	valid saved state exists
`GOLDEN_IMAGE`	restart count ≥ threshold (3) — reset to clean image
`GOLDEN_WITH_BACKOFF`	repeated rapid restarts; exponential backoff `min(10 * 2^(n-3), 300)` seconds

Golden-image reset and validation¶

GoldenImageManager resets at levels SOFT / MODERATE / FULL / FACTORY. FULL deletes all state; MODERATE/FACTORY are partly stubbed.
StateValidator.validate(state: dict) -> bool requires a dict, applies optional state_schema type checks, and rejects state carrying a __corrupted__ marker.
PartialRecoveryManager recovers state field-by-field when a full restore is not possible.

Rapid restarts trigger golden reset

Crossing golden_image_threshold (3) restarts moves recovery from RESTORE_STATE to GOLDEN_IMAGE, discarding saved state in favor of a clean start. Override on_rapid_restart_detected(restart_count) to observe or veto this — returning True (the default) allows the reset path.

Variable	Default	Purpose
`DORY_STATE_BACKEND`	`configmap`	backend selection
`DORY_STARTUP_TIMEOUT_SEC`	`30`	startup hook timeout
`DORY_SHUTDOWN_TIMEOUT_SEC`	`30`	shutdown hook timeout
`DORY_HEALTH_PORT`	`8080`	health/metrics/state port (`0` = auto)
`DORY_LOG_LEVEL`	`INFO`	log level
`DORY_STATE_TOKEN`	—	bearer auth for `/state`
`POD_NAME`	—	pod identity / state keying
`POD_NAMESPACE`	`default`	namespace for ConfigMap state
`RESTART_COUNT`	—	restart count (set by init container)
`APP_NAME`	`dory-processor`	`app_name` in the `/state` envelope
`DORY_S3_BUCKET` …	see above	S3 backend config

See Configuration for the complete environment reference and Edge & Failover for how state interacts with fencing and role transitions.