State Management¶
State management is the reason the Dory SDK exists: in-memory processor state is captured, serialized, and persisted so a pod can shut down, migrate to another node, or restart without losing work. This page covers how state round-trips, the backends, serialization, the migration endpoint, snapshots, and recovery decisions.
How stateful vars round-trip¶
stateful(...) class attributes (see Core Concepts) are the default state surface:
get_state(self) -> dictreturns the current values of allstatefulvars.restore_state(self, state)writes them back on startup.
You can override both hooks to capture or rebuild state that does not live in stateful attributes (e.g. derived caches, external cursors).
class P(BaseProcessor):
counter = stateful(0)
def get_state(self) -> dict:
return {"counter": self.counter, "derived": self.counter * 2}
async def restore_state(self, state) -> None:
self.counter = state["counter"]
State backends¶
The backend is selected by DORY_STATE_BACKEND (default configmap). The StateBackend enum:
| Backend | Value | Storage | Keying |
|---|---|---|---|
| ConfigMap (default) | configmap |
Kubernetes ConfigMap | name dory-state-{processor_id}, data key "state" |
| S3 | s3 |
S3 object + offline SQLite buffer | key {prefix}/{processor_id}/state.json |
| Local | local |
local file | dory-state-{processor_id}.json under /data (fallback /tmp) |
| PVC | pvc |
local file on a mounted PVC | same file layout as Local |
StateManager(backend="configmap" | StateBackend, config=None) exposes async methods:
await sm.save_state(processor_id, state, restart_count=0)
await sm.load_state(processor_id) # -> dict | None
await sm.delete_state(processor_id) # -> bool (golden reset)
await sm.sync_s3_buffer() # -> int (flush offline buffer)
Namespace and pod identity come from POD_NAMESPACE / POD_NAME.
ConfigMap backend requires the kubernetes package
Install the [production] extra. If kubernetes is missing, K8S_AVAILABLE is silently False and the failure only surfaces on the first ConfigMap operation — not at startup.
ConfigMap TTL is advisory only
ConfigMapStore writes a dory.io/state-ttl annotation (default 3600s). Nothing reaps it — the annotation is informational. Stale ConfigMaps are not garbage-collected by the SDK.
S3 backend¶
S3 configuration is purely env-driven via S3Config.from_env():
| Variable | Default / note |
|---|---|
DORY_S3_BUCKET |
required |
DORY_S3_PREFIX |
dory-state |
DORY_S3_REGION / AWS_REGION |
region |
DORY_S3_ENDPOINT_URL |
custom endpoint |
AWS creds, DORY_S3_ROLE_ARN |
auth |
DORY_S3_ENABLE_OFFLINE_BUFFER |
enable offline SQLite buffer |
DORY_S3_BUFFER_PATH |
/data/dory-state-buffer.db |
S3Store retries 3 times with ×2 backoff and buffers writes to a local SQLite file when offline; flush with sync_s3_buffer().
Local/PVC state is lost without a real volume
The Local backend falls back to /tmp when /data is unavailable. /tmp does not survive a pod restart — use a mounted PVC (the pvc backend reuses the same file code) for durable local state.
Serialization (checksum envelope)¶
StateSerializer wraps state in a metadata envelope with a SHA256 checksum:
StateSerializer.serialize(state, processor_id, pod_name, restart_count=0) -> str
StateSerializer.deserialize(data) -> dict # validates checksum
deserialize raises DoryStateError on a checksum mismatch. Versioning is a loose string field state_version="1.0".
No StateFormatVersion enum, no serialize_state function
There is no StateFormatVersion enum and no module-level serialize_state helper. Use StateSerializer.serialize / .deserialize.
Size and timeout limits¶
Two layers of limits apply — the SDK is stricter than the orchestrator:
| Constant | SDK | Orchestrator |
|---|---|---|
| Max state size | DEFAULT_MAX_STATE_SIZE = 8 MB (warn at 75%) |
ORCHESTRATOR_MAX_STATE_SIZE = 10 MB |
| Capture timeout | DEFAULT_CAPTURE_TIMEOUT_SEC = 25 s |
ORCHESTRATOR_STATE_TIMEOUT_SEC = 30 s |
| Restore timeout | DEFAULT_RESTORE_TIMEOUT_SEC = 25 s |
— |
Validate size with:
validate_state_size takes a string
The first argument is the serialized JSON string, not a byte count or a dict. It raises StateSizeExceeded when over the limit and warns at 75%.
Stay under the SDK budget
Keep state well under 8 MB and capture/restore under 25 s. The SDK limits are tighter than the orchestrator's, so passing the SDK checks keeps you within the orchestrator's window too.
The /state migration endpoint¶
During a pod migration, the orchestrator pulls state over HTTP from the source pod's /state endpoint. The response wraps get_state() in an ApplicationState envelope:
{
"pod_name": "...",
"app_name": "dory-processor",
"captured_at": "...",
"state_version": "1.0",
"data": { "...": "your get_state() output" },
"metrics": {},
"connections": {},
"active_sessions": {}
}
app_name comes from APP_NAME (default dory-processor). On restore, the SDK extracts state["data"].
/state requires bearer auth
The /state endpoint is protected by a bearer token read from DORY_STATE_TOKEN. The orchestrator must present a matching token; unauthenticated requests are rejected. Set the same token on the source pod and the orchestrator.
Golden snapshots¶
GoldenSnapshotManager(max_snapshots_per_processor=5) captures point-in-time snapshots independent of the live state:
capture/restore/listsnapshots.- Each snapshot is gzip + base64 encoded with a checksum.
- Stored through the
StateManagerunder key{processor_id}-snapshot, so it inherits the backend's size limits (~1 MB ConfigMap / 8 MB SDK). SIGUSR1triggers a snapshot save (see Core Concepts).
Recovery decisions¶
On startup the SDK decides how to recover. Restart detection and strategy selection are separate steps; for the resilience primitives (circuit breakers, retries) see Resilience.
Restart detection¶
RestartDetector.detect() reads RESTART_COUNT (set by an init container) first, falling back to a /tmp marker file.
Strategy selection¶
RecoveryDecisionMaker(golden_image_threshold=3, rapid_restart_window_sec=60, max_backoff_sec=300):
| Strategy | When |
|---|---|
FIRST_START |
no prior state |
RESTORE_STATE |
valid saved state exists |
GOLDEN_IMAGE |
restart count ≥ threshold (3) — reset to clean image |
GOLDEN_WITH_BACKOFF |
repeated rapid restarts; exponential backoff min(10 * 2^(n-3), 300) seconds |
Golden-image reset and validation¶
GoldenImageManagerresets at levelsSOFT/MODERATE/FULL/FACTORY.FULLdeletes all state;MODERATE/FACTORYare partly stubbed.StateValidator.validate(state: dict) -> boolrequires adict, applies optionalstate_schematype checks, and rejects state carrying a__corrupted__marker.PartialRecoveryManagerrecovers state field-by-field when a full restore is not possible.
Rapid restarts trigger golden reset
Crossing golden_image_threshold (3) restarts moves recovery from RESTORE_STATE to GOLDEN_IMAGE, discarding saved state in favor of a clean start. Override on_rapid_restart_detected(restart_count) to observe or veto this — returning True (the default) allows the reset path.
State-related environment variables¶
| Variable | Default | Purpose |
|---|---|---|
DORY_STATE_BACKEND |
configmap |
backend selection |
DORY_STARTUP_TIMEOUT_SEC |
30 |
startup hook timeout |
DORY_SHUTDOWN_TIMEOUT_SEC |
30 |
shutdown hook timeout |
DORY_HEALTH_PORT |
8080 |
health/metrics/state port (0 = auto) |
DORY_LOG_LEVEL |
INFO |
log level |
DORY_STATE_TOKEN |
— | bearer auth for /state |
POD_NAME |
— | pod identity / state keying |
POD_NAMESPACE |
default |
namespace for ConfigMap state |
RESTART_COUNT |
— | restart count (set by init container) |
APP_NAME |
dory-processor |
app_name in the /state envelope |
DORY_S3_BUCKET … |
see above | S3 backend config |
See Configuration for the complete environment reference and Edge & Failover for how state interacts with fencing and role transitions.