Skip to content

Resilience

The Dory SDK ships three resilience primitives that let a processor degrade gracefully instead of failing hard: circuit breakers that stop hammering a broken dependency, retry with backoff that absorbs transient faults, and an error classifier that decides what to do with an exception. BaseProcessor wires usable defaults for all three, so a processor can lean on them without any setup (see Core Concepts).

Circuit breakers

A circuit breaker wraps a call to an external dependency. While the dependency is healthy the breaker is closed and calls pass through. Once failures pile up it opens and fails fast — protecting both your processor and the struggling dependency — then probes for recovery before closing again.

States

State Meaning Transition
CLOSED ("closed") Calls pass through normally. Opens after failure_threshold consecutive failures.
OPEN ("open") Calls fail fast with CircuitOpenError. Moves to half-open after timeout_seconds elapses.
HALF_OPEN ("half_open") A limited number of trial calls are allowed. Closes after success_threshold successes; any failure re-opens it.

Configuration

CircuitBreakerConfig (and the matching CircuitBreaker constructor arguments):

Field Default Purpose
name Identifier used in stats and logs.
failure_threshold 5 Consecutive failures before opening.
success_threshold 2 Successes in half-open before closing.
timeout_seconds 60.0 How long to stay open before probing.
half_open_max_calls 1 Trial calls allowed while half-open.
on_state_change None Callback invoked on every state transition.

Wrapping a call

call() is the wrapping primitive. It auto-detects whether the wrapped function is a coroutine or a sync function, runs it, and records the outcome. When the breaker is open it raises CircuitOpenError(circuit_name, next_attempt_time) immediately without invoking the function.

from dory.resilience import CircuitBreaker, CircuitOpenError

breaker = CircuitBreaker("payments", failure_threshold=3, timeout_seconds=30.0)
try:
    result = await breaker.call(charge_card, amount)
except CircuitOpenError as exc:
    # fail fast; exc.next_attempt_time tells you when to retry
    result = None

Useful members: properties state, is_closed, is_open, is_half_open; get_stats() -> dict; and the async controls reset() (back to closed) and open() (force open).

The three breakers on BaseProcessor

Every processor gets self.circuit_breakers, a dict pre-populated with three breakers:

Key Intended dependency
"database" Database / persistence calls
"external_api" Outbound HTTP / RPC calls
"cache" Cache backends
class MyProcessor(BaseProcessor):
    async def process(self):
        rows = await self.circuit_breakers["database"].call(db.fetch, query)

Warning

These three breakers are constructed with failure_threshold=5, success_threshold=2, timeout_seconds=30.0, half_open_max_calls=3 — the timeout_seconds and half_open_max_calls values differ from the CircuitBreakerConfig dataclass defaults (60.0 and 1). If you depend on exact timing, read the breaker's get_stats() rather than assuming the dataclass defaults.

A CircuitBreakerRegistry (with a process-wide get_global_registry()) is available for sharing named breakers across components.

Retry with backoff

Transient faults — a dropped connection, a momentary 429 — are best absorbed by retrying with exponential backoff rather than tripping a breaker. The retry_with_backoff decorator works on both async and sync functions.

from dory.resilience import retry_with_backoff, RetryBudget

budget = RetryBudget(budget_percent=20.0)

@retry_with_backoff(max_attempts=4, initial_delay=0.5, budget=budget,
                    retryable_exceptions=(ConnectionError,))
async def call_api():
    ...

On exhaustion the decorator raises RetryExhaustedError(attempts, last_error).

RetryPolicy

The decorator's parameters mirror RetryPolicy:

Field Default Purpose
max_attempts 3 Total attempts (including the first).
initial_delay 1.0 Base delay in seconds.
max_delay 30.0 Upper bound on any single delay.
multiplier 2.0 Exponential growth factor.
jitter True Add randomized jitter to each delay.
retryable_exceptions (Exception,) Exceptions that trigger a retry.
non_retryable_exceptions () Exceptions that never retry (checked first).
on_retry None Callback invoked before each retry.

The delay for an attempt is min(initial_delay * multiplier ** attempt, max_delay). Whether an exception is retryable is decided by is_retryable: non_retryable_exceptions always wins, then the retryable_exceptions check applies.

Jitter

Jitter is one-sided: when enabled, the computed delay is increased by a random amount between 0 and 25% of itself (delay + uniform(0, delay * 0.25)). This spreads retries from many processors so they do not all wake up at the same instant (the "thundering herd").

RetryBudget

A retry budget caps how much of your traffic is allowed to be retries, preventing a retry storm from amplifying load on an already-degraded dependency.

Field Default Purpose
budget_percent 20.0 Max retries as a percentage of requests.
window_seconds 60.0 Sliding window; counters auto-reset.

can_retry() returns True while (retries / requests) * 100 <= budget_percent. Pass the budget to retry_with_backoff(budget=...).

Warning

When the budget is exhausted, the decorator does not raise RetryExhaustedError — it re-raises the original exception immediately instead of retrying. Handle the underlying exception type at the call site, not just RetryExhaustedError.

Error classification

Rather than hard-coding except arms, let ErrorClassifier map an exception to a structured decision. BaseProcessor exposes self.error_classifier.

result = self.error_classifier.classify(error)
if result.retryable:
    ...  # safe to retry

classify(error) returns a ClassificationResult:

Field Type Description
error_type ErrorType Category of the error.
recommended_action RecoveryAction What to do about it.
retryable bool Whether a retry is sensible.
severity str "low", "medium", "high", or "critical".
details dict Additional context.

Use classify_and_handle(error) to classify and log in one call.

ErrorType and RecoveryAction

ErrorType: TRANSIENT, PERMANENT, RESOURCE, EXTERNAL, LOGIC, UNKNOWN.

RecoveryAction: RETRY, CIRCUIT_BREAKER, BACKOFF, SCALE, GOLDEN_RESET, DEGRADE, ALERT, FAIL, LOG.

The classifier maps each type to a default action, and marks {TRANSIENT, EXTERNAL, RESOURCE} as retryable:

ErrorType Default RecoveryAction Retryable
TRANSIENT RETRY Yes
EXTERNAL CIRCUIT_BREAKER Yes
RESOURCE BACKOFF Yes
LOGIC ALERT No
PERMANENT GOLDEN_RESET No
UNKNOWN LOG No

HTTP status codes are recognized heuristically: 500/502/503/504/429 classify as EXTERNAL; 400/401/403/404 classify as PERMANENT.

Custom registration

Map your own exception classes to a type so the classifier handles them correctly:

from dory.errors import ErrorClassifier, ErrorType

classifier = ErrorClassifier()
classifier.register_error_type(MyTimeoutError, ErrorType.TRANSIENT)

Use clear_error_type_registry() to reset. Module-level helpers classify_error(error) and is_retryable(error) are available for one-off use.

Error codes

The SDK defines stable error codes in the format E-<DOMAIN>-<NUMBER>, where domains include COR, STA, MIG, RET, CBR, ECL, GLD, REC, VAL, and PRC. Roughly 28 codes are predefined, for example:

Code Meaning
E-RET-001 Retry budget exhausted
E-CBR-001 Circuit breaker is OPEN
E-STA-003 State corruption detected

ErrorCodeRegistry supports register, get, search, list_by_domain, and all for looking codes up programmatically.

Putting it together

A typical defensive call combines all three: classify the error, retry transient failures within budget, and let the breaker shed load when a dependency is genuinely down.

@retry_with_backoff(max_attempts=3, retryable_exceptions=(ConnectionError,))
async def query(self):
    return await self.circuit_breakers["database"].call(db.fetch, sql)

See Configuration for the environment variables that tune these defaults and API Reference for full signatures.