Metrics¶
The Dory Orchestrator exports Prometheus metrics from pkg/metrics/prometheus.go, served at GET /metrics on port 8080 via promhttp.
See HTTP API Reference for the endpoint, Configuration for MetricsPort, and Deployment for the Prometheus scrape annotations on the dory-orchestrator-metrics Service.
Scrape target¶
The dory-orchestrator-metrics ClusterIP Service (port 8080) carries the Prometheus scrape annotations. Edge pods also reach the orchestrator at this Service for heartbeats.
Metric reference¶
| Name | Type | Labels | Meaning |
|---|---|---|---|
dory_applications_total |
Gauge | — | Apps in the config table. |
dory_database_query_duration_seconds |
HistogramVec | query_type (list_configs/get_config/update_pod_ip) |
DB query latency. |
dory_database_connections_active |
Gauge | — | Active DB connections. |
dory_database_errors_total |
CounterVec | error_type (connection/query/timeout) |
DB errors. |
dory_circuit_breaker_state |
GaugeVec | component (database) |
0=closed, 1=half-open, 2=open. |
dory_circuit_breaker_transitions_total |
CounterVec | component, from_state, to_state |
Circuit breaker transitions. |
dory_reconciliation_queue_depth |
Gauge | — | Pending reconciliations. |
dory_reconciliation_errors_total |
CounterVec | error_type (pod_creation/pod_deletion/scheduling/timeout/general) |
Reconcile errors. |
dory_pod_creation_errors_total |
CounterVec | reason (quota/scheduling/image_pull/resource) |
Pod create failures. |
dory_config_watcher_healthy |
Gauge | — | 1/0. |
dory_config_watcher_errors_total |
Counter | — | Watcher errors. |
dory_config_changes_total |
CounterVec | change_type (added/removed/modified) |
Config changes. |
dory_node_utilization_ratio |
GaugeVec | node |
Managed pods / max pods. |
dory_shutdown_duration_seconds |
HistogramVec | component |
Shutdown duration. |
dory_startup_duration_seconds |
Histogram | — | Startup duration. |
dory_component_healthy |
GaugeVec | component (database/kubernetes/watcher/reconciler) |
1/0. |
dory_pods_total |
GaugeVec | status (running/pending/terminating) |
Managed pods. |
dory_nodes_total |
Gauge | — | Cluster nodes. |
dory_scheduling_decisions_total |
CounterVec | decision (provision/migrate/decommission) |
Scheduling decisions. |
dory_reconciliation_duration_seconds |
Histogram | — | Reconcile time (buckets to 120s). |
dory_pod_startup_duration_seconds |
Histogram | — | Scheduling → Running (buckets to 120s). |
dory_edge_nodes_total |
GaugeVec | status (healthy/not_ready) |
Edge nodes. |
dory_failover_pods_active |
Gauge | — | Pods failed over edge → managed. |
dory_failover_events_total |
CounterVec | type (edge_to_managed/managed_to_edge) |
Failover events. |
dory_node_failures_total |
Counter | — | Edge node failures. |
dory_sdk_detected_total |
CounterVec | sdk_version, app_name |
Pods where the SDK was detected. |
dory_sdk_not_detected_total |
CounterVec | app_name |
Pods where the SDK was NOT detected. |
dory_health_poll_processors_up |
Gauge | — | Healthy processors in the last poll. |
dory_health_poll_processors_down |
Gauge | — | Unhealthy/unreachable processors in the last poll. |
dory_health_poll_duration_seconds |
Histogram | — | Poll cycle duration. |
dory_health_poll_errors_total |
Counter | — | Poll cycle errors. |
Note
The failover metrics (dory_edge_nodes_total, dory_failover_pods_active, dory_failover_events_total, dory_node_failures_total) correspond directly to the behavior described in Edge Failover.
Alert & recording rules¶
Alerting and recording rules live in deploy/prometheus-alerts.yaml (job label dory-orchestrator):
- Critical alerts: orchestrator down, database down, Kubernetes down, circuit breaker open.
- SLA expressions: reconciliation
120s, pod startup30s. - Recording rules: p95s plus success rates.