Skip to content

Metrics

The Dory Orchestrator exports Prometheus metrics from pkg/metrics/prometheus.go, served at GET /metrics on port 8080 via promhttp.

See HTTP API Reference for the endpoint, Configuration for MetricsPort, and Deployment for the Prometheus scrape annotations on the dory-orchestrator-metrics Service.

Scrape target

The dory-orchestrator-metrics ClusterIP Service (port 8080) carries the Prometheus scrape annotations. Edge pods also reach the orchestrator at this Service for heartbeats.

Metric reference

Name Type Labels Meaning
dory_applications_total Gauge Apps in the config table.
dory_database_query_duration_seconds HistogramVec query_type (list_configs/get_config/update_pod_ip) DB query latency.
dory_database_connections_active Gauge Active DB connections.
dory_database_errors_total CounterVec error_type (connection/query/timeout) DB errors.
dory_circuit_breaker_state GaugeVec component (database) 0=closed, 1=half-open, 2=open.
dory_circuit_breaker_transitions_total CounterVec component, from_state, to_state Circuit breaker transitions.
dory_reconciliation_queue_depth Gauge Pending reconciliations.
dory_reconciliation_errors_total CounterVec error_type (pod_creation/pod_deletion/scheduling/timeout/general) Reconcile errors.
dory_pod_creation_errors_total CounterVec reason (quota/scheduling/image_pull/resource) Pod create failures.
dory_config_watcher_healthy Gauge 1/0.
dory_config_watcher_errors_total Counter Watcher errors.
dory_config_changes_total CounterVec change_type (added/removed/modified) Config changes.
dory_node_utilization_ratio GaugeVec node Managed pods / max pods.
dory_shutdown_duration_seconds HistogramVec component Shutdown duration.
dory_startup_duration_seconds Histogram Startup duration.
dory_component_healthy GaugeVec component (database/kubernetes/watcher/reconciler) 1/0.
dory_pods_total GaugeVec status (running/pending/terminating) Managed pods.
dory_nodes_total Gauge Cluster nodes.
dory_scheduling_decisions_total CounterVec decision (provision/migrate/decommission) Scheduling decisions.
dory_reconciliation_duration_seconds Histogram Reconcile time (buckets to 120s).
dory_pod_startup_duration_seconds Histogram Scheduling → Running (buckets to 120s).
dory_edge_nodes_total GaugeVec status (healthy/not_ready) Edge nodes.
dory_failover_pods_active Gauge Pods failed over edge → managed.
dory_failover_events_total CounterVec type (edge_to_managed/managed_to_edge) Failover events.
dory_node_failures_total Counter Edge node failures.
dory_sdk_detected_total CounterVec sdk_version, app_name Pods where the SDK was detected.
dory_sdk_not_detected_total CounterVec app_name Pods where the SDK was NOT detected.
dory_health_poll_processors_up Gauge Healthy processors in the last poll.
dory_health_poll_processors_down Gauge Unhealthy/unreachable processors in the last poll.
dory_health_poll_duration_seconds Histogram Poll cycle duration.
dory_health_poll_errors_total Counter Poll cycle errors.

Note

The failover metrics (dory_edge_nodes_total, dory_failover_pods_active, dory_failover_events_total, dory_node_failures_total) correspond directly to the behavior described in Edge Failover.

Alert & recording rules

Alerting and recording rules live in deploy/prometheus-alerts.yaml (job label dory-orchestrator):

  • Critical alerts: orchestrator down, database down, Kubernetes down, circuit breaker open.
  • SLA expressions: reconciliation 120s, pod startup 30s.
  • Recording rules: p95s plus success rates.