Metrics¶

The Dory Orchestrator exports Prometheus metrics from pkg/metrics/prometheus.go, served at GET /metrics on port 8080 via promhttp.

See HTTP API Reference for the endpoint, Configuration for MetricsPort, and Deployment for the Prometheus scrape annotations on the dory-orchestrator-metrics Service.

Scrape target¶

The dory-orchestrator-metrics ClusterIP Service (port 8080) carries the Prometheus scrape annotations. Edge pods also reach the orchestrator at this Service for heartbeats.

Metric reference¶

Name	Type	Labels	Meaning
`dory_applications_total`	Gauge	—	Apps in the config table.
`dory_database_query_duration_seconds`	HistogramVec	`query_type` (`list_configs`/`get_config`/`update_pod_ip`)	DB query latency.
`dory_database_connections_active`	Gauge	—	Active DB connections.
`dory_database_errors_total`	CounterVec	`error_type` (`connection`/`query`/`timeout`)	DB errors.
`dory_circuit_breaker_state`	GaugeVec	`component` (`database`)	0=closed, 1=half-open, 2=open.
`dory_circuit_breaker_transitions_total`	CounterVec	`component`, `from_state`, `to_state`	Circuit breaker transitions.
`dory_reconciliation_queue_depth`	Gauge	—	Pending reconciliations.
`dory_reconciliation_errors_total`	CounterVec	`error_type` (`pod_creation`/`pod_deletion`/`scheduling`/`timeout`/`general`)	Reconcile errors.
`dory_pod_creation_errors_total`	CounterVec	`reason` (`quota`/`scheduling`/`image_pull`/`resource`)	Pod create failures.
`dory_config_watcher_healthy`	Gauge	—	1/0.
`dory_config_watcher_errors_total`	Counter	—	Watcher errors.
`dory_config_changes_total`	CounterVec	`change_type` (`added`/`removed`/`modified`)	Config changes.
`dory_node_utilization_ratio`	GaugeVec	`node`	Managed pods / max pods.
`dory_shutdown_duration_seconds`	HistogramVec	`component`	Shutdown duration.
`dory_startup_duration_seconds`	Histogram	—	Startup duration.
`dory_component_healthy`	GaugeVec	`component` (`database`/`kubernetes`/`watcher`/`reconciler`)	1/0.
`dory_pods_total`	GaugeVec	`status` (`running`/`pending`/`terminating`)	Managed pods.
`dory_nodes_total`	Gauge	—	Cluster nodes.
`dory_scheduling_decisions_total`	CounterVec	`decision` (`provision`/`migrate`/`decommission`)	Scheduling decisions.
`dory_reconciliation_duration_seconds`	Histogram	—	Reconcile time (buckets to 120s).
`dory_pod_startup_duration_seconds`	Histogram	—	Scheduling → Running (buckets to 120s).
`dory_edge_nodes_total`	GaugeVec	`status` (`healthy`/`not_ready`)	Edge nodes.
`dory_failover_pods_active`	Gauge	—	Pods failed over edge → managed.
`dory_failover_events_total`	CounterVec	`type` (`edge_to_managed`/`managed_to_edge`)	Failover events.
`dory_node_failures_total`	Counter	—	Edge node failures.
`dory_sdk_detected_total`	CounterVec	`sdk_version`, `app_name`	Pods where the SDK was detected.
`dory_sdk_not_detected_total`	CounterVec	`app_name`	Pods where the SDK was NOT detected.
`dory_health_poll_processors_up`	Gauge	—	Healthy processors in the last poll.
`dory_health_poll_processors_down`	Gauge	—	Unhealthy/unreachable processors in the last poll.
`dory_health_poll_duration_seconds`	Histogram	—	Poll cycle duration.
`dory_health_poll_errors_total`	Counter	—	Poll cycle errors.

Note

The failover metrics (dory_edge_nodes_total, dory_failover_pods_active, dory_failover_events_total, dory_node_failures_total) correspond directly to the behavior described in Edge Failover.

Alert & recording rules¶

Alerting and recording rules live in deploy/prometheus-alerts.yaml (job label dory-orchestrator):

Critical alerts: orchestrator down, database down, Kubernetes down, circuit breaker open.
SLA expressions: reconciliation 120s, pod startup 30s.
Recording rules: p95s plus success rates.