Skip to content

Monitoring

Open in ChatGPT ↗
Ask ChatGPT about this page
Open in Claude ↗
Ask Claude about this page
Copied!

Monitor self-hosted Oz workers with OpenTelemetry metrics. Export to Prometheus, OTLP, or console to track worker health, task throughput, and saturation.

The oz-agent-worker daemon exports infrastructure-level metrics over OpenTelemetry, giving your team real-time visibility into worker health, task throughput, and capacity. Combine these metrics with the Oz dashboard for full observability across both the orchestration plane and your self-hosted compute.

  • Prometheus scrape — Expose a /metrics endpoint for Prometheus to scrape, with optional PodMonitor support for the Prometheus Operator.
  • OTLP push — Push metrics to any OpenTelemetry-compatible collector (Grafana Alloy, Datadog Agent, New Relic, etc.).
  • Standard configuration — Exporter selection uses the standard OpenTelemetry environment variables, so the worker integrates with your existing observability stack without custom configuration.
  • Pre-seeded series — All metric series appear at startup (before any tasks run), so dashboards and alerts can reference them immediately.

The worker uses the go.opentelemetry.io/contrib/exporters/autoexport package to select an exporter at runtime based on the OTEL_METRICS_EXPORTER environment variable. Supported values:

  • prometheus — Starts an in-process HTTP server serving /metrics.
  • otlp — Pushes metrics over OTLP (HTTP/protobuf by default).
  • console — Writes metrics to stdout (useful for debugging).
  • none — Disables metrics export entirely.

When OTEL_METRICS_EXPORTER is unset, the worker defaults to OTLP push targeting OTEL_EXPORTER_OTLP_ENDPOINT (which itself defaults to http://localhost:4318 for http/protobuf or http://localhost:4317 for grpc).

All metrics carry resource attributes (service.name=oz-agent-worker, service.version, worker.id, worker.backend) so each worker process shows up as a distinct series in your monitoring system.


Set these environment variables before starting the worker to expose a Prometheus-compatible /metrics endpoint:

Terminal window
export OTEL_METRICS_EXPORTER=prometheus
export OTEL_EXPORTER_PROMETHEUS_HOST=0.0.0.0
export OTEL_EXPORTER_PROMETHEUS_PORT=9464
oz-agent-worker --api-key "$WARP_API_KEY" --worker-id "my-worker"

Verify the endpoint is serving metrics:

Terminal window
curl -s localhost:9464/metrics | grep oz_worker_

Expected outcome: You see oz_worker_connected, oz_worker_tasks_active, and other oz_worker_* metric families in the output.


Set these environment variables to push metrics to an OpenTelemetry collector:

Terminal window
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector.observability.svc:4318
oz-agent-worker --api-key "$WARP_API_KEY" --worker-id "my-worker"

The worker pushes metrics at the SDK’s default interval. Configure the collector endpoint, protocol, and headers using standard OTLP exporter environment variables.


The Helm chart includes built-in support for metrics. Enable metrics with metrics.enabled=true:

Terminal window
helm install oz-agent-worker ./charts/oz-agent-worker \
--namespace warp-oz \
--set worker.workerId=my-worker \
--set image.tag=VERSION \
--set metrics.enabled=true

With metrics.enabled=true and the default metrics.exporter=prometheus, the chart adds:

  • A containerPort: metrics (default 9464) on the worker Deployment.
  • The OTEL_METRICS_EXPORTER, OTEL_EXPORTER_PROMETHEUS_HOST, and OTEL_EXPORTER_PROMETHEUS_PORT environment variables.
  • A namespace-scoped Service named <release>-oz-agent-worker-metrics with prometheus.io/scrape annotations.
  • Optionally, a PodMonitor (metrics.podMonitor.create=true) for clusters using the Prometheus Operator.

Core:

  • metrics.enabled — Enable metrics export. Defaults to false.
  • metrics.exporter — Exporter type: prometheus (default), otlp, console, or none.
  • metrics.port — Port for the Prometheus exporter. Defaults to 9464. Ignored for otlp/console.
  • metrics.extraEnv — Extra environment variables for the worker container (e.g., OTEL_EXPORTER_OTLP_ENDPOINT).

Service (Prometheus scrape):

  • metrics.service.create — Create a metrics Service. Defaults to true.
  • metrics.service.type — Service type. Defaults to ClusterIP.
  • metrics.service.annotations — Annotations on the Service. Defaults include prometheus.io/scrape: "true".

PodMonitor (Prometheus Operator):

  • metrics.podMonitor.create — Create a PodMonitor. Defaults to false (avoids requiring monitoring.coreos.com CRDs).
  • metrics.podMonitor.interval — Scrape interval. Defaults to 30s.
  • metrics.podMonitor.scrapeTimeout — Scrape timeout. Defaults to 10s.
  • metrics.podMonitor.additionalLabels — Extra labels on the PodMonitor resource.

To push metrics to an OTLP collector instead of exposing a Prometheus endpoint, set metrics.exporter=otlp and forward the endpoint via metrics.extraEnv:

metrics:
enabled: true
exporter: otlp
extraEnv:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://otel-collector.observability.svc:4318

All metrics use the oz_worker_ prefix. Each worker process emits a distinct set of series, identified by the resource attributes service.name, service.version, worker.id, and worker.backend.

  • oz_worker_connected (gauge) — 1 while the worker has an active WebSocket connection to Oz’s backend, 0 otherwise.
  • oz_worker_tasks_active (gauge / UpDownCounter) — Tasks currently executing on this worker.
  • oz_worker_tasks_max_concurrent (gauge) — Configured concurrency limit (0 means unlimited).
  • oz_worker_tasks_claimed_total (counter) — Total tasks accepted since process start.
  • oz_worker_tasks_rejected_total{reason} (counter) — Tasks the worker declined (e.g., reason="at_capacity").
  • oz_worker_tasks_completed_total{result} (counter) — Completed tasks labeled result="succeeded" or result="failed".
  • oz_worker_task_duration_seconds{result} (histogram) — Wall-clock task duration on the worker, labeled by result.
  • oz_worker_websocket_reconnects_total{reason} (counter) — WebSocket reconnect attempts (e.g., reason="dial_failed", reason="remote_close"). Spikes indicate flapping workers.
  • oz_worker_info{version,backend,worker_id} (gauge, constant 1) — Build and runtime metadata. Useful for joining other series by labels.

Direct mappings for common operational questions:

  • Workers available:

    sum(oz_worker_connected)
  • Workers active (running at least one task):

    count(oz_worker_tasks_active > 0)
  • Fleet saturation:

    sum(oz_worker_tasks_active) / sum(oz_worker_tasks_max_concurrent > 0)

    This ratio is only meaningful when every worker has a non-zero oz_worker_tasks_max_concurrent. Workers configured with 0 (unlimited) are excluded from the denominator, which can make the saturation result look misleadingly high or undefined for fleets that mix bounded and unlimited workers.

  • Task success rate (5-minute window):

    sum(rate(oz_worker_tasks_completed_total{result="succeeded"}[5m]))
    / sum(rate(oz_worker_tasks_completed_total[5m]))
  • Task duration p95:

    histogram_quantile(0.95, sum by (le) (rate(oz_worker_task_duration_seconds_bucket[5m])))
  • Failure rate:

    sum(rate(oz_worker_tasks_completed_total{result="failed"}[5m]))
  • Reconnect storms (alert threshold):

    sum(rate(oz_worker_websocket_reconnects_total[5m])) > 0.1

To fully disable metrics export, set OTEL_METRICS_EXPORTER=none:

Terminal window
export OTEL_METRICS_EXPORTER=none
oz-agent-worker --api-key "$WARP_API_KEY" --worker-id "my-worker"

Or in the Helm chart:

metrics:
enabled: false