Skip to content

Self-hosting troubleshooting

Open in ChatGPT ↗
Ask ChatGPT about this page
Open in Claude ↗
Ask Claude about this page
Copied!

Diagnose and fix common problems with self-hosted Oz worker daemons across Docker, Kubernetes, and Direct backends.

Diagnostic guides for the oz-agent-worker daemon and its task execution. Use this page when a worker won’t start, won’t connect, tasks stay queued, or tasks fail.


Cause: Docker isn’t running, or the daemon platform isn’t supported.

Fix:

  1. Verify Docker is running: docker info.
  2. Confirm the daemon platform is linux/amd64 or linux/arm64. Windows containers are not supported.
  3. If the worker runs inside Docker, confirm the /var/run/docker.sock mount is correct and the mounting user has permission to the socket.

Cause: The startup preflight Job failed. Common reasons include insufficient RBAC, restrictive Pod Security policies, or an unreachable Kubernetes API server.

Fix:

  1. Check the worker logs for the preflight diagnostic message.
  2. Confirm the worker’s namespace has these permissions: create, get, list, watch, delete on jobs; get, list, watch on pods; get on pods/log; list on events.
  3. Confirm the task namespace allows pods with a root init container (required for sidecar materialization).
  4. If your cluster restricts image sources, set preflight_image in the worker config to an allowlisted image (default is busybox:1.36).
  5. To pull the preflight image from a private registry, configure imagePullSecrets in pod_template — these secrets also apply to the preflight Job.

Cause: The oz CLI isn’t installed or isn’t on the worker’s PATH.

Fix:

  1. Install the Oz CLI on the worker host. See Installing the CLI.
  2. If the CLI isn’t on PATH, set oz_path in the config file to the absolute path of the oz binary.

Cause: The API key is invalid, expired, or the host cannot reach Oz’s backend.

Fix:

  1. Confirm your API key is correct, not expired, and has team scope.
  2. Regenerate the API key in Settings > Cloud platform > Oz Cloud API Keys if you suspect it’s invalid.
  3. Ensure the host has outbound internet access to oz.warp.dev:443.
  4. Check that no firewall rules are blocking WebSocket connections to wss://oz.warp.dev.
  5. Increase log verbosity with --log-level debug to see connection details.

See Security and networking for the full list of outbound endpoints the worker needs.


Cause: The worker isn’t running, the --host value doesn’t match the worker’s --worker-id, or the worker and task belong to different teams.

Fix:

  1. Confirm the worker is running and connected. Check the worker logs for Listening for tasks or similar.
  2. Verify the --host (or worker_host) value you passed matches your --worker-id exactly. Case-sensitive.
  3. Ensure the worker’s team matches the team creating the task.

Cause: The worker is running but metrics aren’t showing up in Prometheus or your collector.

Fix:

  1. Verify OTEL_METRICS_EXPORTER is set correctly on the worker process. Run curl -s localhost:9464/metrics from the worker host (for prometheus mode) to confirm the endpoint is serving.
  2. For Prometheus scrape mode, confirm the bind address is 0.0.0.0 (not localhost) when running in Docker or Kubernetes. localhost is only reachable from inside the container.
  3. Confirm no firewall or network policy blocks the metrics port (default 9464).
  4. For OTLP push mode, verify OTEL_EXPORTER_OTLP_ENDPOINT points to a reachable collector and that the protocol matches (http/protobuf vs grpc).
  5. When using the Helm chart, confirm metrics.enabled=true is set. Check that the Service and (optionally) PodMonitor were created: kubectl get svc,podmonitor -n <namespace>.
  6. If using metrics.podMonitor.create=true, verify the monitoring.coreos.com CRDs are installed in the cluster. The PodMonitor resource requires the Prometheus Operator.
  7. Restart the worker with --log-level debug and look for metrics-related error messages at startup.

See Monitoring for the full setup guide.


Cause: A variety of reasons depending on backend. Start with the diagnostic steps common to all backends, then follow the backend-specific checks.

Fix (all backends):

  1. Review task logs in the Oz dashboard or via session sharing.
  2. Use --no-cleanup to keep the container, Job, or workspace around for inspection after failure.
  3. Use --log-level debug to see detailed execution logs.
  4. Ensure the worker machine or cluster has sufficient resources (CPU, memory, disk).
  1. Verify Docker is running (docker info).
  2. If using a custom image, confirm it is glibc-based (not Alpine/musl) and that its architecture matches the worker’s Docker daemon platform.
  1. Check task Job and Pod status: kubectl get jobs,pods -n <namespace>.
  2. Common issues:
    • Unschedulable pods — Check node selectors, tolerations, and resource requests in pod_template.
    • Image pull failures — Check imagePullSecrets in pod_template.
    • Admission policy rejections — Review Pod Security Standards, OPA Gatekeeper, Kyverno, or similar admission controllers.
  3. The worker fails a task early if its pod remains unschedulable beyond unschedulable_timeout (default 30s). Raise the timeout or fix the scheduling issue.
  1. Verify the Oz CLI is accessible.
  2. Verify the workspace root directory has write permissions for the user running the worker.

  1. If using a private registry, ensure Docker credentials are available to the worker. See Private Docker registries.
  2. Try pulling the image manually on the worker host: docker pull <image>.
  1. Configure imagePullSecrets in the pod_template section of your worker config.
  2. Verify the Secret exists in the task namespace and contains valid credentials.
  • Verify the image exists and the tag is correct.
  • Check network connectivity from the worker/cluster to the registry.