Self-hosting troubleshooting
# Self-hosting troubleshooting Diagnostic guides for the `oz-agent-worker` daemon and its task execution. Use this page when a worker won't start, won't connect, tasks stay queued, or tasks fail. :::note The steps below apply to the [managed architecture](/agent-platform/cloud-agents/self-hosting/#managed-architecture) (`oz-agent-worker` daemon). For [unmanaged](/agent-platform/cloud-agents/self-hosting/unmanaged/) deployments, refer to the documentation for the environment running `oz agent run` (e.g., GitHub Actions, Kubernetes). ::: --- ## Worker won't start ### Docker backend **Cause:** Docker isn't running, or the daemon platform isn't supported. **Fix:** 1. Verify Docker is running: `docker info`. 2. Confirm the daemon platform is `linux/amd64` or `linux/arm64`. Windows containers are not supported. 3. If the worker runs inside Docker, confirm the `/var/run/docker.sock` mount is correct and the mounting user has permission to the socket. ### Kubernetes backend **Cause:** The startup preflight Job failed. Common reasons include insufficient RBAC, restrictive Pod Security policies, or an unreachable Kubernetes API server. **Fix:** 1. Check the worker logs for the preflight diagnostic message. 2. Confirm the worker's namespace has these permissions: `create`, `get`, `list`, `watch`, `delete` on `jobs`; `get`, `list`, `watch` on `pods`; `get` on `pods/log`; `list` on `events`. 3. Confirm the task namespace allows pods with a **root init container** (required for sidecar materialization). 4. If your cluster restricts image sources, set `preflight_image` in the worker config to an allowlisted image (default is `busybox:1.36`). 5. To pull the preflight image from a private registry, configure `imagePullSecrets` in `pod_template` — these secrets also apply to the preflight Job. ### Direct backend **Cause:** The `oz` CLI isn't installed or isn't on the worker's `PATH`. **Fix:** 1. Install the Oz CLI on the worker host. See [Installing the CLI](/reference/cli/#installing-the-cli). 2. If the CLI isn't on `PATH`, set `oz_path` in the config file to the absolute path of the `oz` binary. --- ## Worker won't connect **Cause:** The API key is invalid, expired, or the host cannot reach Oz's backend. **Fix:** 1. Confirm your API key is correct, not expired, and has team scope. 2. Regenerate the API key in **Settings** > **Cloud platform** > **Oz Cloud API Keys** if you suspect it's invalid. 3. Ensure the host has outbound internet access to `oz.warp.dev:443`. 4. Check that no firewall rules are blocking WebSocket connections to `wss://oz.warp.dev`. 5. Increase log verbosity with `--log-level debug` to see connection details. See [Security and networking](/agent-platform/cloud-agents/self-hosting/security-and-networking/#network-requirements) for the full list of outbound endpoints the worker needs. --- ## Tasks not being picked up **Cause:** The worker isn't running, the `--host` value doesn't match the worker's `--worker-id`, or the worker and task belong to different teams. **Fix:** 1. Confirm the worker is running and connected. Check the worker logs for `Listening for tasks` or similar. 2. Verify the `--host` (or `worker_host`) value you passed matches your `--worker-id` exactly. Case-sensitive. 3. Ensure the worker's team matches the team creating the task. --- ## Metrics not appearing **Cause:** The worker is running but metrics aren't showing up in Prometheus or your collector. **Fix:** 1. Verify `OTEL_METRICS_EXPORTER` is set correctly on the worker process. Run `curl -s localhost:9464/metrics` from the worker host (for `prometheus` mode) to confirm the endpoint is serving. 2. For Prometheus scrape mode, confirm the bind address is `0.0.0.0` (not `localhost`) when running in Docker or Kubernetes. `localhost` is only reachable from inside the container. 3. Confirm no firewall or network policy blocks the metrics port (default `9464`). 4. For OTLP push mode, verify `OTEL_EXPORTER_OTLP_ENDPOINT` points to a reachable collector and that the protocol matches (`http/protobuf` vs `grpc`). 5. When using the Helm chart, confirm `metrics.enabled=true` is set. Check that the `Service` and (optionally) `PodMonitor` were created: `kubectl get svc,podmonitor -n <namespace>`. 6. If using `metrics.podMonitor.create=true`, verify the `monitoring.coreos.com` CRDs are installed in the cluster. The `PodMonitor` resource requires the Prometheus Operator. 7. Restart the worker with `--log-level debug` and look for metrics-related error messages at startup. See [Monitoring](/agent-platform/cloud-agents/self-hosting/monitoring/) for the full setup guide. --- ## Task failures **Cause:** A variety of reasons depending on backend. Start with the diagnostic steps common to all backends, then follow the backend-specific checks. **Fix (all backends):** 1. Review task logs in the [Oz dashboard](https://oz.warp.dev) or via [session sharing](/agent-platform/local-agents/session-sharing/). 2. Use `--no-cleanup` to keep the container, Job, or workspace around for inspection after failure. 3. Use `--log-level debug` to see detailed execution logs. 4. Ensure the worker machine or cluster has sufficient resources (CPU, memory, disk). ### Docker backend (task failures) 1. Verify Docker is running (`docker info`). 2. If using a custom image, confirm it is **glibc-based** (not Alpine/musl) and that its architecture matches the worker's Docker daemon platform. ### Kubernetes backend (task failures) 1. Check task Job and Pod status: `kubectl get jobs,pods -n <namespace>`. 2. Common issues: * **Unschedulable pods** — Check node selectors, tolerations, and resource requests in `pod_template`. * **Image pull failures** — Check `imagePullSecrets` in `pod_template`. * **Admission policy rejections** — Review Pod Security Standards, OPA Gatekeeper, Kyverno, or similar admission controllers. 3. The worker fails a task early if its pod remains unschedulable beyond `unschedulable_timeout` (default `30s`). Raise the timeout or fix the scheduling issue. ### Direct backend (task failures) 1. Verify the Oz CLI is accessible. 2. Verify the workspace root directory has write permissions for the user running the worker. --- ## Image pull failures ### Docker backend (image pull) 1. If using a private registry, ensure Docker credentials are available to the worker. See [Private Docker registries](/agent-platform/cloud-agents/self-hosting/managed-docker/#private-docker-registries). 2. Try pulling the image manually on the worker host: `docker pull <image>`. ### Kubernetes backend (image pull) 1. Configure `imagePullSecrets` in the `pod_template` section of your worker config. 2. Verify the Secret exists in the task namespace and contains valid credentials. ### Both backends (image pull) * Verify the image exists and the tag is correct. * Check network connectivity from the worker/cluster to the registry. --- ## Related pages * [Self-hosting overview](/agent-platform/cloud-agents/self-hosting/) — Architecture and decision guide. * [Self-hosted worker reference](/agent-platform/cloud-agents/self-hosting/reference/) — CLI flags and config schema, including every flag mentioned here. * [Security and networking](/agent-platform/cloud-agents/self-hosting/security-and-networking/) — Outbound endpoints the worker needs. * [Agent Session Sharing](/agent-platform/local-agents/session-sharing/) — Attach to running tasks to debug interactively.Diagnose and fix common problems with self-hosted Oz worker daemons across Docker, Kubernetes, and Direct backends.
Diagnostic guides for the oz-agent-worker daemon and its task execution. Use this page when a worker won’t start, won’t connect, tasks stay queued, or tasks fail.
Worker won’t start
Section titled “Worker won’t start”Docker backend
Section titled “Docker backend”Cause: Docker isn’t running, or the daemon platform isn’t supported.
Fix:
- Verify Docker is running:
docker info. - Confirm the daemon platform is
linux/amd64orlinux/arm64. Windows containers are not supported. - If the worker runs inside Docker, confirm the
/var/run/docker.sockmount is correct and the mounting user has permission to the socket.
Kubernetes backend
Section titled “Kubernetes backend”Cause: The startup preflight Job failed. Common reasons include insufficient RBAC, restrictive Pod Security policies, or an unreachable Kubernetes API server.
Fix:
- Check the worker logs for the preflight diagnostic message.
- Confirm the worker’s namespace has these permissions:
create,get,list,watch,deleteonjobs;get,list,watchonpods;getonpods/log;listonevents. - Confirm the task namespace allows pods with a root init container (required for sidecar materialization).
- If your cluster restricts image sources, set
preflight_imagein the worker config to an allowlisted image (default isbusybox:1.36). - To pull the preflight image from a private registry, configure
imagePullSecretsinpod_template— these secrets also apply to the preflight Job.
Direct backend
Section titled “Direct backend”Cause: The oz CLI isn’t installed or isn’t on the worker’s PATH.
Fix:
- Install the Oz CLI on the worker host. See Installing the CLI.
- If the CLI isn’t on
PATH, setoz_pathin the config file to the absolute path of theozbinary.
Worker won’t connect
Section titled “Worker won’t connect”Cause: The API key is invalid, expired, or the host cannot reach Oz’s backend.
Fix:
- Confirm your API key is correct, not expired, and has team scope.
- Regenerate the API key in Settings > Cloud platform > Oz Cloud API Keys if you suspect it’s invalid.
- Ensure the host has outbound internet access to
oz.warp.dev:443. - Check that no firewall rules are blocking WebSocket connections to
wss://oz.warp.dev. - Increase log verbosity with
--log-level debugto see connection details.
See Security and networking for the full list of outbound endpoints the worker needs.
Tasks not being picked up
Section titled “Tasks not being picked up”Cause: The worker isn’t running, the --host value doesn’t match the worker’s --worker-id, or the worker and task belong to different teams.
Fix:
- Confirm the worker is running and connected. Check the worker logs for
Listening for tasksor similar. - Verify the
--host(orworker_host) value you passed matches your--worker-idexactly. Case-sensitive. - Ensure the worker’s team matches the team creating the task.
Metrics not appearing
Section titled “Metrics not appearing”Cause: The worker is running but metrics aren’t showing up in Prometheus or your collector.
Fix:
- Verify
OTEL_METRICS_EXPORTERis set correctly on the worker process. Runcurl -s localhost:9464/metricsfrom the worker host (forprometheusmode) to confirm the endpoint is serving. - For Prometheus scrape mode, confirm the bind address is
0.0.0.0(notlocalhost) when running in Docker or Kubernetes.localhostis only reachable from inside the container. - Confirm no firewall or network policy blocks the metrics port (default
9464). - For OTLP push mode, verify
OTEL_EXPORTER_OTLP_ENDPOINTpoints to a reachable collector and that the protocol matches (http/protobufvsgrpc). - When using the Helm chart, confirm
metrics.enabled=trueis set. Check that theServiceand (optionally)PodMonitorwere created:kubectl get svc,podmonitor -n <namespace>. - If using
metrics.podMonitor.create=true, verify themonitoring.coreos.comCRDs are installed in the cluster. ThePodMonitorresource requires the Prometheus Operator. - Restart the worker with
--log-level debugand look for metrics-related error messages at startup.
See Monitoring for the full setup guide.
Task failures
Section titled “Task failures”Cause: A variety of reasons depending on backend. Start with the diagnostic steps common to all backends, then follow the backend-specific checks.
Fix (all backends):
- Review task logs in the Oz dashboard or via session sharing.
- Use
--no-cleanupto keep the container, Job, or workspace around for inspection after failure. - Use
--log-level debugto see detailed execution logs. - Ensure the worker machine or cluster has sufficient resources (CPU, memory, disk).
Docker backend (task failures)
Section titled “Docker backend (task failures)”- Verify Docker is running (
docker info). - If using a custom image, confirm it is glibc-based (not Alpine/musl) and that its architecture matches the worker’s Docker daemon platform.
Kubernetes backend (task failures)
Section titled “Kubernetes backend (task failures)”- Check task Job and Pod status:
kubectl get jobs,pods -n <namespace>. - Common issues:
- Unschedulable pods — Check node selectors, tolerations, and resource requests in
pod_template. - Image pull failures — Check
imagePullSecretsinpod_template. - Admission policy rejections — Review Pod Security Standards, OPA Gatekeeper, Kyverno, or similar admission controllers.
- Unschedulable pods — Check node selectors, tolerations, and resource requests in
- The worker fails a task early if its pod remains unschedulable beyond
unschedulable_timeout(default30s). Raise the timeout or fix the scheduling issue.
Direct backend (task failures)
Section titled “Direct backend (task failures)”- Verify the Oz CLI is accessible.
- Verify the workspace root directory has write permissions for the user running the worker.
Image pull failures
Section titled “Image pull failures”Docker backend (image pull)
Section titled “Docker backend (image pull)”- If using a private registry, ensure Docker credentials are available to the worker. See Private Docker registries.
- Try pulling the image manually on the worker host:
docker pull <image>.
Kubernetes backend (image pull)
Section titled “Kubernetes backend (image pull)”- Configure
imagePullSecretsin thepod_templatesection of your worker config. - Verify the Secret exists in the task namespace and contains valid credentials.
Both backends (image pull)
Section titled “Both backends (image pull)”- Verify the image exists and the tag is correct.
- Check network connectivity from the worker/cluster to the registry.
Related pages
Section titled “Related pages”- Self-hosting overview — Architecture and decision guide.
- Self-hosted worker reference — CLI flags and config schema, including every flag mentioned here.
- Security and networking — Outbound endpoints the worker needs.
- Agent Session Sharing — Attach to running tasks to debug interactively.