Skip to content

Agents > Inference & providers

Custom inference endpoint

Open in ChatGPT ↗
Ask ChatGPT about this page
Open in Claude ↗
Ask Claude about this page
Copied!

Connect Warp's agents to any OpenAI-compatible inference endpoint — OpenRouter, LiteLLM, z.ai, or an internal gateway you already run.

Warp supports custom inference endpoints for users who want to power Warp’s agents with any OpenAI-compatible inference endpoint — a model router, hosted gateway, or internal infrastructure they already run.

This lets you route AI requests through your preferred provider, run inference behind your own gateway, or use a router like OpenRouter or LiteLLM, while keeping the agent experience inside Warp.

  • OpenAI-compatible - Works with any endpoint that implements the OpenAI Chat Completions API.
  • Provider flexibility - Use a model router (OpenRouter, LiteLLM), a model provider with an OpenAI-compatible surface (z.ai), or your own internal gateway.
  • No AI credits consumed for inference - Inference is billed directly by your endpoint provider. On Business and Enterprise, local agent runs that route through a custom inference endpoint still consume platform credits for Warp’s platform infrastructure.
  • Local configuration - Endpoint URLs and credentials are stored locally on your device and never synced to the cloud.

A custom inference endpoint expects your endpoint to implement the OpenAI Chat Completions API (POST /v1/chat/completions). Any service that exposes a compatible surface can be used as a target:

  • OpenRouter - Aggregates many model providers behind a single OpenAI-compatible API and consolidated billing.
  • LiteLLM - A self-hosted proxy that exposes a unified, OpenAI-compatible API across providers.
  • z.ai - A model provider with an OpenAI-compatible API surface for its models.
  • Internal gateways - Any in-house service that fronts model providers behind an OpenAI-compatible endpoint (for example, a corporate AI gateway with logging, redaction, or access control).

When you configure a custom inference endpoint, Warp stores the endpoint URL, model identifiers, and credentials locally on your device. They are never synced to Warp’s servers.

When a model routed through your endpoint is selected:

  • Warp doesn’t consume your AI credits for that request.
  • Costs are billed directly by your endpoint provider.
  • Warp doesn’t retain or store your endpoint credentials on any of its servers.

To enable and configure a custom inference endpoint:

  1. In Warp, open Settings and search for inference endpoint to jump to the configuration.
  2. Add your endpoint URL (the base URL that exposes /v1/chat/completions) and any required credentials (typically an API key).
  3. Specify the model identifier(s) you want to route through this endpoint.
  4. Save the configuration. Once added, you’ll see your custom models appear in the model picker.

When you explicitly select an endpoint-routed model from the model picker, Warp routes the request through your endpoint instead of consuming Warp’s AI credits.

The configuration flow mirrors the Bring Your Own API Key setup, so the steps will feel familiar if you’ve already configured BYOK.

When you select an endpoint-routed model from the model picker, inference is billed directly by your endpoint provider, according to their pricing, rather than drawing from your Warp AI credits.

Warp’s Auto models dynamically route across providers using Warp’s infrastructure. Because Auto routing depends on Warp, Auto always consumes Warp’s credits, even if you’ve configured a custom inference endpoint.

To use your endpoint, select the specific endpoint-routed model from the model picker rather than an Auto option.

Some AI-powered features (Codebase Context, Active AI recommendations, cloud agent runs) rely on Warp’s infrastructure and are unaffected by a custom inference endpoint. See the feature breakdown on the BYOK page for which features still consume Warp credits.

Warp is SOC 2 compliant and has Zero Data Retention (ZDR) agreements with all of its contracted LLM providers.

When you use a custom inference endpoint:

  • Data retention is determined by your endpoint provider and any upstream model providers they route to.
  • Warp cannot enforce ZDR for requests sent through a custom inference endpoint.
  • If your endpoint provider does not have ZDR with the underlying model provider, your requests may be retained according to their terms.

Review your endpoint provider’s data handling and retention policies before routing sensitive prompts through a custom inference endpoint.

Custom inference endpoints are configured at the user level on every plan. Each user adds their own endpoint locally; centrally configured, admin-managed endpoints for teams are not yet available.

Enterprise teams that need centrally managed model routing today should see Bring Your Own LLM.

How custom inference endpoints differ from BYOK and BYOLLM

Section titled “How custom inference endpoints differ from BYOK and BYOLLM”

Warp offers three ways to bring your own AI infrastructure. Use this table to pick the right one, and follow the links for full details.

NameMeaningPlans
Bring Your Own API Key (BYOK)Use your own API key for OpenAI, Anthropic, or Google models. Keys are stored locally on your device.Free and all eligible paid plans
Custom inference endpointConnect Warp to an OpenAI-compatible endpoint such as OpenRouter, LiteLLM, z.ai, or an internal gateway.Free and all eligible paid plans
Bring Your Own LLM (BYOLLM)Enterprise-managed inference through your cloud provider (AWS Bedrock today; Azure Foundry and Google Vertex coming soon), with Warp handling routing, orchestration, governance, and observability.Enterprise only

Platform credits may apply for local agent runs on Business and Enterprise when using BYOK, a custom inference endpoint, or BYOLLM. See platform credits.

  • Bring Your Own API Key — Use your own OpenAI, Anthropic, or Google API keys.
  • Bring Your Own LLM — Enterprise-managed inference through your cloud provider or approved infrastructure.
  • Model Choice — Full list of supported models and model_id values.
  • Credits — How Warp credits work and when they’re consumed.