Chat & inference
OpenAI-compatible streaming completions across cloud models and your own vLLM.
One OpenAI-compatible API across cloud LLMs and your own + spot GPUs. Health-gated machines, automatic failover, and reapers that stop idle spend — the reliability of a managed provider at neocloud prices.
LLM gateways route to models but never touch GPUs. GPU clouds run models but never route to OpenAI or protect your bill. ai-ctrl is the seam — one API that does both.
A pool manager sits in the middle. Ask for 5 machines and it boots a few extra, benchmarks every one (GPU, SSD, network), keeps the fastest 5 and kills the rest. If one dies — or a benchmark flags slow disk or network — it's replaced automatically. You only ever pay for healthy machines.
A pool manager keeps your fleet healthy — automatically.
Strategy shown: Vast.ai — machines are cheap, so over-provision and cull. On RunPod the manager keeps fewer spares — boot time is billed and machines are more reliable.
Run spot & neocloud GPUs (60–90% cheaper) without the spot-roulette. Only health-checked machines take traffic; if a node dies, requests fail over automatically.
Built-in reapers stop idle machines, kill stragglers, and reclaim provider-side zombies. Per-pool budgets pause spend before it runs away. The #1 FinOps problem, solved in the gateway.
Point the OpenAI SDK at ai-ctrl. Reach every cloud provider and your own vLLM fleet behind a single API — automatic failover, BYOK, no token markup, no lock-in.
One control plane for inference, batch, training and image/video generation — all on the same pooled GPUs.
OpenAI-compatible streaming completions across cloud models and your own vLLM.
OpenAI-compatible batch jobs at roughly half price, ~24 h turnaround.
Fine-tune and LoRA-train open models on pooled GPUs — submit a run, get your adapter back.
Image & video generation pipelines dispatched to GPU workers.
Run FLUX, SDXL & friends — text-to-image on your own GPUs, not a rented API.
Any vLLM / Hugging-Face model, swapped live onto a pool with prestaged weights.
Idle, stalled, crashed, never-booted, and provider-side zombie machines are detected and destroyed automatically. Budgets and account-halt stop spend before it runs away.
Idle machine → stop (keep disk) → kill after grace. No more paying for GPUs doing nothing.
Provider-side pods with no DB record, or machines gone silent, are reclaimed so they stop billing.
Per-pool daily budgets pause provisioning; account-halt stops a runaway scope cold.
Every provisioned box is audited — SSH, GPU, disk & network — before it joins a pool. Bad hardware is rejected; a dead spot node fails over to cloud so a cheap GPU never becomes a failed request.
Route across 8+ LLM providers and your own vLLM. Provision GPUs on RunPod, Vast & Clore into autoscaling, budgeted pools — or add your own boxes to the same fleet.
OpenAI, Anthropic, Gemini, Mistral, DeepSeek, Qwen, Perplexity, OpenRouter + self-hosted vLLM. Priority + cost ranking, opt-in fallback. BYOK.
Declarative, multi-member, autoscaling pools with per-pool budgets. RunPod / Vast / Clore — plus your own gx10 / on-prem boxes over Tailscale.
A live SSE event stream, ~130 machine states and an append-only audit log. Automation routes on stable enums; humans read the detail. No guessing.
statusSmall stable enum. Automation fans out on this.
state_detailHuman-readable current activity.
state_progress0–100% through the current step.
error_codeStable category for retry / alert switching.
| Capability | LLM gateways | GPU / inference clouds | TrueFoundry | ai-ctrl |
|---|---|---|---|---|
| Cross-provider LLM routing + fallback | Yes | No | Yes | Yes |
| Provisions cheap / spot cloud GPUs | No | Yes | Own GPUs only | Yes |
| Active money-drain reapers + $ budgets | Token $ only | Weak / partial | Partial | Yes |
| Health-gated machines | — | Rare | — | Roadmap |
| Your own / local boxes in the fleet | No | Some (BYOC) | Yes (your K8s) | Yes |
| Fleet / GPU-lifecycle observability | Request-level | Partial | Partial | Yes |
Honest take: routing and observability are table-stakes among gateways; provisioning + reapers are the GPU clouds' turf. ai-ctrl is the only one combining both behind one OpenAI-compatible API — TrueFoundry is closest but orchestrates only GPUs you already own. Health-gating is on our roadmap (see badges above).
# Point any OpenAI client at ai-ctrl curl https://api.ai-ctrl.net/v1/chat/completions \ -H "Authorization: Bearer $AI_CTRL_API_KEY" \ -H "Content-Type: application/json" \ -d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"hi"}]}' # Or with the OpenAI Python SDK from openai import OpenAI client = OpenAI(base_url="https://api.ai-ctrl.net/v1", api_key="$AI_CTRL_API_KEY")