Run your workspace

Picking a fast model

The chat model your agent uses is the single biggest knob for reply speed. A 200ms time-to-first-token (TTFT) feels instant; a 1.5s TTFT feels broken. Most "the bot is slow" complaints from buyers trace back to picking a heavyweight model when a fast one would have done.

Where to pick a model

Super-admin → Settings → System → AI providers. Each provider (Cloudflare, OpenAI, OpenRouter) ships a dropdown of curated models with a speed badge, a cost tier, and an estimated TTFT. The estimate comes from each vendor's published latency dashboard as of the catalogue's last update; the real number on your install can vary by ±30% depending on your region.

The "Test connection" button

Next to each model dropdown is a Test connection button. Click it and the server runs a one-token chat against the configured provider and reports the actual TTFT your install sees. The measurement is cached for 24 hours so reloading the page does not burn API budget. Click again to refresh.

If the probe fails — bad key, model deprecated, provider down — the button shows the error inline so you can fix it without leaving the page.

Default picks (June 2026)

Provider	Pick	Why
Cloudflare Workers AI	`@cf/meta/llama-3.3-70b-instruct-fp8-fast`	Fastest 70B on Cloudflare. Free for most installs. Tool-calling reliable.
OpenAI	`gpt-4o-mini`	10× cheaper than GPT-4o, ~3× faster TTFT. Quality good enough for sales-bot use.
OpenRouter	`meta-llama/llama-3.3-70b-instruct:free`	Free tier; rate-limited but enough for low-volume sites.

When to pick something slower

Long-form / nuanced replies (legal, finance, support escalations). Switch to GPT-4o or Claude 3.5 Sonnet. Buyers will accept slower TTFT in exchange for fewer wrong answers.
Huge knowledge bases. Gemini Flash 1.5 has a 1M-token context window. Fits anything in one prompt.
EU residency requirement. Mistral Small via OpenRouter routes to European infrastructure.

Why a model is or is not in the catalogue

The catalogue (App\Services\Llm\ModelCatalog) is hand-curated. We surface models that:

Stream tokens via the OpenAI-style chat/completions endpoint.
Reliably honour the JSON tool-calling format (where flagged).
Are not deprecated by their vendor.

If your model is not in the dropdown, the picker still lets you paste a custom ID — the "Custom" entry pins at the top and falls back to the original free-text input. The Test connection button works against custom models too.

Caveats

Catalogue TTFT estimates drift. Vendors silently change inference hardware. Re-run Test connection monthly if you care about exact numbers.
Probe runs in your local timezone / region. A buyer in Tokyo will see different latency than one in Frankfurt for the same model. The measurement reflects whichever server the install runs on.
Probe consumes one token per click. At OpenAI's gpt-4o-mini rate, that is well under $0.0001. Negligible.