Troubleshooting

Slow replies: find the bottleneck

When a buyer says "the bot is slow", the first question is which stage is slow. A visitor turn crosses four network round-trips before the first character appears, and they fail independently:

Stage	What it is	Typical	Fix when slow
`embed_ms`	Embedding the visitor's question (provider API call)	80–250ms	Provider region; retrieve cache absorbs repeats
`ann_ms`	Vector similarity search (Vectorize or Qdrant)	80–300ms remote / <10ms local Qdrant	Run Qdrant on the app server (`VECTOR_PROVIDER=qdrant`)
`rerank_ms`	Cross-encoder reranking of candidates	120–400ms	Lower `RAG_RERANK_FAN_OUT`; accept ANN order
`first_token_ms`	Wait from turn start until the LLM's first streamed token	200–900ms	Switch to a faster chat model — biggest single lever

The dashboard

Super-admin → /settings/system/hotpath-latency. Shows the last 100 visitor turns with one column per stage, p50 / p95 / max aggregate cards, and a plain-English verdict line naming the dominant cost with its remedy. Colour coding: green < 300ms, amber < 800ms, red above.

Data comes from a cache-backed ring buffer the stream handler appends to after the reply finishes streaming — visitors pay zero latency for the bookkeeping. Buffer survives 7 days or 100 turns, whichever ends first. No database table involved, so it works identically on Redis and database cache drivers.

Reading the table

First token high, retrieval low — the LLM is the bottleneck. Open Settings → System → AI providers and pick a faster model (the dropdown shows expected TTFT per model).
Retrieval high — check which sub-column grew. Search high → vector store round-trip; running Qdrant locally takes it under 10ms. Rerank high → reduce fan-out. Embed high → provider region.
"retrieve cache hit" rows — repeat questions skip embed/search/rerank entirely (30-minute cache). These rows show what your pipeline costs when retrieval is free.
Everything green but visitors still complain — the problem is between the visitor's browser and your server: proxy buffering (see the SSE heartbeat notes in Architecture → Hot path), TLS setup time, or plain geography.

The fast router

When an agent has tools enabled, every legacy turn pays a tool-check completion — a full non-streaming LLM call (1–3 of them, 5–15s each on Workers AI 70B) before the streamed answer starts. Measured in production this put first-token p50 at ~14s even on the fastest model, while ~90% of visitor questions never needed a tool.

The fast router decides per turn whether the tool check is worth running, in <2ms with zero extra network:

Keyword gate — per-tool phrase lists ("open a ticket", "where is my order", the human-handoff phrases).
Embedding gate — the RAG query embedding (already computed for retrieval) is compared against per-tool exemplar centroids by cosine similarity. Centroids are embedded off the hot path (queued job / deploy command) and cached 30 days, keyed by embed model + dimension + exemplar text so they auto-invalidate on any change.
No signal → knowledge route: RAG + stream only, tool check skipped.

Enable it:

FAST_ROUTER_ENABLED=true        # .env (default off)
php artisan router:warm         # embed tool centroids (also run on deploy)

Safety: explicit "talk to a human" messages are caught by the keyword shortcut before the router and always escalate; escalation also gets the lowest embedding threshold (FAST_ROUTER_ESCALATE_THRESHOLD) so it stays the easiest tool to trigger. A per-agent kill switch lives in vertical_overrides.fast_router (true/false beats the global flag). The dashboard's Notes column and perf:hotpath's route split line show every decision (knowledge (no_signal), tool_loop (keyword), …) so misroutes are auditable, never mysterious.

Adaptive rerank skip

The cross-encoder reranker is the slowest retrieval stage (500–1,200ms on Workers AI). Its real job is choosing which top-K of the fan-out candidates survive — but when the vector search already returned K candidates that all score above RAG_RERANK_SKIP_SCORE, that choice is already made and the round-trip buys nothing. With RAG_RERANK_SKIP=true those turns keep ANN order and skip the reranker entirely; weak or sparse candidate sets always rerank (precision matters most exactly when recall was shaky).

RAG_RERANK_SKIP=true          # default off
RAG_RERANK_SKIP_SCORE=0.68    # ANN cosine bar; 0.80 default on OpenAI embeddings
RAG_RERANK_FAN_OUT=3          # candidates = topK × fan_out; lower = faster rerank

Skipped turns show rerank skipped in the dashboard's Rerank column and Notes, and perf:hotpath prints a Rerank skipped (ANN decisive): N/M turns line — so you can see exactly how often the skip fires before trusting it.

CLI: `perf:hotpath`

Same measurement, scriptable. Run after any change you hope made things faster (model switch, Qdrant migration, cache driver) and compare runs:

php artisan perf:hotpath                # 10 turns, first published agent
php artisan perf:hotpath --turns=25     # bigger sample
php artisan perf:hotpath --agent=<id>   # specific agent

The command sends real widget turns (same JWT + SSE path the embedded widget uses), prints wall-clock TTFB per turn, the same per-stage p50/p95 table as the dashboard, and the same verdict line. A unique suffix per message defeats the 30-minute retrieve cache so every turn exercises the full pipeline. When no provider keys are configured it warns that FakeOpenAi is bound — those runs measure pipeline overhead only.

Raw logs

Every turn also writes a structured rag.turn line to the Laravel log with the same stage breakdown plus retrieve_timings. For historical analysis beyond the 100-turn buffer:

grep 'rag.turn' storage/logs/laravel.log | tail -50