Troubleshooting
Slow replies: find the bottleneck
When a buyer says "the bot is slow", the first question is which stage is slow. A visitor turn crosses four network round-trips before the first character appears, and they fail independently:
| Stage | What it is | Typical | Fix when slow |
|---|---|---|---|
embed_ms |
Embedding the visitor's question (provider API call) | 80โ250ms | Provider region; retrieve cache absorbs repeats |
ann_ms |
Vector similarity search (Vectorize or Qdrant) | 80โ300ms remote / <10ms local Qdrant | Run Qdrant on the app server (VECTOR_PROVIDER=qdrant) |
rerank_ms |
Cross-encoder reranking of candidates | 120โ400ms | Lower RAG_RERANK_FAN_OUT; accept ANN order |
first_token_ms |
Wait from turn start until the LLM's first streamed token | 200โ900ms | Switch to a faster chat model โ biggest single lever |
The dashboard
Super-admin โ /settings/system/hotpath-latency.
Shows the last 100 visitor turns with one column per stage,
p50 / p95 / max aggregate cards, and a plain-English verdict line
naming the dominant cost with its remedy. Colour coding: green
< 300ms, amber < 800ms, red above.
Data comes from a cache-backed ring buffer the stream handler appends to after the reply finishes streaming โ visitors pay zero latency for the bookkeeping. Buffer survives 7 days or 100 turns, whichever ends first. No database table involved, so it works identically on Redis and database cache drivers.
Reading the table
- First token high, retrieval low โ the LLM is the bottleneck. Open Settings โ System โ AI providers and pick a faster model (the dropdown shows expected TTFT per model).
-
Retrieval high โ check which sub-column grew.
Searchhigh โ vector store round-trip; running Qdrant locally takes it under 10ms.Rerankhigh โ reduce fan-out.Embedhigh โ provider region. - "retrieve cache hit" rows โ repeat questions skip embed/search/rerank entirely (30-minute cache). These rows show what your pipeline costs when retrieval is free.
- Everything green but visitors still complain โ the problem is between the visitor's browser and your server: proxy buffering (see the SSE heartbeat notes in Architecture โ Hot path), TLS setup time, or plain geography.
The fast router
When an agent has tools enabled, every legacy turn pays a tool-check completion โ a full non-streaming LLM call (1โ3 of them, 5โ15s each on Workers AI 70B) before the streamed answer starts. Measured in production this put first-token p50 at ~14s even on the fastest model, while ~90% of visitor questions never needed a tool.
The fast router decides per turn whether the tool check is worth running, in <2ms with zero extra network:
- Keyword gate โ per-tool phrase lists ("open a ticket", "where is my order", the human-handoff phrases).
- Embedding gate โ the RAG query embedding (already computed for retrieval) is compared against per-tool exemplar centroids by cosine similarity. Centroids are embedded off the hot path (queued job / deploy command) and cached 30 days, keyed by embed model + dimension + exemplar text so they auto-invalidate on any change.
- No signal โ knowledge route: RAG + stream only, tool check skipped.
Enable it:
FAST_ROUTER_ENABLED=true # .env (default off)
php artisan router:warm # embed tool centroids (also run on deploy)
Safety: explicit "talk to a human" messages are caught by the
keyword shortcut before the router and always escalate;
escalation also gets the lowest embedding threshold
(FAST_ROUTER_ESCALATE_THRESHOLD) so it stays the
easiest tool to trigger. A per-agent kill switch lives in
vertical_overrides.fast_router (true/false beats the
global flag). The dashboard's Notes column and
perf:hotpath's route split line show every decision
(knowledge (no_signal),
tool_loop (keyword), โฆ) so misroutes are auditable,
never mysterious.
Adaptive rerank skip
The cross-encoder reranker is the slowest retrieval stage (500โ1,200ms
on Workers AI). Its real job is choosing which top-K of the
fan-out candidates survive โ but when the vector search already
returned K candidates that all score above
RAG_RERANK_SKIP_SCORE, that choice is already made and
the round-trip buys nothing. With RAG_RERANK_SKIP=true
those turns keep ANN order and skip the reranker entirely; weak or
sparse candidate sets always rerank (precision matters most exactly
when recall was shaky).
RAG_RERANK_SKIP=true # default off
RAG_RERANK_SKIP_SCORE=0.68 # ANN cosine bar; 0.80 default on OpenAI embeddings
RAG_RERANK_FAN_OUT=3 # candidates = topK ร fan_out; lower = faster rerank
Skipped turns show rerank skipped in the dashboard's Rerank
column and Notes, and perf:hotpath prints a
Rerank skipped (ANN decisive): N/M turns line โ so you
can see exactly how often the skip fires before trusting it.
CLI: perf:hotpath
Same measurement, scriptable. Run after any change you hope made things faster (model switch, Qdrant migration, cache driver) and compare runs:
php artisan perf:hotpath # 10 turns, first published agent
php artisan perf:hotpath --turns=25 # bigger sample
php artisan perf:hotpath --agent=<id> # specific agent
The command sends real widget turns (same JWT + SSE path the embedded widget uses), prints wall-clock TTFB per turn, the same per-stage p50/p95 table as the dashboard, and the same verdict line. A unique suffix per message defeats the 30-minute retrieve cache so every turn exercises the full pipeline. When no provider keys are configured it warns that FakeOpenAi is bound โ those runs measure pipeline overhead only.
Raw logs
Every turn also writes a structured rag.turn line to
the Laravel log with the same stage breakdown plus
retrieve_timings. For historical analysis beyond the
100-turn buffer:
grep 'rag.turn' storage/logs/laravel.log | tail -50