B Blengi docs

Troubleshooting

Slow replies: find the bottleneck

When a buyer says "the bot is slow", the first question is which stage is slow. A visitor turn crosses four network round-trips before the first character appears, and they fail independently:

StageWhat it isTypicalFix when slow
embed_ms Embedding the visitor's question (provider API call) 80โ€“250ms Provider region; retrieve cache absorbs repeats
ann_ms Vector similarity search (Vectorize or Qdrant) 80โ€“300ms remote / <10ms local Qdrant Run Qdrant on the app server (VECTOR_PROVIDER=qdrant)
rerank_ms Cross-encoder reranking of candidates 120โ€“400ms Lower RAG_RERANK_FAN_OUT; accept ANN order
first_token_ms Wait from turn start until the LLM's first streamed token 200โ€“900ms Switch to a faster chat model โ€” biggest single lever

The dashboard

Super-admin โ†’ /settings/system/hotpath-latency. Shows the last 100 visitor turns with one column per stage, p50 / p95 / max aggregate cards, and a plain-English verdict line naming the dominant cost with its remedy. Colour coding: green < 300ms, amber < 800ms, red above.

Data comes from a cache-backed ring buffer the stream handler appends to after the reply finishes streaming โ€” visitors pay zero latency for the bookkeeping. Buffer survives 7 days or 100 turns, whichever ends first. No database table involved, so it works identically on Redis and database cache drivers.

Reading the table

  • First token high, retrieval low โ€” the LLM is the bottleneck. Open Settings โ†’ System โ†’ AI providers and pick a faster model (the dropdown shows expected TTFT per model).
  • Retrieval high โ€” check which sub-column grew. Search high โ†’ vector store round-trip; running Qdrant locally takes it under 10ms. Rerank high โ†’ reduce fan-out. Embed high โ†’ provider region.
  • "retrieve cache hit" rows โ€” repeat questions skip embed/search/rerank entirely (30-minute cache). These rows show what your pipeline costs when retrieval is free.
  • Everything green but visitors still complain โ€” the problem is between the visitor's browser and your server: proxy buffering (see the SSE heartbeat notes in Architecture โ†’ Hot path), TLS setup time, or plain geography.

The fast router

When an agent has tools enabled, every legacy turn pays a tool-check completion โ€” a full non-streaming LLM call (1โ€“3 of them, 5โ€“15s each on Workers AI 70B) before the streamed answer starts. Measured in production this put first-token p50 at ~14s even on the fastest model, while ~90% of visitor questions never needed a tool.

The fast router decides per turn whether the tool check is worth running, in <2ms with zero extra network:

  1. Keyword gate โ€” per-tool phrase lists ("open a ticket", "where is my order", the human-handoff phrases).
  2. Embedding gate โ€” the RAG query embedding (already computed for retrieval) is compared against per-tool exemplar centroids by cosine similarity. Centroids are embedded off the hot path (queued job / deploy command) and cached 30 days, keyed by embed model + dimension + exemplar text so they auto-invalidate on any change.
  3. No signal โ†’ knowledge route: RAG + stream only, tool check skipped.

Enable it:

FAST_ROUTER_ENABLED=true        # .env (default off)
php artisan router:warm         # embed tool centroids (also run on deploy)

Safety: explicit "talk to a human" messages are caught by the keyword shortcut before the router and always escalate; escalation also gets the lowest embedding threshold (FAST_ROUTER_ESCALATE_THRESHOLD) so it stays the easiest tool to trigger. A per-agent kill switch lives in vertical_overrides.fast_router (true/false beats the global flag). The dashboard's Notes column and perf:hotpath's route split line show every decision (knowledge (no_signal), tool_loop (keyword), โ€ฆ) so misroutes are auditable, never mysterious.

Adaptive rerank skip

The cross-encoder reranker is the slowest retrieval stage (500โ€“1,200ms on Workers AI). Its real job is choosing which top-K of the fan-out candidates survive โ€” but when the vector search already returned K candidates that all score above RAG_RERANK_SKIP_SCORE, that choice is already made and the round-trip buys nothing. With RAG_RERANK_SKIP=true those turns keep ANN order and skip the reranker entirely; weak or sparse candidate sets always rerank (precision matters most exactly when recall was shaky).

RAG_RERANK_SKIP=true          # default off
RAG_RERANK_SKIP_SCORE=0.68    # ANN cosine bar; 0.80 default on OpenAI embeddings
RAG_RERANK_FAN_OUT=3          # candidates = topK ร— fan_out; lower = faster rerank

Skipped turns show rerank skipped in the dashboard's Rerank column and Notes, and perf:hotpath prints a Rerank skipped (ANN decisive): N/M turns line โ€” so you can see exactly how often the skip fires before trusting it.

CLI: perf:hotpath

Same measurement, scriptable. Run after any change you hope made things faster (model switch, Qdrant migration, cache driver) and compare runs:

php artisan perf:hotpath                # 10 turns, first published agent
php artisan perf:hotpath --turns=25     # bigger sample
php artisan perf:hotpath --agent=<id>   # specific agent

The command sends real widget turns (same JWT + SSE path the embedded widget uses), prints wall-clock TTFB per turn, the same per-stage p50/p95 table as the dashboard, and the same verdict line. A unique suffix per message defeats the 30-minute retrieve cache so every turn exercises the full pipeline. When no provider keys are configured it warns that FakeOpenAi is bound โ€” those runs measure pipeline overhead only.

Raw logs

Every turn also writes a structured rag.turn line to the Laravel log with the same stage breakdown plus retrieve_timings. For historical analysis beyond the 100-turn buffer:

grep 'rag.turn' storage/logs/laravel.log | tail -50