docs: RAG.md — combined story for the two RAG surfaces

The browser RAG (@crema/lexical-rag-ui) and the server RAG
(arcadia-search) coexist in this app, and the agent picks between
them via tool descriptions. The two libs each have their own README,
but neither covers the picking decision or how they relate. This doc
fills that gap.

- At-a-glance table: lib, engine, runtime location, corpus size,
  update cadence, auth, agent tool, what it's best for.
- The system-prompt snippet that drives tool picking, with notes on
  observed DeepSeek V3 behavior.
- Per-surface deep dive (build path, tools, storage, limits, why it
  exists).
- Why both run by default (always-on fallback, A/B regression).
- Decision checklist for adding new corpora.
- "What lives where" cheat sheet pointing at the right file in the
  right repo for common tasks.

Cross-references arcadia-search's README.md, MULTI_TENANT.md, and
ARCADIA_INTEGRATION.md so a reader landing here can navigate to the
right rabbit hole.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
jules
2026-05-04 14:15:09 +10:00
parent 628691d2df
commit 444516e900

173
docs/RAG.md Normal file
View File

@@ -0,0 +1,173 @@
# Retrieval-Augmented Generation in arcadia-admin
This app exposes **two** lexical RAG surfaces to the assistant. They
share a contract (`search` + `read`) but live at different layers and
serve different content. The agent picks between them based on tool
descriptions; the operator chooses which to deploy based on corpus
shape.
---
## At a glance
| | Browser RAG | Server RAG |
|---|---|---|
| Lib / service | `@crema/lexical-rag-ui` | `arcadia-search` (Rust) |
| Engine | MiniSearch (BM25, JS) | Tantivy (BM25, Rust) |
| Where it runs | In the user's browser | Sibling of arcadia-app |
| Index storage | Static JSON, fetched once | mmap'd disk, ~3080MB resident |
| Practical corpus size | ~510MB / ~50100k chunks | GB-scale, no hard cap |
| Update cadence | Static — rebuilt at app build time | Live — cron, webhook, or admin trigger |
| Auth | None (bundled with the app) | JWT (via arcadia's Guardian) |
| Tool the agent calls | `search_docs(query, limit)` | `search_kb(query, corpus, limit?, tags?)` + `read_chunk(chunk_id, corpus)` |
| Source content lives in | `arcadia-admin/public/docs-index.json` | `arcadia-search`'s data dir, ingested from disk or arcadia API |
| What's it best for | Static reference docs that ship with the app | Tenant-uploaded files, audit log search, anything that grows |
---
## When the agent picks each
The system prompt in `app/routes/assistant.tsx::buildAdminPreface` tells
the model:
> Two retrieval surfaces exist for documentation/knowledge:
> `search_docs` (browser-side, BM25 over the bundled arcadia docs —
> fast, always available, small corpus) and `search_kb` (server-side,
> BM25 over arcadia-search — same docs as `corpus=docs` for parity,
> plus larger and additional corpora as the operator adds them). For
> questions about the bundled arcadia docs either is fine; prefer
> `search_kb` when you want richer hits or when the user is asking
> about content that wouldn't be in the bundled docs (uploaded files,
> tenant-specific knowledge). When `search_kb` returns a `chunk_id`
> you want to expand, call `read_chunk(chunk_id, corpus)`.
In practice DeepSeek + V3 picks `search_kb` for anything that mentions
"the kb" or sounds dynamic, and `search_docs` for quick lookups against
the bundled docs. Neither pick is wrong for content that exists in
both.
---
## Browser RAG (`@crema/lexical-rag-ui`)
**What it is.** A small React + MiniSearch wrapper. The lib provides
`RAGProvider`, `useRAG`, and a headless `createRAGClient(indexUrl)`.
The index is a single JSON file built offline by
`scripts/build-docs-index.mjs` and shipped in the app's `public/`.
**How it's wired here.**
- Build script: `arcadia-admin/scripts/build-docs-index.mjs` reads
markdown from `../reference/arcadia-app/`, chunks at H1H3,
produces `public/docs-index.json`. Runs on `npm run build:docs`
(and as the `prebuild` step before `npm run build`).
- Tool wrapper: `app/lib/admin-tools.ts` constructs a singleton
`createRAGClient("/docs-index.json")` and exposes it as the
`search_docs` tool. The tool returns hits with the legacy
`category` field collapsed back from `tags[0]` so the agent's
prior expectations stay stable.
- Storage: just the static JSON. No state, no auth, no indexer
process.
**Limits.** Practical ceiling is ~510MB index. Past that, first-load
parse and browser memory get painful (200MB+ heap on a 50MB index;
mobile breaks). Updates require a build step + redeploy.
**Why it exists.** Static reference content that ships with the app —
arcadia's own docs, in this case. Always available even if the search
service is down. Zero infrastructure.
For the lib itself see `lib-lexical-rag-ui/README.md`.
---
## Server RAG (`arcadia-search`)
**What it is.** A standalone Rust HTTP service (Tantivy + axum). Single
static binary, ~3080MB resident. Per-tenant per-corpus indexes on
disk. JWT auth, HMAC webhook intake, atomic rebuild swap, systemd
timer cron.
**How it's wired here.**
- Tools: `app/lib/admin-tools.ts` exposes `search_kb` and
`read_chunk`. The fetch URL is `KB_BASE_URL` (default
`http://127.0.0.1:7800`, override via `window.__ARCADIA_SEARCH_URL`
or `VITE_ARCADIA_SEARCH_URL`). The bearer token is the user's
arcadia JWT from `sessionStorage["arcadia_access_token"]`, with a
`"dev"` fallback when no login.
- Reindex button: `app/routes/ai.tsx::reindexKB` calls
`POST /index/:corpus/build` and toasts the result. Lives in the
AI page's empty state next to the block-preview button.
- System prompt: see the snippet in `assistant.tsx::buildAdminPreface`
above.
**Storage.** `<INDEX_DIR>/<tenant>/<corpus>/current/` per index;
`previous-<stamp>/` for the last few rebuilds (rollback). Sources can
be on-disk markdown, or pulled from arcadia's `/api/v1/digital_objects`
API (see `arcadia-search/MULTI_TENANT.md` and `ARCADIA_INTEGRATION.md`).
**Update cadence.** Three triggers, layered so each compensates for
the others' failure modes:
- **Cron** (systemd timer, hourly default) — always-on safety net.
- **Admin button** — one-click rebuild from the AI page.
- **Webhook** — arcadia POSTs `/events/changed` on file create/delete;
search debounces (2-min default) and rebuilds.
**Why it exists.** Anything that doesn't fit the browser ceiling:
tenant-uploaded files, audit-log-ish content, multi-tenant knowledge
bases, anything that grows over time.
For the service see `arcadia-search/README.md`.
For multi-tenant config see `arcadia-search/MULTI_TENANT.md`.
For the upstream arcadia integration story (file content fetch,
text extraction, webhook signature, service tokens) see
`arcadia-search/ARCADIA_INTEGRATION.md`.
---
## How they coexist
The default deploy runs **both**:
- `search_docs` indexes the same arcadia-app docs the parity corpus
on `arcadia-search` indexes. Same content, two engines.
- This is intentional — it means the assistant always has *something*
to search, even if `arcadia-search` is down or unreachable. The
failure mode is "no `search_kb`, but `search_docs` still works."
- It also gives a permanent A/B regression test: query both, compare
hits, catch relevance regressions in either engine.
---
## Picking ONE for a new corpus
Use this checklist when adding new content:
| Question | Answer → use |
|---|---|
| Is the corpus < 5MB and basically static? | Browser |
| Does it need to update without a redeploy? | Server |
| Is it per-tenant content (uploaded files, tenant-specific KB)? | Server |
| Are you OK shipping it in the JS bundle? | Browser |
| Does it need agentic `read_chunk` follow-up? | Server (browser doesn't expose `read` over the tool surface) |
| Does it need to work offline / with no backend? | Browser |
| Is it growing > 10MB? | Server |
Most "knowledge base" content lives in the server side. The browser
side is reserved for the always-bundled reference material that ships
with the app.
---
## What lives where (cheat sheet)
| Want to | Look at |
|---|---|
| Add a doc to the bundled browser RAG | `arcadia-admin/scripts/build-docs-index.mjs` (extend `SOURCES`) |
| Add a tool to the agent | `arcadia-admin/app/lib/admin-tools.ts` |
| Change the LLM's tool-picking guidance | `arcadia-admin/app/routes/assistant.tsx::buildAdminPreface` |
| Add a corpus to arcadia-search | `arcadia-search/deploy/<tenant>/<corpus>.config.json` + new systemd timer |
| Add a tenant to arcadia-search | `arcadia-search/MULTI_TENANT.md` |
| Wire an arcadia file → search ingest | `arcadia-search/ARCADIA_INTEGRATION.md` (needs upstream changes first) |
| Reindex the server-side corpus right now | "reindex kb (docs)" button on `/ai` empty state, or `POST /index/docs/build` |