Files
arcadia-admin/docs/RAG.md
jules 938143f3f5 refactor: rename service references arcadia-app → arcadia-core
The Phoenix auth/identity/tenancy backend repo is being renamed
arcadia-app → arcadia-core (its primary OTP app is already arcadia_core).
Updates prose, doc paths, and git.sky-ai.com repo URLs. Deliberately
leaves the Rust crate arcadia-app-client and host arcadia-app.internal
(handled separately), and the kept namespace (issuer/release "arcadia").

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-11 13:40:25 +10:00

7.5 KiB
Raw Permalink Blame History

Retrieval-Augmented Generation in arcadia-admin

This app exposes two lexical RAG surfaces to the assistant. They share a contract (search + read) but live at different layers and serve different content. The agent picks between them based on tool descriptions; the operator chooses which to deploy based on corpus shape.


At a glance

Browser RAG Server RAG
Lib / service @crema/lexical-rag-ui arcadia-search (Rust)
Engine MiniSearch (BM25, JS) Tantivy (BM25, Rust)
Where it runs In the user's browser Sibling of arcadia-core
Index storage Static JSON, fetched once mmap'd disk, ~3080MB resident
Practical corpus size ~510MB / ~50100k chunks GB-scale, no hard cap
Update cadence Static — rebuilt at app build time Live — cron, webhook, or admin trigger
Auth None (bundled with the app) JWT (via arcadia's Guardian)
Tool the agent calls search_docs(query, limit) search_kb(query, corpus, limit?, tags?) + read_chunk(chunk_id, corpus)
Source content lives in arcadia-admin/public/docs-index.json arcadia-search's data dir, ingested from disk or arcadia API
What's it best for Static reference docs that ship with the app Tenant-uploaded files, audit log search, anything that grows

When the agent picks each

The system prompt in app/routes/assistant.tsx::buildAdminPreface tells the model:

Two retrieval surfaces exist for documentation/knowledge: search_docs (browser-side, BM25 over the bundled arcadia docs — fast, always available, small corpus) and search_kb (server-side, BM25 over arcadia-search — same docs as corpus=docs for parity, plus larger and additional corpora as the operator adds them). For questions about the bundled arcadia docs either is fine; prefer search_kb when you want richer hits or when the user is asking about content that wouldn't be in the bundled docs (uploaded files, tenant-specific knowledge). When search_kb returns a chunk_id you want to expand, call read_chunk(chunk_id, corpus).

In practice DeepSeek + V3 picks search_kb for anything that mentions "the kb" or sounds dynamic, and search_docs for quick lookups against the bundled docs. Neither pick is wrong for content that exists in both.


Browser RAG (@crema/lexical-rag-ui)

What it is. A small React + MiniSearch wrapper. The lib provides RAGProvider, useRAG, and a headless createRAGClient(indexUrl). The index is a single JSON file built offline by scripts/build-docs-index.mjs and shipped in the app's public/.

How it's wired here.

  • Build script: arcadia-admin/scripts/build-docs-index.mjs reads markdown from ../reference/arcadia-core/, chunks at H1H3, produces public/docs-index.json. Runs on npm run build:docs (and as the prebuild step before npm run build).
  • Tool wrapper: app/lib/admin-tools.ts constructs a singleton createRAGClient("/docs-index.json") and exposes it as the search_docs tool. The tool returns hits with the legacy category field collapsed back from tags[0] so the agent's prior expectations stay stable.
  • Storage: just the static JSON. No state, no auth, no indexer process.

Limits. Practical ceiling is ~510MB index. Past that, first-load parse and browser memory get painful (200MB+ heap on a 50MB index; mobile breaks). Updates require a build step + redeploy.

Why it exists. Static reference content that ships with the app — arcadia's own docs, in this case. Always available even if the search service is down. Zero infrastructure.

For the lib itself see lib-lexical-rag-ui/README.md.


What it is. A standalone Rust HTTP service (Tantivy + axum). Single static binary, ~3080MB resident. Per-tenant per-corpus indexes on disk. JWT auth, HMAC webhook intake, atomic rebuild swap, systemd timer cron.

How it's wired here.

  • Tools: app/lib/admin-tools.ts exposes search_kb and read_chunk. The fetch URL is KB_BASE_URL (default http://127.0.0.1:7800, override via window.__ARCADIA_SEARCH_URL or VITE_ARCADIA_SEARCH_URL). The bearer token is the user's arcadia JWT from sessionStorage["arcadia_access_token"], with a "dev" fallback when no login.
  • Reindex button: app/routes/ai.tsx::reindexKB calls POST /index/:corpus/build and toasts the result. Lives in the AI page's empty state next to the block-preview button.
  • System prompt: see the snippet in assistant.tsx::buildAdminPreface above.

Storage. <INDEX_DIR>/<tenant>/<corpus>/current/ per index; previous-<stamp>/ for the last few rebuilds (rollback). Sources can be on-disk markdown, or pulled from arcadia's /api/v1/digital_objects API (see arcadia-search/MULTI_TENANT.md and ARCADIA_INTEGRATION.md).

Update cadence. Three triggers, layered so each compensates for the others' failure modes:

  • Cron (systemd timer, hourly default) — always-on safety net.
  • Admin button — one-click rebuild from the AI page.
  • Webhook — arcadia POSTs /events/changed on file create/delete; search debounces (2-min default) and rebuilds.

Why it exists. Anything that doesn't fit the browser ceiling: tenant-uploaded files, audit-log-ish content, multi-tenant knowledge bases, anything that grows over time.

For the service see arcadia-search/README.md. For multi-tenant config see arcadia-search/MULTI_TENANT.md. For the upstream arcadia integration story (file content fetch, text extraction, webhook signature, service tokens) see arcadia-search/ARCADIA_INTEGRATION.md.


How they coexist

The default deploy runs both:

  • search_docs indexes the same arcadia-core docs the parity corpus on arcadia-search indexes. Same content, two engines.
  • This is intentional — it means the assistant always has something to search, even if arcadia-search is down or unreachable. The failure mode is "no search_kb, but search_docs still works."
  • It also gives a permanent A/B regression test: query both, compare hits, catch relevance regressions in either engine.

Picking ONE for a new corpus

Use this checklist when adding new content:

Question Answer → use
Is the corpus < 5MB and basically static? Browser
Does it need to update without a redeploy? Server
Is it per-tenant content (uploaded files, tenant-specific KB)? Server
Are you OK shipping it in the JS bundle? Browser
Does it need agentic read_chunk follow-up? Server (browser doesn't expose read over the tool surface)
Does it need to work offline / with no backend? Browser
Is it growing > 10MB? Server

Most "knowledge base" content lives in the server side. The browser side is reserved for the always-bundled reference material that ships with the app.


What lives where (cheat sheet)

Want to Look at
Add a doc to the bundled browser RAG arcadia-admin/scripts/build-docs-index.mjs (extend SOURCES)
Add a tool to the agent arcadia-admin/app/lib/admin-tools.ts
Change the LLM's tool-picking guidance arcadia-admin/app/routes/assistant.tsx::buildAdminPreface
Add a corpus to arcadia-search arcadia-search/deploy/<tenant>/<corpus>.config.json + new systemd timer
Add a tenant to arcadia-search arcadia-search/MULTI_TENANT.md
Wire an arcadia file → search ingest arcadia-search/ARCADIA_INTEGRATION.md (needs upstream changes first)
Reindex the server-side corpus right now "reindex kb (docs)" button on /ai empty state, or POST /index/docs/build