Cited RAG search across 2M legal documents
Search that answers with citations across two million documents — reranked, redacted, and audit-trailed for a legal environment.
Institutional knowledge, unfindable
Twenty years of precedent lived across document management silos. Finding relevant prior work meant asking senior people or re-doing research the firm had already paid for.
The cost of leaving it alone
Associates burned billable hours reconstructing work that existed. Worse, inconsistent precedent use created quality risk the partners could feel but not measure.
Retrieval with receipts
A retrieval pipeline where every answer carries citations to source documents, redaction rules run before display, and every query is audit-logged.
- 2.1M documents chunked into 12.4M segments, embedded with voyage-3-large
- pgvector + HNSW for retrieval, cross-encoder reranking on top
- Redaction layer strips privileged and client-identifying content by matter walls
- Every query and result set is audit-trailed for compliance review
Stack: Pinecone · Claude · pgvector · Voyage embeddings
How it was built
- Week 1–3: corpus ingestion, chunking strategy, and embedding pipeline
- Week 4–5: retrieval quality tuning against a partner-built eval set of 200 questions
- Week 6–7: redaction rules, matter walls, and audit logging with compliance
- Week 8–9: rollout to two practice groups, then firm-wide
What the numbers say
What happened next
Usage settled at ~900 queries a day. The eval set grew into a living quality benchmark: every index update replays it, so retrieval quality is a number, not an opinion.
This system is an example of AI Agents & Internal Assistants work.
Need a similar system?
Let's talk through your version of this — same architecture thinking, scoped to your operations and tools.
30 minutes · no pitch deck · reply within 24h if you write instead