Marcus Kane
Lead Engineer, Engium · Oct 08, 2024
Off-the-shelf LLMs are remarkably capable, yet they hallucinate on your specific pricing, get product names wrong, and confidently answer questions about policies that changed six months ago. The solution is not a better model — it is better context.
Why RAG Works
Retrieval-Augmented Generation grounds the LLM in verified facts at inference time. Instead of relying on statistical patterns learned during training, the model reads your actual documents and answers based on what it finds.
The architectural insight is that retrieval is a search problem, not a generation problem. Embedding-based semantic search finds relevant chunks even when users phrase questions differently from your documentation.
Ingestion Pipeline
Engium's ingestion pipeline accepts PDFs, Markdown, DOCX, and web URLs. Each document is parsed, split into overlapping chunks, embedded with gemini-embedding-001 (768 dimensions), and stored in pgvector.
Chunking Strategy
Chunk size is the most important hyperparameter. Chunks that are too large dilute the embedding signal; chunks that are too small lose context. For FAQ-style content, 256–512 tokens with 64-token overlap works reliably.
Hierarchical chunking — where document summaries are stored alongside paragraph-level chunks — dramatically improves recall on complex multi-part questions.
Embedding Strategy
Embedding model choice affects both retrieval quality and cost. Engium defaults to gemini-embedding-001 (768 dimensions) rather than OpenAI's text-embedding-3 (1536 dimensions). The smaller vector size reduces storage costs and speeds up ANN queries without measurable recall degradation on business content.
"The jump from keyword search to semantic search is not incremental — it's categorical. Users phrase things in ten different ways, and only embedding-based retrieval handles that reliably."
Retrieval Tuning
HNSW indexes (used for FAQ items in Engium) deliver sub-millisecond recall on datasets up to several million vectors. Tune the similarity threshold — 0.78 is the default, but domain-specific content often benefits from raising it to 0.82 to reduce false-positive retrievals.
- 01.Use HNSW indexes for FAQ retrieval (fast, high recall)
- 02.Use IVFFlat for large document sets (cheaper to build, good for batch)
- 03.Re-embed stale content monthly as your knowledge base evolves
- 04.Monitor mean similarity scores — a drop signals knowledge base drift
Was this helpful?

