Resources | Engium

Marcus Kane

Lead Engineer, Engium · Oct 08, 2024

10 min read

Off-the-shelf LLMs are remarkably capable, yet they hallucinate on your specific pricing, get product names wrong, and confidently answer questions about policies that changed six months ago. The solution is not a better model — it is better context.

Why RAG Works

Retrieval-Augmented Generation grounds the LLM in verified facts at inference time. Instead of relying on statistical patterns learned during training, the model reads your actual documents and answers based on what it finds.

The architectural insight is that retrieval is a search problem, not a generation problem. Embedding-based semantic search finds relevant chunks even when users phrase questions differently from your documentation.

Ingestion Pipeline

Engium's ingestion pipeline accepts PDFs, Markdown, DOCX, and web URLs. Each document is parsed, split into overlapping chunks, embedded with gemini-embedding-001 (768 dimensions), and stored in pgvector.

ingestion-pipeline.py

# Engium knowledge ingestion
chunks = splitter.split(document, chunk_size=512, overlap=64)
embeddings = gemini.embed(chunks, model="gemini-embedding-001")

await db.execute(
    insert(knowledge_chunks).values([
        {"content": c, "embedding": e, "tenant_id": tenant_id}
        for c, e in zip(chunks, embeddings)
    ])
)

Chunking Strategy

Chunk size is the most important hyperparameter. Chunks that are too large dilute the embedding signal; chunks that are too small lose context. For FAQ-style content, 256–512 tokens with 64-token overlap works reliably.

Hierarchical chunking — where document summaries are stored alongside paragraph-level chunks — dramatically improves recall on complex multi-part questions.

Embedding Strategy

Embedding model choice affects both retrieval quality and cost. Engium defaults to gemini-embedding-001 (768 dimensions) rather than OpenAI's text-embedding-3 (1536 dimensions). The smaller vector size reduces storage costs and speeds up ANN queries without measurable recall degradation on business content.

"The jump from keyword search to semantic search is not incremental — it's categorical. Users phrase things in ten different ways, and only embedding-based retrieval handles that reliably."

Retrieval Tuning

HNSW indexes (used for FAQ items in Engium) deliver sub-millisecond recall on datasets up to several million vectors. Tune the similarity threshold — 0.78 is the default, but domain-specific content often benefits from raising it to 0.82 to reduce false-positive retrievals.

01.Use HNSW indexes for FAQ retrieval (fast, high recall)
02.Use IVFFlat for large document sets (cheaper to build, good for batch)
03.Re-embed stale content monthly as your knowledge base evolves
04.Monitor mean similarity scores — a drop signals knowledge base drift

Was this helpful?

Why RAG Works

Ingestion Pipeline

ingestion-pipeline.py

# Engium knowledge ingestion
chunks = splitter.split(document, chunk_size=512, overlap=64)
embeddings = gemini.embed(chunks, model="gemini-embedding-001")

await db.execute(
    insert(knowledge_chunks).values([
        {"content": c, "embedding": e, "tenant_id": tenant_id}
        for c, e in zip(chunks, embeddings)
    ])
)

Embedding Strategy

"The jump from keyword search to semantic search is not incremental — it's categorical. Users phrase things in ten different ways, and only embedding-based retrieval handles that reliably."

Retrieval Tuning

01.Use HNSW indexes for FAQ retrieval (fast, high recall)

02.Use IVFFlat for large document sets (cheaper to build, good for batch)

03.Re-embed stale content monthly as your knowledge base evolves

04.Monitor mean similarity scores — a drop signals knowledge base drift

Beyond LLMs: Building a Proprietary Knowledge Base

Beyond LLMs: Building a Proprietary Knowledge Base

Why RAG Works

Ingestion Pipeline

Chunking Strategy

Embedding Strategy

Retrieval Tuning

Continue reading

ChatGPT + Engium: The Complete Setup Guide for Service Businesses

Scaling Customer Support to 10k Users via WhatsApp

The Security of Decentralized AI

Beyond LLMs: Building a Proprietary Knowledge Base

Beyond LLMs: Building a Proprietary Knowledge Base

Why RAG Works

Ingestion Pipeline

Chunking Strategy

Embedding Strategy

Retrieval Tuning

Continue reading

ChatGPT + Engium: The Complete Setup Guide for Service Businesses

Scaling Customer Support to 10k Users via WhatsApp

The Security of Decentralized AI