RAG Is Not Just Retrieval

Retrieval-Augmented Generation has become the default architecture for any LLM application that needs to work with external knowledge. The basic idea is simple: instead of relying on the model's parametric memory, retrieve relevant documents at inference time and put them in the context.

The simple version is easy to implement. The working version is not.

Here's what I've learned building RAG systems in production.

The Chunking Problem Is Underrated

Most tutorials chunk documents at fixed character counts — 512 tokens, 1000 tokens — with some overlap. This works well enough for demos. It fails in production for a simple reason: meaning doesn't respect character boundaries.

A fixed-size chunk might split a paragraph mid-sentence, separating a claim from its supporting evidence. It might break a code block into two useless halves. It might put the question and the answer in different chunks.

Chunking strategy should follow document structure, not character count:

Code: chunk by function or class, not lines
Documentation: chunk by section heading
Conversations: chunk by turn or topic shift
Legal/technical docs: chunk by numbered clause or section

The right chunk size also depends on what you're retrieving for. For question answering, smaller chunks with more precision. For summarization, larger chunks with more context. One size fits all is a compromise that fits nothing well.

Embedding Models Are Not Interchangeable

OpenAI's text-embedding-3-small is not the same as text-embedding-3-large is not the same as a fine-tuned domain-specific model.

General-purpose embedding models are trained on general-purpose text. They capture semantic similarity well for common concepts and break down in specialized domains.

Medical terminology, legal language, trucking jargon, financial terms — these have semantic relationships that general models don't represent well. "Brake fade" and "brake failure" are semantically similar in general English. In heavy truck mechanics, they're different problems with different causes and solutions.

For domain-specific RAG, fine-tuning the embedding model (or at minimum, selecting one trained on similar text) matters more than fine-tuning the generative model. Retrieval quality is the ceiling.

Retrieval Recall vs. Precision

The default metric people optimize for is retrieval relevance — did we get the right documents? But there are two distinct failure modes:

Low recall: the relevant document exists in the knowledge base but wasn't retrieved. The model gives a wrong answer or says "I don't know."

Low precision: irrelevant documents were retrieved alongside the relevant ones. The model gets confused, contradicts itself, or hallucinates a synthesis of conflicting information.

These require different fixes. Low recall is usually a chunking or embedding problem. Low precision is usually a retrieval count problem — you're retrieving too many chunks and flooding the context with noise.

Start with retrieval eval before you evaluate generation. If your retrieval isn't surfacing the right documents, no amount of prompt engineering will fix the generation.

Query Transformation

The query a user types is often not the best query for retrieval.

Users write conversational queries: "how do I fix the thing that was happening with the truck?" The vector store needs semantic overlap with the stored document, which might be titled "Cummins ISX valve lash adjustment procedure."

Query transformation bridges this gap:

Hypothetical Document Embeddings (HyDE): ask the LLM to generate a hypothetical answer to the question, then embed that answer and use it as the retrieval query. Counter-intuitive but effective — the hypothetical answer's embedding is often closer to the actual answer's embedding than the question's embedding is.

Query expansion: generate multiple variants of the query and retrieve against all of them, then deduplicate. Catches cases where the original phrasing doesn't match the stored phrasing.

Decomposition: for complex multi-part questions, break them into sub-queries, retrieve for each, then synthesize. "What are the most common failure modes for the DD15 and how do they compare to the ISX?" should be two retrievals, not one.

Reranking After Retrieval

Vector similarity search is fast but approximate. The top-k results by cosine similarity are not necessarily the top-k by relevance to the actual question.

Reranking adds a second pass: take the top-k from vector search, then run them through a cross-encoder that scores (query, document) pairs jointly. Cross-encoders are slower than bi-encoders but significantly more accurate at relevance ranking.

In practice: retrieve top-20 by vector similarity, rerank with a cross-encoder, pass top-5 to the LLM. The extra latency is usually 100-200ms — worth it for the precision improvement.

Cohere Rerank and cross-encoders from sentence-transformers both work well here.

When Not to Retrieve

This is the most underrated design decision in RAG.

Some queries don't need retrieval. "What is 2 + 2?" "Summarize the text I just pasted." "Translate this to Spanish." Forcing these through a retrieval pipeline adds latency and introduces noise.

Build a classifier (even a simple one) that routes queries:

Factual questions about your knowledge base → retrieve
General knowledge questions → direct generation or search
Transformations of user-provided content → direct generation
Ambiguous → retrieve and let the model decide whether to use it

Routing reduces latency, reduces cost, and reduces the chance that irrelevant retrieved content degrades a generation that would have been good without it.

The Answer Isn't Always in the Chunks

The failure mode that's hardest to debug: the answer requires synthesizing information across multiple chunks, but each individual chunk looks irrelevant when scored against the query.

Example: "What's the trend in carrier safety scores in Texas over the last 3 years?" No single document contains this. The answer requires aggregating data across many records.

RAG as typically implemented can't answer this. You need either a different architecture (structured data query + generation) or explicit synthesis logic that asks the model to aggregate across retrieved chunks rather than find the answer in one.

Know what your knowledge base actually contains and design the retrieval strategy accordingly. The most common RAG failure isn't a retrieval failure — it's using RAG for a problem that needs a different tool.