The failure mode nobody diagnoses correctly
A team builds a RAG system. They embed their documents, stand up a vector database, wire in a retrieval step, and ship. The demo works. The early users are impressed.
Six weeks later, the support tickets start. Users are getting wrong answers. The team diagnoses the problem as a model issue — the LLM is hallucinating. They swap models. The problem persists. They tune the prompt. It persists.
The actual problem, almost always, is chunking.
RAG systems fail at retrieval far more often than they fail at generation. A capable language model given the wrong context will produce confident, fluent nonsense. A capable language model given the right context will produce correct answers. The retrieval step — which is almost entirely a function of how documents were chunked — is the bottleneck.
What naive chunking actually does
The standard approach, used in most tutorials and most production systems, is character-count chunking: split the document every N characters (or N tokens), with an optional overlap of M characters.
This is simple. It is also wrong in ways that matter.
Consider a section of technical documentation that looks like this:
...is the most common failure mode in distributed systems. The mitigation
is to implement circuit breakers at service boundaries.
## Circuit Breaker Pattern
A circuit breaker monitors calls to an external service and tracks the
failure rate over a rolling window...
If the chunk boundary falls at "The mitigation", you get two chunks:
- Chunk A: "...is the most common failure mode in distributed systems. The mitigation"
- Chunk B: "is to implement circuit breakers at service boundaries. ## Circuit Breaker Pattern A circuit breaker monitors..."
Chunk A has a dangling sentence. It will not embed well, because the sentence that completes its meaning is in Chunk B. Chunk B begins mid-sentence and will not be retrieved for queries about "circuit breaker pattern" because the most semantically rich part of that section — the heading — is buried in the middle.
Now multiply this across a document with hundreds of sections. Your retrieval quality has a structural ceiling that no prompt engineering will break.
Semantic chunking: what it means and what it costs
Semantic chunking is splitting documents at logical boundaries rather than character counts. Headers, paragraphs, procedure steps, code blocks — these are the natural units of meaning in most technical documents.
The implementation varies by document type:
Markdown / structured text: Split at heading boundaries. A chunk is everything from one heading to the next heading at the same or higher level. Headings are preserved at the start of each chunk to maintain context.
PDFs and unstructured documents: Use a layout-aware parser (like pdfplumber, unstructured, or docling) that understands visual structure — columns, headers, tables, captions — before chunking.
Code: Split at function and class boundaries, not at arbitrary line counts. A function is a unit of meaning. Half a function is not.
Long prose: Split at paragraph boundaries. If paragraphs are very long, split at sentence boundaries but preserve the paragraph heading as context.
The cost is implementation complexity. Character-count chunking is three lines of code. Semantic chunking requires understanding your document structure. For most production use cases, this complexity pays for itself within weeks.
The retrieval evaluation harness you need before you start
The most important tool in RAG development is a retrieval evaluation harness built before the first embedding is written.
The harness is simple: a test set of questions that real users would ask, with manually annotated correct source passages. For each query, the harness retrieves the top-k documents and computes:
- Precision@k — of the k retrieved documents, how many are relevant?
- Recall@k — of all relevant documents, how many were retrieved in the top k?
- MRR (Mean Reciprocal Rank) — how highly was the first relevant document ranked?
class RetrievalEvaluator:
def __init__(self, retriever, test_cases: list[TestCase]):
self.retriever = retriever
self.test_cases = test_cases
def evaluate(self, top_k: int = 5) -> EvalReport:
results = []
for case in self.test_cases:
retrieved = self.retriever.retrieve(case.query, top_k=top_k)
retrieved_ids = [doc.id for doc in retrieved]
precision = self._precision(retrieved_ids, case.relevant_ids)
recall = self._recall(retrieved_ids, case.relevant_ids)
mrr = self._mrr(retrieved_ids, case.relevant_ids)
results.append(EvalResult(case.query, precision, recall, mrr))
return EvalReport(results)
Build this before you build anything else. Then use it to measure every decision: chunking strategy, embedding model, chunk size, overlap, metadata filtering, hybrid vs pure semantic retrieval.
Without this harness, you are guessing. With it, you are engineering.
Hybrid retrieval and why it usually wins
Pure semantic retrieval (dense vector search) is outperformed by hybrid retrieval (dense + sparse/BM25) in most production scenarios. The reason is that real user queries contain proper nouns, product names, version numbers, and code identifiers that semantic search handles poorly.
A query for "PXGraph.createInstance deprecated in 24.1" will be retrieved by BM25 because it matches on the exact string. It may not be retrieved by semantic search because the semantic embedding of the query does not closely match the semantic embedding of the relevant documentation section.
Most modern vector databases support hybrid retrieval natively (Pinecone, Weaviate, Elasticsearch). The performance uplift is usually 10–20% on precision metrics and worth the added complexity.
The metadata problem
Chunking and retrieval are only part of the problem. Metadata filtering is the other part that most implementations get wrong.
Documents are not equivalent. In a knowledge base for an enterprise device management product, a document about macOS 14 configuration is not the same as a document about macOS 12 configuration, even if they are semantically similar. A user asking about their macOS 14 deployment does not want to retrieve macOS 12 documentation.
Every chunk should carry structured metadata that can be used to filter retrieval before or after the vector search:
- Document type (guide, API reference, release note, troubleshooting)
- Version or date
- Product area or category
- Audience (admin, developer, end user)
- Confidence level (official, community, deprecated)
Retrieval then becomes: find the semantically closest chunks that also match the metadata filters for this user's context. This is significantly more accurate than treating all documents as equivalent.
The result when you get it right
The difference between naive and semantic chunking is not marginal. It is the difference between a RAG system that users notice and one that users stop noticing — which is the target state.
When users stop noticing the retrieval layer, they are getting the right document every time. That only happens when every chunk is a meaningful unit of information, when retrieval is measured rigorously, and when metadata filtering respects document structure.
The implementation takes longer than character-count chunking. It works better from day one and continues to work better as the knowledge base grows.