ENTRY_10

RAG Is Dead (And So Is Email Search)

Every few months someone declares RAG dead. The argument goes like this: context windows keep growing. Gemini hit a million tokens. Claude’s at 200k. Why bother with retrieval when you can just put everything in the prompt?

It sounds reasonable. More space means more documents. More documents means more coverage. Problem solved.

But there’s a gap in the logic.

The email inbox problem

Imagine you need to find one specific email from three years ago. You have two options.

Option one: read every email you’ve ever received until you find it.

Option two: use search.

Nobody picks option one. It doesn’t matter that your inbox can hold 50,000 emails. Capacity isn’t the problem. Finding the right email is the problem.

This is the same mistake people make with context windows. A million tokens means you can fit more documents. It doesn’t mean the model can find the right information inside them.

What actually happens with long contexts

There’s a phenomenon researchers call “lost in the middle.” When you give a model a long context, it doesn’t process all of it equally. Information at the beginning and end gets more attention. Stuff in the middle gets partially ignored.

This isn’t a minor edge case. Recent research on what’s been called context rot tested 18 frontier models, including GPT-4, Claude, and Gemini. Performance degraded as input length increased, even on straightforward retrieval tasks. The more content you added, the worse the model got at finding specific information within it.

Adding irrelevant content made it worse. Every document that wasn’t directly relevant to the question acted as noise, making it harder for the model to surface what actually mattered.

The counterintuitive finding: it’s not just about length. It’s about signal-to-noise ratio. A focused 10-page context often outperforms a comprehensive 100-page one.

What RAG actually solves

When you need an LLM to answer questions about proprietary data, hallucination is the core problem. The model wasn’t trained on your contracts, policies, or customer records. Ask it about them anyway and it doesn’t say “I don’t know.” It makes something up. Confidently. This is fine for general knowledge. It’s a disaster when you need accurate answers.

The first instinct is to put that proprietary information in the prompt. Give the model what it needs, then ask your question. This works until you hit the problems we just discussed. Long contexts, lost in the middle, noise drowning out signal.

The other option is fine-tuning. Retrain the model on your proprietary data so it actually “knows” the information instead of just seeing it in context. This does help with hallucination. A model trained on your contracts will answer questions about them more accurately than one that’s never seen them. The problem is what happens six months later. Your contracts changed. Your policies updated. The model doesn’t know that. It still has the old information baked into its weights, and it will answer questions about outdated terms just as confidently as before. You’ve traded “making things up” for “confidently wrong about stale data.” Different flavor of the same issue. Fine-tuning is also expensive. Depending on the model and data, you’re looking at tens of thousands to potentially millions of dollars. It takes days or weeks. If your contracts update quarterly or your policies change monthly, you’re retraining constantly or accepting that your model is working with yesterday’s answers.

RAG sidesteps both problems. Instead of stuffing everything into context or baking information into model weights, you pull out the relevant pieces when someone asks a question. The model gets what it needs for this specific question, nothing more. Your data lives in a vector database that you can update anytime without touching the model.

The quality of that retrieval depends entirely on your preprocessing pipeline. How you parse documents. How you chunk them. What metadata you preserve. If your chunks are incoherent or your parsing mangled the source content, retrieval can’t fix that. This is the problem I’m working on with VectorFlow: making it easier to get accurate, well-structured information into the vector database so the retrieval layer has something good to work with.

The red herring

So where does the “RAG is dead” argument fit in? It assumes context windows and retrieval are competing solutions. They’re not. A million-token window doesn’t help you find the right documents. It just gives you more room to fit them once you have.

If anything, better models make retrieval quality matter more. A capable model with precisely the right context will outperform the same model drowning in loosely relevant documents. The ceiling goes up, but only if you’re feeding it well.

Context windows determine how much you can fit. Retrieval determines whether you find it. Different problems.

The inbox can hold 50,000 emails. You’re still going to use search.

← Back to Home