ENTRY_05 12-08-2025

Your RAG Retrieval Isn't Broken. Your Processing Is.

This post is part of why I’m building VectorFlow. Configure your ingestion pipeline through conversation, preview every stage, and load to your vector database.

I’ve been talking to engineers about their RAG setups lately. The same story keeps coming up.

Retrieval quality sucks. I’ve tried BM25, hybrid search, rerankers. Nothing moves the needle.

So they tune. They swap embedding models. Adjust k values. Add rerankers. They spend weeks in the retrieval layer, convinced that’s where the problem lives.

It usually isn’t.

The retrieval layer is a scapegoat

Retrieval finds the chunks most similar to a query and returns them. That’s it. If the right answer isn’t in your chunks, or it’s split across three chunks with no connecting context, retrieval can’t find it. It’s similarity search over whatever you gave it.

When engineers complain about retrieval quality, they’re usually complaining about processing: tables split in half, parsers mangling PDFs, noise embedded alongside signal, metadata stripped out.

None of these are retrieval problems. No amount of reranker tuning fixes them.

The feedback loop problem

The frustration sounds like: “I’ll spend like 3 days just figuring out why my PDFs are extracting weird characters. Meanwhile the actual RAG part takes an afternoon to wire up.”

Three days on processing. An afternoon on retrieval. That ratio tells you where the real work is.

The problem is the feedback loop. When retrieval returns bad results, you don’t get an error message saying chunk 47,832 has a split table. You just get irrelevant passages. So you start tuning retrieval because that’s the layer you can see.

The fix is to shorten the feedback loop on processing. Look at your chunks before you embed them. Catch the problems when they’re cheap to fix.

What to do about it

If your RAG retrieval quality is poor, before you touch the retrieval layer:

Sample your chunks. Literally read 50 random ones. Are they coherent? Can you understand them without context?

Check your PDFs specifically. Compare what your parser produced to the original. Look for merged columns, garbled characters.

Look for split structures. Partial tables, numbered lists that start at “3”, code blocks that end mid-function. These are retrieval killers.

If you’re stuck in retrieval debugging hell, it might be worth looking one layer up.

← Back to Home