The Preprocessing Gap Between RAG and Agentic
RAG is the standard way to connect documents to LLMs. Most people building RAGs know the steps by now: parse documents, chunk them, embed, store vectors, retrieve at query time. But something different happens when you’re building systems that act rather than answer.
The RAG mental model
RAG preprocessing optimizes for retrieval. Someone asks a question, you find relevant chunks, you synthesize an answer. The whole pipeline is designed around that interaction pattern.
The work happens before anyone asks anything. Documents get parsed into text, extracting content from PDFs, Word docs, HTML, whatever format you’re working with. Then chunking splits that text into pieces sized for context windows. You choose a strategy based on your content: split on paragraphs, headings, or fixed token counts. Overlap between chunks preserves context across boundaries. Finally, embedding converts each chunk into a vector where similar meanings cluster together. “The contract expires in December” ends up near “Agreement termination date: 12/31/2024” even though they share few words. That’s what makes semantic search work.
Retrieval is similarity search over those vectors. Query comes in, gets embedded, you find the nearest chunks in vector space. For Q&A, this works well. You ask a question, the system finds relevant passages, an LLM synthesizes an answer. The whole architecture assumes a query-response pattern.
The requirements shift when you’re building systems that act instead of answer.
What agentic actually needs
Consider a contract monitoring system. It tracks obligations across hundreds of agreements: Example Bank owes a quarterly audit report by the 15th, so the system sends a reminder on the 10th, flags it as overdue on the 16th, and escalates to legal on the 20th. The system doesn’t just find text about deadlines. It acts on them.
That requires something different at the data layer. The system needs to understand that Party A owes Party B deliverable X by date Y under condition Z. And it needs to connect those facts across documents. Not just find text about obligations, but actually know what’s owed to whom and when.
The preprocessing has to pull out that structure, not just preserve text for later search. You’re not chunking paragraphs. You’re turning “Example Bank shall submit quarterly compliance reports within 15 days of quarter end” into data you can query: party, obligation type, deadline, conditions. Think rows in a database, not passages in a search index.
Two parallel paths
The architecture ends up looking completely different.
RAG has a linear pipeline. Documents go in, chunking happens, embeddings get created, vectors get stored. At query time, search, retrieve, generate.
Agentic systems need two tracks running in parallel. The main one pulls structured data out of documents. An LLM reads each contract, extracts the obligations, parties, dates, and conditions, and writes them to a graph database. Why a graph? Because you’re not just storing isolated facts, you’re storing how they connect. Example Bank owes a report. That report is due quarterly. The obligation comes from Section 4.2 of Contract #1847. Those connections between entities are what graph databases are built for. This is what powers the actual monitoring.
But you still need embeddings. Just for different reasons.
The second track catches what extraction misses. Sometimes “the Lender” in paragraph 12 needs to connect to “Example Bank” from paragraph 3. Sometimes you don’t know what patterns matter until you see them repeated across documents. The vector search helps you find connections that weren’t obvious enough to extract upfront.
So you end up with two databases working together. The graph database stores entities and their relationships: who owes what to whom by when. The vector database helps you find things you didn’t know to look for.
Different failure modes
For RAG, you care about chunk boundaries. Semantic coherence, retrieval quality, context windows. You obsess over overlap sizes and splitting strategies.
For agentic systems, you care about extraction quality. Did you get all the obligations. Are parties correctly identified. Are dates normalized. Do conditions link to the right clauses. You obsess over schema design and extraction prompts.
RAG chunking decisions are about retrieval performance. Agentic extraction decisions are about correctness. Miss an obligation, your monitoring system has a blind spot. Chunk a contract poorly, you just get slightly worse answers.
Different failure modes, different preprocessing priorities.
Different configuration challenges
Both approaches have their own configuration puzzles to solve.
RAG preprocessing is about tuning retrieval quality. Chunk size, overlap, splitting strategy, embedding model. The parameters are well understood at this point, though finding the right combination for your data still takes iteration. This is exactly the problem I’m solving with VectorFlow: configure through conversation, preview each stage, shorten the feedback loop.
Agentic preprocessing is about defining what to extract. Extraction schemas. Entity resolution strategies. How to merge conflicting information across documents. When to fall back to unstructured retrieval. The parameters are different, not necessarily harder. You’re working in schema design and extraction prompts rather than chunking strategies.
The tooling maturity differs. RAG pipelines have had a few years to develop standard approaches. Agentic extraction is newer, so the patterns are still emerging. It’s part of why I’ve been thinking about entity and relationship extraction for VectorFlow. The gap between “good enough for retrieval” and “reliable enough for action” is where a lot of the interesting problems live.
The split
Chunks don’t have deadlines. Vectors don’t have obligations.
Building something that answers questions, optimize for retrieval. RAG pipelines are maturing.
Building something that takes actions over time, you need structured extraction first, embeddings second. Graph database for relationships, vector database for discovery.
The documents are the same. What you’re trying to do with them isn’t.
← Back to Home