RAG explained: how to give your AI agent a long-term memory.
A language model knows a lot — but it doesn't know your contracts, your policies, your past proposals, or your product documentation. RAG (Retrieval-Augmented Generation) fixes that by giving your agent access to your knowledge base at query time, without expensive fine-tuning.
Every business that builds an AI system eventually runs into the same limitation: the model knows a lot, but it doesn't know their stuff. It doesn't know their pricing. It doesn't know their internal policies. It doesn't know the terms in the contract they signed with a specific client three years ago.
The solution isn't training a new model. For most applications, the solution is RAG — Retrieval-Augmented Generation. Understanding RAG is one of the most important concepts for any business building serious AI systems. Here's how it works.
The fundamental problem RAG solves
A language model's knowledge comes from its training data — everything it saw during the pre-training process, frozen at a specific date. There are two problems with this:
First, the model doesn't know your proprietary information. Your company's internal documents, policies, past work, and specific data were never in the training set.
Second, the model's knowledge has a cutoff date. Things that happened after training are unknown to the model.
RAG solves both problems by connecting the model to an external knowledge base. When a question arrives, the system retrieves the most relevant information from the knowledge base and passes it to the model as part of the prompt. The model now has access to current, proprietary information — not because it was trained on it, but because it was given it at the moment of the query.
The mechanics: how retrieval actually works
The retrieval step is where most of the engineering complexity lives in a RAG system. Here's the process:
Step 1: Ingestion. Your documents — PDFs, Word files, web pages, database records, whatever — are broken into chunks. The chunking strategy matters: too large and the chunks contain too much irrelevant information; too small and they lose context. A common default is 400–800 token chunks with some overlap between adjacent chunks.
Step 2: Embedding. Each chunk is converted into a numerical vector — a list of hundreds or thousands of numbers — using an embedding model. This vector represents the semantic meaning of the chunk. The critical property: chunks with similar meaning produce similar vectors.
Step 3: Storage. The vectors are stored in a vector database — a database optimized for finding similar vectors quickly. Common options include Pinecone, Weaviate, Chroma, and pgvector (for teams already using PostgreSQL).
Step 4: Query-time retrieval. When a user asks a question, the question is also embedded into a vector. The vector database finds the stored chunks whose vectors are most similar to the query vector. Those chunks are the "relevant context" — the information most likely to help answer the question.
Step 5: Augmented generation. The retrieved chunks are added to the prompt along with the user's question. The model now has both its pre-trained knowledge and the retrieved context, and produces an answer grounded in both.
Why semantic search is the key innovation
Traditional keyword search matches your query to documents that contain the same words. This works well when you know exactly how the information is expressed in the source document. It fails when you use different terminology, when the document uses jargon you didn't know to search for, or when the meaning of the query is clear but no single document matches the exact keywords.
Semantic search finds documents by meaning similarity, not keyword matching. You can ask "what's our policy on client data retention?" and retrieve the relevant section of your data governance policy even if it never uses the phrase "data retention" — because the embedding model recognizes semantic similarity between your question and the policy language.
This is a significant improvement over keyword search for knowledge retrieval tasks, and it's what makes RAG practical for large, varied knowledge bases.
RAG vs. fine-tuning: which one you actually need
This is one of the most common misunderstandings in enterprise AI. "Let's fine-tune the model on our data" sounds intuitive, but it's almost never the right solution for knowledge access problems.
Fine-tuning changes the model's weights — it modifies the model itself. It's appropriate when you want to change how the model behaves: its tone, its format, its reasoning style on a specific class of tasks. It's not appropriate for giving the model access to facts, because:
- Fine-tuning is expensive and slow — you can't update the model every time a policy changes.
- Models don't reliably "memorize" specific facts from fine-tuning; they absorb patterns of behavior.
- You can't easily audit or update what a fine-tuned model "knows."
- RAG is faster to build, cheaper to run, and much easier to maintain and update.
Use RAG when you want the model to know things. Use fine-tuning when you want the model to behave differently. In most business applications, you want the model to know things — use RAG.
What goes in the knowledge base
Almost any text-based content can feed a RAG system. The most common use cases:
- Internal policy documents: HR policies, operational procedures, compliance requirements, product specifications.
- Past proposals and contracts: let the agent reference previous work when drafting new proposals or answering contract questions.
- Client records and history: meeting notes, email threads, case files — the agent can answer questions about specific clients based on the full history of the relationship.
- Product documentation: user manuals, technical specs, FAQs — the agent answers support questions accurately by citing the actual documentation.
- Research and market intelligence: industry reports, competitor information, market data — the agent synthesizes across sources rather than hallucinating general knowledge.
Where RAG fails and how to prevent it
RAG shifts the failure mode from hallucination to retrieval failure. The model's answers are only as good as what gets retrieved. Common failure patterns:
Poor chunking: chunks that split mid-thought, or are too short to contain enough context, produce fragments that don't answer the question correctly even when retrieved. Spend time on chunking strategy — it has more impact on quality than most teams expect.
Low-quality source documents: garbage in, garbage out. If the knowledge base contains conflicting information, outdated documents, or poorly written policies, the agent will retrieve and cite that information. Curate what goes in.
Retrieval that misses the right document: sometimes the most relevant chunk isn't the most similar by vector — the query language and the document language are too different. Hybrid search (combining vector similarity with keyword matching) often improves retrieval for specialized terminology.
Context window overflow: retrieving too many chunks can overflow the model's context window or bury the most relevant information. Tune the number of retrieved chunks based on your context window and the typical query complexity.
RAG is one of the most practical tools in AI engineering, and it's the foundation of most knowledge-grounded agent systems. If you're building a system that needs to reason over your specific documents, data, or history, it's almost certainly part of the right architecture.
If you want to understand how RAG would work for your specific use case, book a call. We architect and build RAG systems as part of most of our agent engagements.
Book a call