How RAG Works

3 min readDec 14, 2024

RAG short for Retrieval Augmented Generation, is a software engineering pattern that enhance the capability of an LLM (Large Language Model) to answer questions. This involves expanding capability of the model to update its knowledge by accessing newer and/or private data, without having to re-train or fine tune the model.

Personally, I find RAG to be very interesting, because when I first learn about LLM and GenAI, my initial thought was “Hey, can I use this to talk to personal data I store at work? Will that breach the company privacy? Does it adhere to security and compliance policy?” All of that was answered by RAG. My data is kept on a private zone, whereas I just need to get the LLM capability so it can understand my queries and sending relevant answers.

To implement that pattern, we put LLM as part of a bigger system. Instead of directly pass a user’s query to LLM, an RAG system will first look into an external database for a data relevant to that query. Afterwards, finding from this database will be added as context for the initial query. Therefore, LLM will respond promptly according to this updated context, enabling it to give a more accurate answer, without having to retrain or fine-tune it.

This following image is taken from the paper Retrieval-Augmented Generation for Large Language Models: A Survey from Yunfan Gao et al. and it gives a concise definition of how typically an RAG system works.

Based on the above definition, in its basic form, an RAG system will involve usage of the following components:

Front end for user to send query
External document relevant to the RAG system
External database, that stores additional document
LLM model

Here’s a high level diagram of how it might be implemented along with the functionality of each building block. This was taken from my last presentation. If you asked me, this does look like a standard 3-tier web app, i.e. frontend, backend and database. Only now, instead of 1 database, we add LLM to the fray.

In practice, the most commonly used type of database for RAG, is vector database, as it gives straightforward solution to find data closely related to user query. A vector database stores data in form of vector, whereby a closely related keywords in a text are clubbed together and given a number representation, thus making it easier to seek and retrieve data deemed to be similar to the user query.

A vector database should be able to store any form of text based data, tables and images. When a data is stored in vector database, the calculation that transforms text into vector is called embeddings. How this is calculated will be dependent on something called embedding model. An embedding model will split sentences into chunks and based on how they appear next to each other, assign multi dimensional numbers (vectors). Upon query, this embedding model will then perform similar action and find vectors in database that is similar to that query.

Of course, this is only a short introduction of an RAG system. We can further develop RAG by either improving the workflow of the system or expand each of the system’s component. As an example:

Make RAG system to store every query and answer, thus making it more intelligent with each query
Re-prompt from first query
Have a more robust data ingestion system
Better data validation system, etc.

In future article, we’ll look into how we can implement a simple RAG system for our further experiments.

How RAG Works

Written by Adityo Pratomo

No responses yet