Building RAG Locally
In the previous article, we’ve discussed what is RAG and how it works under the hood. That article should’ve given enough context for us to start implementing our own RAG system. Now for the next part of this series, we’re going to discuss how can we do it locally, without utilizing any self-managed cloud infrastructure.
But you may ask, why would we want to do it locally? Truth be told, this wasn’t even my first preference. However, after giving presentation about building RAG, I always get a question of how to achieve it in a local setting. And in a way it does make sense. By building it locally, we’re reducing additional variables to think about: cloud infra (whether it’s managed or self-managed), network, cost, and everything in between. In other word, we can solely focus on getting closer to the core of the solution that we’re trying to build, and reducing additional complexities that may only happen once we’re trying to productize and operationalize the solution. All within our laptops.
Another supporting argument for this choice is that we can start with zero cost, maximizing components that we currently have in our laptops. Having powerful GPU might be helpful to speed up some process, but we’ll later see that running a smaller LLM that doesn’t require GPU should be enough for our learning journey.
That being said, let’s choose our components to implement this RAG system. Referring to block diagram in previous article, we can now compose something that looks like this
Here we have the following choice of solution:
- LLM: Gamma 2B
- Vector DB: ChromaDB
- Embedding function: all-MiniLM-L6-v2 (ChromaDB’s default)
The arguments for the choices above are as follows.
- Gamma 2B is an open-source lightweight LLM from Google that should be enough for local use cases when no GPU/CUDA support is in place
- ChromaDB provides straightforward API for searching vectors within DB and it’s basically an RAG-first DB.
- While the embedding function, for starter, I’ll use default provided by ChromaDB. Note, I don’t even have to define the embedding model, ChromaDB’s API handle it for me, behind the scene.
Additionally, for the sake of simplicity, we will run this code in local Jupyter notebook, so all of the interaction happen via code blocks. Again, we want to see how the RAG system component runs, so we opt for a more under-the-hood approach, instead of a polished product that has chat UI.
Implementation
I’ll note few interesting bits from it. As an additional note, I’m running this on my 2021 Lenovo Legion, with 16 GB RAM, Ryzen 7 CPU and GTX 1660 Ti. You can run this in your local Jupyter Notebook. This code is based on this Google Colab that I tweak a bit here and there, just to fit our scenario.
Importing External Data for RAG
Here we’ll download an article and store it locally first before putting it to our vector database. I’m downloading a rather long article that describes history of NBA jump ball.
from google_labs_html_chunker.html_chunker import HtmlChunker
from urllib.request import urlopen
with urlopen(
"<https://www.theringer.com/nba/2024/2/21/24074472/jump-ball-rules-strategy-chicanery-history>"
) as f:
html = f.read().decode("utf-8")
# Chunk the file using HtmlChunker
chunker = HtmlChunker(
max_words_per_aggregate_passage=200,
greedily_aggregate_sibling_nodes=True,
html_tags_to_exclude={"noscript", "script", "style"},
)
passages = chunker.chunk(html)
Insert Data into Database
Now let’s put it to our database. Here we setup client for ChromaDB and add aforementioned result of chunking the HTML page. Of course, ChromaDB has to be installed first.
import chromadb
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="rag_cookbook_collection")
collection.add(documents=passages, ids=[str(i) for i in range(len(passages))])
Creating Prompt
Running this should give us the answer to the prompt.
prompt_template = """Hi, please give me answer to the following question. Use the provided context below.
In case you can't find answer in the article, just respond "I could not find the answer based on the context you provided."
User question: {}
Context:
{}
"""
user_question = "Does NBA has a formal rule on how to do jump ball?"
results = collection.query(query_texts=user_question, n_results=3)
context = "\\n".join(
[f"{i+1}. {passage}" for i, passage in enumerate(results["documents"][0])]
)
prompt = f"{prompt_template.format(user_question, context)}"
Loading LLM
Here we load our LLM, Gemma 2B. We’re using Hugging Face’s API here.
from transformers import AutoTokenizer
import transformers
import torch
import bitsandbytes, accelerate
model = "google/gemma-2-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
model_kwargs={
"torch_dtype": torch.bfloat16,
},
)
Prompting Result
In this step we run the pipeline for inference. Again, pipeline is also part of Hugging Face’s API, where answering question in a chat-style interaction is also neatly abstracted for us.
messages = [
{"role": "user", "content": prompt},
]
prompt = pipeline.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.1)
print(outputs[0]["generated_text"][len(prompt) :])
Experimentation
As previously explained, one thing that we should look into after having this kind of setup, is experimenting with the RAG system components in hand. For example, we can switch LLM and use Gemma 7B instead. However, this would causes the load time to be quite long if we’re trying to do it without GPU support. This is where GPU comes into play, as it’ll help speed up the process.
The code below loads Gemma 7B with Hugging Face quantization API that utilizes GPU. This technique will reduce memory usage for loading the model as it uses lower precision data-type so the model fits our limited memory.
model = "google/gemma-1.1-7b-it"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
model_kwargs={
"torch_dtype": torch.bfloat16,
"quantization_config": {"load_in_4bit": True},
},
)
Another example, is we can try to see what different embedding function impacts result of storing and retrieving data. Essentially, embedding function performs different kind of sentence transformation and using different kind of function will affects how information is stored. For example, instead of using the default all-MiniLM-L6-v2, I can use all-mpnet-base-v2 and I can see how it affects the prompting result. Code is as follows. Note that to use different embedding function, we have to explicitly change the parameter.
import chromadb
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-mpnet-base-v2")
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(name="jump_ball_collection", embedding_function=sentence_transformer_ef)
collection.add(documents=passages, ids=[str(i) for i in range(len(passages))])
Closing
This experiments has brought us closer to have an RAG system with components that we can tweak, without having to think about additional variables such as infrastructure architecture and cost. This is a good way to learn about RAG. Also, we have a very simple pipeline that we can modify in the future.
In the next entry of this series, we’ll look into how we can use Vertex AI as the cloud infra abstraction to run LLM and implement RAG system. Until then, happy experimenting.