Building RAG with Vertex AI RAG Engine

4 min readJan 10, 2025

Continuing from the previous entries, we can now look into how to implement RAG system for public use case. Recalling from that article, an RAG system can be implemented by composing LLM and Vector Database to process prompt coming in from a user. On top of that, if we’re to enable this system to accept inputs from external users, we have to create a frontend where users’ inputs can be sent to the backend of the system (LLM+VectorDB) and output from this backend is made visible to the user.

To implement this, we can utilize one of offering in Vertex AI, the RAG Engine, that already provides us with ready to use components for implementing RAG. In short, the required functionalities are available through Vertex AI’s python SDK that we can put into our code. As an overview, this is the block diagram of our solution.

From an implementation perspective, this is some explanation for each component of the solution:

Frontend, a chat UI written in React, specifically we’re using NextJS as the framework
Backend server is written in python where it’ll perform 3 big tasks:
- Act as an API endpoint that’ll be hit by frontend when a user sends chat
- Act as connector to VectorDB and LLM created by Vertex AI RAG Engine via Vertex AI SDK
- Sends back LLM response to the frontend
Vertex AI is used to host Vector Database and LLM. It offers ready made solution through their RAG Engine, so let’s utilize off the shelf components

To simplify our journey, both Frontend and Backend component will be run locally. But Vertex AI is run on Google Cloud, so you’ll need to enable Vertex AI API in your GCP Project.

Show Me the Code

The full code is available in this GitHub repo, so I suggest you check it out. But let me highlight few things in the code that showcases what RAG Engine in Vertex AI can help us build RAG solution faster

Creating Vector Database

In only few lines of code, we can create Vector DB, import external file from Google Drive to it and have it ready to be accessed. The term used by RAG Engine is for a Vector DB is Corpus and this can store vectors generated from multiple files stored in either Google Drive or Google Cloud Storage. We can upload HTML, PDF and Google Docs files, including Sheets and Slides.

Note that we can also set the embedding model as well as configuring the parameter used to chunk texts into vectors. We don’t specify DB here, we will use built in managed Vector DB offered by RAG Engine.

external_file = ["<https://drive.google.com/file/d/xxx/>"]

# Create RAGCorpus
Configure embedding model, for example "text-embedding-004".
embedding_model_config = rag.EmbeddingModelConfig(
    publisher_model="publishers/google/models/text-embedding-004"
)
rag_corpus = rag.create_corpus(
    display_name=display_name,
    embedding_model_config=embedding_model_config,
)
# Import Files to the RagCorpus
rag.import_files(
    rag_corpus.name,
    external_file,
    chunk_size=1024,  # Optional
    chunk_overlap=100,  # Optional
    max_embedding_requests_per_min=900,  # Optional
)

Additionally, we don’t need to create RAG Corpus on every occasion. To use a previously created RAG Corpus, we can use get_corpus function instead of create_corpus, as below:

rag_corpus = rag.get_corpus(name=corpus_name)

Interfacing with LLM

This is 2 part. First we create an RAG retrieval tool, which interfaces with previously created RAG Corpus. This is where we perform search on the Vector DB. Second part, we describe which LLM we want to use. Here we use Gemini 1.5. We can use any model available in Vertex AI Model Garden.

Also, if required we can retrieve from multiple RAG Corpus.

# Create a RAG retrieval tool
rag_retrieval_tool = Tool.from_retrieval(
    retrieval=rag.Retrieval(
        source=rag.VertexRagStore(
            rag_resources=[
                rag.RagResource(
                    rag_corpus=rag_corpus.name
                )
            ],
            similarity_top_k=3,  # Optional
            vector_distance_threshold=0.5,  # Optional
        ),
    )
)

# Create a gemini-pro model instance
rag_model = GenerativeModel(
    model_name="gemini-1.5-flash-001", tools=[rag_retrieval_tool]
)

Crafting Response

Finally, to create response, we can use the following code

# Generate response
response = rag_model.generate_content(prompt)

# Return response
return response.text

All that’s remaining is to bundle the RAG retrieval tool and response generation into a function and put it in our server code to be invoked every time a specific endpoint is hit from the frontend. To keep this writing short, I advise you to take a look at the above Github repo for more detail.

Lessons Learned

From the experience of creating this application, I can see that how much abstraction is given by RAG Engine so we don’t have to think a lot about the backend components such as embedding model, LLM and the chunking mechanism. We can instead focus on creating the pipeline and bundle the RAG mechanism into the backend code.

The steps of how RAG engine works here is similar to what I’ve explained in the previous writing, but seeing that we can have something running in the cloud with so little code puts a smile in my face.

Also cost wise, it’s not too much as inference like this is quite cheap. Again, many of the infrastructure components is already abstracted by Vertex AI. And that’s another advantage of running this approach: I get first-hand experience to Vertex AI. I was previously confused with the provided solutions within Vertex AI, because there are a lot going on here. But turns out, the best way to learn is by utilizing small slice of the cake and that’s where the gateway is.

What an exciting time to learn AI!

Building RAG with Vertex AI RAG Engine

Show Me the Code

Lessons Learned

Written by Adityo Pratomo

No responses yet