RAG Basics: A Beginner’s Guide to Retrieval-Augmented Generation
if you are new to this topic before you start reading this, I would suggest you go through these 2 earlier editions of the newsletter:
- Build Your Business Specific LLMs Using RAG.“ this is to understand the fundamentals
- Chat with Knowledge Base through RAG. Technical code to build RAG-Based Chatbots
This week, we will start with the very basic RAG from scratch, based on the repository available from Mistral LLM. The goal is to clarify your understanding of RAG’s internal workings and equip you with the foundational knowledge needed to construct an RAG using minimal dependencies.
Let’s start with installing the required packages:
Now, to get the data from any article or document, or web source:
response = requests.get(‘https://www.gutenberg.org/cache/epub/1513/pg1513.txt‘)text = response.text
Then, split the data into Chunks: In a Retrieval-Augmented Generation (RAG) system, breaking the document into smaller chunks is essential for efficiently identifying and retrieving the most relevant information during the retrieval process. In this example, we split the text by characters and grouped 2048 characters into each chunk.
Key points:
Chunk size: To achieve optimal performance in RAG, we may need to customize or experiment with different chunk sizes and overlaps based on the specific use case. Smaller chunks can be more beneficial for retrieval processes, as larger chunks often contain filler text that can obscure semantic representation. Using smaller chunks allows the RAG system to identify and extract relevant information more effectively and accurately. However, be mindful of the trade-offs, such as increased processing time and computational resources, that come with using smaller chunks.
How to split: The simplest method is to split the text by character, but other options are based on the use case and document structure. To avoid exceeding token limits in API calls, you might need to split the text by tokens. Consider splitting the text into sentences, paragraphs, or HTML headers to maintain chunk cohesiveness. When working with code, it’s often best to split by meaningful code chunks, such as using an Abstract Syntax Tree (AST) parser.
Creation of embeddings for each text chunk:
Text embeddings convert text into numeric representations in a vector, enabling the model to understand semantic relationships between words. Words with similar meanings will be closer in this space, which is crucial for tasks like information retrieval and semantic search.
To generate these embeddings, we use Mistral AI’s embeddings API endpoint with the mistral-embed model. We create a function called get_text_embedding to retrieve the embedding for a single text chunk. Then, we use list comprehension to apply this function to all text chunks and obtain their embeddings efficiently.
Loading into Vector Database: after getting the embeddings in place, we need to store them in the Vector Database.
The question that the user will ask needs to create embeddings and then receive similar chunks from the Vector DB.
To perform a search on the vector database, we use the index.search method, which requires two arguments: the vector of the question embeddings and the number of similar vectors to retrieve. This method returns the distances and indices of the most similar vectors to the question vector in the database. Using these indices, we can then retrieve the corresponding relevant text chunks.
There are some common methods:
- Similarity Search with Embeddings: This method uses embeddings to find similar text chunks based on their vector representations. It’s a straightforward approach that directly compares the vector distances.
- Filtering with Metadata: If metadata is available, it can be beneficial to filter the data based on this metadata before performing the similarity search. This can narrow the search space and improve the relevance of the results.
- Statistical Retrieval Methods: TF-IDF (Term Frequency-Inverse Document Frequency) evaluates the importance of a term in a document relative to a collection of documents. It uses the frequency of terms to identify relevant text chunks. BM25 is a ranking function based on term frequency and document length, which provides a more nuanced approach to identifying relevant text chunks compared to TF-IDF.
Combine Context and Question in a Prompt to Generate a Response:
Lastly, we can use the retrieved text chunks as a context within the prompt to generate a response.
Prompting Techniques for Developing a RAG System: In developing a Retrieval-Augmented Generation (RAG) system, various prompting techniques can significantly enhance the model’s performance and the quality of its responses.
Here are some key techniques that can be applied:
- Few-Shot Learning: Few-shot learning involves providing the model with a few task examples to guide its responses. By including these examples in the prompt, the model can better understand the desired format and context, leading to more accurate and relevant answers. Example: Suppose you are building an RAG system to answer questions about historical events. The prompt could include a few examples of questions and answers to show the model how to respond appropriately.
- Explicit Instructions: Explicitly instructing the model to format its answers in a specific way can help standardize the output, making it more consistent and easier to interpret. This can be especially useful for tasks that require a specific structure, such as generating reports or summaries. Example: If you need the model to provide responses in bullet points or a numbered list, you can include these instructions in the prompt to ensure the output follows the desired format
Head over to this link, and you can try building your first simple RAG.