Large language models (LLMs) have created a huge opportunity for aspiring researchers and AI enthusiasts. But, we need to understand one important behavior of these LLMs: that they hallucinate. They sometimes produce answers/responses that are inaccurate, biased or made up ones. To mitigate this hallucinating behavior, we have several techniques. Retrieval augmented generation is considered one of the sophisticated strategies to mitigate the LLM hallucinations. But again, building robust RAG applications depends on various factors, not just selecting LLMs but other factors such as embedding model, chunking size & strategy, the AI framework, vector database, retrieval strategy, etc.
Today, we are going to see how semantic chunking enhances RAG applications in retrieving the most accurate and contextually correct chunk.
What is Retrieval Augmented Generation (RAG)?
Retrieval-Augmented Generation (RAG) is an approach that leverages external data stored in a database to respond back to the user query. This enhances the quality of response generation with more context. RAG utilizes both retrieval techniques and generative models to produce contextually relevant responses.
Think of a scenario where you would like to get custom responses from your AI application. First, the organization’s documents are converted into embeddings through an embedding model and stored in a vector database. When a query is sent to the AI application, it gets converted into a vector query embedding and goes through the vector database to find the most similar object by vector similarity search. This way, your LLM-powered application doesn’t hallucinate since you have already instructed it to ground its responses with the custom data.
One simple use case would be the customer support application, where the custom data is fed to the application stored in a vector database and when a user query comes in, it generates the most appropriate response related to your products or services and not some generic answer.
The RAG pipeline basically involves three critical components: Retrieval component, Augmentation component, Generation component.
Retrieval:
This component helps you fetch the relevant information from the external knowledge base like a vector database for any given user query. This component is very crucial as this is the first step in curating the meaningful and contextually correct responses.
Augmentation:
This part involves enhancing and adding more relevant context to the retrieved response for the user query.
Generation:
Finally, a final output is presented to the user with the help of a large language model (LLM). The LLM uses its own knowledge and the provided context and comes up with an apt response to the user’s query.
These three components are the basis of a RAG pipeline to help users to get the contextually-rich and accurate responses they are looking for. That is the reason why RAG is so special when it comes to building chatbots, question-answering systems, etc.
Chunking in RAG applications
Chunking in Retrieval-Augmented Generation (RAG) applications involves breaking down large pieces of data into smaller, manageable segments or "chunks." This process enhances the efficiency and accuracy of information retrieval by enabling the model to handle more precise and relevant portions of data.
In RAG systems, when a query is made, the model searches through these chunks to find the most relevant information, rather than going through an entire document. This not only speeds up the retrieval process but also improves the quality of the generated responses by focusing on the most pertinent information. Chunking is especially useful in scenarios where documents are lengthy or contain diverse topics, as it ensures that the retrieved data is contextually appropriate and precise.
Chunking strategies
Chunking strategies are essential in RAG applications to improve retrieval efficiency and accuracy. The usage of these strategies depends on our use cases. Listing down some prominent chunking strategies.
Fixed-Length Chunking:
Fixed length chunking strategy divides the text into chunks of a fixed number of words or characters. It is simple to implement and ensures uniform chunk sizes, but it may sometimes split relevant information across chunks.
Recursive Character Chunking:
Recursive Character Chunking is a strategy for dividing the data into smaller segments by progressively breaking it down based on character count while ensuring semantic coherence. Initially, a large document is divided into sizable chunks at character boundaries. These initial chunks are then recursively split further into smaller segments, each time preserving meaningful units like sentences or phrases to maintain context.
Agentic Chunking:
Agentic chunking is a strategy in RAG that leverages autonomous AI agents to dynamically segment text into coherent and contextually relevant chunks.
Document Based Chunking:
Document based chunking strategy is where instead of segmenting text based on arbitrary character counts or recursive methods, this approach creates chunks that align with the logical sections of the document, such as paragraphs or subsections.
Semantic Chunking:
Semantic Chunking is a method that focuses on extracting and preserving the semantic meaning within text segments. By utilizing embeddings to capture the underlying semantics, this approach assesses the relationships between different chunks to ensure that similar content is kept together.
By focusing on the text's meaning and context, Semantic Chunking significantly enhances retrieval quality. It's ideal for maintaining semantic integrity, ensuring coherent and relevant information retrieval.
So let’s understand how semantic chunking works practically through a simple tutorial.
Semantic Chunking Tutorial
We will use a publicly available pdf, chunk it using naive and semantic chunking, store it in our database and retrieve using both chunking strategies to construct RAG pipelines to see how they respond.
Tech Stack Used
LangChain - Open source AI framework to load, split and to create embeddings of the data.
SingleStore - As a robust vector database to store vector embeddings. Also to use the notebooks feature where we will be running our code. Sign up & get a free account.
Groq and HuggingFace - To choose our LLMs and embedding models
Once you sign up to SingleStore, create a database under your workspace.
Then create a new notebook by going to the ‘Data Studio’ tab.
Once you create a new notebook, select the workspace and the respective database from the dropdown as shown below.
Then you can start running your code.
The complete notebook code is available here.
We basically use naive chunking and semantic chunking techniques to chunk our data. Then construct the RAG pipelines for both.
RAG pipeline for semantic chunking
RAG pipeline for naive chunking
Then we ask 3 questions for each strategy and see the responses.
At the end of this tutorial, you will get to the point where you will understand that the responses generated using semantic chunking provide more detailed and contextually rich explanations compared to those generated using naive chunking.
This demonstrates the superiority of semantic chunking in extracting and presenting nuanced information, making it a better strategy for information retrieval in complex scenarios.