Large Language Models (LLMs) have revolutionized natural language processing, enabling applications that range from automated customer service to content generation. However, optimizing their performance remains a challenge due to issues like hallucinations - where the model generates plausible but incorrect information. This article delves into key strategies to enhance the performance of your LLMs, starting with prompt engineering and moving through Retrieval-Augmented Generation (RAG) and fine-tuning techniques.
Each method provides unique benefits: prompt engineering refines input for clarity, RAG leverages external knowledge to fill gaps, and fine-tuning tailors the model to specific tasks and domains. Understanding and applying these strategies can significantly improve the accuracy, reliability, and efficiency of your LLM applications.
RAG & Finetuning
While LLMs have the hallucinating behaviour, there are some ground breaking approaches we can use to provide more context to the LLMs and reduce or mitigate the impact of hallucinations.
Every LLM journey begins with Prompt Engineering.
Then comes the RAG and Fine-tuning techniques. But, it is very important to understand how these three work.
RAG comes into play when the LLM needs an extra layer of context.
This method is about leveraging external knowledge to enhance the model's responses. However, it's not always the right tool.
Invoke RAG when evaluations reveal knowledge gaps or when the model requires a wider breadth of context.
Fine-tuning is about specialization, adapting your LLM to your application's specific task, unique voice and context.
The decision to fine-tune comes after you've gauged your model's proficiency through thorough evaluations. When your LLM needs to understand industry-specific jargon, maintain a consistent personality, or provide in-depth answers that require a deeper understanding of a particular domain, fine-tuning is your go-to process.
Fine-tuning an LLM is a nuanced process that can significantly elevate the performance of your model - if done correctly.
Chunk Sizes Matter in LLMs
Chunking is the process of dividing a large corpus of text data into smaller, semantically meaningful units.
The size of chunks is critical in semantic retrieval tasks due to its direct impact on the effectiveness and efficiency of information retrieval from large datasets and complex language models.
Different chunk sizes can significantly influence semantic retrieval results in the following ways:
Smaller chunk sizes offer finer granularity by capturing more detailed information within the text. However, they may lack context, leading to potential ambiguity or incomplete understanding.
Larger chunk sizes provide a broader context, enabling a comprehensive view of the text. While enhancing coherence, they may also introduce noise or irrelevant information. Optimal chunk sizes balance granularity and coherence, ensuring that each chunk represents a coherent semantic unit. But, there doesn't seem to be a one-size-fits-all optimal chunk size. The ideal chunk size depends on the specific use case and the desired outcome of the system.
One optimal process while retrieving the best results can be as follows,
→ Chunk up the same document in a bunch of different ways, say with chunk sizes: 128, 256, 512, and 1024.
→ During retrieval, we fetch relevant chunks from each retriever, thus ensembling them together for retrieval.
→ Use a reranker to rank results.
Chunks are usually converted into vector embeddings to store the contextual meanings that help in correct retrieval.
Here is my article on understanding vector embeddings
How to Fine-tune Your Model?
Fine-tuning involves using a Large Language Model as a base and further training it with a domain-based dataset to enhance its performance on specific tasks.
Let's take as an example a model to detect sentiment out of tweets. Instead of creating a new model from scratch, we could take advantage of the natural language capabilities of GPT-3 and further train it with a data set of tweets labeled with their corresponding sentiment.
This would improve this model in our specific task of detecting sentiments out of tweets.
This process reduces computational costs, eliminates the need to develop new models from scratch and makes them more effective for real-world applications tailored to specific needs and goals.
➤ Supervised Fine-tuning: This common method involves training the model on a labeled dataset relevant to a specific task, like text classification or named entity recognition.
➤ Few-shot Learning: In situations where it's not feasible to gather a large labeled dataset, few-shot learning comes into play. This method uses only a few examples to give the model a context of the task, thus bypassing the need for extensive fine-tuning.
➤ Transfer Learning: While all fine-tuning is a form of transfer learning, this specific category is designed to enable a model to tackle a task different from its initial training. It utilizes the broad knowledge acquired from a general dataset and applies it to a more specialized or related task.
➤ Domain-specific Fine-tuning: This approach focuses on preparing the model to comprehend and generate text for a specific industry or domain. By fine-tuning the model on text from a targeted domain, it gains better context and expertise in domain-specific tasks.
Know more about fine-tuning strategies & best practices in this article
Reinforcement Learning with Human Feedback (RLHF)
RLHF is one of the best model training approaches.
Did you know that ChatGPT uses RLHF?
Yes. ChatGPT generates conversational, real-life answers for the person making the query, it uses RLHF. ChatGPT uses large language models (LLMs) that are trained on a massive amount of data to predict the next word to form a sentence.
RLHF is an iterative process because collecting human feedback and refining the model with reinforcement learning is repeated for continuous improvement.
With Reinforcement Learning with Human Feedback (RHLF), you improve model precision by aligning with human feedback.
Instead of providing a human curated prompt/ response pairs (as in instructions tuning), a reward model provides feedback through its scoring mechanism about the quality and alignment of the model response.
This mimics a human providing feedback but cost optimized.
Model generates a response to a prompt sampled from a distribution.
Model's response is scored through the reward model and based on the reward, RL policy updates the weights of the model.
RL policy is designed to maximize the reward.
In addition to maximizing reward, there is another constraint added to prevent excessive divergence from the underlying model's behavior.
This is done by comparing the responses of the pre-trained model and the trained model with KL divergence score and add it as part of the objective function.
Know more about model training patterns in this article
RAG isn't a Silver Bullet!
Yes, RAG is the cheapest way to improve LLMs
BUT that may not be the case always.
Here is a flowchart guiding the decision on whether to use Retrieval-Augmented Generation (RAG).
⮕ Dataset Size and Specificity:
If the dataset is large and diverse, proceed with considering RAG.
If the dataset is small and specific, do not use RAG.
⮕ For Large and Diverse Datasets:
If contextual information is needed, use RAG.
If you can handle increased complexity and latency, use RAG.
If you aim for improved search and answer quality, use RAG.
⮕ For Small and Specific Datasets:
If there is no need for external knowledge, do not use RAG.
If faster response times are preferred, do not use RAG.
If the task involves simple Q&A or a fixed data source, do not use RAG.
If not RAG the what can we use? we can use fine-tuning and prompt engineering.
Fine-tuning involves training the large language model (LLM) on a specific dataset relevant to your task. This helps the LLM understand the domain and improve its accuracy for tasks within that domain.
Prompt engineering is where you focus on crafting informative prompts and instructions for the LLM. By carefully guiding the LLM with the right questions and context, you can steer it towards generating more relevant and accurate responses without needing an external information retrieval step.
Ultimately, the best alternative depends on your specific needs.
Take a look at my article on RAG.
If you like to use a robust database for not just AI/ML applications but also for real-time analytics, try SingleStore database.
Semantic Caching to Improve LLMs & RAG
Fast retrieval is a must in RAG for today's AI/ML applications.
Latency and computational cost are the two major challenges while deploying these applications in production.
While RAG enhances this capability to certain extent, integrating a semantic cache layer in between that will store various user queries and decide whether to generate the prompt enriched with information from the vector database or the cache is a must.
A semantic caching system aims to identify similar or identical user requests. When a matching request is found, the system retrieves the corresponding information from the cache, reducing the need to fetch it from the original source.
There are many solutions that can help you with the semantic caching but I can recommend using SingleStore database.
Why use SingleStore Database as the semantic cache layer?
SingleStoreDB is a real-time, distributed database designed for blazing fast queries with an architecture that supports a hybrid model for transactional and analytical workloads.
This pairs nicely with generative AI use cases as it allows for reading or writing data for both training and real-time tasks - without adding complexity and data movement from multiple products for the same task.
SingleStoreDB also has a built-in plancache to speed up subsequent queries with the same plan.
Know more about semantic caching with SingleStore.
LLM Evaluation
LLM evaluation metrics are metrics that score an LLM's output based on criteria you care about.
Fortunately, there are numerous established methods available for calculating metric scores - some utilize neural networks, including embedding models and LLMs, while others are based entirely on statistical analysis.
Let's look at some notable ones below
⮕ G-Eval:
G-Eval is a recently developed framework from a paper titled "NLG Evaluation using GPT-4 with Better Human Alignment" that uses LLMs to evaluate LLM outputs (aka. LLM-Evals).
G-Eval first generates a series of evaluation steps using chain of thoughts (CoTs) before using the generated steps to determine the final score via a form-filling paradigm (this is just a fancy way of saying G-Eval requires several pieces of information to work).
⮕ GPTScore:
Unlike G-Eval which directly performs the evaluation task with a form-filling paradigm, GPTScore uses the conditional probability of generating the target text as an evaluation metric.
⮕ SelfCheckGPT:
SelfCheckGPT is an odd one. It is a simple sampling-based approach that is used to fact-check LLM outputs. It assumes that hallucinated outputs are not reproducible, whereas if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts.
SelfCheckGPT is an interesting approach because it makes detecting hallucination a reference-less process, which is extremely useful in a production setting.
⮕ QAG Score:
QAG (Question Answer Generation) Score is a scorer that leverages LLMs' high reasoning capabilities to reliably evaluate LLM outputs. It uses answers (usually either a 'yes' or 'no') to close-ended questions (which can be generated or preset) to compute a final metric score. It is reliable because it does NOT use LLMs to directly generate scores.
Know in-depth about LLM evaluation metrics in this original article.
Do you know that SingleStore database has a free shared tier? Yes. Free forever.You can sign up and start using in minutes.