Large Language Models (LLMs) like GPT-4, Claude, LLama and Gemini have contributed a lot to the AI community, helping organizations build robust LLM-powered applications. Yet even after all these advancements, LLMs hallucinate and often make up their own stories that sound true. It has become critical for organizations to adhere to the safe, secure and responsible use of LLMs for AI. It is highly recommended to evaluate these LLMs not just for speed, but also accuracy and performance.
Today, we will walk through how to evaluate these LLMs for better performance through a simple tutorial. But first, let’s get a better understanding of what LLM evaluation is.
What is LLM evaluation?
LLM evaluation is key to understanding how well an LLM performs. It helps developers identify the model's strengths and weaknesses, ensuring it functions effectively in real-world applications. This evaluation process also helps mitigate risks like biased or misleading content.
There are two main types of LLM evaluation:
- Model evaluation, which assesses the core abilities of the LLM itself
- System evaluation, which looks at how it performs within a specific program or with user input.
LLM evaluation metrics
Here's a list of the most important evaluation metrics you need to consider before launching your LLM application to production.
It is very important to have metrics in place to evaluate LLMs. These metrics act as scoring mechanisms that assess an LLM's outputs based on the given criteria. The following include some common metrics and criteria we can use:
Response completeness and conciseness. This determines if the LLM response resolves the user query completely. Conciseness determines how relevant the generated response is.
Text similarity metrics. These compare the generated text to a reference or benchmark text, gauging how similar they are. Then, a score is given to help understand how a particular LLM performed.
Question answering accuracy. Measures how well an LLM can answer the questions posed based on factual correctness.
Relevance. This is to determine the relevancy of an LLM's response for a given prompt or user question.
Hallucination index. Identifies how much an LLM is making up the information, or if it’s sharing biased output for a given prompt.
Toxicity. This determines the percentage of offensive or harmful language in the LLM's output.
Task-specific metrics. This can depend on the type of task and application ( summarization, translation, etc.), specialized metrics that exist like BLEU score for machine translation.
LLM evaluation frameworks and tools
LLM evaluation frameworks and tools are important because they provide standardized benchmarks to measure and improve the performance, reliability and fairness of language models. The following are some of LLM evaluation frameworks and tools:
DeepEval. DeepEval is an open-source framework that helps organizations evaluate LLM applications by quantifying their performance on various important metrics like contextual recall, answer relevance and faithfulness.
promptfoo. A CLI and library for evaluating LLM output quality and performance promptfoo enables you to systematically test prompts and models with predefined tests.
EleutherAI LM Eval. Few-shot evaluation and performance across a wide range of tasks with minimal fine-tuning.
MMLU. An LLM evaluation framework that tests the models on a wide range of subjects with zero-shot and one-shot settings.
BLEU (BiLingual Evaluation Understudy). This is a metric that is used to measure the similarity of the machine-translated text to the already-benchmarked, high-quality reference translations. This ranges from 0 to 1.
SQuAD (Stanford Question Answering Dataset). This is a dataset to evaluate LLMs for question-answering tasks. It includes a set of context passages and corresponding questions that are associated with a specific answer.
OpenAI Evals. Evals is a standard framework for evaluating LLMs by OpenAI, and an open-source registry of benchmarks. This framework is used to test the LLM models to ensure their accuracy.
UpTrain. UpTrain is an open-source LLM evaluation tool. It provides pre-built metrics to check LLM responses on aspects including correctness, hallucination and toxicity, among others.
H2O LLM EvalGPT. This is an open tool for understanding a model’s performance across a plethora of tasks and benchmarks.
Evaluating LLMs with UpTrain: Notebook tutorial
If you haven’t already, sign up for your free SingleStore trial to follow along with the tutorial. We will be using SingleStore Notebooks, which are just like Jupyter Notebooks but with the additional capabilities and benefits of an integrated database.
When you sign up, you need to create a workspace.
Go to the main dashboard and click on the Develop tab.
Create a new Notebook, and name it whatever you’d like.
Now you can get started. Add all the code shown here in the notebook you created.
Create a database named ‘evaluate_llm’
%%sql
DROP DATABASE IF EXISTS evaluate_llm;
CREATE DATABASE evaluate_llm;
Install the necessary packages
!pip install uptrain==0.5.0 openai==1.3.3 langchain==0.1.4 tiktoken==0.5.2 --quiet
The next step involves setting the required environment variables - mainly the openai key (for generating responses), singlestoredb
(for context retrieval) and uptrain api key
(for evaluating responses).
You can create an account with UpTrain and generate the API key for free.
Please visit https://uptrain.ai/
import getpass
import os
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key: ')
import openai
client = openai.OpenAI()
Add the UpTrain API key.
UPTRAIN_API_KEY = getpass.getpass('Uptrain API Key: ')
Import necessary modules
import singlestoredb
from uptrain import APIClient, Evals
from langchain.vectorstores import SingleStoreDB
from langchain.embeddings import OpenAIEmbeddings
Load data from the web
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader('https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio')
data = loader.load()
Next, split the data
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)
all_splits = text_splitter.split_documents(data)
Set up the SingleStore database with OpenAI embeddings
import os
from langchain.vectorstores import SingleStoreDB
from langchain.embeddings import OpenAIEmbeddings
from singlestoredb import create_engine
conn = create_engine().connect()
vectorstore = SingleStoreDB.from_documents(documents=all_splits,
embedding=OpenAIEmbeddings(),
table_name='vertex_ai_docs_chunk_size_200')
The complete step-by-step Notebook code is present here in our spaces.
Finally you will run evaluations using UpTrain, an open-source LLM evaluation tool. You will be able to access the UpTrain dashboards to see the evaluation results. We can experiment with different chunk sizes to see the varied outcomes.
UpTrain's API client also provides an evaluate_experiments
method which takes the input data and evaluates it along with the list of checks to be run, and the name of the columns associated with the experiment.
By following the LLM evaluation approach and tools as shown in the tutorial, we gain a deeper understanding of LLM strengths and weaknesses. This allows us to leverage their capabilities responsibly — mitigating potential risks associated with factual inaccuracies and biases. Ultimately, effective LLM evaluation paves the way for building trust and fostering the ethical development of AI in various LLM-powered applications.