Interpreting Benchmarks and Evaluations in LLMs

Yash Jivani - Jun 25 - - Dev Community

Welcome to the era of Large Language Models (LLMs). If you've heard terms like "benchmarks" or "evaluations" and wondered what they mean & how to interpret them in the context of LLMs, you're in the right place. Let's break down these concepts to understand them.

Benchmarks

Benchmarks refer to standardized tests or tasks in the form of datasets used to evaluate and compare the performance of different models. These benchmarks often include various language understanding and generation tasks, such as text completion, question answering, and summarization. Benchmarks provide a measure of how well an LLM performs compared to others. Some of the benchmarks used to evaluate LLMs are -

  1. General Language Understanding Evaluation (GLUE) benchmark is a collection of 9 natural language understanding tasks
  2. MMLU (Massive Multitask Language Understanding) is a benchmark to measure knowledge acquired during pretraining. It covers 57 subjects across STEM
  3. HumanEval focuses on whether the LLM's generated code works as intended Detailed information can be found in [2] & [3].

Evaluations

Evaluation refers to measuring and assessing a model's performance and effectiveness in how accurately the model can predict or generate the next word in a sentence, understand context, summarize data, and respond to queries. Evaluation is crucial because it helps determine the model's strengths and weaknesses, and provides insights into areas of improvement. There are 2 different ways to compute metrics scores

  1. Statistical Scorer
    These are purely number-based scorers - i.e. they don't consider semantics into account. Some of these are -
    a. BLEU (BiLingual Evaluation Understudy) evaluates the output of LLM application against annotated ground truths by calculation precision for each matching n-gram between actual & predicted outputs.
    b. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is primarily used for evaluating text summaries from NLP models and calculates recall by comparing the overlap of n-grams between LLM outputs and expected outputs.

  2. Model-Based Scorer
    These are NLP-based scorer that takes semantics into account. Some of these are -
    a. BLEURT(Bilingual evaluation understudy with Representations from Transformers), often used for machine translation, uses pre-trained models like BERT to score LLM outputs on some expected outputs
    b. NLI(Natural Language Inference) uses the NLP classification model to classify whether an LLM output is logically consistent (entailment), contradictory, or unrelated (neutral) with respect to a given reference text.

Detailed information can be found in [4].


After undergoing the benchmark’s evaluation, models are usually awarded a score from 0 to 100. These are the numbers you usually see when companies publish along with their LLM model to compare other models evaluated on the benchmark.

References

  1. Turing
  2. Papers with Code
  3. HumanLoop
  4. Confident-AI
.