MT-Bench: Comparing different LLM Judges

Maxim Saplin - Jun 8 - - Dev Community

By default, MT-Bench uses OpenAI as a service provider with a gpt-4 model ID, which is a vanilla GPT-4 model with 8k context introduced back in Spring 2023. However, it is possible to override the model ID via the --judge-model argument.

As of June 2024, GPT-4 series models have the following pricing (per million tokens):

Prompt Completion
GPT-4o $ 5,00 $ 15,00
GPT-4-Turbo (0125-preview) $ 10,00 $ 30,00
GPT4 8K (0613) $ 30,00 $ 60,00

By running the MT-Bench using GPT-4 Turbo or Omni one can potentially save 6 times on API calls for one evaluation. But how will the score change? Let's find out :)

Costs

I have used Phi-3 Medium with 8K context and quantized at 8-bits (running inference server via LM Studio). I have executed answer generation 4 times. Then for each of the sets, I have run one judgment generation with the three models.

OpenAI API consumption cost per one eval*:

GPT-4o  $              0,93
GPT-4-Turbo (0125-preview)  $              1,85
GPT4 8K (0613)  $              5,10

*I only collected the total tokens consumed by gpt-4-0613 (621805) during 4 runs. For the calculation, I assumed that each model had similar token consumption with 580k prompt and 60k completion tokens

Reviewing the Scores

The below findings can not be generalized as they take a small sample of results for just one target model (Phi-3). Still...

For each of the LLM judges, I have calculated the mean (out of 4 runs) and standard deviation as a percentage of the mean. As you can see:

  • Omni tends to inflate the score by a factor of 12%
  • All models are quite consistent with just a 1-3% deviation in scores
  • The vanilla GPT-4 shows the most consistency across turns
Mean 1st Turn 2nd Turn Avg
GPT-4o 9,13125 8,2814875 8,70720325
GPT-4-Turbo (0125-preview) 8,290625 7,5270175 7,90932575
GPT-4 8K (0613) 8,41875 7,04375 7,73125
StDev 1st Turn 2nd Turn Avg
GPT-4o 0,00230424 0,0262376 0,01302793
GPT-4-Turbo (0125-preview) 0,00620126 0,02336659 0,01396082
GPT-4 8K (0613) 0,01178508 0,01858418 0,01152749

GPT-4 Turbo is the closest to the baseline of GPT-4 8K, 2nd turn sees the most deviation:

 % of GPT4 8K
1st Turn
GPT-4o 108,5%
GPT-4-Turbo (0125-preview) 98,5%
GPT-4 8K (0613) 100,0%

Both Omni and Turbo see the least drop in 2nd turn scores:

2nd turn drop
GPT-4o 9,31%
GPT-4-Turbo (0125-preview) 9,21%
GPT-4 8K (0613) 16,33%

Raw Scores

Model 1st Turn 2nd Turn Avg
GPT-4o #1 9,14375 8,5625 8,853125
GPT-4o #2 9,14375 8,3375 8,740625
GPT-4o #3 9,1 8,15 8,625
GPT-4o #4 9,1375 8,07595 8,610063
GPT-4-Turbo (0125-preview) #1 8,35 7,7 8,025
GPT-4-Turbo (0125-preview) #2 8,2875 7,64557 7,968553
GPT-4-Turbo (0125-preview) #3 8,3 7,4375 7,86875
GPT-4-Turbo (0125-preview) #4 8,225 7,325 7,775
GPT-4 8K (0613) #1 8,4875 7,2125 7,85
GPT-4 8K (0613) #2 8,5125 6,975 7,74375
GPT-4 8K (0613) #3 8,3 7,075 7,6875
GPT-4 8K (0613) #4 8,375 6,9125 7,64375

About

MT-Bench is a quick (and dirty?) way to evaluate a chatbot model (fine-tuned instruction following LLM). When a new open-source model is published at Hugging-face it is not uncommon to see the score presented as a testament of quality. It offers ~$5 worth of OpenAI API calls towards getting a good ballpark of how your model does. A good tool to iterate on fine-tuning an assistant model.

Image description

MT-Bench is a Python program that asks the target model 80 predefined questions (doing inference via HF Transformers or OpenAI compatible API endpoint). The questions cover Humanities, STEM, Extraction, Roleplay, Writing, Reasoning, and Coding. There are 2 turns - it asks a question and gets the answer (1st turn), then adds a follow-up question and collects the 2nd answer (2nd turn). It then iterates through all questions and asks the GPT-4 model (the legacy 8K model from Spring 2023) to score both answers on a scale from 1 to 10 (hence the lowest a model can get is 1, not 0 :). The results are 3 aggregate scores: 1st turn, 2nd turn, and average score.



########## First turn ##########
                                        score
model                          turn
stablelm-2-brief-1_6b_2        1     3.240506

########## Second turn ##########
                                        score
model                          turn
stablelm-2-brief-1_6b_3        2     2.443038


########## Average ##########
                                   score
model
stablelm-2-brief-1_6b_3         2.822785


Enter fullscreen mode Exit fullscreen mode

As explained in this paper, which introduced the MT-Bench and investigated the utility of LLM as an evaluator, the score shows high agreement with human preferences. I.e. the larger the MT-Bench score the higher the model gets on LMSYS Chatbot Arena.

Another popular option for LLM evaluation is AlpacaEval. This one uses a newer and cheaper GPT-4 Turbo model as a baseline. The authors of AlpacaEval provided correlation coefficients of different evals with LMSYS Arena showing a strong association between LLM judges' scores and human preferences at the Arena:

Chatbot Arena Correlation

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .