DoLa and MT-Bench - A Quick Eval of a new LLM trick

Maxim Saplin - Jul 11 - - Dev Community

Decoding by Contrasting Layers (DoLa) is a technique suggesting a different approach to calculating next token probabilities in a transformer. It is described in this paper. What is interesting is that without any changes to the model, it is possible to make a code change to the decoder part of the transformer and get a noticeable boost in the model's factual knowledge and fewer hallucinations.

A few days ago a PR was merged into the Hugging Face Transformers library implementing this trick.

It happened that I had MT-Bench set up while tinkering with 1.6B model and conducting the evals. The LLM Judge relies on HF Transformers, so it was easy to do a quick trial of DoLa and see if it improves AI chatbot's overall performance (reasoning, coding, writing, etc.)

  1. I installed the Transformers from sources (the new feature is not available at PiPY yet): pip install git+https://github.com/huggingface/transformers

  2. Made a change to gen_model_answer.py adding the dola_layers params

output_ids = model.generate(
    torch.as_tensor(input_ids).cuda(),
    do_sample=do_sample,
    temperature=temperature,
    max_new_tokens=max_new_token,
    dola_layers='low'
)
Enter fullscreen mode Exit fullscreen mode
  1. Ran MT-Bench with the params commented out, set to low and high

Here're the results:

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                                           score
model                               turn        
stablelm-2-brief-1_6b_r57_no_dola   1     4.8375
stablelm-2-brief-1_6b_r57_dola_low  1     4.6125
stablelm-2-brief-1_6b_r57_dola_high 1     3.9500

########## Second turn ##########
                                          score
model                               turn       
stablelm-2-brief-1_6b_r57_dola_low  2     3.700
stablelm-2-brief-1_6b_r57_no_dola   2     3.475
stablelm-2-brief-1_6b_r57_dola_high 2     2.825

########## Average ##########
                                       score
model                                       
stablelm-2-brief-1_6b_r57_dola_low   4.15625
stablelm-2-brief-1_6b_r57_no_dola    4.15625
stablelm-2-brief-1_6b_r57_dola_high  3.38750
Enter fullscreen mode Exit fullscreen mode

As you can see, while first turn score went down, the second score actually improved. Yet the results can't be representative, from my experience MT-Bench can have 10% score variation between runs. Overall, if there're any effects, they are marginal.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .