Decoding by Contrasting Layers (DoLa) is a technique suggesting a different approach to calculating next token probabilities in a transformer. It is described in this paper. What is interesting is that without any changes to the model, it is possible to make a code change to the decoder part of the transformer and get a noticeable boost in the model's factual knowledge and fewer hallucinations.

A few days ago a PR was merged into the Hugging Face Transformers library implementing this trick.

It happened that I had MT-Bench set up while tinkering with 1.6B model and conducting the evals. The LLM Judge relies on HF Transformers, so it was easy to do a quick trial of DoLa and see if it improves AI chatbot's overall performance (reasoning, coding, writing, etc.)

I installed the Transformers from sources (the new feature is not available at PiPY yet): pip install git+https://github.com/huggingface/transformers
Made a change to gen_model_answer.py adding the dola_layers params

output_ids = model.generate(
    torch.as_tensor(input_ids).cuda(),
    do_sample=do_sample,
    temperature=temperature,
    max_new_tokens=max_new_token,
    dola_layers='low'
)

Ran MT-Bench with the params commented out, set to low and high

Here're the results:

Mode: single
Input file: data/mt_bench/model_judgment/gpt-4_single.jsonl

########## First turn ##########
                                           score
model                               turn        
stablelm-2-brief-1_6b_r57_no_dola   1     4.8375
stablelm-2-brief-1_6b_r57_dola_low  1     4.6125
stablelm-2-brief-1_6b_r57_dola_high 1     3.9500

########## Second turn ##########
                                          score
model                               turn       
stablelm-2-brief-1_6b_r57_dola_low  2     3.700
stablelm-2-brief-1_6b_r57_no_dola   2     3.475
stablelm-2-brief-1_6b_r57_dola_high 2     2.825

########## Average ##########
                                       score
model                                       
stablelm-2-brief-1_6b_r57_dola_low   4.15625
stablelm-2-brief-1_6b_r57_no_dola    4.15625
stablelm-2-brief-1_6b_r57_dola_high  3.38750

As you can see, while first turn score went down, the second score actually improved. Yet the results can't be representative, from my experience MT-Bench can have 10% score variation between runs. Overall, if there're any effects, they are marginal.

DoLa and MT-Bench - A Quick Eval of a new LLM trick