Llama 3 8B is better than Llama 2 70B

Llama 3 has just been rolled-out, exactly 9 month after the release of Llama 2. It is already available for chat at Meta web site, can be downloaded from Huggingface in safetensors or GGUF format.

While the previous generation has been trained on a dataset of 2 trillion tokens the new one utilised 15 trillion tokens.

What is fascinating is how the smaller 8B version outperformed the bigger previus-gen 70B model in every benchmark listed on the model card:

Benchmark	Llama 3 8B	Llama 2 7B	Llama 2 13B	Llama 3 70B	Llama 2 70B

GPQA (0-shot)	34.2	21.7	22.3	39.5	21.0
HumanEval (0-shot)	62.2	7.9	14.0	81.7	25.6
GSM-8K (8-shot, CoT)	79.6	25.7	77.4	93.0	57.5
MATH (4-shot, CoT)	30.0	3.8	6.7	50.4	11.6

Llama 3 has also upped the context window size from 4k to 8k tokens.