The recent release of Llama 3.1 was reminiscent of many releases this year. It underlined a trend that formed in the first half of 2024:
- Closed SOTA LLMs (GPT-4o, Gemini 1.5, Claud 3.5) had marginal improvements over their predecessors, sometimes even falling behind (e.g. GPT-4o hallucinating more than previous versions).
- Smaller open models were catching up across a range of evals.
There have been many releases this year. Open AI has introduced GPT-4o, Anthropic brought their well-received Claude 3.5 Sonnet, and Google's newer Gemini 1.5 boasted a 1 million token context window.
Among open models, we've seen CommandR, DBRX, Phi-3, Yi-1.5, Qwen2, DeepSeek v2, Mistral (NeMo, Large), Gemma 2, Llama 3, Nemotron-4. Every time I read a post about a new model there was a statement comparing evals to and challenging models from OpenAI.
Take these few evals from Llama 3.1 blog post as an example:
Notice how 7-9B models come close to or surpass the scores of GPT-3.5 - the King model behind the ChatGPT revolution. Also, see how models nearing 100B params confidently surpass GPT 3.5.
This is the pattern I noticed reading all those blog posts introducing new LLMs. Models converge to the same levels of performance judging by their evals. LLMs around 10B params converge to GPT-3.5 performance, and LLMs around 100B and larger converge to GPT-4 scores.
Another colorful picture supporting this statement is recent Aider's eval of coding capabilities:
The Ceiling
The marginal improvements, eval scores fluctuating within MoE, "vibe checks" and feedback users share on the SOTA LLMs... All of that suggests that the models' performance has hit some natural limit. LLMs do not get smarter. Take for example LLM leaderboard SEAL:
Couple this saturated LLM performance with much talk around the Gen AI bubble, and little tangible value brought by the technology... Titles like "Gen AI: too much spend, too little benefit?" or "So far the technology has had almost no economic impact"...
The technology of LLMs has hit the ceiling with no clear answer as to whether the $600B investment will ever have reasonable returns.
Efficiency, not Effectiveness
There's another evident trend, the cost of LLMs going down while the speed of generation going up, maintaining or slightly improving the performance across different evals.
Take Anthropic and OpenAI models as an example:
Model | Price (per mil. tok., input/output) | Speed (tok/sec) |
---|---|---|
Claude 3 Sonnet | $3/$15 | 63 |
Claude 3.5 Sonnet | $3/$15 | 79 |
gpt-3.5-turbo-16k-0613 | $3/$4 | ~40-50 |
gpt-3.5-turbo-0125 | $0.5/$1.5 | 83 |
gpt-4-32k | $60/$120 | ~22 |
gpt-4o | $5/$15 | 83 |
See how the successor either gets cheaper or faster (or both). The most drastic difference is in the GPT-4 family. The original model is 4-6 times more expensive yet it is 4 times slower.
We see the progress in efficiency - faster generation speed at lower cost. We see little improvement in effectiveness (evals).
What could be the reason? I can speculate that:
- Closed models get smaller, i.e. get closer to their open-source counterparts.
- Closed models use the efficiency tricks the Open-source world has brought over the past years. E.g. Flash Attention, Quantisation, etc.
Can it be another manifestation of convergence? This time the movement of old-big-fat-closed models towards new-small-slim-open models.