Gen AI Hype - the Never Ending Excitement

Maxim Saplin - Sep 23 - - Dev Community

"When sensationalism wins over nuance, we lose our ability to think." is a great quote by Lex Fridman that sums up what I have to say next :)


The gap between expectation and the real utility of Generative AI (chatbots, agents, content generators, etc.) didn't get much narrower in the past few years. Some started voicing doubts, asking whether the huge Gen AI investments will ever make any returns. Some draw a bleak picture for the big-tech industry that hasn't figured out yet how to make valuable and economically sustainable Gen AI products.

I have been following the Gen AI agenda very closely: subscribed to influencers and newsletters, read social media (X, LinkedIn), watched YouTube (I like AI Explained a lot), investigated Gen AI product blogs (like this one), read news articles, talked in Gen AI chats... I find Gen AI exciting and captivating! At the same time, I sense and see the frustration with Gen AI not always delivering.

Here are a few cases that demonstrate the mechanics of inflating the expectations.

Influencers Hurrying to Share Their Excitement

Here's a recent example where Philipp Schmidt tells us about the release of Qwen2.5-Coder 7-billion model that surpassed the performance of GPT-4-0613 released in June 2023 (supposedly a 1.7-trillion model):

Qwen2.5-Coder 7B beating GPT-4

As he points out there's now a free, open-weight, 7B model beating a monstrous 1.7T LLM by OpenAI, in coding!! That is insane progress in LLM development!!! Or is it?

Phillip is a respected voice in the world of Gen AI, with over 110k followers on LinkedIn. I came across his great blog early this year when I started experimenting with LLM fine-tuning and looked for practical examples. Since then I subscribed to his LinkedIn page and find it very useful to get timely updates on what's happening in the world of Gen AI.

Yet I also started noticing the disconnect between the number of significant Gen AI events (he and the rest of the AI-flock regularly post about), and how little of the breakthrough is seen in real life... Such as chatting with newer models and tackling coding tasks with AI assistants. With that many record-breaking evals throughout the year they must have accumulated and the breakthrough must be apparent in the products everyone uses daily!

The claims in the above post on Qwen2.5 made me very very suspicious. Take for example BigCodeBench (a relatively new eval introduced in June 2024) score:

  • Qwen2.5-Coder 7B: 29.6%
  • GPT-4 0613: 17.4%

The difference seemed so outrageous that I made an effort and ran the eval of the model on BigCodeBench. Here's the result (the average of COMPLETE and INSTRUCT scores on the HARD tasks subset):



## Avg

- pass: 57 /19.2%
  For comparison:
    - GPT-4-Turbo-2024-04-09: 32.1%
    - GPT-4o-2024-08-06: 30.8%
    - Claude-3.5-Sonnet-20240620: 29.4%
    - DeepSeek-Coder-V2-Instruct (2024-07-24): 29.4%
    - GPT-3.5-Turbo-0125: 19.9%
    - GPT-4-0613: 17.6%
- total: 296


Enter fullscreen mode Exit fullscreen mode

I followed the guide at BigCodeBench GitHub, used greedy generation, sanitized LLM answers, and ran provided docker images to evaluate the answers...

I didn't replicate the 29.6% figure, the achieved score of 19.2% is well below. Is it a manifestation of Replication crisis? Or did the Qwen authors mess up the numbers, confuse hard/full sets of tasks, use wrong averages - whatever.

Still, 19.2% is significantly more than 17.2% from GPT-4-0613... As well as GPT-3.5-Turbo's 19.9% score also beats GPT-4. The reported score is significantly lower than the one achieved, yet larger than older GPT-4, yet lower than newer GPT-3.5 - it is all confusing :)

What I am trying to say is that:

  • The reported bench results didn't match my experiment - raises suspicion of cherry-picking by the Qwen team
  • Even if the numbers matched it's unlikely they will translate into real-life scenarios
  • Comparing to older GPT-4-0613 and not comparing to others doesn't seem right
  • It's so easy to fall for sensationalism and find breakthroughs where there are none
  • The problem of model contamination with benchmark data and overfitting/gaming the evals gets worse

Let me elaborate on the last part. There's a popular opinion that public evals of LLMs no longer depict the reality - they are hard to filter out of training datasets (hence the memoization of right answers), they are easy to game and tune for (if there's a goal to show good optics). There are attempts to create closed benchmarks (such as SEAL) or to apply clever tricks to make sure you evaluate models using newer data unlikely to exist at the time of training (such as LiveCodeBench).

LiveCodeBench provides evidence that smaller models can be especially susceptible to eval contamination - being overfitted (i.e. trained more) on the data that includes samples from the evals.

LiveCodeBench Overfilling

We also find that models that perform well on HumanEval might be overfitting on the benchmark. Particularly, the models are separated into two clusters depicted by the green and red shaded region in the right scatterplot. The models in the green region perform similarly on HumanEval and LCB-Easy, while the models in the red region perform well on HumanEval but lag behind on LCB-Easy.
...
This highlights a potential lack of diverse fine-tuning data being employed by the open source community and the need for optimizing models for a broader set of code-related tasks.

Influencers are not Journalists

They are not supposed to put effort into verification, shed skepticism instead of enthusiasm, or question the achievements they talk about. Hurrying to share an exciting piece of news is the top priority.

Even without running every eval and checking every number (which doesn't make sense) one can still criticize the value of evals and question if those bumps in scores will have any reflection in real-life use.

Another recent example is Matthew Berman's story (a Youtuber with 330k subscribers covering all Gen AI) who jumped onto the hype train of Reflection 70B, "the world’s top open-source model". A few days after he covered the model and interviewed the creator he had to record an excuse video where he shared his account of how he happened to be involved in promoting some scam-like project.

Those 2 examples are not corner cases, they are a visual representation of what is happening in media - a constant tide of sensational achievements, people with large audiences sharing their enthusiasm for the tech demonstrating data that support the claims.

If you are like me, you're interested in Gen AI and closely follow the events in the industry, just be cautious with all those heavy claims and breakthroughs you come across every day. It is easy to get lost in the information flow and take it for granted.

A new Rockstar Open-Source Model Coming every Week

  • "outperforms Llama 3.1 70B and matches the performance of the massive 405B"
  • "an open Llama 3 model, has surpassed Anthropic Claude 3.5 Sonnet and OpenAI GPT-4o using Reflection-Tuning"
  • "this breakthrough confirms that we're not hitting a ceiling on LLM performance"

The above quotes are something you find often this year - weekly, bi-weekly... You come across an announcement post in your feed praising yet another Llama fine-tune or open-weight model by some AI shop. There are so many Rockstar Eval Champions introduced recently they are hard to tell apart: Llama 3 and 3.1, Phi 3 and 3.5, Qwen 2 and 2.5, Yi 1.5, Gemma 2, SuperNova, Hermes, Reflection, all kinds of Minitrons and NeMoTrons...

Incremental changes coming on top of incremental changes so frequently suggest that open LLMs must have made great progress this year. You might get such an impression reading the feeds, but not trying the models.

The Disappointment in SLMs

I have recently posted about the convergence of LLMs - a trend of having several clusters of models of similar sizes that converge on certain baseline across evals. One such cluster includes Small Language Models (SLM) - models below 20-30B params that are relatively accessible for local deployment on consumer hardware.

There was one SLM related thing missing... Nobody in the right mind should use Llama 3.1 8B or Gemma 2 9B for productivity tasks, such as chatbots or coding. They are horrible when compared to relatively affordable bigger models from OpenAI and Anthropic.

Besides building narrow use cases, integrating Gen AI into products, and fine-tuning small models on your data, I don't see much use in SLMs.

I try them from time to time... There's something therapeutic in waiting for the model to complete downloading to get it up and running and chat to it. At some point I had 70+ downloaded models in LM Studio/Ollama - is that a lot, I don't know...

Ollama downloading multiple models

Besides chit-chat, I have my own advanced-level benchmark I use to crash-test local models. There's a Crew.ai 3-agent/4-task workflow requiring reading and evaluating multiple files and creating intermediate file outputs (reused by other agents). I used to swap the LLM config for a local one with the model of interest loaded via LM Studio or Ollama and accessible through the OpenAI endpoint on localhost. And you know, I tested plenty of rock-star models ranging from 7B to 33B params. They rarely could finish the whole chain producing any output, let alone producing meaningful output. Compared to the cheapest GPT-4o-mini SLMs struggle with instruction following, adhering to ReAct prompt structure and tool call conventions, sometimes snowballing into endless hallucinations.

LLMs are Task Specific

Sometimes they work, sometimes they don't. That is my key observation. LLMs happened to be not as universal and versatile as they seemed at the dawn of the ChatGPT revolution. You might find your combination of model, prompt, and use-case that will make you happy. However, you would be disappointed when this combination won't scale to a different/similar use case (or swapping a model and breaking everything).

Trying to extrapolate evals and blog posts doesn't seem to help, the only true way to find the right model is trial and error. And of course starting with the biggest, most expensive, and capable model.

P.S>

Exaggerated enthusiasm, lack of skepticism, and multiplicative effects of social media can be some of the factors hindering the pragmatic adoption of Gen AI.

"Journalism seems to increasingly optimize for drama over truth." is another super relevant quote from Lex.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .