The recent release of OpenAI's new model hinted at a few evals of Llama 3 400B (teased but not released by Meta):
Besides the fact the data didn't come from Meta what caught my attention was that the 4 times smaller model outperformed the original GPT-4 (supposedly 1.76T params). And it is within the margin-of-error away from the more recent GPT-4 models. It would be a pity if the 400B never gets released...
Seems that for the new model, OpenAI has chosen the path of "horizontal integration" - i.e. supporting more modalities out of the box, improving latency... Making the model more natural and lively so it is hard to tell the difference if you are talking to a living person or not.
Yet from the initial set of text benchmarks (knowledge, reasoning, math, coding) GPT-4o doesn't look like a breakthrough but a mere incremental improvement. 3rd party benchmarks are surfacing and GPT-4o is not an all-round winner one might expect. E.g. here's the needle-in-the haystack test showing that GPT-4o struggles with fact retrievals beyond 1k tokens:
Another hallucination benchmark shows that GPT-4o is behind older GPT-3.5.
It also can't boast an increased context size - still at 128k.
The Positioning of ChatGPT
The announcement reads as a pitch for the "new best friend". Few missed the opportunity to bring up analogies with the 11-year-old "Her" movie which depicted the development of a romantic relationship between a human and a digital assistant.
Open AI use-case videos demonstrate conversations with emotion, engagement, and interest from the GPT-4o. ChatGPT digital friend might be more successful at building friendships than most humans. It reminds me of the Bowling Alone book discussing how in-person social interactions have seen a steady decline over the years. GPT-4o perfectly fits the trend - just like instant messengers took over voice calls and social media substituted in-person communication, the free-to-use and nearly perfect ChatGPT has all the chances to make the next step alienating people.
Pivoting away from the Utility
While the year 2023 was full of headlines warning of AI taking over cognitive jobs (just like in the 19th century machinery started phasing out manual labor) in 2024 there's little evidence of AI automation bringing meaningful productivity gains. Voices are calling out the AI winter emphasizing the limited applicability of Gen AI to real-world problems, at scale.
"The first AI software engineer" Devin went viral at the beginning of 2024 and faced a lot of valid criticism shortly after. Yet even the marketing materials from the creators showed that on a benchmark that distantly resembles real-world dev tasks, it failed 86% of the time.
Additionally, the data from Sequoia tells us that Gen AI tools have problems with user engagement and retention. People try those tools but do not return:
In 2024 we might see the pivot from AI seen as a utility tool, productivity booster, and cognitive labor automation technology to personal assistants making relations.
OpenAI as for-profit company has likely the challenges of user engagement and seeks for ways to make users spend more time with their products. GPT-4o fits that strategy perfectly.
While the faster and 2 times cheaper GPT-4o (compared to GPT-4 Turbo) might be opening up new venues for building AI agents aimed at higher-level autonomy, this aspect is clearly not the focus of OpenAI's recent release. The key problems, such as hallucinations or reliability of the output, have not been treated in the new model. And the whole release is about the "digital friend" features with utility being not a priority.
Our garbage civilization seems to be headed towards the loneliness age at full speed...