OpenAI has teased the o3 model today—a further development of the "reasoning" model and a successor to o1.
I was impressed by how much it improved on the ARC-AGI-1 benchmark - a supposedly unbeatable benchmark by the current generation of LLMs. o1's high-score was 32% while o3 jumped right at 88%. The authors of the Arc Challenge ($1mil reward for beating ARC-AGI) were quite confident that transformer-based models won't be successful in their benchmark - they were not impressed with o1. Yet, the o3 blog post is a completely different sentiment with words such as "surprising", "novel" and "breakthrough". Yet there's a catch - it's very, very expensive: scoring 76% cost around $9k and 88% - OpenAI didn't disclose (one can evaluate the total cost to be at $1.5mil given the statement that 172x more compute was used).
o3 has reminded me of an analogy often mentioned when discussing LLMs. No matter the complexity of the task GPTs use the same amount of compute/time per token as if they are streaming info from their subconscious without ever stopping to think. This is similar to how the "Fast" System 1 of the human brain operates.
A quick recap, "Thinking Fast and Slow" is a 2011 book by Daniel Kahneman. He argues that functionally (based on empirical research) our brain has 2 departments (or modes):
- System 1, Fast - effortless, autonomous, associative.
- System 2, Slow - effortful, deliberate, logical.
The 2 systems work together and shape humans' thinking processes. We can read a book out loud without any stress, yet we might not remember a single word. We can read the book and be focused, constantly replaying the scenes and pictures in our mind, keeping track of events and timelines, and be exhausted after a short period—yet we might acquire new knowledge.
As Andrew Ng once noted, "Try typing a text without ever hitting backspace" - seems like a hard task, and that is how LLMs work.
Well, that's how they worked until recently. When o1 (and later Deepseek R1, QwQ, Gemini 2.0 Flash Thinking) appeared the models learned to make a break and operate in a mode similar to the "Slow" system.
Recently there has been a lot of talk of LLM pre-training plateauing, exhausting training data, AI development hitting a wall.
We might be seeing a forming trend on what comes in 2025 - combining reasoning/thinking models with traditional LLMs, interconnecting them as Slow and Fast minds: planning (Slow) and taking action (Fast), identifying (Fast) and evaluating (Slow) etc.
Here's one of the recent examples from Aider AI coding assistant which shows how combining QwQ as Architect and Qwen 2.5 as a Coder (there's a 2-step "architect-code" mode allowing to choose different models for each of the steps) increases coding performance.
Whether it will play out - it's hard to say. There are plenty of challenges that we haven't seen a lot of progress lately, even with Slow models. It's unclear how models such as o3 will be tolerant to hallucinations. The context window is still too small. The prices are going up... The slow models, while they hit the next levels of different "isolated" evals, are far from practical application at scale (doing large projects on their own OR simulating a junior intern). Additionally the Fast models, the actors, it doesn't seem they have shown progress in computer use and Moravec's paradox is still a challenge when it comes to automating a computer clerk.
P.S.>
About the same time when o3 was announced I received API access to o1-mini. I ran my own LLM Chess Eval that simulated chess games prompting models to play against a random player. While the previous SOTA models couldn't score even a single win (and I assumed the benchmark is as hard as the ARC eval)... o1-mini won 30% of the time! Now I am less skeptical, after all there might be some reasoning.