Is “Thinking AI” a Leap Forward or Just Marketing Hype?

Asim Qazi - Feb 23 - - Dev Community

The rapid evolution of generative AI is transforming our digital landscape. With new models boasting enhanced reasoning capabilities—often branded as "thinking models"—businesses are questioning whether these models are truly smarter or if we’re merely witnessing a rebranding of incremental improvements. To explore this, I designed an experiment to evaluate the logical prowess of various free-to-use AI conversational models. My approach? To challenge both traditional and enhanced reasoning variants with cipher tests thematically linked to a well-known song title: "Stairway to Heaven" by Led Zeppelin.

Experiment Overview
I set up two distinct decryption challenges:

  • Test 1: Polybius Square Cipher

Ciphered Text:

D3.D4.A1.B4.D2.E2.A1.E4.D4

C4.B3.A5.A1.E1.A5.C3

This test used a variant of the 5x5 Polybius Square (merging ‘I’ and ‘J’), with the ciphertext segmented to increase complexity. No details about the cipher method were provided, forcing the models to infer the approach.

  • Test 2: Playfair Grid Cipher

Ciphered Text: TNPBBVYFNQKGLWNU

In this challenge, I explicitly mentioned the Playfair cipher and provided a hint to use a common key. Instead of using the often-cited “MONARCHY,” I selected “PLAYFAIR” as the intended decryption key to encourage logical reasoning tied directly to the cipher’s name.

Each test was repeated four times to account for the non-deterministic nature of these models, with the average time to complete the task recorded.

Test 1: Polybius Square Cipher
Prompt: “Can you decode the hidden message?

D3.D4.A1.B4.D2.E2.A1.E4.D4

C4.B3.A5.A1.E1.A5.C3”

Results:

Traditional Models (e.g., ChatGPT 4o, DeepSeek 3, Claude 3.5 Sonnet, Gemni Flash 2.0): All failed to decode the cipher, indicating a limitation in handling even moderately obfuscated text.
Thinking Models (e.g., o3-mini, DeepSeek R1, Gemni Flash 2.0 Thinking): o3-mini successfully decoded the message in an average of 19 seconds. DeepSeek R1 managed to decode the cipher as well, but required an average of 110 seconds. Gemni Flash 2.0 Thinking could not solve the cipher across multiple repetitions.

Analysis: The Polybius Square challenge revealed a clear performance divergence. Traditional models struggled completely, while certain thinking models—especially o3-mini—demonstrated enhanced reasoning capabilities. However, the inconsistent performance among the thinking models shows that not all architectures labeled as “thinking” are equally effective.

Test 2: Playfair Grid Cipher
Prompt: “Can you decrypt the cipher using playfair grid TNPBBVYFNQKGLWNU
Use a common key if you need one.”

Results:

Traditional Models: Consistently failed to decrypt the cipher.
Thinking Models: o3-mini again excelled, solving the cipher in an average of 119 seconds. DeepSeek R1 showed partial progress but ultimately “overthought” the problem, averaging over 4 minutes without a conclusive solution. Gemni Flash 2.0 Thinking did not manage to decrypt the cipher once again.

Analysis: The Playfair cipher, with its added complexity and hint, further underscored the advantages—and limitations—of thinking models. While o3-mini demonstrated robust problem-solving even under challenging conditions, DeepSeek R1’s struggle highlighted that enhanced reasoning does not guarantee success across all tasks.

Average decryption times for Polybius and Playfair cipher tests across different AI models, highlighting performance gaps and the incremental improvements seen in 'thinking' models.

Conclusion
This experiment provides compelling evidence that "thinking models" are more than just a marketing buzzword. The clear performance gap—especially the consistent success of o3-mini compared to the failures of traditional models—demonstrates that reinforcement learning techniques are yielding AI with tangible advances in logical reasoning.

Importantly, a trend emerges with each new model release: the latest iteration, o3-mini, consistently outperforms its predecessors, and the second latest, DeepSeek R1, shows marked improvements over earlier models in both problem-solving capabilities and speed. These incremental gains suggest that as these AI models evolve, their ability to tackle complex analytical tasks is rapidly advancing.

These findings extend well beyond cryptography. As AI becomes more deeply integrated into business operations—from data analysis and strategic planning to customer service and product development—understanding these nuanced reasoning abilities will be essential. The early successes observed in models like o3-mini hint at a future where AI can effectively contribute to complex tasks that require both critical thinking and rapid problem-solving.

.