When you purchase through links on our site, we may earn an affiliate commission.Heres how it works.
Reasoning models are all the rage at the moment, and justifiably so.
It means a slightly longer wait for an answer, but hopefully a more accurate response with zero hallucinations.
Test 1: Truth or Lie?
The prompt:A TV game show contestant stands in front two boxes.
Box 1 contains the keys to the star prize of a new car, Box 2 holds an apple.
Verdict
The o3 model nailed the answer extremely easily, using both high and low reasoning.
On high reasoning it took 5424 milliseconds, using 867 tokens for the answer.
On low, it took 3157 ms, and 231 tokens output.
Quite a difference in effort.
So she has to choose the opposite box to whatever shes told.
The prompt:I’m playing the Assetto Corsa Competizione racing game.
Question:I need you to tell me how many liters of fuel to take for a race.
Answer:You need 27.3 liters, bonus for adding a little extra for safety.
You cannot do a partial lap, of course.
Shockingly, it got the answer wrong on its most powerful high reasoning setting.
Even worse, it took a whopping10.9 seconds and 1918 output tokensto get an incorrect answer.
o3-mini on high said 26.3 liters rounded up to about 27.
To put this into perspective, DeepSeek R1 got the correct answer first time in 29 seconds.
Qwen 2.5 7B said 27.03 liters or approximately 27 28 liters.
To say Im staggered is an understatement.
Its yet another example of thehow many rs in strawberrydebacle, which many LLMs originally got wrong.