In many ways, the paper echoes and amplifies an argument that I have been making since 1998.
Gary Marcus
AI critic

Key Facts
- Apple researchers released a 53-page paper arguing that reasoning large language models (LRMs) like OpenAI's o3 and Google's Gemini do not truly 'think' or reason.
- The paper found that these AI models struggle with simple puzzles, scoring less than 80% accuracy on seven-disk tasks and failing on eight-disk puzzles.
- The research sparked a viral debate on X, with many interpreting it as evidence that current LLMs are glorified autocomplete engines rather than genuine reasoners.
- AI critic Gary Marcus supported the findings, noting they echo arguments he has made since 1998 about AI's limitations.
- The debate highlighted that evaluation design is now as crucial as model design in assessing AI capabilities.
Key Stats at a Glance
Length of Apple research paper
53 pages
Accuracy of AI models on seven-disk puzzles
80%
