The Science of Structured Reasoning (2025)

Report Contents

1Executive Summary & Key Findings
2Theoretical Foundations: Chain-of-Thought Reasoning
3Beyond Linear: Tree-of-Thoughts & Graph-Based Approaches
4Test-Time Compute Scaling (Snell et al. 2025)
5Process Reward Models & Verification
6Frontier Model Analysis: o3, GPT-5.2, DeepSeek R1
7Benchmark Deep Dive: GPQA, ARC-AGI, AIME 2025
8Practical Implementation Guidance
9Failure Modes & Limitations
10Future Directions & Open Problems

          What You'll Learn
          How Chain-of-Thought prompting improves reasoning accuracy by 17.9pp on GSM8K
Why Tree-of-Thoughts outperforms CoT by 18.5x on decomposition tasks
Test-time compute scaling curves: when to invest in inference vs training
Process Reward Models vs Outcome Reward Models for verification
o3's 87.5% ARC-AGI breakthrough and the path to 93.2% GPQA Diamond
Chain-of-Draft: Achieving 92% token reduction while maintaining accuracy
LIMO pattern: 57% AIME accuracy with only 817 training samples
Critical failure modes and when structured reasoning breaks down