Back to Benchmarks
NeurIPS 2023

Game of 24 Benchmark

Success rate comparison between reasoning methodologies on mathematical problem-solving

Why 18.5× Improvement?

The architectural difference that drives the performance gap

Chain-of-Thought

Start: 4, 5, 6, 10
Try: 4 × 6 = 24
Need 5, 10... stuck
Single path fails
Paths Explored
1
Success Rate
4%

Tree-of-Thoughts

Start: 4, 5, 6, 10
├─ Branch A: (10-4)×(6-5) → eval
├─ Branch B: 4×6×(10/5) → eval
└─ Branch C: (10-6)×(5+...) → eval
Select: B = 24
Paths Explored
3+
Success Rate
74%

Source: Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models"

arXiv:2305.10601 (NeurIPS 2023) →

Bring Structured Reasoning to Your AI

ReasonKit implements Tree-of-Thoughts and other advanced reasoning protocols in production-ready Rust.