Game of 24 Benchmark

Name: Tree-of-Thoughts vs Chain-of-Thought Benchmark Comparison
Creator: ReasonKit

Success rate comparison between reasoning methodologies on mathematical problem-solving

Performance gain 18.5×

Linear sequential reasoning

Branching exploration with evaluation

Why 18.5× Improvement?

The architectural difference that drives the performance gap

→ Start: 4, 5, 6, 10

→ Try: 4 × 6 = 24 ✗

→ Need 5, 10... stuck

→ Single path fails

Paths Explored

Success Rate

→ Start: 4, 5, 6, 10

├─ Branch A: (10-4)×(6-5) → eval

├─ Branch B: 4×6×(10/5) → eval

└─ Branch C: (10-6)×(5+...) → eval

→ Select: B = 24 ✓

Paths Explored

Success Rate

74%

Source: Yao et al., "Tree of Thoughts: Deliberate Problem Solving with Large Language Models"

ReasonKit implements Tree-of-Thoughts and other advanced reasoning protocols in production-ready Rust.