← Back to Home

Variance Reduction Benchmarks

Measured results demonstrating how structured reasoning protocols reduce LLM output variance.

Tree-of-Thoughts vs Chain-of-Thought

Interactive visualization of the NeurIPS 2023 Game of 24 benchmark showing 18.5× performance improvement through structured reasoning.

CoT: 4%
ToT: 74%

View Interactive Chart →

Claude CLI Benchmark (2025-12-30)

5 runs per question, 10 questions across factual/math/logic/decision/complex categories.

Inconsistency: 4.0% → 2.0%
Complex Tasks: 80% → 100%

View full report →

Multi-Model Benchmark (2025-12-30)

20 runs per question, simulated variance patterns based on academic literature.

Inconsistency: 20.0% → 1.5%
Complex Tasks: 55% → 100%

View full report →

Methodology

Based on academic literature:

Run your own benchmarks: python benchmarks/variance_benchmark_claude.py