Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation—treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers—such as Olympiad or programming contest teams—leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. Inspired by this, we introduce 𝕏olver—a training-free, multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. 𝕏olver integrates diverse experience modalities, including external and self-retrieval, tool use, agent collaboration, self-evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, 𝕏olver avoids generating solutions from scratch—marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, 𝕏olver consistently outperforms specialized reasoning agents (e.g., OctoTools, CheatSheet, Search-o1). Even when instantiated with lightweight backbones (e.g., QWQ-32B), it often surpasses the most advanced models to date—including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With a stronger backbone like o3-mini-high, it achieves a new best result—98.1% on GSM8K, 94.4% on AIME'24, 93.7% on AIME'25, 99.8% on Math-500, and 91.6% on LiveCodeBench—highlighting holistic experience learning as a key step toward dynamic, generalist agents capable of expert-level reasoning.
𝕏olver Scaffold. At each iteration, agents receive their past reasoning history and top ranked exemplars to generate new thoughts and responses, using tools (e.g., code) as needed. A judge model ranks outputs, and an intermediate memory maintains the best responses over time. Exemplars are initialized via episodic retrieval and continually updated with high-quality solutions from the memory. Iteration stops when convergence or max steps are reached, followed by final verification.
Model | Appr. | GSM8K | AIME '24 | AIME '25 | Math-500 | LiveCodeBench (v5) |
---|---|---|---|---|---|---|
Proprietary Models | ||||||
Claude 3.7 Sonnet T. | LongCoT | – | 61.3 | 49.5 | 96.2 | 51.4 |
Grok-3 (Beta) T. | Direct | – | 83.9 | 77.3 | – | 70.6 |
Grok-3-mini (Beta) T. | LongCoT | – | 89.5 | 82.0 | – | - |
Gemini 2.5 Flash T. | LongCoT | – | 88.0 | 78.0 | – | 63.5 |
o1 | LongCoT | 96.4 | 74.3 | 79.2 | 96.4 | 71.0 |
o3-mini-high | LongCoT | – | 87.3 | 86.5 | – | 69.5 |
Gemini 2.5 Pro. | Direct | – | 92.0 | 86.7 | – | 70.4 |
o3 | LongCoT | 96.7 | 91.6 | 88.9 | – | – |
o4-mini-high | LongCoT | – | 93.4 | 92.7 | – | 69.5 |
Open Weights Models | ||||||
DeepSeek-R1 | LongCoT | – | 79.8 | 70.0 | 97.3 | 64.3 |
Qwen3-235B-A22B | LongCoT | – | 85.7 | 81.5 | – | 70.7 |
Math/Code Specialized Models | ||||||
rStar-Math (Best) | - | 95.2 | 53.3 | – | 90.0 | – |
OpenMathReason (Best) | - | – | 93.3 | 80.0 | – | – |
AlphaOne (Best) | - | - | 53.3 | - | 89.4 | 75.8 |
OpenCodeReason (Best) | - | – | – | – | – | 61.8 |
rStar-Coder (Best) | - | – | – | – | – | 62.5 |
Kimi-k1.6-IOI-high | - | – | – | – | – | 73.8 |
Reasoning Agents/Frameworks | ||||||
o3-mini-medium | LongCoT | 95.2 | 75.8 | 70.4 | 97.3 | 66.3 |
Self-Refl. | 93.1 | 79.4 | 76.5 | 95.2 | 73.2 | |
OctoTools | 95.4 | 81.7 | 75.3 | 97.5 | – | |
Search-o1 | 95.8 | 81.8 | 76.7 | 97.6 | 73.6 | |
CheatSheet | 95.9 | 82.2 | 75.8 | 97.7 | – | |
CodeSim | – | – | – | – | 73.8 | |
𝕏olver (–) | 95.6 | 87.2 | 85.1 | 97.7 | 79.6 | |
𝕏olver (+) | 97.1 | 93.8 | 89.4 | 99.2 | 87.3 | |
QWQ-32B | LongCoT | 96.1 | 78.1 | 65.8 | 83.2 | 63.4 |
Self-Refl. | 94.0 | 79.3 | 66.3 | 80.4 | 69.2 | |
OctoTools | 96.3 | 83.0 | 71.7 | 86.1 | – | |
Search-o1 | 96.4 | 84.4 | 71.8 | 87.1 | 69.3 | |
CheatSheet | 96.8 | 83.5 | 72.2 | 86.5 | – | |
CodeSim | – | – | – | – | 70.5 | |
𝕏olver (–) | 96.5 | 89.9 | 79.5 | 93.1 | 76.2 | |
𝕏olver (+) | 98.0 | 93.6 | 82.7 | 95.5 | 79.2 | |
o3-mini-high | 𝕏olver (+) | 98.1 | 94.4 | 93.7 | 99.8 | 91.6 |
𝕏olver against SoTA reasoning models, specialized models, and other reasoning agents across mathematical and coding tasks. Best results are boldfaced and second-best results are underlined. T: Think models. LongCoT*: standard prompting for reasoning models. "–" denotes either n/a (e.g., only math/code specialized models) or results not reported.
We analyze the effect of varying the number of agents and reasoning iterations on 𝕏olver's performance. In a controlled setup, we fix one variable (e.g., 3 agents or 2 iterations) and incrementally increase the other. Performance improves consistently on both AIME ’25 and LiveCodeBench with more agents or iterations. To probe deeper, we conduct a budget-controlled experiment on the AIME ’25 dataset, where the total reasoning budget (i.e., number of agents × number of iterations) is fixed. While iterative reasoning remains a crucial factor for 𝕏olver's performance, we find that increasing the number of agents—particularly beyond a minimum of three—yields additional, emergent improvements, leading to over a 4% performance gain.
We evaluate the impact of different retrieval strategies on 𝕏olver by comparing three settings: (1) External Retrieval, where the model retrieves the top-k (e.g., k = 5) most similar problems and their solutions from an external corpus; (2) Self-Retrieval, where the model recalls the top-k most similar problems and solutions from its own internal memory; and (3) No Retrieval, where neither external nor self-retrieval is used. Performance on both AIME '25 and LiveCodeBench follows the trend: External Retrieval > Self-Retrieval > No Retrieval.
We perform a fine-grained analysis of 𝕏olver's performance across both MATH-500 and LiveCodeBench.On MATH-500, 𝕏olver (both o3-mini-medium and QWQ-32B) consistently outperforms CheatSheet across nearly all seven subject categories, despite the latter relying on costly per-problem memory updates. The only exception is in Number Theory, where o3-mini-medium scores 99.2 compared to CheatSheet's 99.5. As for QWQ-32B, 𝕏olver achieves substantial accuracy gains over CheatSheet across all categories, with improvements of +9.0% in Prealgebra, +8.5% in Algebra, +11.0% in Number Theory, +8.5% in Counting and Probability, +8.8% in Geometry, +10.0% in Intermediate Algebra, and +7.5% in Precalculus. On LiveCodeBench, 𝕏olver demonstrates even more pronounced gains. The o3-mini-medium variant achieves 95.6%, 90.4%, and 85.8% accuracy on Easy, Medium, and Hard problems respectively, significantly outperforming CodeSim by +4.5%, +11.9%, and a striking +32.3% margin on hard examples. Even with a weaker QWQ-32B backbone, 𝕏olver (95.2%, 87.5%, 70.0%) surpasses all baselines and achieves similar gains.
We perform a detailed analysis of token usage, reporting input, reasoning, and output statistics for 𝕏olver (QWQ-32B) across all datasets. Our LLM token usage has computational complexity of O(mI), where m is the number of agents and I is the number of reasoning iterations. However, the runtime complexity remains O(I) since the dynamic agents operate in parallel. This is significantly more efficient than the self-consistency, which typically require 32–64 generations per example, as well as the baseline CheatSheet framework, which incurs a memory update complexity of O(n2)—quadratic in the test dataset size—due to usefulness estimation over all previous examples after solving each new example. As a multi-agent system, 𝕏olver allocates a majority of its tokens to context sharing and inter-agent communication, while approximately 25% are spent on actual reasoning steps. We also compare the total token usage of 𝕏olver with a single agent reasoning framework Search-o1 for token count. As expected, 𝕏olver incurs higher token costs—approximately 1.5× that of Search-o1—due to its collaborative and iterative multi-agent reasoning. However, this moderate increase represents a highly efficient trade-off given the substantial performance improvements observed.
We conduct an in-depth analysis combining qualitative runtime inspection with controlled experiments to understand the reasoning and problem-solving strategies behind 𝕏olver. We begin by manually studying Xolver’s agent interaction traces on AIME '25 and LiveCodeBench. These case studies reveal that at each iteration, dynamic agents attempt to improve upon earlier failures by leveraging Judge agent feedback and by aligning with top-ranked outputs stored in the shared memory. This process results in progressively refined outputs, increased agent alignment, and eventual convergence toward correct solutions. To verify this behavior systematically, we conduct a controlled experiment across both math and code tasks. We instantiate two dynamic agents with complementary strengths: a Coder agent and a Mathematician agent, each proficient in one domain but suboptimal in the other. We then measure their performance and agreement across iterations—defined as the percentage of problems in which both agents independently produce the same correct answer (for math) or code that passes the same test cases (for code). Both agents demonstrate consistent accuracy improvements over time, accompanied by a rising agreement rate.
Reasoning Pattern | Correct Solutions | Incorrect Solutions | ||
---|---|---|---|---|
Easy → Medium | Medium → High | Easy → Medium | Medium → High | |
Self-Evaluation (↑) | 0.35 → 0.38 | 0.38 → 0.40 | 0.35 → 0.37 | 0.32 → 0.35 |
New Approach (↑) | 0.18 → 0.21 | 0.21 → 0.24 | 0.17 → 0.24 | 0.24 → 0.26 |
Problem Rephrasing (↓↑) | 0.20 → 0.17 | 0.18 → 0.18 | 0.23 → 0.24 | 0.24 → 0.25 |
Subgoal Setup (↓↑) | 0.14 → 0.13 | 0.13 → 0.11 | 0.11 → 0.12 | 0.11 → 0.11 |
Major reasoning patterns frequency changes as problem difficulty increases in LiveCodeBench, comparing correct vs. incorrect solutions. Green and red indicate statistically significant increases or decreases (p < 0.05). Underlined cells highlight patterns where 𝕏olver improves over OpenCodeReasoning, which otherwise shows a declining trend. Direction arrows denote: ↑ = increase, ↓ = decrease, ↓↑ = mixed trend (decrease in correct, increase in incorrect). 𝕏olver increases use of self-evaluation and new approaches with task difficulty, and demonstrates targeted subgoal setup and problem rephrasing when solutions fail—reflecting its adaptive, collaborative reasoning.
@article{hosain2025xolver,
title={𝕏olver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team},
author={Md Tanzib Hosain and Salman Rahman and Md Kishor Morol and Md Rizwan Parvez},
journal={arXiv preprint},
year={2025}
}