𝕏olver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team

Abstract

Despite impressive progress on complex reasoning, current large language models (LLMs) typically operate in isolation—treating each problem as an independent attempt, without accumulating or integrating experiential knowledge. In contrast, expert problem solvers—such as Olympiad or programming contest teams—leverage a rich tapestry of experiences: absorbing mentorship from coaches, developing intuition from past problems, leveraging knowledge of tool usage and library functionality, adapting strategies based on the expertise and experiences of peers, continuously refining their reasoning through trial and error, and learning from other related problems even during competition. Inspired by this, we introduce 𝕏olver—a training-free, multi-agent reasoning framework that equips a black-box LLM with a persistent, evolving memory of holistic experience. 𝕏olver integrates diverse experience modalities, including external and self-retrieval, tool use, agent collaboration, self-evaluation, and iterative refinement. By learning from relevant strategies, code fragments, and abstract reasoning patterns at inference time, 𝕏olver avoids generating solutions from scratch—marking a transition from isolated inference toward experience-aware language agents. Built on both open-weight and proprietary models, 𝕏olver consistently outperforms specialized reasoning agents (e.g., OctoTools, CheatSheet, Search-o1). Even when instantiated with lightweight backbones (e.g., QWQ-32B), it often surpasses the most advanced models to date—including Qwen3-235B, Gemini 2.5 Pro, o3, and o4-mini-high. With a stronger backbone like o3-mini-high, it achieves a new best result—98.1% on GSM8K, 94.4% on AIME'24, 93.7% on AIME'25, 99.8% on Math-500, and 91.6% on LiveCodeBench—highlighting holistic experience learning as a key step toward dynamic, generalist agents capable of expert-level reasoning.

𝕏olver Scaffold. At each iteration, agents receive their past reasoning history and top ranked exemplars to generate new thoughts and responses, using tools (e.g., code) as needed. A judge model ranks outputs, and an intermediate memory maintains the best responses over time. Exemplars are initialized via episodic retrieval and continually updated with high-quality solutions from the memory. Iteration stops when convergence or max steps are reached, followed by final verification.

Results

Model	Appr.	GSM8K	AIME '24	AIME '25	Math-500	LiveCodeBench (v5)
Proprietary Models
Claude 3.7 Sonnet T.	LongCoT	–	61.3	49.5	96.2	51.4
Grok-3 (Beta) T.	Direct	–	83.9	77.3	–	70.6
Grok-3-mini (Beta) T.	LongCoT	–	89.5	82.0	–	-
Gemini 2.5 Flash T.	LongCoT	–	88.0	78.0	–	63.5
o1	LongCoT	96.4	74.3	79.2	96.4	71.0
o3-mini-high	LongCoT	–	87.3	86.5	–	69.5
Gemini 2.5 Pro.	Direct	–	92.0	86.7	–	70.4
o3	LongCoT	96.7	91.6	88.9	–	–
o4-mini-high	LongCoT	–	93.4	92.7	–	69.5
Open Weights Models
DeepSeek-R1	LongCoT	–	79.8	70.0	97.3	64.3
Qwen3-235B-A22B	LongCoT	–	85.7	81.5	–	70.7
Math/Code Specialized Models
rStar-Math (Best)	-	95.2	53.3	–	90.0	–
OpenMathReason (Best)	-	–	93.3	80.0	–	–
AlphaOne (Best)	-	-	53.3	-	89.4	75.8
OpenCodeReason (Best)	-	–	–	–	–	61.8
rStar-Coder (Best)	-	–	–	–	–	62.5
Kimi-k1.6-IOI-high	-	–	–	–	–	73.8
Reasoning Agents/Frameworks
o3-mini-medium	LongCoT	95.2	75.8	70.4	97.3	66.3
	Self-Refl.	93.1	79.4	76.5	95.2	73.2
	OctoTools	95.4	81.7	75.3	97.5	–
	Search-o1	95.8	81.8	76.7	97.6	73.6
	CheatSheet	95.9	82.2	75.8	97.7	–
	CodeSim	–	–	–	–	73.8
	𝕏olver (–)	95.6	87.2	85.1	97.7	79.6
	𝕏olver (+)	97.1	93.8	89.4	99.2	87.3
QWQ-32B	LongCoT	96.1	78.1	65.8	83.2	63.4
	Self-Refl.	94.0	79.3	66.3	80.4	69.2
	OctoTools	96.3	83.0	71.7	86.1	–
	Search-o1	96.4	84.4	71.8	87.1	69.3
	CheatSheet	96.8	83.5	72.2	86.5	–
	CodeSim	–	–	–	–	70.5
	𝕏olver (–)	96.5	89.9	79.5	93.1	76.2
	𝕏olver (+)	98.0	93.6	82.7	95.5	79.2
o3-mini-high	𝕏olver (+)	98.1	94.4	93.7	99.8	91.6

𝕏olver against SoTA reasoning models, specialized models, and other reasoning agents across mathematical and coding tasks. Best results are boldfaced and second-best results are underlined. T: Think models. LongCoT*: standard prompting for reasoning models. "–" denotes either n/a (e.g., only math/code specialized models) or results not reported.

Performance with Agents and Iterations

We analyze the effect of varying the number of agents and reasoning iterations on 𝕏olver's performance. In a controlled setup, we fix one variable (e.g., 3 agents or 2 iterations) and incrementally increase the other. Performance improves consistently on both AIME ’25 and LiveCodeBench with more agents or iterations. To probe deeper, we conduct a budget-controlled experiment on the AIME ’25 dataset, where the total reasoning budget (i.e., number of agents × number of iterations) is fixed. While iterative reasoning remains a crucial factor for 𝕏olver's performance, we find that increasing the number of agents—particularly beyond a minimum of three—yields additional, emergent improvements, leading to over a 4% performance gain.

Performance with Retrieval Methods

We evaluate the impact of different retrieval strategies on 𝕏olver by comparing three settings: (1) External Retrieval, where the model retrieves the top-k (e.g., k = 5) most similar problems and their solutions from an external corpus; (2) Self-Retrieval, where the model recalls the top-k most similar problems and solutions from its own internal memory; and (3) No Retrieval, where neither external nor self-retrieval is used. Performance on both AIME '25 and LiveCodeBench follows the trend: External Retrieval > Self-Retrieval > No Retrieval.

Fine-grained Performance Analysis

We perform a fine-grained analysis of 𝕏olver's performance across both MATH-500 and LiveCodeBench.On MATH-500, 𝕏olver (both o3-mini-medium and QWQ-32B) consistently outperforms CheatSheet across nearly all seven subject categories, despite the latter relying on costly per-problem memory updates. The only exception is in Number Theory, where o3-mini-medium scores 99.2 compared to CheatSheet's 99.5. As for QWQ-32B, 𝕏olver achieves substantial accuracy gains over CheatSheet across all categories, with improvements of +9.0% in Prealgebra, +8.5% in Algebra, +11.0% in Number Theory, +8.5% in Counting and Probability, +8.8% in Geometry, +10.0% in Intermediate Algebra, and +7.5% in Precalculus. On LiveCodeBench, 𝕏olver demonstrates even more pronounced gains. The o3-mini-medium variant achieves 95.6%, 90.4%, and 85.8% accuracy on Easy, Medium, and Hard problems respectively, significantly outperforming CodeSim by +4.5%, +11.9%, and a striking +32.3% margin on hard examples. Even with a weaker QWQ-32B backbone, 𝕏olver (95.2%, 87.5%, 70.0%) surpasses all baselines and achieves similar gains.

Cost Analysis and How Long Do Agents Think

We perform a detailed analysis of token usage, reporting input, reasoning, and output statistics for 𝕏olver (QWQ-32B) across all datasets. Our LLM token usage has computational complexity of O(mI), where m is the number of agents and I is the number of reasoning iterations. However, the runtime complexity remains O(I) since the dynamic agents operate in parallel. This is significantly more efficient than the self-consistency, which typically require 32–64 generations per example, as well as the baseline CheatSheet framework, which incurs a memory update complexity of O(n²)—quadratic in the test dataset size—due to usefulness estimation over all previous examples after solving each new example. As a multi-agent system, 𝕏olver allocates a majority of its tokens to context sharing and inter-agent communication, while approximately 25% are spent on actual reasoning steps. We also compare the total token usage of 𝕏olver with a single agent reasoning framework Search-o1 for token count. As expected, 𝕏olver incurs higher token costs—approximately 1.5× that of Search-o1—due to its collaborative and iterative multi-agent reasoning. However, this moderate increase represents a highly efficient trade-off given the substantial performance improvements observed.

Case Study: How 𝕏olver Enhances Reasoning

We conduct an in-depth analysis combining qualitative runtime inspection with controlled experiments to understand the reasoning and problem-solving strategies behind 𝕏olver. We begin by manually studying Xolver’s agent interaction traces on AIME '25 and LiveCodeBench. These case studies reveal that at each iteration, dynamic agents attempt to improve upon earlier failures by leveraging Judge agent feedback and by aligning with top-ranked outputs stored in the shared memory. This process results in progressively refined outputs, increased agent alignment, and eventual convergence toward correct solutions. To verify this behavior systematically, we conduct a controlled experiment across both math and code tasks. We instantiate two dynamic agents with complementary strengths: a Coder agent and a Mathematician agent, each proficient in one domain but suboptimal in the other. We then measure their performance and agreement across iterations—defined as the percentage of problems in which both agents independently produce the same correct answer (for math) or code that passes the same test cases (for code). Both agents demonstrate consistent accuracy improvements over time, accompanied by a rising agreement rate.

Reasoning Patterns in 𝕏olver Traces

Reasoning Pattern	Correct Solutions		Incorrect Solutions
Reasoning Pattern	Easy → Medium	Medium → High	Easy → Medium	Medium → High
Self-Evaluation (↑)	0.35 → 0.38	0.38 → 0.40	0.35 → 0.37	0.32 → 0.35
New Approach (↑)	0.18 → 0.21	0.21 → 0.24	0.17 → 0.24	0.24 → 0.26
Problem Rephrasing (↓↑)	0.20 → 0.17	0.18 → 0.18	0.23 → 0.24	0.24 → 0.25
Subgoal Setup (↓↑)	0.14 → 0.13	0.13 → 0.11	0.11 → 0.12	0.11 → 0.11

Major reasoning patterns frequency changes as problem difficulty increases in LiveCodeBench, comparing correct vs. incorrect solutions. Green and red indicate statistically significant increases or decreases (p < 0.05). Underlined cells highlight patterns where 𝕏olver improves over OpenCodeReasoning, which otherwise shows a declining trend. Direction arrows denote: ↑ = increase, ↓ = decrease, ↓↑ = mixed trend (decrease in correct, increase in incorrect). 𝕏olver increases use of self-evaluation and new approaches with task difficulty, and demonstrates targeted subgoal setup and problem rephrasing when solutions fail—reflecting its adaptive, collaborative reasoning.

Example Agent Interaction

Example for CORRECT MATH TASK GREEN: CORRECT ANSWER. RED: WRONG ANSWER.

Query: Bob is in charge of doing laundry for a large hotel. Each room has two sheets, one comforter, twice as many pillow cases as sheets and twice as many towels as pillow cases. How many pieces of laundry are there in 80 rooms? Make sure to wrap your final answer in \boxed{} block with the entire solution (in the final answer step).

Iteration 1

Mathematical Modeler: Each room contains 2 sheets, 1 comforter, 2 pillow cases (which is twice the number of sheets), and 4 towels (which is twice the number of pillow cases). Hence, the total number of laundry items in each room is:
2 + 1 + 2 + 4 = 9 pieces of laundry.
When considering 80 rooms, the total number of pieces of laundry is:
80 x 9 = 720 pieces of laundry.
Thus, the final answer is: 720

Judge: The candidate's solution contains an error in calculating the number of pillow cases and towels per room. Here's the breakdown:
Sheets: 2 (correct). Comforter: 1 (correct). Pillow Cases: The problem states "twice as many pillow cases as sheets." Since there are 2 sheets, pillow cases should be 2 x 2 = 4. The candidate incorrectly used 2 pillow cases (half the required amount). Towels: The problem states "twice as many towels as pillow cases." With the correct pillow case count (4), towels should be 2 x 4 = 8. The candidate used 4 towels (based on their incorrect pillow case count of 2).
The correct per-room total is:
2 (sheets) + 1 (comforter) + 4 (pillow cases) + 8 (towels) = 15 items. The total for 80 rooms is:
80 x 15 = 1200.

The candidate's answer is incorrect due to miscalculating pillow cases and towels.

Score: 0.

Numerical Analyst: For each room, Bob needs to wash 2 sheets, 1 comforter, 4 pillow cases (since 2 x 2 = 4), and 8 towels (since 4 x 2 = 8). Therefore, for each room, the total amount of laundry Bob needs to wash is:
2 + 1 + 4 + 8 = 15 pieces of laundry.
For 80 rooms, the total number of pieces of laundry Bob must wash is:
80 x 15 = 1200 pieces of laundry.

Judge: The final answer is: 1200. The candidate correctly interprets the problem's relationships: sheets: 2 (given), comforter: 1 (given), pillow cases: twice the sheets, 2 x 2 = 4 (correct), towels: twice the pillow cases, 2 x 4 = 8 (correct). Total per room: 2 + 1 + 4 + 8 = 15 (correct). Total for 80 rooms: 80 x 15 = 1200 (correct). All steps align with the problem's specifications, and the final answer is accurate. The candidate's solution is error-free.

Score: 1.

Mathematical Modeler: Every room is equipped with 2 sheets, 1 comforter, 4 pillow cases (calculated as 2 x 2), and 8 towels (since 4 x 2 = 8). Summing these, the total number of laundry items for one room is:
2 + 4 + 8 = 14 pieces of laundry.
Multiplying by the number of rooms, the total quantity for 80 rooms becomes:
80 x 14 = 1120

Judge: The candidate incorrectly omitted the comforter in their final sum. Here's the breakdown: sheets: 2 (correct), comforter: 1 (given but not included in the candidate's calculation), pillow cases: 2 x 2 = 4 (correct), towels: 2 x 4 = 8 (correct). The candidate's total per room was 2 + 4 + 8 = 14, but the correct total should include the comforter: 2 + 1 + 4 + 8 = 15. Thus, the total for 80 rooms should be 80 x 15 = 1200, not 1120. The error stems from excluding the comforter in the summation.

Score: 0.

D_{S rank}: [Numerical Analyst, Mathematical Modeler, Mathematician]

Iteration 2

Mathematical Modeler: Each room contains 2 sheets, 1 comforter, 2 pillow cases (which is twice the number of sheets), and 4 towels (which is twice the number of pillow cases). Hence, the total number of laundry items in each room is:
2 + 1 + 4 + 8 = 15 pieces of laundry.
When considering 80 rooms, the total number of pieces of laundry is:
80 x 15 = 1200 pieces of laundry.
Thus, the final answer is: 1200
This confirms the accuracy of the revised calculation.

Judge: The candidate's final answer is correct, and the mathematical calculations align with the problem's requirements: 1. Sheets: 2 (correct). 2. Comforter: 1 (included in the total). 3. Pillow Cases: The problem states "twice as many pillow cases as sheets." Since there are 2 sheets, this should be 2 x 2 = 4. The candidate correctly uses 4 pillow cases in their total calculation (despite a typewritten inconsistency in the explanation, where they initially wrote "2 pillow cases"). 4. Towels: "Twice as many towels as pillow cases" gives 2 x 4 = 8 (correct).

The candidate's total per room is 2 + 1 + 4 + 8 = 15 (correct), leading to 80 x 15 = 1200 (correct).

While the explanation contains a minor inconsistency in wording ("2 pillow cases" instead of 4), the numerical calculations are accurate. Since the final answer is correct and the mathematical steps resolve the inconsistency, the solution is valid.

Score: 1.

Numerical Analyst: Based on the information provided, we know that each room includes 2 sheets, 1 comforter, 4 pillow cases (computed as 2 x 2), and 8 towels (since 4 x 2 = 8). Adding these together, the total number of laundry items required for one room is:
2 + 1 + 4 + 8 = 15 pieces.
Multiplying by the total number of rooms, the overall amount of laundry for 80 rooms is:
80 x 15 = 1200

Although the other agents correctly determined the number of laundry items required for each room, there was an error in the concluding step-calculate the total number of laundry pieces.

Judge: The candidate's calculations and final answer are correct: sheets: 2 (given), comforter: 1 (given), pillow cases: 2 x 2 = 4 (twice the sheets), towels: 2 x 4 = 8 (twice the pillow cases), total per room: 2 + 1 + 4 + 8 = 15 (correct), total for 80 rooms: 80 x 15 = 1200 (correct).

Score: 1.

Mathematical Modeler: According to the provided details, every room is equipped with 2 sheets, 1 comforter, 4 pillow cases (calculated as 2 x 2), and 8 towels (since 4 x 2 = 8). Summing these, the total number of laundry items for one room is:
1 + 4 + 8 = 13 pieces of laundry.
Multiplying by the number of rooms, the total quantity for 80 rooms becomes:
80 x 13 = 1040

Judge: The candidate incorrectly omitted the sheets in their final sum. Here's the breakdown: 1. Sheets: 2 (given but not included in the candidate's calculation). 2. Comforter: 1 (included). 3. Pillow Cases: 2 x 2 = 4 (correct). 4. Towels: 2 x 4 = 8 (correct). The candidate's total per room was 1 + 4 + 8 = 13, but the correct total should include the sheets: 2 + 1 + 4 + 8 = 15. Thus, the total for 80 rooms should be 80 x 15 = 1200, not 1040. The error stems from excluding the sheets in the summation.

Score: 0.

D_{S rank}: [Mathematical Modeler, Numerical Analyst, Mathematician]

Iteration 3

Mathematical Modeler: From the details provided, it is clear that each room contains 2 sheets, 1 comforter, 4 pillow cases (as 2 x 2 = 4), and 8 towels (calculated as 4 x 2 = 8). Adding these together, the total number of laundry per room amounts to:
2 + 1 + 4 + 8 = 15 items.
Consequently, for 80 rooms, the total number of laundry pieces becomes:
80 x 15 = 1200

I agree with the other agent's corrected calculation.

Judge: The candidate's solution is mathematically accurate and complete: sheets: 2 (correct), comforter: 1 (correct), pillow cases: 2 x 2 = 4 (twice the sheets), towels: 2 x 4 = 8 (twice the pillow cases), total per room: 2 + 1 + 4 + 8 = 15 (correct), total for 80 rooms: 80 x 15 = 1200 (correct).

Score: 1.

Numerical Analyst: Given the information, we know that each room includes 2 sheets, 1 comforter, 4 pillow cases (computed as 2 x 2), and 8 towels (since 4 x 2 = 8). Adding these together, the total number of laundry items required for one room is:
2 + 1 + 4 + 8 = 15 pieces.
Multiplying by the total number of rooms, the overall amount of laundry for 80 rooms is:
80 x 15 = 1200

Although the other agent correctly determined the number of laundry items required for each room, there was an error in the concluding step. Specifically, they incorrectly used 80 x 13 instead of the appropriate multiplication of 80 x 15 to calculate the total number of laundry pieces.

Judge: The candidate's solution is fully correct: sheets: 2 (given), comforter: 1 (given), pillow cases: 2 x 2 = 4 (twice the sheets), towels: 2 x 4 = 8 (twice the pillow cases), total per room: 2 + 1 + 4 + 8 = 15 (correct), total for 80 rooms: 80 x 15 = 1200 (correct). All steps align with the problem's specifications, and the final answer is accurate. No errors were made.

Score: 1.

Mathematician: According to the given details, each room contains 2 sheets, 1 comforter, 4 pillow cases (calculated as 2 x 2), and 8 towels (determined by 4 x 2). Summing these amounts, the total number of laundry items per room is:
2 + 1 + 4 + 8 = 15 pieces of laundry.
To find the total for 80 rooms, we multiply:
80 x 15 = 1200

I concur with the other agents' results and the computations they presented.

Judge: The candidate's solution is fully correct and aligns with the problem's requirements: sheets: 2 (given), comforter: 1 (given), pillow cases: 2 x 2 = 4 (twice the sheets), towels: 2 x 4 = 8 (twice the pillow cases), total per room: 2 + 1 + 4 + 8 = 15 (correct), total for 80 rooms: 80 x 15 = 1200 (correct). All calculations are accurate, and the final answer matches the problem's constraints.

Score: 1.

D_{S rank}: [Mathematician, Mathematical Modeler, Numerical Analyst]

BibTeX

@article{hosain2025xolver,
      title={𝕏olver: Multi-Agent Reasoning with Holistic Experience Learning Just Like an Olympiad Team}, 
      author={Md Tanzib Hosain and Salman Rahman and Md Kishor Morol and Md Rizwan Parvez},
      journal={arXiv preprint},
      year={2025}
}