CodeSIM

Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging

Md. Ashraful Islam ^*1, Mohammed Eunus Ali ¹, Md Rizwan Parvez ²

¹ Bangladesh University of Engineering and Technology (BUET)

² Qatar Computing Research Institute (QCRI)

^* Work done when working as a remote RA at QCRI.

picture_as_pdf arXiv code code trophy results

Figure: Overview of CodeSIM: It consists of three agents—planning, coding, and debugging. The Planning Agent first generates an exemplar problem-solution (i.e., via self-retrieval) and devises a plan, which is then verified and refined through simulation. Next, the Coding Agent implements the plan. Finally, the Debugging Agent addresses potential bugs through step-wise simulation across d trials. The entire process iterates p times.

Introduction

Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSIM, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis—planning, coding, and debugging—through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSIM uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSIM's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results—(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers.

CodeSIM Overview

Our goal is to develop a multi-agent code generation approach capable of complex problem solving. Drawing inspiration from recent works like MapCoder, we devise the agents in CodeSIM for planning, coding, and debugging. While these existing approaches focus primarily on expanding steps without verifying underlying hypotheses, we address this limitation by introducing a novel verification approach. Our approach simulates input/output step-by-step, verifying generated plans and performing internal debugging, mirroring how humans understand, visualize, and refine in algorithm development. Below, we present our proposed model.

double_arrowPlanning Agent

The first component of CodeSIM is the Planning Agent. Given a problem description, the Planning Agent generates a single exemplar—a relevant problem along with its plan and solution. This mimics the behavior of human programmers, who, when faced with a new problem, first recall a similar problem they've previously solved. This exemplar-based recall is crucial as it provides a starting point for constructing a solution plan. Instead of generating multiple ungrounded exemplars as in MapCoder, our agent focuses on only one at a time. We then instruct the LLM to generate an appropriate plan. Once the plan is created, the LLM simulates (step-by-step) the solution with a sample input. If the simulation result does not match the expected output, the agent prompts the LLM to revise the plan. Otherwise, the plan is deemed valid. In the case of failure, the Planning Agent refines the plan.

double_arrowCoding Agent

Next component is the Coding Agent, which takes the problem description and the plan generated by the Planning Agent as input. The role of this agent is to translate the plan into executable code that solves the given problem. Once the code is generated, CodeSIM evaluates it using sample input/output test cases. If the code passes all sample tests, it is returned as the final solution. Otherwise, the code is handed over to the next agent for further refinement.

double_arrowDebugging Agent

The final component, the Debugging Agent, receives the original problem, the plan from the Planning Agent, the code generated by the Coding Agent, and the execution (unit testing) log as input to debug the code. To identify bugs, instead of directly prompting the LLMs, we uniquely leverage the simulation once again. The LLM is instructed specifically to simulate the code on inputs where it fails to produce the expected output, allowing it to trace the execution step by step and locate the error. Once the bug is identified, the LLM modifies the code to resolve the issue.

Results

double_arrowResults on Basic Programming Problems

Table: Pass@1 results for different approaches on basic programming tasks.

Overall, CodeSIM demonstrates consistently superior performance compared to all other baselines across all datasets and LLMs. Notably, CodeSIM achieves top scores with GPT-4o, reaching 95.1% on HumanEval, 87.2% on EvalPlus, and 90.7% on MBPP, resulting in an impressive 82.7% overall average and their new state-of-the-art (SoTA) results. This represents a significant improvement over the next best /method, MapCoder, which scores 79.0% on average with GPT-4o. CodeSIM's effectiveness is consistent across different model variants, outperforming other approaches with ChatGPT (75.1% avg) and GPT-4 (81.3% avg) as well. The method's robust performance across diverse datasets, including the challenging MBPP-ET where it achieves 61.5% with GPT-4, underscores its versatility in handling various programming tasks. These results strongly indicate that CodeSIM's simulation-driven planning and debugging approach marks a substantial advancement in code generation and problem-solving capabilities, as it consistently outperformed other baselines.

double_arrowResults on Contest Level Programming Problems

Table: Pass@1 results for different approaches on CodeContest and APPS dataset.

We evaluate performance on complex, contest-level code generation tasks. CodeSIM delivers significant improvements over other baselines in solving complex contest-level code generation tasks. With GPT-4, CodeSIM reaches a strong 29.1% on CodeContests and 22.0% on APPS, marking a consistent edge over MapCoder's 25.3% average. The performance gains are even more pronounced with ChatGPT, where CodeSIM achieves a 16.4% on CodeContests, and 12.0% on APPS resulting 14.2% overall, outperforming MapCoder's 12.0%. These results highlight CodeSIM's ability to handle the complexity of contest-level problems more effectively, especially through its simulation-driven approach.

double_arrowResults on Open Source LLMs

Table: Pass@1 results for different approaches using Open-source LLMs.

To further demonstrate CodeSIM's generalization capability, we evaluate its performance with open-source LLMs, including Gemma2-9B, Mixtral8x7B, LLaMa3.1-8B, and LLaMa3.1-70B. As shown in the above table, CodeSIM consistently outperforms all other methods across these models. On LLaMa3.1-70B, CodeSIM achieves an accuracy of 90.2% on HumanEval and 76.2% on EvalPlus, with an average of 80.1%, closely matching GPT-4o's performance. Due to the complex prompting scheme of MapCoder, open-source LLMs often struggle to generate output in the correct format. Therefore, we exclude MapCoder from this experiment. On the other hand, Reflexion shows minimal improvement in accuracy. These results highlight CodeSIM's strong generalization ability across various LLM architectures, even on smaller models like Gemma2-9B that achieves a notable avg accuracy of 75.8%.

Cite Us

@misc{islam2025codesim,
    title={CODESIM: Multi-Agent Code Generation and Problem Solving through 
        Simulation-Driven Planning and Debugging}, 
    author={Md. Ashraful Islam and Mohammed Eunus Ali and Md Rizwan Parvez},
    year={2025},
    eprint={2502.05664},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2502.05664}, 
}