1 Bangladesh University of Engineering and Technology (BUET)
2 Qatar Computing Research Institute (QCRI)
* Work done when working as a remote RA at QCRI.
picture_as_pdf arXiv code code trophy resultsFigure: Overview of CodeSIM: It consists of three agents—planning, coding, and debugging. The Planning Agent first generates an exemplar problem-solution (i.e., via self-retrieval) and devises a plan, which is then verified and refined through simulation. Next, the Coding Agent implements the plan. Finally, the Debugging Agent addresses potential bugs through step-wise simulation across d trials. The entire process iterates p times.
Large Language Models (LLMs) have made significant strides in code generation and problem solving. Current approaches employ external tool-based iterative debuggers that use compiler or other tool-based runtime feedback to refine coarse programs generated by various methods. However, the effectiveness of these approaches heavily relies on the quality of the initial code generation, which remains an open challenge. In this paper, we introduce CodeSIM, a novel multi-agent code generation framework that comprehensively addresses the stages of program synthesis—planning, coding, and debugging—through a human-like perception approach. As human verifies their understanding of any algorithms through visual simulation, CodeSIM uniquely features a method of plan verification and internal debugging through the step-by-step simulation of input/output. Extensive experiments across seven challenging competitive problem-solving and program synthesis benchmarks demonstrate CodeSIM's remarkable code generation capabilities. Our framework achieves new state-of-the-art (pass@1) results—(HumanEval 95.1%, MBPP 90.7%, APPS 22%, and CodeContests 29.1%). Furthermore, our method shows potential for even greater enhancement when cascaded with external debuggers.
Table: Pass@1 results for different approaches on basic programming tasks.
Overall, CodeSIM demonstrates consistently superior performance compared to all other baselines across all datasets and LLMs. Notably, CodeSIM achieves top scores with GPT-4o, reaching 95.1% on HumanEval, 87.2% on EvalPlus, and 90.7% on MBPP, resulting in an impressive 82.7% overall average and their new state-of-the-art (SoTA) results. This represents a significant improvement over the next best /method, MapCoder, which scores 79.0% on average with GPT-4o. CodeSIM's effectiveness is consistent across different model variants, outperforming other approaches with ChatGPT (75.1% avg) and GPT-4 (81.3% avg) as well. The method's robust performance across diverse datasets, including the challenging MBPP-ET where it achieves 61.5% with GPT-4, underscores its versatility in handling various programming tasks. These results strongly indicate that CodeSIM's simulation-driven planning and debugging approach marks a substantial advancement in code generation and problem-solving capabilities, as it consistently outperformed other baselines.
Table: Pass@1 results for different approaches on CodeContest and APPS dataset.
We evaluate performance on complex, contest-level code generation tasks. CodeSIM delivers significant improvements over other baselines in solving complex contest-level code generation tasks. With GPT-4, CodeSIM reaches a strong 29.1% on CodeContests and 22.0% on APPS, marking a consistent edge over MapCoder's 25.3% average. The performance gains are even more pronounced with ChatGPT, where CodeSIM achieves a 16.4% on CodeContests, and 12.0% on APPS resulting 14.2% overall, outperforming MapCoder's 12.0%. These results highlight CodeSIM's ability to handle the complexity of contest-level problems more effectively, especially through its simulation-driven approach.
Table: Pass@1 results for different approaches using Open-source LLMs.
To further demonstrate CodeSIM's generalization capability, we evaluate its performance with open-source LLMs, including Gemma2-9B, Mixtral8x7B, LLaMa3.1-8B, and LLaMa3.1-70B. As shown in the above table, CodeSIM consistently outperforms all other methods across these models. On LLaMa3.1-70B, CodeSIM achieves an accuracy of 90.2% on HumanEval and 76.2% on EvalPlus, with an average of 80.1%, closely matching GPT-4o's performance. Due to the complex prompting scheme of MapCoder, open-source LLMs often struggle to generate output in the correct format. Therefore, we exclude MapCoder from this experiment. On the other hand, Reflexion shows minimal improvement in accuracy. These results highlight CodeSIM's strong generalization ability across various LLM architectures, even on smaller models like Gemma2-9B that achieves a notable avg accuracy of 75.8%.
@misc{islam2025codesim, title={CODESIM: Multi-Agent Code Generation and Problem Solving through Simulation-Driven Planning and Debugging}, author={Md. Ashraful Islam and Mohammed Eunus Ali and Md Rizwan Parvez}, year={2025}, eprint={2502.05664}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.05664}, }