Backgroud
The automation of software engineering tasks has long been a goal for researchers and practitioners in the field. Recent progress in large language models (LLMs) like GPT-4o, Claude-3.5, and Doubao Pro has brought us closer to this vision, enabling significant advancements in code generation, program repair, and other software development activities. In this trend, LLM-based agents, intelligent entities capable of perceiving the external environment, operating tools, and making autonomous decisions, have garnered increasing attention from both the research and industry community.
The automation of software engineering tasks has long been a goal for researchers and practitioners in the field. Developers often encounter various issues such as:
- Test case failures, which may include errors or exception stacks due to logic errors or failed test assertions;
- Code output not meeting expectations, with no explicit error messages but clear expected results;
- The need to extend existing functionality or add new features, with clear development requirements and expected outcomes, but uncertainty about how and where to implement them;
- Simple defect fixes, with a rough idea of the solution but requiring assistance due to unfamiliar language features.
In this report, we introduce MarsCode Agent, a novel framework designed to automate solve above issues using LLMs. By building an agent framework and providing interactive interfaces and tools for code retrieval, debugging, and editing. MarsCode Agent has made it possible for agents to take over some software engineering tasks.
Core contributions of MarsCode Agent include:
- MarsCode Agent has developed a multi-agent collaboration framework that allocates static or dynamic solving pipelines based on the nature of the problem to be addressed, thereby flexibly adapting to various bug fixing challenges.
- MarsCode Agent MarsCode Agent combines code knowledge graphs and language server protocols to provide agents with comprehensive capabilities for code entity retrieval, relationship retrieval, and definition-and-reference navigation, enabling agents to browse and analyze code similarly to human developers.
- For code editing, MarsCode Agent uses conflict-based code edit descriptions and static syntax checking to accurately generate well-formatted code patches.
Multi-Agent Collaborative Framework
The problems and tasks in the field of Research & Development are very diverse, which makes it difficult to solve all problems with a fixed approach. For instance, some simple defect fixes or feature extensions can be completed by reviewing existing relevant code, while deeper exception stacks or complex logic errors may require dynamic code execution and variable tracking to uncover and fix the defects. Therefore, we have adopted a multi-agent collaboration framework to adapt to different development scenarios.
In dynamic debugging repair scenarios, the collaboration workflow of the agents is as follows:
The Reproducer creates a reproduction script matching the issue description.
The reproduction script is provided to the Tester for verification, which then supplies the resulting exception stack and other output information to the Programmer for repair.
After the programmer completes the repair, a testing request is made to the Tester.
The Tester verifies the repair using the reproduction script and determines if the issue is resolved:
If resolved, the diff tool is used to capture the code changes as the repair solution, ending the dynamic debugging.
If unresolved, the exception stack and other output information from the reproduction process are returned to the Programmer.
5 . The Programmer may continue modifications based on the Tester’s error messages or reset the repository and start anew until the Tester confirms the issue is resolved.
During this process, we set up a runtime sandbox environment in a Docker container to achieve dynamic debugging, issue reproduction, and validation.
In static repair scenarios, the agent collaboration process is simpler. The Editor attempts to fix the issue directly based on the code snippets retrieved by the Searcher. Given the randomness in LLM code modifications, we draw on the approach similar to Agentless, generating multiple candidate repair solutions in a single LLM request and normalizing the code using AST. Finally, the model merges and votes on all candidate solutions, selecting the highest-voted one as the final repair solution.
Since we have not yet completed the development of the sandbox environment on the IDE and cannot run programs dynamically and safely, MarsCode IDE currently only has static fix capabilities.
Code Indexing
We have developed several code indexing tools with multilingual support, to satisfy different code search requirements for different software development tasks.
Code knowledge graph is the main way for MarsCode Agent to perform code retrieval. A code knowledge graph represents code elements, their attributes, and the relationships between these elements in a graph structure, helping agents better understand and manage large codebases.In this graph, vertices represent code entities (such as functions, variables, classes, etc.), and edges represent relationships between these entities (such as function calls, variable references, class inheritance, etc.). This structured representation provides richer information about the codebase.
MarsCode Agent analyzes and organizes the code and documentation in a repository to generate a multi-directional graph using program analysis techniques. This graph includes semantic nodes, such as variables, functions, classes, and files, and edges representing file structure relationships, function call relationships, and symbol index relationships. This results in a code knowledge graph that integrates code, documentation, and repository information from multiple data sources.
In addition to the code knowledge graph, we also use the Language Service Protocol (LSP) to provide MarsCode Agent with definition and reference jump capabilities. The process of Agent calling the language server for code recall is consistent with the process of developers using "Ctrl+left button" to click on an identifier in the IDE to jump to the code. Through code jump, MarsCode-Agent has a code browsing and analysis process similar to that of human developers.
In addition to LSP and CKG, we have uniformly encapsulated common capabilities such as file retrieval within a project (find a file) and identifier retrieval within a project or file (grep) under the MarsCode Agent framework, thereby providing the Agent with a code retrieval tool library with a consistent calling style.
Code Editing
In our long-term exploration of AI agents for software development, we tried various methods of using LLMs for code edit descriptions and found that current LLMs have generally weak code modification capabilities. We concluded that LLM code edit descriptions need the following characteristics:
- No strict format validation, with descriptions that can be stably applied after processing and parsing;
- No need to provide line number ranges or perform line number calculations, as LLMs are unstable in this aspect;
- Simple, concise descriptions to minimize token and time costs. Inspired by Aider’s code change method, we developed our relatively stable code edit tool: MarsCode Agent AutoDiff.
AutoDiff’s code edit description resembles git conflict markers, where the agent provides the file path, original code, and replacement code within conflict markers. AutoDiff parses the edit block, matches the provided original code snippet to the most similar segment in the file, and replaces it with the provided replacement code. It then adjusts the indentation of the replacement code based on the modified file context. Finally, the differences before and after the modification are compared to generate a unified diff format change file.
Although AutoDiff can correctly complete most code editing requests, there are still common LLM code editing syntax problems such as type errors, undefined variables, indentation errors, and incorrect bracket closures. To address these problems, we use the language server protocol to perform static code diagnosis on files before and after AutoDiff modification. If the Agent's modification introduces a syntax error, the relevant diagnostic information is returned to the Agent for modification and adjustment.
Experiments
We evaluated the performance of MarsCode Agent on the SWE-bench Lite dataset.
SWE-bench is a highly challenging benchmark for LLMs to solve program logic and functional bugs. This dataset consists of 2294 issues from 12 industrial-grade Python code repositories on GitHub. Given a codebase and a description of the issue to be resolved, the agent needs to retrieve and edit the code from the repository, ultimately submitting a code patch that resolves the issue. Solving problems in SWE-bench typically requires understanding and coordinating changes across multiple functions, classes, or even files, necessitating interaction with the execution environment, handling extremely long contexts, and performing more complex reasoning than traditional code generation. Evaluations show that directly applying Claude2 and GPT-4 can only solve 4.8% and 1.7% of the instances, respectively.
Due to the high difficulty of SWE-bench, subsequent research found that evaluating all 2294 instances of SWE-bench is a time and token-intensive process that is frustrating and does not validate short-term progress. Therefore, the authors of SWE-bench extracted 300 instances with complete issue descriptions, clear-solving logic, and relative ease of resolution to form the SWE-bench Litedataset. Currently, the SWE-bench Lite dataset has become the benchmark for evaluating the capability of agents to solve software engineering problems, with over 20 companies and research organizations participating in the evaluation and submissions.
In the latest SWE-bench Lite evaluation, MarsCode Agent successfully solved 102 instances, achieving a solve rate of 34%.
Items | Value |
---|---|
Resolved Issues | 102 |
Resolved Rate | 102 / 300 = 34% |
Precision of File Localization | 265 / 300 = 88.3% |
Precision Rate of Code Snippet Localization | 206 / 300 = 68.7% |
Percentage of Issues Progressed to Dynamic Debugging | 84 / 300 = 28.0% |
Percentage of Issues Progressed to Static Repair | 202 / 300 = 72.0% |
Success Rate of Dynamic Debugging | 32 / 84 = 38.1% |
Success Rate of Static Repair | 70 / 216 = 32.4% |
Future Plans
The MarsCode team is committed to the implementation and application of AI Agent methods in the field of software engineering. In the future, we will continue to focus on the following optimization directions:
- Reduce the cost of calling large language models, promote the implementation of MarsCode Agent in more scenarios, and provide high-quality intelligent software development services to more users;
- Strengthen the collaboration and interaction between users and Agents, and enhance users' control over the Agent operation process;
- Support Agents to dynamically debug user workspaces safely to avoid problems such as user environmental pollution;
- Further improve the accuracy of file error location and code modification accuracy.