Articles in this series:
- Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM & Midscene
- Part 2 - Data: UI-Tars VS GPT-4o in Midscene
- Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent
This series of articles will provide a clear and practical guide to AI applications in end-to-end test automation. I will use AI to verify a product's end-to-end functionality, ensuring that it meets the required specifications.
Let's watch a demo first 👀👀, and then I will elaborate on how it works after.
(The video is not accelerated. I use Vinted.com as an example because I heard about it very frequently from my wife...
I tell the AI Agent that it must open the home page, search for a product, then open a product detail page, and check the price on the Product page.
The video above demonstrates how the AI Agent perceives the process—autonomously interpreting business-oriented test cases, evaluating the webpage's current state(screenshot), making plans and decisions, and executing the test. It engages in multi-step decision-making, leveraging various types of reasoning to achieve its goal.
1. Reviewing the Role of End-to-End Testing
It is crucial to emphasize once again that: the primary goal of end-to-end testing is to validate that new features and regression functionality match the product requirements and design by simulating the customers' behavior.
End-to-end testing is a testing approach widely used in regression validation. It can be performed either manually—such as by writing a sanity checklist and executing tests manually- or through automation by writing test scripts using tools like Playwright or Appium.
The three key aspects of end-to-end testing are described in the above figure:
- Understand How Users Benefit from the Feature – Identify the value the feature brings to users.
- Design Test Cases from the User's Perspective – Create test scenarios that align with real user interactions.
- Iterate Test Execution During Development – Continuously run test cases throughout the development process to verify that the implemented code meets the required functionality.
2. In Action – Executing Your Test Cases with an AI Agent
In traditional end-to-end test automation, the typical approach is as follows:
- Analyze the functionality – Understand the feature and its expected behavior.
- Analyze and write test cases – Define test scenarios based on user interactions and requirements.
- Write automation scripts – Implement test cases using automation frameworks.
When writing automated test cases, we usually create Page Object-like classes to represent the HTML tree, allowing the test script to interact with or retrieve elements efficiently.
Now, let's see how an AI Agent can optimize this process.
2.1 - A Multiple-decisions AI Agent can optimize the test process
From the above video, this is the test case that AI Agent read:
Scenario: Customer can search and open product detail page from a search result
Go to https://www.vinted.com
The country selection popup is visible
Select France as the country that I'm living
Accept all privacy preferences
Search 'Chanel', and press the Enter
Scroll down to the 1st product
Click the 2nd product from the 1st row in the product list
Assert Price is visible
Thus, the AI Agent actually read some business-oriented languages, instead of something like "Click A"/ "Type B". The AI Agent itself will make reasoning and plan the steps.
To run this test, it requires the following hardware:
- Nvidia L40s - 1 x GPU, 48GB GPU Memory, 7 x vCPU, 40GB CPU memory
2.2 - What problems were solved by this AI Agent
Reflecting on what we mentioned in Section 1, end-to-end testing can be performed in two ways: by writing automated test scripts or executing tests manually.
With the introduction of an AI Agent, a new approach emerges—simply providing your test cases to the AI Agent without writing test scripts. The AI Agent then replaces manual execution by autonomously carrying out the test cases.
Specifically, it addresses the following problems:
- Reduces Manual Testing Costs – The AI Agent can interpret test cases written by anyone, eliminating the need to write test scripts and allowing tests to be executed at any time.
- Lowers Test Script Maintenance Effort – The AI Agent autonomously determines the next browser action, reducing the need to modify tests for minor UI changes.
- Increases Accessibility and Participation – Shifting from traditional QA engineers writing automation scripts to a decentralized model where developers contribute, and now to a stage where anyone proficient in English and familiar with the product can write end-to-end test cases.
2.3 - Embedded into Playwright
test("[UI-Tars - Business]a user can search then view a product", async ({ page, ai, aiAssert, aiWaitFor }) => {
await page.goto("https://www.vinted.com")
await aiWaitFor('The country selection popup is visible')
await ai("Select France as the country that I'm living")
await page.waitForURL((url: URL) => url.hostname.indexOf('vinted.fr') >= 0)
await ai("Accept all privacy preferences")
await ai("Click Search bar, then Search 'Chanel', and press the Enter")
await ai("Scroll down to the 1st product")
await ai("Click the 2nd product from the 1st row in the product list")
expect(page.url()).toContain("/items/")
await aiAssert("Price is visible")
})
3. Introduction UI-Tars LLM and Midscene
3.1 UI-Tars
UI-Tars is a native, open-source GUI Multimodality LLM, which is re-built on top of qwen-2.5-VL(通义千问2.5 VL). This model can process both text and GUI images simultaneously, and provides STF and DPO 2 kinds of models, with a huge amount of GUI screenshots. UI-Tars is specifically designed for interacting with GUI.
It performs well in:
- Browser application
- Desktop and Desktop application
- Mobile and mobile application
It supports prompts in 2 languages:
- Chinese
- English
More details - please read from the Paper
3.2 Midscene
Midscene is a state machine, it builds a multiple-reasoning AI-Agent with provided Models.
It supports the following LLMs:
- UI-Tars (the main branch doesn't support AIAssert, AIQuery, and AIWaitfor, but you can check my branch)
- Qwen-2.5 VL (通义千问2.5, I really love this name...)
- GPT-4o
4 The mechanism between UI-Tars & Midscene
4.1 Orchestrations and Comparisons
Its core capability is to plan, reason, and execute multiple steps autonomously, just like a human, based on both visual input and instructions—continuing until it determines that the task is complete.
It possesses 3 key abilities:
- Multi-Step Planning Across Platforms – Given an instruction, it can plan multiple actions across web browsers, desktop, or mobile applications.
- Tool Utilization for Execution – It can leverage external tools to carry out the planned actions.
- Autonomous Reasoning & Adaptation – It can determine whether the task is complete or take additional actions if necessary.
I compared the most popular solutions in the market until the end of 2025-02:
Solutions | is it an AI Agent | Cost | Additional Input to LLM | how to get html element | Multiple Step Decision & Autonomous Reasoning | Playwright Integration | Mobile App support | Desktop App |
---|---|---|---|---|---|---|---|---|
UI-Tars(/GPT-4o) + Midscene | Yes | 1.8$/h with UI-Tars:7B OR ~0.1263875$/test with OpenAI GPT-4o | GUI Screenshot | GUI Screenshot Processing | Yes | Yes | Yes | Yes |
Llama 3.2 + Binded Tools + LangGraph | Yes | 0.2$ / tests | HTML | HTML DOM processing | No | Yes | Not yet | No |
ZeroStep / auto-playwright | Kind of | Unknown | HTML | HTML DOM processing | No | Yes | Unknown | No |
StageHand(GPT-4o or Claude 3.5) | Yes | Unknown | HTML & GUI Screenshot | HTML DOM Processing | Not yet | Yes | Not yet | No |
To summarise - a solution UI-Tars(or GPT-4o) with Midscene seems is the most applicable and cheapest approach.
4.2 Multiple-Step decisions and reasoning
Let's have a look at an actual step - Search 'Chanel', and press the Enter
from the above example.
4.2.1 Midscene sends a system message to UI-Tars
Midscene sends the test step as part of the System Message to the LLM, together with the current screenshot.
4.2.2 UI-Tars return the through and Action
Because this "User step" requires multiple browser actions, like identifying where is the search bar, then click the search bar, then type "channel", and pressing "Enter" at the end. Thus UI-Tars make 1st decision to "click the search bar".
4.2.3 UI-Tars start reasoning and plan the next browser actions for the same user step iteratively
Midscene currently takes screenshots before each reasoning, so UI-Tars always knows the latest state in the browser, besides of that, UI-Tars currently sends all chat history back to UI-Tars when it is reasoning.
4.2.4 UI-Tars autonomously check whether the user step is achieved
5. Code
6. The cost for this PoC
7. References
@software{Midscene.js,
author = {Zhou, Xiao and Yu, Tao},
title = {Midscene.js: Let AI be your browser operator.},
year = {2025},
publisher = {GitHub},
url = {https://github.com/web-infra-dev/midscene}
}
@article{qin2025ui,
title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
journal={arXiv preprint arXiv:2501.12326},
year={2025}
}