Articles in this series:

Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM & Midscene
Part 2 - Data: UI-Tars VS GPT-4o in Midscene
Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent

This series of articles will provide a clear and practical guide to AI applications in end-to-end test automation. I will use AI to verify a product's end-to-end functionality, ensuring that it meets the required specifications.

Let's watch a demo first 👀👀, and then I will elaborate on how it works after.

(The video is not accelerated. I use Vinted.com as an example because I heard about it very frequently from my wife...

I tell the AI Agent that it must open the home page, search for a product, then open a product detail page, and check the price on the Product page.

The video above demonstrates how the AI Agent perceives the process—autonomously interpreting business-oriented test cases, evaluating the webpage's current state(screenshot), making plans and decisions, and executing the test. It engages in multi-step decision-making, leveraging various types of reasoning to achieve its goal.

1. Reviewing the Role of End-to-End Testing

It is crucial to emphasize once again that: the primary goal of end-to-end testing is to validate that new features and regression functionality match the product requirements and design by simulating the customers' behavior.

End-to-end testing is a testing approach widely used in regression validation. It can be performed either manually—such as by writing a sanity checklist and executing tests manually- or through automation by writing test scripts using tools like Playwright or Appium.

The three key aspects of end-to-end testing are described in the above figure:

Understand How Users Benefit from the Feature – Identify the value the feature brings to users.
Design Test Cases from the User's Perspective – Create test scenarios that align with real user interactions.
Iterate Test Execution During Development – Continuously run test cases throughout the development process to verify that the implemented code meets the required functionality.

2. In Action – Executing Your Test Cases with an AI Agent

In traditional end-to-end test automation, the typical approach is as follows:

Analyze the functionality – Understand the feature and its expected behavior.
Analyze and write test cases – Define test scenarios based on user interactions and requirements.
Write automation scripts – Implement test cases using automation frameworks.

When writing automated test cases, we usually create Page Object-like classes to represent the HTML tree, allowing the test script to interact with or retrieve elements efficiently.

Now, let's see how an AI Agent can optimize this process.

2.1 - A Multiple-decisions AI Agent can optimize the test process

From the above video, this is the test case that AI Agent read:

Scenario: Customer can search and open product detail page from a search result

Go to https://www.vinted.com
The country selection popup is visible
Select France as the country that I'm living
Accept all privacy preferences
Search 'Chanel', and press the Enter
Scroll down to the 1st product
Click the 2nd product from the 1st row in the product list
Assert Price is visible

Thus, the AI Agent actually read some business-oriented languages, instead of something like "Click A"/ "Type B". The AI Agent itself will make reasoning and plan the steps.

To run this test, it requires the following hardware:

Nvidia L40s - 1 x GPU, 48GB GPU Memory, 7 x vCPU, 40GB CPU memory

2.2 - What problems were solved by this AI Agent

Reflecting on what we mentioned in Section 1, end-to-end testing can be performed in two ways: by writing automated test scripts or executing tests manually.

With the introduction of an AI Agent, a new approach emerges—simply providing your test cases to the AI Agent without writing test scripts. The AI Agent then replaces manual execution by autonomously carrying out the test cases.

Specifically, it addresses the following problems:

Reduces Manual Testing Costs – The AI Agent can interpret test cases written by anyone, eliminating the need to write test scripts and allowing tests to be executed at any time.
Lowers Test Script Maintenance Effort – The AI Agent autonomously determines the next browser action, reducing the need to modify tests for minor UI changes.
Increases Accessibility and Participation – Shifting from traditional QA engineers writing automation scripts to a decentralized model where developers contribute, and now to a stage where anyone proficient in English and familiar with the product can write end-to-end test cases.

2.3 - Embedded into Playwright


  test("[UI-Tars - Business]a user can search then view a product", async ({ page, ai, aiAssert, aiWaitFor }) => {
    await page.goto("https://www.vinted.com")

    await aiWaitFor('The country selection popup is visible')
    await ai("Select France as the country that I'm living")
    await page.waitForURL((url: URL) => url.hostname.indexOf('vinted.fr') >= 0)
    await ai("Accept all privacy preferences")
    await ai("Click Search bar, then Search 'Chanel', and press the Enter")
    await ai("Scroll down to the 1st product")
    await ai("Click the 2nd product from the 1st row in the product list")

    expect(page.url()).toContain("/items/")
    await aiAssert("Price is visible")
  })

3. Introduction UI-Tars LLM and Midscene

3.1 UI-Tars

UI-Tars is a native, open-source GUI Multimodality LLM, which is re-built on top of qwen-2.5-VL(通义千问2.5 VL). This model can process both text and GUI images simultaneously, and provides STF and DPO 2 kinds of models, with a huge amount of GUI screenshots. UI-Tars is specifically designed for interacting with GUI.

It performs well in:

Browser application
Desktop and Desktop application
Mobile and mobile application

It supports prompts in 2 languages:

Chinese
English

More details - please read from the Paper

3.2 Midscene

Midscene is a state machine, it builds a multiple-reasoning AI-Agent with provided Models.
It supports the following LLMs:

UI-Tars (the main branch doesn't support AIAssert, AIQuery, and AIWaitfor, but you can check my branch)
Qwen-2.5 VL (通义千问2.5, I really love this name...)
GPT-4o

4 The mechanism between UI-Tars & Midscene

4.1 Orchestrations and Comparisons

Its core capability is to plan, reason, and execute multiple steps autonomously, just like a human, based on both visual input and instructions—continuing until it determines that the task is complete.

It possesses 3 key abilities:

Multi-Step Planning Across Platforms – Given an instruction, it can plan multiple actions across web browsers, desktop, or mobile applications.
Tool Utilization for Execution – It can leverage external tools to carry out the planned actions.
Autonomous Reasoning & Adaptation – It can determine whether the task is complete or take additional actions if necessary.

I compared the most popular solutions in the market until the end of 2025-02:

Solutions	is it an AI Agent	Cost	Additional Input to LLM	how to get html element	Multiple Step Decision & Autonomous Reasoning	Playwright Integration	Mobile App support	Desktop App
UI-Tars(/GPT-4o) + Midscene	Yes	1.8$/h with UI-Tars:7B OR ~0.1263875$/test with OpenAI GPT-4o	GUI Screenshot	GUI Screenshot Processing	Yes	Yes	Yes	Yes
Llama 3.2 + Binded Tools + LangGraph	Yes	0.2$ / tests	HTML	HTML DOM processing	No	Yes	Not yet	No
ZeroStep / auto-playwright	Kind of	Unknown	HTML	HTML DOM processing	No	Yes	Unknown	No
StageHand(GPT-4o or Claude 3.5)	Yes	Unknown	HTML & GUI Screenshot	HTML DOM Processing	Not yet	Yes	Not yet	No

To summarise - a solution UI-Tars(or GPT-4o) with Midscene seems is the most applicable and cheapest approach.

4.2 Multiple-Step decisions and reasoning

Let's have a look at an actual step - Search 'Chanel', and press the Enter from the above example.

4.2.1 Midscene sends a system message to UI-Tars

Midscene sends the test step as part of the System Message to the LLM, together with the current screenshot.

4.2.2 UI-Tars return the through and Action

Because this "User step" requires multiple browser actions, like identifying where is the search bar, then click the search bar, then type "channel", and pressing "Enter" at the end. Thus UI-Tars make 1st decision to "click the search bar".

4.2.3 UI-Tars start reasoning and plan the next browser actions for the same user step iteratively

Midscene currently takes screenshots before each reasoning, so UI-Tars always knows the latest state in the browser, besides of that, UI-Tars currently sends all chat history back to UI-Tars when it is reasoning.

4.2.4 UI-Tars autonomously check whether the user step is achieved

5. Code

6. The cost for this PoC

(UI-Tars costs)

(GPT-4o costs)

7. References

@software{Midscene.js,
  author = {Zhou, Xiao and Yu, Tao},
  title = {Midscene.js: Let AI be your browser operator.},
  year = {2025},
  publisher = {GitHub},
  url = {https://github.com/web-infra-dev/midscene}
}

@article{qin2025ui,
  title={UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  author={Qin, Yujia and Ye, Yining and Fang, Junjie and Wang, Haoming and Liang, Shihao and Tian, Shizuo and Zhang, Junda and Li, Jiahao and Li, Yunxin and Huang, Shijue and others},
  journal={arXiv preprint arXiv:2501.12326},
  year={2025}
}

Practical Applications of AI in Test Automation - Context, UI-Tars LLM , Midscene (Part 1)