Articles in this series:

Part 1 - Practical Applications of AI in Test Automation — Context, Demo with UI-Tars LLM & Midscene
Part 2 - Data: UI-Tars VS GPT-4o in Midscene
Part 3 - Stage Conclusion: UI-Tars + RAG = Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent

This is the final article in this series.

I will present an example demonstrating:

How to integrate UI-Tars' system-reasoning 2 with our locally built RAG using Ollama and LangChain to create a system that understands high-level user instructions and automates execution based on browser screenshots at each stage.
Verify the capability of system-2 reasoning after combining the UI-Tars & local RAG

1. End-to-End Demo

This Demo uses Miro as an example to demonstrate the capability to handle a non-B2C & complicated system. The AI Agent follows a user's single instruction: "Create a new board with 2 sticky notes and link these 2 sticky notes by a line. "

⚠️VERY IMPORTANT⚠️: The Demo uses Miro's Free Plan, which is open to everyone who can use the Free plan. The test is executed less than 10 times to verify the stability. The authentication part uses my personal Miro Free account(hardcoded already to avoid any other risks). I strongly ask any readers who want to reproduce this test for any customer-facing products should NOT impact the normal usage of the product, and MUST follow up the policy of the product respectively.

(👀 Except Authentication, rest of actions are planed and executed by the AI-Agent with reading a single High-level User instruction)

2. The explanation

Before the test, we need to deploy UI-Tars-7B-SFT:

Step 1: Deploy UI-Tars-7B-SFT to Hugging Face, to L40S 1 x GPU, you can follow up the steps from here
Step 2: Config the .env file for Midscene

OPENAI_API_KEY="hf_" // this is can be the HF access key
OPENAI_BASE_URL="https://something/v1" // this can be HF Endpoint URL
MIDSCENE_MODEL_NAME="ui-tars-7b-sft" 
MIDSCENE_USE_VLM_UI_TARS=1 // must tell Midscene to switch to UI-Tars

MIDSCENE_LANGSMITH_DEBUG=1 // enable trace send to Langsmith, help us debug and test this AI Agent

LANGSMITH_API_KEY= 
LANGSMITH_ENDPOINT="https://api.smith.langchain.com"
LANGSMITH_TRACING_V2=true
LANGSMITH_PROJECT=

DEBUG=true //This will enable the OpenAI SDK to print Debug logs to the console

Step 3: Building own Ollama + Langchain environment locally, and pull the nomic-embed-text as embeddings, and a RAG locally which contains "very structured" documents.

The test did 3 steps:

Step 1: Authenticated, save the authentication state to the local file, it is implemented via writing the Playwright code by following the guides from Playwright Authentication
Step 2: Pass only one High-level user instruction to the ai(), let ai() handle to plan and execute it. This function is exposed by Midscene, a tool created by ByteDance.

# Passing only ONE user instruction to UI-Tars, UI-Tars only use this single instruction to do reasoning. 

        await ai(`**ID: ${ uuidv4() }, 这是一个新指令，完全忽略之前的记忆，严格按照指令和规则执行，禁止幻想**
Given A free plan user creates a new board without upgrade,
And the user creates 2 sticky notes in 2 different side of the grid, 
Then the user adds a line from 1st sticky note to the 2nd sticky note`)

Step 3: Use AI to assert the final state via aiAssert()

3. The Stage Conclusions

3.1 System-2 Reasoning Capability

Referenced Paper System-1 reasoning refers
to the model directly producing actions without chain-of-thought, while system-2 reasoning involves a more
deliberate thinking process where the model generates reasoning steps before selecting an action

In Part-1, it demonstrated an example by using Vinted. It shows a good result for System-1 reasoning by transferring the human language to a single browser action without chain-of-thought.

The above video demonstrated the ability of UI-Tars about System-2 reasoning by having a chain of thought, including our own local RAG.

Takeaway messages

comparing with gpt-4o, UI-Tars has better ability of system-2 reasoning. (data from - Part-2)
However, having proper reasoning from UI-Tars for a specific sector or a product requires a structured RAG or fine-tuning UI-Tars with our own input data. (data from - (RAG for UI-Tars)[https://github.com/web-infra-dev/midscene/issues/426])
For a B2C site, with a small RAG together with UI-Tars-7B-SFT, we can achieve a very high-level instruction - ex: I want to buy a bag in Vinted. (It has been tested against Vinted.com, but I cannot share the demo because the test was treated as a robot, and the action breached Vinted's policy.)

3.2 Applicable products and scenarios

This solution is already capable of serving as a supplement to existing regression E2E automated testing for B2C websites, even without building RAG, but it requires a few system-2 reasonings.
This solution is capable as part of automated exploratory testing for B2C applications, and others require RAG building or fine-tuning UI-Tars-SFT with their own structured business data.
This solution is capable of doing GUI and OCR checking to replace your current Screenshot Assertions within existing playwright/puppeteer tests. (Only if the purpose of your screenshot assertions is NOT comparing the UI style with given pictures)

3.3 Problems & Risk

You might be excited! It can already address real challenges in QA engineering, but it requires cautious use and ongoing development.

One of the problems in Quality Assurance engineering may migrate from "The automated test is flaky/outdated" to "The AI-decided Automated test result is NOT trustable".
For situations using RAG together with UI-Tars, it works stable with 7B-SFT, but doesn't work well(or at all) with 7B-DPO, so it requires more effort to make the solution be more scalable.
How to apply this AI-Agent as a Virtual QA engineer into existing SDLC faces challenges in terms of when and how, and the stability of the result, although it's already quite stable for System-reasoning 1 and partially for System-reasoning 2
If one of your purposes is to achieve the high velocity of test execution, then it may not work very well for complicated web applications, such as some SaaS application, when using too much System-2 reasoning, because UI-Tars has its own short-term and long-term memory, jumping to an "incorrect" cache will lead the result to the Moon...

Stage Conclusion: UI-Tars + RAG = E2E/Exploratory AI Test Agent (Part 3)