E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion

Ke Wang ¹, Tianyu Xia ¹¹¹footnotemark: 1 , Zhangxuan Gu¹, Yi Zhao²,
Shuheng Shen¹, Changhua Meng¹, Weiqiang Wang ¹, Ke Xu ²,
¹Ant Group, ²Tsinghua University
[email protected], [email protected] Equal contribution

Abstract

Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named E-ANT, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing more than 40,000+ real human traces over 20000+ different tinyAPPs and URLs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.

Ke Wang ¹^†^†thanks: Equal contribution, Tianyu Xia ¹¹¹footnotemark: 1 , Zhangxuan Gu¹, Yi Zhao², Shuheng Shen¹, Changhua Meng¹, Weiqiang Wang ¹, Ke Xu ², ¹Ant Group, ²Tsinghua University [email protected], [email protected]

1 Introduction

The integration of natural language and voice commands for automating tasks on mobile devices is a pivotal topic within human-computer interaction and intelligent agent design. This research holds immense value for individuals with physical disabilities or those engaged in activities like driving, where hands-free operation is essential. An efficient mobile device automatic control system should continuously comprehend the active screen, make informed decisions, and execute the necessary actions to fulfill objectives articulated through natural language, such as automatically opening vehicle navigation or ordering food while driving.

Refer to caption — Figure 1: Left: Annotations of AitW. Middle: Our page analysis results on AitW data. Right: Our page analysis results on our dataset. Note: Our method accurately identifies the "Select Specifications" button, critical for order placements.

Existing research approaches are from either a software engineering Google ; Apple ; Li et al. (2017a); Azim et al. (2016); Li et al. (2017b) or navigation algorithms Li et al. (2020); Bai et al. (2021); Hong et al. (2023); Rawles et al. (2024); Wang et al. (2023a); Wen et al. (2023b, a); Yan et al. (2023); Zhang et al. (2023a). The former methods explore the automation of instruction execution or the abstraction of various APIs, while the latter ones are primarily concerned with translating natural language commands into system-comprehensible instructions (such as clicks or slides). Nevertheless, they both need the model has powerful decision-making capability given human inputs. In this case, GUI navigation benchmark is an essential aspect of evaluating the decision-making capabilities of both large language models(LLMs) Achiam et al. (2023); Zheng et al. (2024); Touvron et al. (2023); Bai et al. (2023); Baichuan (2023); Jiang et al. (2023); Almazrouei et al. (2023) and multi-modal large language models(MLLMs) Liu et al. (2024, 2023); Zhu et al. (2023); Chen et al. (2023); Li et al. (2022, 2023a); Wang et al. (2023b); Hong et al. (2023). It holds significant importance in assessing their performance as agents. As a result, Android in the Wild (AitW) Rawles et al. (2024) is a data benchmark recently introduced by Google to assess the efficiency of UI navigation algorithms while performing daily tasks on native Android systems. This benchmark effectively fills the void in data benchmarks for evaluating UI interaction within Android systems.

Despite the proliferation of Chinese mobile applications, surpassing 2.61 million MIIT of China (2023), the majority of existing GUI navigation datasets primarily cater to English, leading to a clear lack of comprehensive datesets for Chinese GUI navigation. Additionally, datasets like AitW are narrowly focused on GUI navigation within the native Android operating system and its inherent applications, making their applicability to third-party apps from various developers limited. Moreover, the quality of annotation regarding GUI element positions in these datasets is poor, with some inaccuracies(figure 1) and wrong labels Hong et al. (2023), which can impair the decision-making precision in downstream GUI navigation activities.

In this paper, we focus on navigating the Chinese GUI in third-party applications created by various developers. We primarily gather our data from tiny-apps, which are lightweight and simple to develop mobile applications. Currently, there are over 4.6 million active tiny-apps. We developed a Large-Scale Dataset for Efficient Automatic GUI NavigaTion (E-ANT), consisting of over 40,000 user operation trajectories. This dataset covers a wide range of navigation intentions and includes various tiny apps. Comparatively, interacting with the Android native system is different as tiny apps are typically created by numerous third-party developers, each with their own design logic and art styles. This diversity poses unique challenges in navigating through tiny apps. We provide a clear and comprehensive understanding of each trajectory. This includes an intention described in natural language, a series of consecutive page screenshots ranging from several to dozens, and the corresponding actions performed on each page, such as clicks and slides in coordinate dimensions. Additionally, for each page screenshot, we offer detailed information about the page elements captured, including their type (such as button, icon, OCR, etc.), coordinates, and the text contained within each element.

To gain a precise understanding and evaluation of the GUI navigation performances of current mainstream LLMs/MLLMs, we conducted extensive benchmark tests on E-ANT. Specifically, we evaluate the GUI navigation level of the current mainstream models under the following strategies. (1)Zero-shot inference. Directly use the existing pre-trained model to test on the test set. (3)Fine-tuning. Use a part of the samples as a training set to fine-tune the model before inference. (4)Fine-tuning with data augmentation. This is our recommended method of fine-tuning the UI navigation model. It does not directly use coordinate positions as labels, but allows them to make decisions step by step in a chain. We will introduce this method in detail later.

We summarize our contributions as follows:

•

We gather and publish the first large-scale Chinese dataset for GUI navigation, collected from diverse tiny apps. It will make foreseeable contributions to both the multimodal and the automatic GUI navigation community.
•

We analyze the characteristics of our dataset and provide a recommended fine-tuning methods for GUI navigation data.
•

We evaluate in detail the performance of current mainstream LLMs/MLLMs on this dataset under different inference methods.

2 Related Work

2.1 UI Navigation and Automation Execution

Previously, three primary methods existed for incorporating automated UI navigation on mobile devices. These options included smart assistants developed by mobile phone manufacturers such as Siri, as well as macro recording tools Rodrigues (2015); Rodrigues and Guerreiro (2014); Li (2021) and Programming by Demonstration (PBD) systems Cypher and Halbert (1993); Lieberman (2001); Guibert et al. (2004); Maués and Barbosa (2013); Li et al. (2017a). They can all translate user intentions into low-level operations and automate execution. The smart assistant is restricted to calling only the built-in applications on mobile phones and a few select external applications that are in collaboration with mobile phone manufacturers. This limitation significantly constrains its range of application scenarios. The macro recording tool’s Rodrigues (2015); Rodrigues and Guerreiro (2014); Li (2021) capabilities are limited to playing back user-recorded operations. It lacks the ability to handle tasks with altered parameters or customized actions. The PBD system not only supports the automatic generation of execution scripts through user demonstrations but also provides corresponding interfaces for users to edit scripts Cypher and Halbert (1993); Lieberman (2001); Guibert et al. (2004); Maués and Barbosa (2013); Li et al. (2017a). However, despite its usefulness, the system still has a learning curve, and the scripts are not easily applicable across various applications with similar functions.

2.2 UI Navigation with LLMs/MLLMs

The increasing popularity of large-scale language models and multi-modal language models, including GPT Achiam et al. (2023), LLaMA Touvron et al. (2023), BaiChuan Baichuan (2023), LLaVA Liu et al. (2024, 2023), MiniGPT4-V Zhu et al. (2023); Chen et al. (2023), and others, has led to a growing interest among researchers in utilizing these models as intelligent agents for automating UI navigation Wang et al. (2023a); Kim et al. (2024); Wen et al. (2023b); Zhang et al. (2023b); Wen et al. (2023a); Lee et al. (2023); Yan et al. (2023); Zhan and Zhang (2023); Hong et al. (2023); Yang et al. (2023).

Among them, some works Wang et al. (2023a); Kim et al. (2024); Yan et al. (2023); Wen et al. (2023a) utilize trained LLMs/MLLMs like GPT. They achieve automatic navigation on mobile devices by prompt and incorporating the knowledge of UI navigation domain in pre-trained LLMs/MLLMs. Furthermore, Li et al. Li et al. (2023b)introduced structured self-reflection into the UI navigation agent to improve its planning capabilities, while Zhang et al. Zhan and Zhang (2023) used Chain-of-Action prompts to improve the performance of multi-modal agents on UI navigation tasks. In addition, Zhang et al. Zhan and Zhang (2023) and Hong et al. Hong et al. (2023) based on pre-trained LLMs/MLLMs, fine-tuned instructions for the content in the UI navigation field to enhance the accuracy of the model’s navigation decisions.

2.3 UI Navigation Benchmark

There are currently some studies focusing on evaluation data collection and benchmark construction in the field of UI navigation Shi et al. (2017); Liu et al. (2018); Yao et al. (2022); Rawles et al. (2024); Bai et al. (2021); Deka et al. (2017). MiniWob Shi et al. (2017) and MiniWob++ Liu et al. (2018) are established benchmarks in the field of computer UI navigation research. It requires agents to perform specific tasks in the computer environment they build through instructions such as clicks and inputs. MiniWob++ goes a step further by providing programmatically defined rewards for each decision made during execution. In addition, WebShop Yao et al. (2022) provides the UI navigation community with a simulated e-commerce environment. In this environment, agents need to navigate multiple types of web pages and issue different actions to find and purchase products based on instructions. These environments, datasets, and benchmarks focus primarily on navigation and decision-making on web pages Shi et al. (2017); Liu et al. (2018); Yao et al. (2022). For mobile phones, UIbert Bai et al. (2021) and RICO Deka et al. (2017) provide practical page understanding benchmarks that can effectively evaluate the target detection and recognition capabilities of models or agents on the page. However, they lack the intentions and operational actions actually performed on the page and cannot evaluate the agent’s Navigation and decision-making skills. The AitW dataset Rawles et al. (2024) fills this gap by providing an extensive collection of over 6 million images and corresponding actions performed on the Android operating system. However, its focus is mostly limited to first-party applications, such as settings, clock, and Google Maps, with minimal support for third-party applications with different design styles. This limits the comprehensiveness of the evaluation capabilities of this dataset and the generalizability of models trained on this dataset.

3 E-Ant TinyAPP Dataset

3.1 GUI Navigation Task

Generally, navigating a GUI can be seen as a series of decision-making tasks when interacting with a webpage or app.

Decision-making tasks on UI pages. For a web page denoted as $S_{t}$ , it hosts a range of interactive controls. We identify the collection of all possible interactions within these elements as the action set $A_{t}$ . Upon selection and execution of an action $a_{t}$ from $A_{t}$ by either a user or an agent, the web page transitions to its updated state, labeled as $S_{t+1}$ . We use the notation $S_{t+1}=S_{t}\wedge a_{t}$ to represent that by executing action $a_{t}$ on state $S_{t}$ , the system transitions to the new state $S_{t+1}$ .

Navigation tasks on UI pages. Given an initial page $S_{0}$ and a final state $S^{*}$ , the objective of the UI navigation task is to ensure that $S_{T}=S^{*}$ , which is achieved by sequentially applying a series of decisions $A=\{a_{0},a_{1},\cdots,a_{T-1}\}$ to transition from $S_{0}$ to $S_{T}$ through the operations $S_{0}\wedge a_{0}\wedge a_{1}\wedge\cdots\wedge a_{T-1}$ . For each $t\in\{0,\cdots,T-1\}$ , the action $a_{t}$ is chosen from the set $A_{t}$ , which contains all possible interactive actions available on page $S_{t}$ . We usually call $S^{*}$ the intention or purpose of the UI navigation task.

Therefore, the aim for a tool or model designed for UI navigation is to develop a decision-making function $\hat{a}_{t}=f(S^{*},S_{0})$ that achieves $S^{*}=S_{T}=S_{0}\wedge\hat{a}_{0}\wedge\hat{a}_{1}\cdots\wedge\hat{a}_{T-1}$ , all within a limited number of steps $T$ . Previous research Rawles et al. (2024); Humphreys et al. (2022) has also incorporated historical operation trajectories and states into the input for decision-making functions in the study of UI navigation tasks. For these approaches, the decision-making function can be represented as $\hat{a}_{t}=f(S^{*},S_{0},\cdots,S_{t-1},a_{0},\cdots,a_{t-1})$ . Additionally, for ease of annotation and comprehension, the target state $S^{*}$ is typically described using a single sentence $p$ .

3.2 Data Collections

We design an annotation systems for annotators to interact with and record tasks to record the real human’s behaviour on tiny apps. The annotation system establishes a real-time connection with an Android emulator in the backend, while the annotators interact with the frontend which contains a mobile interface and task description. The entire data collection process is as follows: (1)The backend server synchronizes screenshots from the backend Android device to the frontend interface. (2)Annotators act on the interface according to specified tasks, such as clicking buttons, scrolling pages, entering content, and navigating back, etc. (3)The backend server records the actions and synchronizes the current screenshot, operation coordinates, and text to the cloud as a record. (4)The backend server sends the operation instructions of the tagging personnel to the Android virtual machine and actually performs the corresponding operations. (5)After the emulator performs the actions, it updates the screenshot to the frontend.

3.3 Data Organization Methods

Our dataset is composed of 49,023 operational traces across diverse mini-programs within super apps. It encompasses both single-step and multi-step traces, spanning 27 sectors including catering, retail, healthcare, and government services, and extends to over 20,000 distinct tiny-apps and urls. For each operation trace, we will provide the corresponding operation purpose $p$ , an indicator of whether the purpose was achieved, and a series of operation steps. At the same time, for each operation step, we will provide page screenshots, page layout analysis results and corresponding actions.

Operation Purpose $p$ . The purpose $p$ is succinctly described in a single sentence, typically suggesting a corresponding page state $S^{*}$ . For instance, the intent to "rent an iPhone 15" suggests that $S^{*}$ would be the order or payment page for the iPhone 15 within the tiny-app. Our dataset encompasses over 10,000 such action intents. Moreover, for more complex purposes, we also detail the sub-purposes associated with different steps.

Operation Step. Each operational step is defined by the current state information including a page screenshot, page layout analysis results and the action performed. Executable actions fall into several categories: CLICK, SWIPE, WAIT, INPUT, and STOP. The CLICK action specifies the coordinates where the click occurred, while the SWIPE action contains a start point and an end point to represent the sliding between the two points.

To gain a deeper insight into our dataset’s composition, we have visualized one of its trajectories in Figure 2. Additionally, Figure 3 illustrates the data associated with each step of the trajectory alongside the data production pipeline.

3.4 Analysis of our Datasets

More divergence. We contend that our dataset exhibits greater diversity compared to existing datasets in the UI navigation domain. The data we have gathered originates from mini-apps created by various developers, signifying a range of UI design styles and operational logics. As illustrated in the figure 4, our dataset encompasses over 20,000 distinct tiny-apps, with the majority appearing in fewer than 5 trajectories within the dataset. This diversity introduces significant challenges for the generalization of models or agents developed or assessed using this dataset. In contrast, most of AitW’s data is sourced from Google’s first-party applets or those applets developed in close collaboration with Google, with the majority of this data concentrated in specific applications such as Chrome, Android Settings, and Gmail. We also visualize the succes and fail rate at different length of traces, as shown in the figure 4, which shows that as the step length raise, the success rate get lower, implying that longer steps tend to have a higher failure rate.

Chinese language. Both MiniWob for web pages and AitW for Android phones focus mostly on English. They don’t have much data for other languages. Since Chinese is one of the world’s most used languages on the internet, and because it uses a different writing system from English, it’s hard for models trained on English data to work well with Chinese. This means we really need to make a dataset for Chinese UI navigation.

Layout Analysis vs OCR. In our dataset, we employ a layout analysis algorithm Gu et al. (2023) rooted in UI data as an alternative to OCR. As depicted in the figure 1, our layout analysis algorithm outperforms OCR technology by capturing a wider range of UI elements that are likely integral to decision-making processes. When we compare our layout analysis approach to the IconNet methodology used by AitW in the AitW dataset, it is evident that our algorithm identifies a more comprehensive set of elements. Furthermore, our method groups text and icons that are spatially proximate, which better reflects the spatial logic inherent in UI design.

4 Dataset Evaluation

4.1 Evaluation Metrics

Picture-level Accuracy. We employ a key evaluation metric that measures the congruence between the model’s decisions in its present state and the corresponding real-world actions observed in each image. This alignment is quantified by posing a binary question. For "CLICK" actions, the model must identify the precise click element. That is, the model needs to find the element information that matches the expected action (marked by the annotator) from the given layout analysis results and output it. Meanwhile, the model is considered to have made a correct decision for other action types as long as it accurately predicts the type of action performed. We leverage these criteria to calculate the model’s mean accuracy across various images, serving as a gauge of its performance in UI navigation tasks.

Trajectory-level Accuracy. In undertaking GUI navigation tasks, each trajectory consists of a sequence of steps. An error in predicting any one of these steps could hinder the successful navigation to the intended destination. Consequently, we consider the model’s trajectory-level accuracy as a key performance measure. Successful navigation is achieved only when the model executes the appropriate action for each image within a given trajectory.

4.2 Methods

We employ three evaluation methods to assess the performance of various models on the E-ANT, specifically designed for LLMs/MLLMs. These methods include Zero-shot inference and Fine-tuning. In addition, we trained an XYLayoutLM model Gu et al. (2022) trained using the behavioral cloning method, which is a baseline for non-generative methods.

Table 1: Benchmarks on E-ANT cover LLM, MLLM and non-generative models

model type	training strategy	Picture-level acc	Trajectory-level acc
GPT-3.5-16K	Zero-shot	23.5%	1.9%
LLaVA-v1.5-7B	Zero-shot	12.8%	0.4%
LLaVA-NeXT-7B	Zero-shot	19.9%	0.8%
Qwen-72B	Zero-shot	28.6%	2.1%
Qwen1.5-14B	Zero-shot	18.7%	0.6%
XYLayoutLM	Finetune(Behavioral Cloning)	66.8%	11.1%
LLaVA-v1.5-7B	Finetune	47.3%	3.6%
LLaVA-v1.5-7B	Finetune with data augmentation	51.6%	4.0%

Zero-shot inference for LLM/MLLMs. In inference using Language Models (LLMs), we systematically analyze the layout information presented by each image and provide it to LLMs. This information is captured within a structured element bar, which adheres to a standardized format: $\{^{\prime}id^{\prime}:<\cdot>,^{\prime}cate^{\prime}:<\cdot>,^{\prime}text^{% \prime}:<\cdot>,^{\prime}box^{\prime}:<\cdot>\}$ . Once formatted, this data is then fed into the LLM, which is instructed to output responses that conform to a predetermined structure, given by the template: $\{^{\prime}thinking^{\prime}:<\cdot>,^{\prime}action\_type^{\prime}:<\cdot>,^{% \prime}button^{\prime}:\{^{\prime}id^{\prime}:<\cdot>,^{\prime}cate^{\prime}:<% \cdot>,^{\prime}text^{\prime}:<\cdot>,^{\prime}box^{\prime}:<\cdot>\}\}$ . Based on this scheme, the LLM selects an appropriate action; however, if the ‘ $action\_type$ ’ is not a ‘ $click$ ’, the details of the ‘button’ need not be furnished. Meanwhile, We supply MLLMs with both layout parsing text prompts and original image embeddings.

Fine-tuning for MLLMs. In addition to the setting without retraining, we also focus on the performance of multi-modal large models after fine-tuning instructions for certain UI navigation task data. We referred to the training method provided by Auto-UI Zhan and Zhang (2023), used pictures as model input, and then directly organized the decision results into text as the object that the model needed to learn. In organizing training data for fine-tuning, we categorize it into two types: general domain data, sourced from LLaVA Liu et al. (2024, 2023), and data specifically aimed at GUI (Graphical User Interface) navigation decisions. All LLaVA-based fine-tuning results are obtained by training one epoch using the mixed data on the official checkpoint released by the company. The training settings refer to the standard settings.

Fine-tuning for XYLayoutLM. We formulate the decision task as a combination of a NER(Named Entity Recognition) task and a sentence-level classification task using a multimodal model, and train these two tasks on the XYLayoutLM model Gu et al. (2022), which is an improvement of the LayoutLM family of models. Most of the model’s raw settings are kept except each detected element is treated as an delicated word with text content like ’icon: <ocr text>’ and its corresponding position. The sentence-level classification task is used to learn the expected action type on the current page, which will make judgments among types such as CLICK, WAIT, SWIPE, INPUT, SUCCESS and FAIL. For CLICK or INPUT action, the model is also required to highlight the token corresponding to the element expected to be tapped on. This is described as an NER task, that is a binary classification task in the token dimension to decide which elemented to be chosen. As for the INPUT part, since XYLayoutLM is not a generative model, so we treat a action is correct as long as the model can identify the correct action and token.

4.3 Experiments Results on E-ANT

We randomly selected 1000 trajectories as our testset (about 5%), and for fine-tuning experiments, we selected 4000 trajectories (about 20%) that are not included in the testset as training data. We present our experimental results in the table 1. We can notice that for zero-shot inference, although most models can correctly infer the decision under a single step on some pictures, they perform poorly in terms of accuracy in the trajectory-level. At the same time, for the same zero-shot inference, LLaVA did not show a higher level of competitiveness than stronger LLMs such as GPT3.5 and Qwen72B, even though it added image modal input. This may be because under the premise of carrying layout parsing text input, increasing the basic ability of the text base itself is more important for decision-making than adding a modal input. In addition, the fine-tuned LLaVA can show higher accuracy than the un-fine-tuned version. Combining the understanding of data through GUI with decision-making data effectively enhances its accuracy.

5 Data Augmentation Method for E-ANT

5.1 Motivation

Typically, human users navigate through GUI navigation tasks in a two-step approach: initially by comprehending the contents of the page (page understanding), and subsequently determining the necessary actions (action decision-making). This indicates that a profound understanding of the page significantly enhances the model’s ability to connect the input image with specific actions.

Generally speaking, if you want the model to obtain a good understanding of UI pages, you need to rely on additional training data, including page understanding data sets such as UI-Bert and RICO. However, a challenge arises due to the stylistic divergence between the UI designs in these datasets and those encountered in current UI navigation tasks. This discrepancy may lead to a disjointed integration of the two critical phases of understanding and action.

In fact, the image trajectory data utilized for GUI navigation tasks is valuable for understanding image pages. However, it lacks the necessary annotation information for each element within the image. Simultaneously, employing annotators to add further annotations to all images and their respective elements would entail a significant investment in terms of both manpower and financial resources. Another approach involves leveraging the existing MLLM for auxiliary annotation. However, we’ve observed that the current MLLM struggles to fully and accurately identify elements within GUI images. This challenge arises from the characteristic of the MLLM’s training data, which primarily consist of regular images and not specialized GUI content like app screenshots. To enhance the image trajectory data from GUI navigation tasks into high-quality image page understanding data without incurring additional costs, this paper introduces a bootstrap data augmentation method. This method leverages the existing layout parsing model and the MLLMs to expand data.

5.2 Overview and data workflow

In this subsection, we describe our approach to training our multimodal UI navigation model using UI navigation data. Our methodology is structured around two critical processes: generating page understanding data and creating chained decision-making data. The goal of the first process is to enrich the model’s comprehension of webpage content, while the second process is designed to improve the model’s ability to link navigation objectives with ultimate decision-making actions.

Page understanding data generation. For processing an GUI page, we commence by deploying the GUI layout parsing model to identify and segment the page elements into multiple sub-images. Each sub-element image is then fed into an advanced multi-modal large model (such as LLaVA, MiniGPT4, or Blip) that does not specialize in UI navigation data fine-tuning, requiring the model to independently generate an outline. Through this approach, we capture both the coordinates of each element (via layout analysis) and their descriptive information (via the multi-modal model). Subsequently, we synthesize three distinct types of data for a comprehensive understanding of the page: element positioning data, page element enumeration data, and a page summary. Examples of these data types are illustrated in accompanying figures. To create the element positioning data, we simply pair the coordinates with their respective descriptions for each element. The page element enumeration data is produced by aggregating these pairs across all page elements into a cohesive paragraph. Finally, for the page summary data, we compile the descriptions of all elements and submit them to the LLM to generate a succinct summary.

Chained decision-making data generation. Indeed, for numerous pages and navigational goals, the link between the intended navigation and the specific actions required on a given page is not always intuitive. Even humans must meticulously review the page’s content before deciding on their navigational approach. To improve the model’s ability to establish this connection, we used multi-round conversation data in the training data, requiring the model to first answer questions related to the understanding of the page and then make corresponding action decisions.

6 Conclusion

Navigation plays an important rule in people’s daily life, yet we find there is a lack of a comprehensive and well-designed data benchmark. Moreover, existing benchmarks are predominantly in English, with poor box quality and limited availability for Chinese. To address these issues, we have introduced a new benchmark with a large-scale dataset and several distinct features, including more divergence and good layout baselines. For now we collect over 40k high quality trajectories performed and corrected by human annotators, which will fill the gaps of navigation data on Mobile UI.

7 Limitation

The E-ANT dataset is annotated by the annotators on the computer page through ADB interacting with the Android virtual machine in the background, which means that there is still a gap between our environment and the real Android mobile phone, and it cannot have more flexible operations like using the Android system directly on the mobile phone.

In addition, although E-ANT is the first large-scale Chinese GUI navigation dataset with a large amount of data, due to the existence of a large number of heterogeneous mobile devices and Chinese applications with different resolutions and GUI styles, there is still a need to further improve the scope and quality of the data to help build a more effective GUI navigation intelligent agent.

Finally, various high-performance LLM/MLLMs are being released by different researchers. Due to limited resources (computing resources and open model checkpoint resources), we, as data publishers, cannot traverse and evaluate all models on the market using E-ANT data. However, we will work with the community to improve such evaluations as much as possible and continuously iterate and optimize the training and testing pipelines.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. Falcon-40B: an open large language model with state-of-the-art performance.
(3) Apple. Xcode. https://developer.apple.com/xcode/.
Azim et al. (2016) Tanzirul Azim, Oriana Riva, and Suman Nath. 2016. ulink: Enabling user-defined deep linking to app content. In Proceedings of the 14th Annual International Conference on mobile systems, applications, and services, pages 305–318.
Bai et al. (2021) Chongyang Bai, Xiaoxue Zang, Ying Xu, Srinivas Sunkara, Abhinav Rastogi, Jindong Chen, et al. 2021. Uibert: Learning generic multimodal representations for ui understanding. arXiv preprint arXiv:2107.13731.
Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
Baichuan (2023) Baichuan. 2023. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305.
Chen et al. (2023) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechu Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. 2023. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
Cypher and Halbert (1993) Allen Cypher and Daniel Conrad Halbert. 1993. Watch what I do: programming by demonstration. MIT press.
Deka et al. (2017) Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. 2017. Rico: A mobile app dataset for building data-driven design applications. In Proceedings of the 30th annual ACM symposium on user interface software and technology, pages 845–854.
(11) Google. Android debug bridge. https://developer.android.com/tools/adb.
Gu et al. (2022) Zhangxuan Gu, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. 2022. Xylayoutlm: Towards layout-aware multimodal networks for visually-rich document understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pages 4573–4582. IEEE.
Gu et al. (2023) Zhangxuan Gu, Zhuoer Xu, Haoxing Chen, Jun Lan, Changhua Meng, and Weiqiang Wang. 2023. Mobile user interface element detection via adaptively prompt tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 11155–11164. IEEE.
Guibert et al. (2004) Nicolas Guibert, Patrick Girard, and Laurent Guittet. 2004. Example-based programming: a pertinent visual approach for learning to program. In Proceedings of the working conference on Advanced visual interfaces, pages 358–361.
Hong et al. (2023) Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2023. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914.
Humphreys et al. (2022) Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. 2022. A data-driven approach for learning to control computers. In International Conference on Machine Learning, pages 9466–9482. PMLR.
Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. 2023. Mistral 7b. arXiv preprint arXiv:2310.06825.
Kim et al. (2024) Geunwoo Kim, Pierre Baldi, and Stephen McAleer. 2024. Language models can solve computer tasks. Advances in Neural Information Processing Systems, 36.
Lee et al. (2023) Sunjae Lee, Junyoung Choi, Jungjae Lee, Hojun Choi, Steven Y Ko, Sangeun Oh, and Insik Shin. 2023. Explore, select, derive, and recall: Augmenting llm with human-like memory for mobile task automation. arXiv preprint arXiv:2312.03003.
Li et al. (2023a) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023a. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
Li et al. (2023b) Tao Li, Gang Li, Zhiwei Deng, Bryan Wang, and Yang Li. 2023b. A zero-shot language agent for computer control with structured reflection. arXiv preprint arXiv:2310.08740.
Li et al. (2017a) Toby Jia-Jun Li, Amos Azaria, and Brad A Myers. 2017a. Sugilite: creating multimodal smartphone automation by demonstration. In Proceedings of the 2017 CHI conference on human factors in computing systems, pages 6038–6049.
Li et al. (2017b) Toby Jia-Jun Li, Yuanchun Li, Fanglin Chen, and Brad A Myers. 2017b. Programming iot devices by demonstration using mobile apps. In End-User Development: 6th International Symposium, IS-EUD 2017, Eindhoven, The Netherlands, June 13-15, 2017, Proceedings 6, pages 3–17. Springer.
Li (2021) Wei Li. 2021. Learning ui navigation through demonstrations composed of macro actions. arXiv preprint arXiv:2110.08653.
Li et al. (2020) Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776.
Lieberman (2001) Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.
Liu et al. (2018) Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. 2018. Reinforcement learning on web interfaces using workflow-guided exploration. arXiv preprint arXiv:1802.08802.
Liu et al. (2023) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2023. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
Liu et al. (2024) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, 36.
Maués and Barbosa (2013) Rodrigo de A Maués and Simone Diniz Junqueira Barbosa. 2013. Keep doing what i just did: automating smartphones by demonstration. In Proceedings of the 15th international conference on Human-computer interaction with mobile devices and services, pages 295–303.
MIIT of China (2023) MIIT of China. 2023. Operation of the internet and related service industries in the first three quarters of 2023.
Rawles et al. (2024) Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. 2024. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems, 36.
Rodrigues (2015) André Rodrigues. 2015. Breaking barriers with assistive macros. In Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility, pages 351–352.
Rodrigues and Guerreiro (2014) André Rodrigues and Tiago Guerreiro. 2014. Swat: Mobile system-wide assistive technologies. In Proceedings of the 28th International BCS Human Computer Interaction Conference (HCI 2014). BCS Learning & Development.
Shi et al. (2017) Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. 2017. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR.
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
Wang et al. (2023a) Bryan Wang, Gang Li, and Yang Li. 2023a. Enabling conversational interaction with mobile ui using large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17.
Wang et al. (2023b) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. 2023b. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
Wen et al. (2023a) Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. 2023a. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272.
Wen et al. (2023b) Hao Wen, Hongming Wang, Jiaxuan Liu, and Yuanchun Li. 2023b. Droidbot-gpt: Gpt-powered ui automation for android. arXiv preprint arXiv:2304.07061.
Yan et al. (2023) An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. 2023. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562.
Yang et al. (2023) Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2023. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771.
Yao et al. (2022) Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
Zhan and Zhang (2023) Zhuosheng Zhan and Aston Zhang. 2023. You only look at screens: Multimodal chain-of-action agents. arXiv preprint arXiv:2309.11436.
Zhang et al. (2023a) Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and Yan Lu. 2023a. Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716.
Zhang et al. (2023b) Zhizheng Zhang, Xiaoyi Zhang, Wenxuan Xie, and Yan Lu. 2023b. Responsible task automation: Empowering large language models as responsible task automators. arXiv preprint arXiv:2306.01242.
Zheng et al. (2024) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2024. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Zhu et al. (2023) Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. 2023. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.