-
Automated Code Fix Suggestions for Accessibility Issues in Mobile Apps
Authors:
Forough Mehralian,
Titus Barik,
Jeff Nichols,
Amanda Swearngin
Abstract:
Accessibility is crucial for inclusive app usability, yet developers often struggle to identify and fix app accessibility issues due to a lack of awareness, expertise, and inadequate tools. Current accessibility testing tools can identify accessibility issues but may not always provide guidance on how to address them. We introduce FixAlly, an automated tool designed to suggest source code fixes fo…
▽ More
Accessibility is crucial for inclusive app usability, yet developers often struggle to identify and fix app accessibility issues due to a lack of awareness, expertise, and inadequate tools. Current accessibility testing tools can identify accessibility issues but may not always provide guidance on how to address them. We introduce FixAlly, an automated tool designed to suggest source code fixes for accessibility issues detected by automated accessibility scanners. FixAlly employs a multi-agent LLM architecture to generate fix strategies, localize issues within the source code, and propose code modification suggestions to fix the accessibility issue. Our empirical study demonstrates FixAlly's capability in suggesting fixes that resolve issues found by accessibility scanners -- with an effectiveness of 77% in generating plausible fix suggestions -- and our survey of 12 iOS developers finds they would be willing to accept 69.4% of evaluated fix suggestions.
△ Less
Submitted 7 August, 2024;
originally announced August 2024.
-
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
Authors:
Jason Wu,
Eldon Schoop,
Alan Leung,
Titus Barik,
Jeffrey P. Bigham,
Jeffrey Nichols
Abstract:
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an…
▽ More
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
UIClip: A Data-driven Model for Assessing User Interface Design
Authors:
Jason Wu,
Yi-Hao Peng,
Amanda Li,
Amanda Swearngin,
Jeffrey P. Bigham,
Jeffrey Nichols
Abstract:
User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmenta…
▽ More
User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
BISCUIT: Scaffolding LLM-Generated Code with Ephemeral UIs in Computational Notebooks
Authors:
Ruijia Cheng,
Titus Barik,
Alan Leung,
Fred Hohman,
Jeffrey Nichols
Abstract:
Programmers frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generati…
▽ More
Programmers frequently engage with machine learning tutorials in computational notebooks and have been adopting code generation technologies based on large language models (LLMs). However, they encounter difficulties in understanding and working with code produced by LLMs. To mitigate these challenges, we introduce a novel workflow into computational notebooks that augments LLM-based code generation with an additional ephemeral UI step, offering users UI scaffolds as an intermediate stage between user prompts and code generation. We present this workflow in BISCUIT, an extension for JupyterLab that provides users with ephemeral UIs generated by LLMs based on the context of their code and intentions, scaffolding users to understand, guide, and explore with LLM-generated code. Through a user study where 10 novices used BISCUIT for machine learning tutorials, we found that BISCUIT offers users representations of code to aid their understanding, reduces the complexity of prompt engineering, and creates a playground for users to explore different variables and iterate on their ideas.
△ Less
Submitted 11 July, 2024; v1 submitted 10 April, 2024;
originally announced April 2024.
-
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Authors:
Keen You,
Haotian Zhang,
Eldon Schoop,
Floris Weers,
Amanda Swearngin,
Jeffrey Nichols,
Yinfei Yang,
Zhe Gan
Abstract:
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given…
▽ More
Recent advancements in multimodal large language models (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.
△ Less
Submitted 8 April, 2024;
originally announced April 2024.
-
Keyframer: Empowering Animation Design using Large Language Models
Authors:
Tiffany Tseng,
Ruijia Cheng,
Jeffrey Nichols
Abstract:
Large language models (LLMs) have the potential to impact a wide range of creative domains, but the application of LLMs to animation is underexplored and presents novel challenges such as how users might effectively describe motion in natural language. In this paper, we present Keyframer, a design tool for animating static images (SVGs) with natural language. Informed by interviews with profession…
▽ More
Large language models (LLMs) have the potential to impact a wide range of creative domains, but the application of LLMs to animation is underexplored and presents novel challenges such as how users might effectively describe motion in natural language. In this paper, we present Keyframer, a design tool for animating static images (SVGs) with natural language. Informed by interviews with professional animation designers and engineers, Keyframer supports exploration and refinement of animations through the combination of prompting and direct editing of generated output. The system also enables users to request design variants, supporting comparison and ideation. Through a user study with 13 participants, we contribute a characterization of user prompting strategies, including a taxonomy of semantic prompt types for describing motion and a 'decomposed' prompting style where users continually adapt their goals in response to generated output.We share how direct editing along with prompting enables iteration beyond one-shot prompting interfaces common in generative tools today. Through this work, we propose how LLMs might empower a range of audiences to engage with animation creation.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations
Authors:
Yue Jiang,
Eldon Schoop,
Amanda Swearngin,
Jeffrey Nichols
Abstract:
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, o…
▽ More
Multimodal Vision-Language Models (VLMs) enable powerful applications from their fused understanding of images and language, but many perform poorly on UI tasks due to the lack of UI training data. In this paper, we adapt a recipe for generating paired text-image training data for VLMs to the UI domain by combining existing pixel-based methods with a Large Language Model (LLM). Unlike prior art, our method requires no human-provided annotations, and it can be applied to any dataset of UI screenshots. We generate a dataset of 335K conversational examples paired with UIs that cover Q&A, UI descriptions, and planning, and use it to fine-tune a conversational VLM for UI tasks. To assess the performance of our model, we benchmark it on UI element detection tasks, evaluate response quality, and showcase its applicability to multi-step UI navigation and planning.
△ Less
Submitted 7 October, 2023;
originally announced October 2023.
-
AXNav: Replaying Accessibility Tests from Natural Language
Authors:
Maryam Taeb,
Amanda Swearngin,
Eldon Schoop,
Ruijia Cheng,
Yue Jiang,
Jeffrey Nichols
Abstract:
Developers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs, however t…
▽ More
Developers and quality assurance testers often rely on manual testing to test accessibility features throughout the product lifecycle. Unfortunately, manual testing can be tedious, often has an overwhelming scope, and can be difficult to schedule amongst other development milestones. Recently, Large Language Models (LLMs) have been used for a variety of tasks including automation of UIs, however to our knowledge no one has yet explored their use in controlling assistive technologies for the purposes of supporting accessibility testing. In this paper, we explore the requirements of a natural language based accessibility testing workflow, starting with a formative study. From this we build a system that takes as input a manual accessibility test (e.g., ``Search for a show in VoiceOver'') and uses an LLM combined with pixel-based UI Understanding models to execute the test and produce a chaptered, navigable video. In each video, to help QA testers we apply heuristics to detect and flag accessibility issues (e.g., Text size not increasing with Large Text enabled, VoiceOver navigation loops). We evaluate this system through a 10 participant user study with accessibility QA professionals who indicated that the tool would be very useful in their current work and performed tests similarly to how they would manually test the features. The study also reveals insights for future work on using LLMs for accessibility testing.
△ Less
Submitted 4 March, 2024; v1 submitted 3 October, 2023;
originally announced October 2023.
-
Towards Automated Accessibility Report Generation for Mobile Apps
Authors:
Amanda Swearngin,
Jason Wu,
Xiaoyi Zhang,
Esteban Gomez,
Jen Coughenour,
Rachel Stukenborg,
Bhavya Garg,
Greg Hughes,
Adriana Hilliard,
Jeffrey P. Bigham,
Jeffrey Nichols
Abstract:
Many apps have basic accessibility issues, like missing labels or low contrast. Automated tools can help app developers catch basic issues, but can be laborious or require writing dedicated tests. We propose a system, motivated by a collaborative process with accessibility stakeholders at a large technology company, to generate whole app accessibility reports by combining varied data collection me…
▽ More
Many apps have basic accessibility issues, like missing labels or low contrast. Automated tools can help app developers catch basic issues, but can be laborious or require writing dedicated tests. We propose a system, motivated by a collaborative process with accessibility stakeholders at a large technology company, to generate whole app accessibility reports by combining varied data collection methods (e.g., app crawling, manual recording) with an existing accessibility scanner. Many such scanners are based on single-screen scanning, and a key problem in whole app accessibility reporting is to effectively de-duplicate and summarize issues collected across an app. To this end, we developed a screen grouping model with 96.9% accuracy (88.8% F1-score) and UI element matching heuristics with 97% accuracy (98.2% F1-score). We combine these technologies in a system to report and summarize unique issues across an app, and enable a unique pixel-based ignore feature to help engineers and testers better manage reported issues across their app's lifetime. We conducted a qualitative evaluation with 18 accessibility-focused engineers and testers which showed this system can enhance their existing accessibility testing toolkit and address key limitations in current accessibility scanning tools.
△ Less
Submitted 16 October, 2023; v1 submitted 29 September, 2023;
originally announced October 2023.
-
AI ATAC 1: An Evaluation of Prominent Commercial Malware Detectors
Authors:
Robert A. Bridges,
Brian Weber,
Justin M. Beaver,
Jared M. Smith,
Miki E. Verma,
Savannah Norem,
Kevin Spakes,
Cory Watson,
Jeff A. Nichols,
Brian Jewell,
Michael. D. Iannacone,
Chelsey Dunivan Stahl,
Kelly M. T. Huffer,
T. Sean Oesch
Abstract:
This work presents an evaluation of six prominent commercial endpoint malware detectors, a network malware detector, and a file-conviction algorithm from a cyber technology vendor. The evaluation was administered as the first of the Artificial Intelligence Applications to Autonomous Cybersecurity (AI ATAC) prize challenges, funded by / completed in service of the US Navy. The experiment employed 1…
▽ More
This work presents an evaluation of six prominent commercial endpoint malware detectors, a network malware detector, and a file-conviction algorithm from a cyber technology vendor. The evaluation was administered as the first of the Artificial Intelligence Applications to Autonomous Cybersecurity (AI ATAC) prize challenges, funded by / completed in service of the US Navy. The experiment employed 100K files (50/50% benign/malicious) with a stratified distribution of file types, including ~1K zero-day program executables (increasing experiment size two orders of magnitude over previous work). We present an evaluation process of delivering a file to a fresh virtual machine donning the detection technology, waiting 90s to allow static detection, then executing the file and waiting another period for dynamic detection; this allows greater fidelity in the observational data than previous experiments, in particular, resource and time-to-detection statistics. To execute all 800K trials (100K files $\times$ 8 tools), a software framework is designed to choreographed the experiment into a completely automated, time-synced, and reproducible workflow with substantial parallelization. A cost-benefit model was configured to integrate the tools' recall, precision, time to detection, and resource requirements into a single comparable quantity by simulating costs of use. This provides a ranking methodology for cyber competitions and a lens through which to reason about the varied statistical viewpoints of the results. These statistical and cost-model results provide insights on state of commercial malware detection.
△ Less
Submitted 28 August, 2023;
originally announced August 2023.
-
Never-ending Learning of User Interfaces
Authors:
Jason Wu,
Rebecca Krosnick,
Eldon Schoop,
Amanda Swearngin,
Jeffrey P. Bigham,
Jeffrey Nichols
Abstract:
Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets that are collected and labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, it is possible to guess if a UI element is "tappable"…
▽ More
Machine learning models have been trained to predict semantic information about user interfaces (UIs) to make apps more accessible, easier to test, and to automate. Currently, most models rely on datasets that are collected and labeled by human crowd-workers, a process that is costly and surprisingly error-prone for certain tasks. For example, it is possible to guess if a UI element is "tappable" from a screenshot (i.e., based on visual signifiers) or from potentially unreliable metadata (e.g., a view hierarchy), but one way to know for certain is to programmatically tap the UI element and observe the effects. We built the Never-ending UI Learner, an app crawler that automatically installs real apps from a mobile app store and crawls them to discover new and challenging training examples to learn from. The Never-ending UI Learner has crawled for more than 5,000 device-hours, performing over half a million actions on 6,000 apps to train three computer vision models for i) tappability prediction, ii) draggability prediction, and iii) screen similarity.
△ Less
Submitted 16 August, 2023;
originally announced August 2023.
-
Topological Deep Learning: A Review of an Emerging Paradigm
Authors:
Ali Zia,
Abdelwahed Khamis,
James Nichols,
Zeeshan Hayder,
Vivien Rolland,
Lars Petersson
Abstract:
Topological data analysis (TDA) provides insight into data shape. The summaries obtained by these methods are principled global descriptions of multi-dimensional data whilst exhibiting stable properties such as robustness to deformation and noise. Such properties are desirable in deep learning pipelines but they are typically obtained using non-TDA strategies. This is partly caused by the difficul…
▽ More
Topological data analysis (TDA) provides insight into data shape. The summaries obtained by these methods are principled global descriptions of multi-dimensional data whilst exhibiting stable properties such as robustness to deformation and noise. Such properties are desirable in deep learning pipelines but they are typically obtained using non-TDA strategies. This is partly caused by the difficulty of combining TDA constructs (e.g. barcode and persistence diagrams) with current deep learning algorithms. Fortunately, we are now witnessing a growth of deep learning applications embracing topologically-guided components. In this survey, we review the nascent field of topological deep learning by first revisiting the core concepts of TDA. We then explore how the use of TDA techniques has evolved over time to support deep learning frameworks, and how they can be integrated into different aspects of deep learning. Furthermore, we touch on TDA usage for analyzing existing deep models; deep topological analytics. Finally, we discuss the challenges and future prospects of topological deep learning.
△ Less
Submitted 7 February, 2023;
originally announced February 2023.
-
WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics
Authors:
Jason Wu,
Siyan Wang,
Siman Shen,
Yi-Hao Peng,
Jeffrey Nichols,
Jeffrey P. Bigham
Abstract:
Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebU…
▽ More
Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Screen Correspondence: Mapping Interchangeable Elements between UIs
Authors:
Jason Wu,
Amanda Swearngin,
Xiaoyi Zhang,
Jeffrey Nichols,
Jeffrey P. Bigham
Abstract:
Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by mapping their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semanti…
▽ More
Understanding user interface (UI) functionality is a useful yet challenging task for both machines and people. In this paper, we investigate a machine learning approach for screen correspondence, which allows reasoning about UIs by mapping their elements onto previously encountered examples with known functionality and properties. We describe and implement a model that incorporates element semantics, appearance, and text to support correspondence computation without requiring any labeled examples. Through a comprehensive performance evaluation, we show that our approach improves upon baselines by incorporating multi-modal properties of UIs. Finally, we show three example applications where screen correspondence facilitates better UI understanding for humans and machines: (i) instructional overlay generation, (ii) semantic UI element search, and (iii) automated interface testing.
△ Less
Submitted 19 January, 2023;
originally announced January 2023.
-
Testing SOAR Tools in Use
Authors:
Robert A. Bridges,
Ashley E. Rice,
Sean Oesch,
Jeff A. Nichols,
Cory Watson,
Kevin Spakes,
Savannah Norem,
Mike Huettel,
Brian Jewell,
Brian Weber,
Connor Gannon,
Olivia Bizovi,
Samuel C Hollifield,
Samantha Erwin
Abstract:
Modern security operation centers (SOCs) rely on operators and a tapestry of logging and alerting tools with large scale collection and query abilities. SOC investigations are tedious as they rely on manual efforts to query diverse data sources, overlay related logs, and correlate the data into information and then document results in a ticketing system. Security orchestration, automation, and res…
▽ More
Modern security operation centers (SOCs) rely on operators and a tapestry of logging and alerting tools with large scale collection and query abilities. SOC investigations are tedious as they rely on manual efforts to query diverse data sources, overlay related logs, and correlate the data into information and then document results in a ticketing system. Security orchestration, automation, and response (SOAR) tools are a new technology that promise to collect, filter, and display needed data; automate common tasks that require SOC analysts' time; facilitate SOC collaboration; and, improve both efficiency and consistency of SOCs. SOAR tools have never been tested in practice to evaluate their effect and understand them in use. In this paper, we design and administer the first hands-on user study of SOAR tools, involving 24 participants and 6 commercial SOAR tools. Our contributions include the experimental design, itemizing six characteristics of SOAR tools and a methodology for testing them. We describe configuration of the test environment in a cyber range, including network, user, and threat emulation; a full SOC tool suite; and creation of artifacts allowing multiple representative investigation scenarios to permit testing. We present the first research results on SOAR tools. We found that SOAR configuration is critical, as it involves creative design for data display and automation. We found that SOAR tools increased efficiency and reduced context switching during investigations, although ticket accuracy and completeness (indicating investigation quality) decreased with SOAR use. Our findings indicated that user preferences are slightly negatively correlated with their performance with the tool; overautomation was a concern of senior analysts, and SOAR tools that balanced automation with assisting a user to make decisions were preferred.
△ Less
Submitted 14 February, 2023; v1 submitted 11 August, 2022;
originally announced August 2022.
-
Reflow: Automatically Improving Touch Interactions in Mobile Applications through Pixel-based Refinements
Authors:
Jason Wu,
Titus Barik,
Xiaoyi Zhang,
Colin Lea,
Jeffrey Nichols,
Jeffrey P. Bigham
Abstract:
Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. R…
▽ More
Touch is the primary way that users interact with smartphones. However, building mobile user interfaces where touch interactions work well for all users is a difficult problem, because users have different abilities and preferences. We propose a system, Reflow, which automatically applies small, personalized UI adaptations, called refinements -- to mobile app screens to improve touch efficiency. Reflow uses a pixel-based strategy to work with existing applications, and improves touch efficiency while minimally disrupting the design intent of the original application. Our system optimizes a UI by (i) extracting its layout from its screenshot, (ii) refining its layout, and (iii) re-rendering the UI to reflect these modifications. We conducted a user study with 10 participants and a heuristic evaluation with 6 experts and found that applications optimized by Reflow led to, on average, 9% faster selection time with minimal layout disruption. The results demonstrate that Reflow's refinements useful UI adaptations to improve touch interactions.
△ Less
Submitted 15 July, 2022;
originally announced July 2022.
-
Extracting Replayable Interactions from Videos of Mobile App Usage
Authors:
Jieshan Chen,
Amanda Swearngin,
Jason Wu,
Titus Barik,
Jeffrey Nichols,
Xiaoyi Zhang
Abstract:
Screen recordings of mobile apps are a popular and readily available way for users to share how they interact with apps, such as in online tutorial videos, user reviews, or as attachments in bug reports. Unfortunately, both people and systems can find it difficult to reproduce touch-driven interactions from video pixel data alone. In this paper, we introduce an approach to extract and replay user…
▽ More
Screen recordings of mobile apps are a popular and readily available way for users to share how they interact with apps, such as in online tutorial videos, user reviews, or as attachments in bug reports. Unfortunately, both people and systems can find it difficult to reproduce touch-driven interactions from video pixel data alone. In this paper, we introduce an approach to extract and replay user interactions in videos of mobile apps, using only pixel information in video frames. To identify interactions, we apply heuristic-based image processing and convolutional deep learning to segment screen recordings, classify the interaction in each segment, and locate the interaction point. To replay interactions on another device, we match elements on app screens using UI element detection. We evaluate the feasibility of our pixel-based approach using two datasets: the Rico mobile app dataset and a new dataset of 64 apps with both iOS and Android versions. We find that our end-to-end approach can successfully replay a majority of interactions (iOS--84.1%, Android--78.4%) on different devices, which is a step towards supporting a variety of scenarios, including automatically annotating interactions in existing videos, automated UI testing, and creating interactive app tutorials.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
Assembling a Cyber Range to Evaluate Artificial Intelligence / Machine Learning (AI/ML) Security Tools
Authors:
Jeffrey A. Nichols,
Kevin D. Spakes,
Cory L. Watson,
Robert A. Bridges
Abstract:
In this case study, we describe the design and assembly of a cyber security testbed at Oak Ridge National Laboratory in Oak Ridge, TN, USA. The range is designed to provide agile reconfigurations to facilitate a wide variety of experiments for evaluations of cyber security tools -- particularly those involving AI/ML. In particular, the testbed provides realistic test environments while permitting…
▽ More
In this case study, we describe the design and assembly of a cyber security testbed at Oak Ridge National Laboratory in Oak Ridge, TN, USA. The range is designed to provide agile reconfigurations to facilitate a wide variety of experiments for evaluations of cyber security tools -- particularly those involving AI/ML. In particular, the testbed provides realistic test environments while permitting control and programmatic observations/data collection during the experiments. We have designed in the ability to repeat the evaluations, so additional tools can be evaluated and compared at a later time. The system is one that can be scaled up or down for experiment sizes. At the time of the conference we will have completed two full-scale, national, government challenges on this range. These challenges are evaluating the performance and operating costs for AI/ML-based cyber security tools for application into large, government-sized networks. These evaluations will be described as examples providing motivation and context for various design decisions and adaptations we have made. The first challenge measured end-point security tools against 100K file samples (benignware and malware) chosen across a range of file types. The second is an evaluation of network intrusion detection systems efficacy in identifying multi-step adversarial campaigns -- involving reconnaissance, penetration and exploitations, lateral movement, etc. -- with varying levels of covertness in a high-volume business network. The scale of each of these challenges requires automation systems to repeat, or simultaneously mirror identical the experiments for each ML tool under test. Providing an array of easy-to-difficult malicious activity for sussing out the true abilities of the AI/ML tools has been a particularly interesting and challenging aspect of designing and executing these challenge events.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
-
Sketch-based Creativity Support Tools using Deep Learning
Authors:
Forrest Huang,
Eldon Schoop,
David Ha,
Jeffrey Nichols,
John Canny
Abstract:
Sketching is a natural and effective visual communication medium commonly used in creative processes. Recent developments in deep-learning models drastically improved machines' ability in understanding and generating visual content. An exciting area of development explores deep-learning approaches used to model human sketches, opening opportunities for creative applications. This chapter describes…
▽ More
Sketching is a natural and effective visual communication medium commonly used in creative processes. Recent developments in deep-learning models drastically improved machines' ability in understanding and generating visual content. An exciting area of development explores deep-learning approaches used to model human sketches, opening opportunities for creative applications. This chapter describes three fundamental steps in developing deep-learning-driven creativity support tools that consumes and generates sketches: 1) a data collection effort that generated a new paired dataset between sketches and mobile user interfaces; 2) a sketch-based user interface retrieval system adapted from state-of-the-art computer vision techniques; and, 3) a conversational sketching system that supports the novel interaction of a natural-language-based sketch/critique authoring process. In this chapter, we survey relevant prior work in both the deep-learning and human-computer-interaction communities, document the data collection process and the systems' architectures in detail, present qualitative and quantitative results, and paint the landscape of several future research directions in this exciting area.
△ Less
Submitted 18 November, 2021;
originally announced November 2021.
-
Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots
Authors:
Jason Wu,
Xiaoyi Zhang,
Jeff Nichols,
Jeffrey P. Bigham
Abstract:
Automated understanding of user interfaces (UIs) from their pixels can improve accessibility, enable task automation, and facilitate interface design without relying on developers to comprehensively provide metadata. A first step is to infer what UI elements exist on a screen, but current approaches are limited in how they infer how those elements are semantically grouped into structured interface…
▽ More
Automated understanding of user interfaces (UIs) from their pixels can improve accessibility, enable task automation, and facilitate interface design without relying on developers to comprehensively provide metadata. A first step is to infer what UI elements exist on a screen, but current approaches are limited in how they infer how those elements are semantically grouped into structured interface definitions. In this paper, we motivate the problem of screen parsing, the task of predicting UI elements and their relationships from a screenshot. We describe our implementation of screen parsing and provide an effective training procedure that optimizes its performance. In an evaluation comparing the accuracy of the generated output, we find that our implementation significantly outperforms current systems (up to 23%). Finally, we show three example applications that are facilitated by screen parsing: (i) UI similarity search, (ii) accessibility enhancement, and (iii) code generation from UI screenshots.
△ Less
Submitted 17 September, 2021;
originally announced September 2021.
-
Screen Recognition: Creating Accessibility Metadata for Mobile Applications from Pixels
Authors:
Xiaoyi Zhang,
Lilian de Greef,
Amanda Swearngin,
Samuel White,
Kyle Murray,
Lisa Yu,
Qi Shan,
Jeffrey Nichols,
Jason Wu,
Chris Fleizach,
Aaron Everitt,
Jeffrey P. Bigham
Abstract:
Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces of…
▽ More
Many accessibility features available on mobile platforms require applications (apps) to provide complete and accurate metadata describing user interface (UI) components. Unfortunately, many apps do not provide sufficient metadata for accessibility features to work as expected. In this paper, we explore inferring accessibility metadata for mobile apps from their pixels, as the visual interfaces often best reflect an app's full functionality. We trained a robust, fast, memory-efficient, on-device model to detect UI elements using a dataset of 77,637 screens (from 4,068 iPhone apps) that we collected and annotated. To further improve UI detections and add semantic information, we introduced heuristics (e.g., UI grouping and ordering) and additional models (e.g., recognize UI content, state, interactivity). We built Screen Recognition to generate accessibility metadata to augment iOS VoiceOver. In a study with 9 screen reader users, we validated that our approach improves the accessibility of existing mobile apps, enabling even previously inaccessible apps to be used.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Beyond the Hype: A Real-World Evaluation of the Impact and Cost of Machine Learning-Based Malware Detection
Authors:
Robert A. Bridges,
Sean Oesch,
Miki E. Verma,
Michael D. Iannacone,
Kelly M. T. Huffer,
Brian Jewell,
Jeff A. Nichols,
Brian Weber,
Justin M. Beaver,
Jared M. Smith,
Daniel Scofield,
Craig Miles,
Thomas Plummer,
Mark Daniell,
Anne M. Tall
Abstract:
In this paper, we present a scientific evaluation of four prominent malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or…
▽ More
In this paper, we present a scientific evaluation of four prominent malware detection tools to assist an organization with two primary questions: To what extent do ML-based tools accurately classify previously- and never-before-seen files? Is it worth purchasing a network-level malware detector? To identify weaknesses, we tested each tool against 3,536 total files (2,554 or 72\% malicious, 982 or 28\% benign) of a variety of file types, including hundreds of malicious zero-days, polyglots, and APT-style files, delivered on multiple protocols. We present statistical results on detection time and accuracy, consider complementary analysis (using multiple tools together), and provide two novel applications of the recent cost-benefit evaluation procedure of Iannacone \& Bridges. While the ML-based tools are more effective at detecting zero-day files and executables, the signature-based tool may still be an overall better option. Both network-based tools provide substantial (simulated) savings when paired with either host tool, yet both show poor detection rates on protocols other than HTTP or SMTP. Our results show that all four tools have near-perfect precision but alarmingly low recall, especially on file types other than executables and office files -- 37% of malware tested, including all polyglot files, were undetected. Priorities for researchers and takeaways for end users are given.
△ Less
Submitted 17 August, 2022; v1 submitted 16 December, 2020;
originally announced December 2020.
-
Radon cumulative distribution transform subspace modeling for image classification
Authors:
Mohammad Shifat-E-Rabbi,
Xuwang Yin,
Abu Hasnat Mohammad Rubaiyat,
Shiying Li,
Soheil Kolouri,
Akram Aldroubi,
Jonathan M. Nichols,
Gustavo K. Rohde
Abstract:
We present a new supervised image classification method applicable to a broad class of image deformation models. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scal…
▽ More
We present a new supervised image classification method applicable to a broad class of image deformation models. The method makes use of the previously described Radon Cumulative Distribution Transform (R-CDT) for image data, whose mathematical properties are exploited to express the image data in a form that is more suitable for machine learning. While certain operations such as translation, scaling, and higher-order transformations are challenging to model in native image space, we show the R-CDT can capture some of these variations and thus render the associated image classification problems easier to solve. The method -- utilizing a nearest-subspace algorithm in R-CDT space -- is simple to implement, non-iterative, has no hyper-parameters to tune, is computationally efficient, label efficient, and provides competitive accuracies to state-of-the-art neural networks for many types of classification problems. In addition to the test accuracy performances, we show improvements (with respect to neural network-based methods) in terms of computational efficiency (it can be implemented without the use of GPUs), number of training samples needed for training, as well as out-of-distribution generalization. The Python code for reproducing our results is available at https://github.com/rohdelab/rcdt_ns_classifier.
△ Less
Submitted 2 March, 2022; v1 submitted 7 April, 2020;
originally announced April 2020.
-
A Computational Method for Evaluating UI Patterns
Authors:
Bardia Doosti,
Tao Dong,
Biplab Deka,
Jeffrey Nichols
Abstract:
UI design languages, such as Google's Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navig…
▽ More
UI design languages, such as Google's Material Design, make applications both easier to develop and easier to learn by providing a set of standard UI components. Nonetheless, it is hard to assess the impact of design languages in the wild. Moreover, designers often get stranded by strong-opinionated debates around the merit of certain UI components, such as the Floating Action Button and the Navigation Drawer. To address these challenges, this short paper introduces a method for measuring the impact of design languages and informing design debates through analyzing a dataset consisting of view hierarchies, screenshots, and app metadata for more than 9,000 mobile apps. Our data analysis shows that use of Material Design is positively correlated to app ratings, and to some extent, also the number of installs. Furthermore, we show that use of UI components vary by app category, suggesting a more nuanced view needed in design debates.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)
Authors:
Robert Bridges,
Jarilyn Hernandez Jimenez,
Jeffrey Nichols,
Katerina Goseva-Popstojanova,
Stacy Prowell
Abstract:
This paper presents an experimental design and data analytics approach aimed at power-based malware detection on general-purpose computers. Leveraging the fact that malware executions must consume power, we explore the postulate that malware can be accurately detected via power data analytics. Our experimental design and implementation allow for programmatic collection of CPU power profiles for fi…
▽ More
This paper presents an experimental design and data analytics approach aimed at power-based malware detection on general-purpose computers. Leveraging the fact that malware executions must consume power, we explore the postulate that malware can be accurately detected via power data analytics. Our experimental design and implementation allow for programmatic collection of CPU power profiles for fixed tasks during uninfected and infected states using five different rootkits. To characterize the power consumption profiles, we use both simple statistical and novel, sophisticated features. We test a one-class anomaly detection ensemble (that baselines non-infected power profiles) and several kernel-based SVM classifiers (that train on both uninfected and infected profiles) in detecting previously unseen malware and clean profiles. The anomaly detection system exhibits perfect detection when using all features and tasks, with smaller false detection rate than the supervised classifiers. The primary contribution is the proof of concept that baselining power of fixed tasks can provide accurate detection of rootkits. Moreover, our treatment presents engineering hurdles needed for experimentation and allows analysis of each statistical feature individually. This work appears to be the first step towards a viable power-based detection capability for general-purpose computers, and presents next steps toward this goal.
△ Less
Submitted 16 May, 2018;
originally announced May 2018.
-
Malware Detection on General-Purpose Computers Using Power Consumption Monitoring: A Proof of Concept and Case Study
Authors:
Jarilyn M. Hernández Jiménez,
Jeffrey A. Nichols,
Katerina Goseva-Popstojanova,
Stacy Prowell,
Robert A. Bridges
Abstract:
Malware detection is challenging when faced with automatically generated and polymorphic malware, as well as with rootkits, which are exceptionally hard to detect. In an attempt to contribute towards addressing these challenges, we conducted a proof of concept study that explored the use of power consumption for detection of malware presence in a general-purpose computer. The results of our experi…
▽ More
Malware detection is challenging when faced with automatically generated and polymorphic malware, as well as with rootkits, which are exceptionally hard to detect. In an attempt to contribute towards addressing these challenges, we conducted a proof of concept study that explored the use of power consumption for detection of malware presence in a general-purpose computer. The results of our experiments indicate that malware indeed leaves a signal on the power consumption of a general-purpose computer. Specifically, for the case study based on two different rootkits, the data collected at the +12V rails on the motherboard showed the most noticeable increment of the power consumption after the computer was infected. Our future work includes experimenting with more malware examples and workloads, and developing data analytics approach for automatic malware detection based on power consumption.
△ Less
Submitted 4 May, 2017;
originally announced May 2017.
-
Recommending Targeted Strangers from Whom to Solicit Information on Social Media
Authors:
Jalal Mahmud,
Michelle X. Zhou,
Nimrod Megiddo,
Jeffrey Nichols,
Clemens Drews
Abstract:
We present an intelligent, crowd-powered information collection system that automatically identifies and asks target-ed strangers on Twitter for desired information (e.g., cur-rent wait time at a nightclub). Our work includes three parts. First, we identify a set of features that characterize ones willingness and readiness to respond based on their exhibited social behavior, including the content…
▽ More
We present an intelligent, crowd-powered information collection system that automatically identifies and asks target-ed strangers on Twitter for desired information (e.g., cur-rent wait time at a nightclub). Our work includes three parts. First, we identify a set of features that characterize ones willingness and readiness to respond based on their exhibited social behavior, including the content of their tweets and social interaction patterns. Second, we use the identified features to build a statistical model that predicts ones likelihood to respond to information solicitations. Third, we develop a recommendation algorithm that selects a set of targeted strangers using the probabilities computed by our statistical model with the goal to maximize the over-all response rate. Our experiments, including several in the real world, demonstrate the effectiveness of our work.
△ Less
Submitted 21 May, 2014;
originally announced May 2014.
-
Who Will Retweet This? Automatically Identifying and Engaging Strangers on Twitter to Spread Information
Authors:
Kyumin Lee,
Jalal Mahmud,
Jilin Chen,
Michelle Zhou,
Jeffrey Nichols
Abstract:
There has been much effort on studying how social media sites, such as Twitter, help propagate information in different situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time fra…
▽ More
There has been much effort on studying how social media sites, such as Twitter, help propagate information in different situations, including spreading alerts and SOS messages in an emergency. However, existing work has not addressed how to actively identify and engage the right strangers at the right time on social media to help effectively propagate intended information within a desired time frame. To address this problem, we have developed two models: (i) a feature-based model that leverages peoples' exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to propagate information on Twitter via the act of retweeting; and (ii) a wait-time model based on a user's previous retweeting wait times to predict her next retweeting time when asked. Based on these two models, we build a recommender system that predicts the likelihood of a stranger to retweet information when asked, within a specific time window, and recommends the top-N qualified strangers to engage with. Our experiments, including live studies in the real world, demonstrate the effectiveness of our work.
△ Less
Submitted 12 July, 2014; v1 submitted 15 May, 2014;
originally announced May 2014.
-
Optimizing The Selection of Strangers To Answer Questions in Social Media
Authors:
Jalal Mahmud,
Michelle Zhou,
Nimrod Megiddo,
Jeffrey Nichols,
Clemens Drews
Abstract:
Millions of people express themselves on public social media, such as Twitter. Through their posts, these people may reveal themselves as potentially valuable sources of information. For example, real-time information about an event might be collected through asking questions of people who tweet about being at the event location. In this paper, we explore how to model and select users to target wi…
▽ More
Millions of people express themselves on public social media, such as Twitter. Through their posts, these people may reveal themselves as potentially valuable sources of information. For example, real-time information about an event might be collected through asking questions of people who tweet about being at the event location. In this paper, we explore how to model and select users to target with questions so as to improve answering performance while managing the load on people who must be asked. We first present a feature-based model that leverages users exhibited social behavior, including the content of their tweets and social interactions, to characterize their willingness and readiness to respond to questions on Twitter. We then use the model to predict the likelihood for people to answer questions. To support real-world information collection applications, we present an optimization-based approach that selects a proper set of strangers to answer questions while achieving a set of application-dependent objectives, such as achieving a desired number of answers and minimizing the number of questions to be sent. Our cross-validation experiments using multiple real-world data sets demonstrate the effectiveness of our work.
△ Less
Submitted 8 April, 2014;
originally announced April 2014.
-
Home Location Identification of Twitter Users
Authors:
Jalal Mahmud,
Jeffrey Nichols,
Clemens Drews
Abstract:
We present a new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone or geographic region, using the content of users tweets and their tweeting behavior. Unlike existing approaches, our algorithm uses an ensemble of statistical and heuristic classifiers to predict locations and makes use of a geographic gazetteer dictionary to ide…
▽ More
We present a new algorithm for inferring the home location of Twitter users at different granularities, including city, state, time zone or geographic region, using the content of users tweets and their tweeting behavior. Unlike existing approaches, our algorithm uses an ensemble of statistical and heuristic classifiers to predict locations and makes use of a geographic gazetteer dictionary to identify place-name entities. We find that a hierarchical classification approach, where time zone, state or geographic region is predicted first and city is predicted next, can improve prediction accuracy. We have also analyzed movement variations of Twitter users, built a classifier to predict whether a user was travelling in a certain period of time and use that to further improve the location detection accuracy. Experimental evidence suggests that our algorithm works well in practice and outperforms the best existing algorithms for predicting the home location of Twitter users.
△ Less
Submitted 7 March, 2014;
originally announced March 2014.
-
Why Are You More Engaged? Predicting Social Engagement from Word Use
Authors:
Jalal Mahmud,
Jilin Chen,
Jeffrey Nichols
Abstract:
We present a study to analyze how word use can predict social engagement behaviors such as replies and retweets in Twitter. We compute psycholinguistic category scores from word usage, and investigate how people with different scores exhibited different reply and retweet behaviors on Twitter. We also found psycholinguistic categories that show significant correlations with such social engagement b…
▽ More
We present a study to analyze how word use can predict social engagement behaviors such as replies and retweets in Twitter. We compute psycholinguistic category scores from word usage, and investigate how people with different scores exhibited different reply and retweet behaviors on Twitter. We also found psycholinguistic categories that show significant correlations with such social engagement behaviors. In addition, we have built predictive models of replies and retweets from such psycholinguistic category based features. Our experiments using a real world dataset collected from Twitter validates that such predictions can be done with reasonable accuracy.
△ Less
Submitted 26 February, 2014;
originally announced February 2014.
-
TGCat, The Chandra Transmission Grating Catalog and Archive: Systems, Design and Accessibility
Authors:
Arik W. Mitschang,
David P. Huenemoerder,
Joy S. Nichols
Abstract:
The recently released Chandra Transmission Grating Catalog and Archive, TGCat, presents a fully dynamic on-line catalog allowing users to browse and categorize Chandra gratings observations quickly and easily, generate custom plots of resulting response corrected spectra on-line without the need for special software and to download analysis ready products from multiple observations in one conven…
▽ More
The recently released Chandra Transmission Grating Catalog and Archive, TGCat, presents a fully dynamic on-line catalog allowing users to browse and categorize Chandra gratings observations quickly and easily, generate custom plots of resulting response corrected spectra on-line without the need for special software and to download analysis ready products from multiple observations in one convenient operation. TGCat has been registered as a VO resource with the NVO providing direct access to the catalogs interface. The catalog is supported by a back-end designed to automatically fetch newly public data, process, archive and catalog them, At the same time utilizing an advanced queue system integrated into the archive's MySQL database allowing large processing projects to take advantage of an unlimited number of CPUs across a network for rapid completion. A unique feature of the catalog is that all of the high level functions used to retrieve inputs from the Chandra archive and to generate the final data products are available to the user in an ISIS written library with detailed documentation. Here we present a structural overview of the Systems, Design, and Accessibility features of the catalog and archive.
△ Less
Submitted 30 December, 2009;
originally announced January 2010.
-
Pushdown dimension
Authors:
David Doty,
Jared Nichols
Abstract:
This paper develops the theory of pushdown dimension and explores its relationship with finite-state dimension. Pushdown dimension is trivially bounded above by finite-state dimension for all sequences, since a pushdown gambler can simulate any finite-state gambler. We show that for every rational 0 < d < 1, there exists a sequence with finite-state dimension d whose pushdown dimension is at mos…
▽ More
This paper develops the theory of pushdown dimension and explores its relationship with finite-state dimension. Pushdown dimension is trivially bounded above by finite-state dimension for all sequences, since a pushdown gambler can simulate any finite-state gambler. We show that for every rational 0 < d < 1, there exists a sequence with finite-state dimension d whose pushdown dimension is at most d/2. This establishes a quantitative analogue of the well-known fact that pushdown automata decide strictly more languages than finite automata.
△ Less
Submitted 26 May, 2005; v1 submitted 12 April, 2005;
originally announced April 2005.