-
Power in Numbers: Robust reading comprehension by finetuning with four adversarial sentences per example
Authors:
Ariel Marcus
Abstract:
Recent models have achieved human level performance on the Stanford Question Answering Dataset when using F1 scores to evaluate the reading comprehension task. Yet, teaching machines to comprehend text has not been solved in the general case. By appending one adversarial sentence to the context paragraph, past research has shown that the F1 scores from reading comprehension models drop almost in h…
▽ More
Recent models have achieved human level performance on the Stanford Question Answering Dataset when using F1 scores to evaluate the reading comprehension task. Yet, teaching machines to comprehend text has not been solved in the general case. By appending one adversarial sentence to the context paragraph, past research has shown that the F1 scores from reading comprehension models drop almost in half. In this paper, I replicate past adversarial research with a new model, ELECTRA-Small, and demonstrate that the new model's F1 score drops from 83.9% to 29.2%. To improve ELECTRA-Small's resistance to this attack, I finetune the model on SQuAD v1.1 training examples with one to five adversarial sentences appended to the context paragraph. Like past research, I find that the finetuned model on one adversarial sentence does not generalize well across evaluation datasets. However, when finetuned on four or five adversarial sentences the model attains an F1 score of more than 70% on most evaluation datasets with multiple appended and prepended adversarial sentences. The results suggest that with enough examples we can make models robust to adversarial attacks.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
On Using GUI Interaction Data to Improve Text Retrieval-based Bug Localization
Authors:
Junayed Mahmud,
Nadeeshan De Silva,
Safwat Ali Khan,
Seyed Hooman Mostafavi,
SM Hasan Mansur,
Oscar Chaparro,
Andrian Marcus,
Kevin Moran
Abstract:
One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable sem…
▽ More
One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable semantic gap between the information contained in bug reports and identifiers or natural language contained within source code files. For user-facing software, there is currently a key source of information that could aid in bug localization, but has not been thoroughly investigated - information from the GUI.
We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, and using this to aid in retrieving potentially buggy files, can improve upon existing techniques for bug localization. To examine this phenomenon, we conduct a comprehensive empirical study that augments four baseline techniques for bug localization with GUI interaction information from a reproduction scenario to (i) filter out potentially irrelevant files, (ii) boost potentially relevant files, and (iii) reformulate text-retrieval queries. To carry out our study, we source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports, consisting of 80 bug reports from 39 popular open-source apps. Our results illustrate that augmenting traditional techniques with GUI information leads to a marked increase in effectiveness across multiple metrics, including a relative increase in Hits@10 of 13-18%. Additionally, through further analysis, we find that our studied augmentations largely complement existing techniques.
△ Less
Submitted 12 October, 2023;
originally announced October 2023.
-
Concurrent ischemic lesion age estimation and segmentation of CT brain using a Transformer-based network
Authors:
Adam Marcus,
Paul Bentley,
Daniel Rueckert
Abstract:
The cornerstone of stroke care is expedient management that varies depending on the time since stroke onset. Consequently, clinical decision making is centered on accurate knowledge of timing and often requires a radiologist to interpret Computed Tomography (CT) of the brain to confirm the occurrence and age of an event. These tasks are particularly challenging due to the subtle expression of acut…
▽ More
The cornerstone of stroke care is expedient management that varies depending on the time since stroke onset. Consequently, clinical decision making is centered on accurate knowledge of timing and often requires a radiologist to interpret Computed Tomography (CT) of the brain to confirm the occurrence and age of an event. These tasks are particularly challenging due to the subtle expression of acute ischemic lesions and the dynamic nature of their appearance. Automation efforts have not yet applied deep learning to estimate lesion age and treated these two tasks independently, so, have overlooked their inherent complementary relationship. To leverage this, we propose a novel end-to-end multi-task transformer-based network optimized for concurrent segmentation and age estimation of cerebral ischemic lesions. By utilizing gated positional self-attention and CT-specific data augmentation, the proposed method can capture long-range spatial dependencies while maintaining its ability to be trained from scratch under low-data regimes commonly found in medical imaging. Furthermore, to better combine multiple predictions, we incorporate uncertainty by utilizing quantile loss to facilitate estimating a probability density function of lesion age. The effectiveness of our model is then extensively evaluated on a clinical dataset consisting of 776 CT images from two medical centers. Experimental results demonstrate that our method obtains promising performance, with an area under the curve (AUC) of 0.933 for classifying lesion ages <=4.5 hours compared to 0.858 using a conventional approach, and outperforms task-specific state-of-the-art algorithms.
△ Less
Submitted 21 June, 2023;
originally announced June 2023.
-
BURT: A Chatbot for Interactive Bug Reporting
Authors:
Yang Song,
Junayed Mahmud,
Nadeeshan De Silva,
Ying Zhou,
Oscar Chaparro,
Kevin Moran,
Andrian Marcus,
Denys Poshyvanyk
Abstract:
This paper introduces BURT, a web-based chatbot for interactive reporting of Android app bugs. BURT is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. BURT guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of th…
▽ More
This paper introduces BURT, a web-based chatbot for interactive reporting of Android app bugs. BURT is designed to assist Android app end-users in reporting high-quality defect information using an interactive interface. BURT guides the users in reporting essential bug report elements, i.e., the observed behavior, expected behavior, and the steps to reproduce the bug. It verifies the quality of the text written by the user and provides instant feedback. In addition, BURT provides graphical suggestions that the users can choose as alternatives to textual descriptions. We empirically evaluated BURT, asking end-users to report bugs from six Android apps. The reporters found that BURT's guidance and automated suggestions and clarifications are useful and BURT is easy to use. BURT is an open-source tool, available at github.com/sea-lab-wm/burt/tree/tool-demo. A video showing the full capabilities of BURT can be found at https://youtu.be/SyfOXpHYGRo
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Translating Video Recordings of Complex Mobile App UI Gestures into Replayable Scenarios
Authors:
Carlos Bernal-Cárdenas,
Nathan Cooper,
Madeleine Havranek,
Kevin Moran,
Oscar Chaparro,
Denys Poshyvanyk,
Andrian Marcus
Abstract:
Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly e…
▽ More
Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly evolving platforms, automated techniques for analyzing all types of rich software artifacts provide benefit to mobile developers. Unfortunately, automatically analyzing screen recordings presents serious challenges, due to their graphical nature, compared to other types of (textual) artifacts. To address these challenges, this paper introduces V2S+, an automated approach for translating video recordings of Android app usages into replayable scenarios. V2S+ is based primarily on computer vision techniques and adapts recent solutions for object detection and image classification to detect and classify user gestures captured in a video, and convert these into a replayable test scenario. Given that V2S+ takes a computer vision-based approach, it is applicable to both hybrid and native Android applications. We performed an extensive evaluation of V2S+ involving 243 videos depicting 4,028 GUI-based actions collected from users exercising features and reproducing bugs from a collection of over 90 popular native and hybrid Android apps. Our results illustrate that V2S+ can accurately replay scenarios from screen recordings, and is capable of reproducing $\approx$ 90.2% of sequential actions recorded in native application scenarios on physical devices, and $\approx$ 83% of sequential actions recorded in hybrid application scenarios on emulators, both with low overhead. A case study with three industrial partners illustrates the potential usefulness of V2S+ from the viewpoint of developers.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Toward Interactive Bug Reporting for (Android App) End-Users
Authors:
Yang Song,
Junayed Mahmud,
Ying Zhou,
Oscar Chaparro,
Kevin Moran,
Andrian Marcus,
Denys Poshyvanyk
Abstract:
Many software bugs are reported manually, particularly bugs that manifest themselves visually in the user interface. End-users typically report these bugs via app reviewing websites, issue trackers, or in-app built-in bug reporting tools, if available. While these systems have various features that facilitate bug reporting (e.g., textual templates or forms), they often provide limited guidance, co…
▽ More
Many software bugs are reported manually, particularly bugs that manifest themselves visually in the user interface. End-users typically report these bugs via app reviewing websites, issue trackers, or in-app built-in bug reporting tools, if available. While these systems have various features that facilitate bug reporting (e.g., textual templates or forms), they often provide limited guidance, concrete feedback, or quality verification to end-users, who are often inexperienced at reporting bugs and submit low-quality bug reports that lead to excessive developer effort in bug report management tasks. We propose an interactive bug reporting system for end-users (Burt), implemented as a task-oriented chatbot. Unlike existing bug reporting systems, Burt provides guided reporting of essential bug report elements (i.e., the observed behavior, expected behavior, and steps to reproduce the bug), instant quality verification, and graphical suggestions for these elements. We implemented a version of Burt for Android and conducted an empirical evaluation study with end-users, who reported 12 bugs from six Android apps studied in prior work. The reporters found that Burt's guidance and automated suggestions/clarifications are useful and Burt is easy to use. We found that Burt reports contain higher-quality information than reports collected via a template-based bug reporting system. Improvements to Burt, informed by the reporters, include support for various wordings to describe bug report elements and improved quality verification. Our work marks an important paradigm shift from static to interactive bug reporting for end-users.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Existence and polynomial time construction of biregular, bipartite Ramanujan graphs of all degrees
Authors:
Aurelien Gribinski,
Adam W. Marcus
Abstract:
We prove that there exist bipartite, biregular Ramanujan graphs of every degree and every number of vertices provided that the cardinalities of the two sets of the bipartition divide each other. This generalizes a result of Marcus, Spielman, and Srivastava and, similar to theirs, the proof is based on the analysis of expected polynomials. The primary difference is the use of some new machinery inv…
▽ More
We prove that there exist bipartite, biregular Ramanujan graphs of every degree and every number of vertices provided that the cardinalities of the two sets of the bipartition divide each other. This generalizes a result of Marcus, Spielman, and Srivastava and, similar to theirs, the proof is based on the analysis of expected polynomials. The primary difference is the use of some new machinery involving rectangular convolutions, developed in a companion paper. We also prove the constructibility of such graphs in polynomial time in the number of vertices, extending a result of Cohen to this biregular case.
△ Less
Submitted 5 August, 2021;
originally announced August 2021.
-
An Empirical Study of Data Constraint Implementations in Java
Authors:
Juan Manuel Florez,
Laura Moreno,
Zenong Zhang,
Shiyi Wei,
Andrian Marcus
Abstract:
Software systems are designed according to guidelines and constraints defined by business rules. Some of these constraints define the allowable or required values for data handled by the systems. These data constraints usually originate from the problem domain (e.g., regulations), and developers must write code that enforces them. Understanding how data constraints are implemented is essential for…
▽ More
Software systems are designed according to guidelines and constraints defined by business rules. Some of these constraints define the allowable or required values for data handled by the systems. These data constraints usually originate from the problem domain (e.g., regulations), and developers must write code that enforces them. Understanding how data constraints are implemented is essential for testing, debugging, and software change. Unfortunately, there are no widely-accepted guidelines or best practices on how to implement data constraints.
This paper presents an empirical study that investigates how data constraints are implemented in Java. We study the implementation of 187 data constraints extracted from the documentation of eight real-world Java software systems. First, we perform a qualitative analysis of the textual description of data constraints and identify four data constraint types. Second, we manually identify the implementations of these data constraints and reveal that they can be grouped into 30 implementation patterns. The analysis of these implementation patterns indicates that developers prefer a handful of patterns when implementing data constraints and deviations from these patterns are associated with unusual implementation decisions or code smells. Third, we develop a tool-assisted protocol that allows us to identify 256 additional trace links for the data constraints implemented using the 13 most common patterns. We find that almost half of these data constraints have multiple enforcing statements, which are code clones of different types.
△ Less
Submitted 9 July, 2021;
originally announced July 2021.
-
Toward Speeding up Mutation Analysis by Memoizing Expensive Methods
Authors:
Ali Ghanbari,
Andrian Marcus
Abstract:
Mutation analysis has many applications, such as assessing the quality of test cases, fault localization, test input generation, security analysis, etc. Such applications involve running test suite against a large number of program mutants leading to poor scalability. Much research has been aimed at speeding up this process, focusing on reducing the number of mutants, the number of executed tests,…
▽ More
Mutation analysis has many applications, such as assessing the quality of test cases, fault localization, test input generation, security analysis, etc. Such applications involve running test suite against a large number of program mutants leading to poor scalability. Much research has been aimed at speeding up this process, focusing on reducing the number of mutants, the number of executed tests, or the execution time of the mutants. This paper presents a novel approach, named MeMu, for reducing the execution time of the mutants, by memoizing the most expensive methods in the system. Memoization is an optimization technique that allows bypassing the execution of expensive methods, when repeated inputs are detected. MeMu can be used in conjunction with existing acceleration techniques. We implemented MeMu on top of PITest, a well-known JVM bytecode-level mutation analysis system, and obtained, on average, an 18.15% speed-up over PITest, in the execution time of the mutants for 12 real-world programs. These promising results and the fact that MeMu could also be used for other applications that involve repeated execution of tests (e.g., automatic program repair and regression testing), strongly support future research for improving its efficiency.
△ Less
Submitted 23 February, 2021;
originally announced February 2021.
-
PRF: A Framework for Building Automatic Program Repair Prototypes for JVM-Based Languages
Authors:
Ali Ghanbari,
Andrian Marcus
Abstract:
PRF is a Java-based framework that allows researchers to build prototypes of test-based generate-and-validate automatic program repair techniques for JVM languages by simply extending it with their patch generation plugins. The framework also provides other useful components for constructing automatic program repair tools, e.g., a fault localization component that provides spectrum-based fault loc…
▽ More
PRF is a Java-based framework that allows researchers to build prototypes of test-based generate-and-validate automatic program repair techniques for JVM languages by simply extending it with their patch generation plugins. The framework also provides other useful components for constructing automatic program repair tools, e.g., a fault localization component that provides spectrum-based fault localization information at different levels of granularity, a configurable and safe patch validation component that is 11+X faster than vanilla testing, and a customizable post-processing component to generate fix reports. A demo video of PRF is available at https://bit.ly/3ehduSS.
△ Less
Submitted 14 September, 2020;
originally announced September 2020.
-
Translating Video Recordings of Mobile App Usages into Replayable Scenarios
Authors:
Carlos Bernal-Cárdenas,
Nathan Cooper,
Kevin Moran,
Oscar Chaparro,
Andrian Marcus,
Denys Poshyvanyk
Abstract:
Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly e…
▽ More
Screen recordings of mobile applications are easy to obtain and capture a wealth of information pertinent to software developers (e.g., bugs or feature requests), making them a popular mechanism for crowdsourced app feedback. Thus, these videos are becoming a common artifact that developers must manage. In light of unique mobile development constraints, including swift release cycles and rapidly evolving platforms, automated techniques for analyzing all types of rich software artifacts provide benefit to mobile developers. Unfortunately, automatically analyzing screen recordings presents serious challenges, due to their graphical nature, compared to other types of (textual) artifacts. To address these challenges, this paper introduces V2S, a lightweight, automated approach for translating video recordings of Android app usages into replayable scenarios. V2S is based primarily on computer vision techniques and adapts recent solutions for object detection and image classification to detect and classify user actions captured in a video, and convert these into a replayable test scenario. We performed an extensive evaluation of V2S involving 175 videos depicting 3,534 GUI-based actions collected from users exercising features and reproducing bugs from over 80 popular Android apps. Our results illustrate that V2S can accurately replay scenarios from screen recordings, and is capable of reproducing $\approx$ 89% of our collected videos with minimal overhead. A case study with three industrial partners illustrates the potential usefulness of V2S from the viewpoint of developers.
△ Less
Submitted 18 May, 2020;
originally announced May 2020.
-
Assessing the Quality of the Steps to Reproduce in Bug Reports
Authors:
Oscar Chaparro,
Carlos Bernal-Cardenas,
Jing Lu,
Kevin Moran,
Andrian Marcus,
Massimiliano Di Penta,
Denys Poshyvanyk,
Vincent Ng
Abstract:
A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a…
▽ More
A major problem with user-written bug reports, indicated by developers and documented by researchers, is the (lack of high) quality of the reported steps to reproduce the bugs. Low-quality steps to reproduce lead to excessive manual effort spent on bug triage and resolution. This paper proposes Euler, an approach that automatically identifies and assesses the quality of the steps to reproduce in a bug report, providing feedback to the reporters, which they can use to improve the bug report. The feedback provided by Euler was assessed by external evaluators and the results indicate that Euler correctly identified 98% of the existing steps to reproduce and 58% of the missing ones, while 73% of its quality annotations are correct.
△ Less
Submitted 17 June, 2019;
originally announced June 2019.
-
A Mobile Device Prototype Application for the Detection and Prediction of Node Faults in Wireless Sensor Networks
Authors:
Anthony Marcus,
Ionut Cardei,
Borko Furht,
Osman Salem,
Ahmed Mehaoua
Abstract:
Various implementations of wireless sensor networks (i.e. personal area-, wireless body area- networks) are prone to node and network failures by such characteristics as limited node energy resources and hardware damage incurred from their surrounding environment (i.e. flooding, forest fires, a patient falling). This may jeopardize their reliability to act as early warning systems, monitoring syst…
▽ More
Various implementations of wireless sensor networks (i.e. personal area-, wireless body area- networks) are prone to node and network failures by such characteristics as limited node energy resources and hardware damage incurred from their surrounding environment (i.e. flooding, forest fires, a patient falling). This may jeopardize their reliability to act as early warning systems, monitoring systems for patients and athletes, and industrial and environmental observation networks. Following the current trend and widespread use of hand held, mobile communication devices, we outline an application architecture designed to detect and predict faulty nodes in wireless sensor networks. Furthermore, we implement our design as a proof of concept prototype for Android-based smartphones, which may be extended to develop other applications used for monitoring networked wireless personal area and body sensors used in other capacities. We have conducted several preliminary experiments to demonstrate the use of our design, which is capable of monitoring networks of wireless sensor devices and predicting node faults based on several localized metrics. As attributes of such networks may change over time, any models generated when the application is initialized must be updated periodically such that the applied machine learning algorithm maintains high levels of both accuracy and precision. The application is designed to discover node faults and, once identified, alert the user so that appropriate action may be taken.
△ Less
Submitted 21 January, 2014;
originally announced January 2014.
-
Human-powered Sorts and Joins
Authors:
Adam Marcus,
Eugene Wu,
David Karger,
Samuel Madden,
Robert Miller
Abstract:
Crowdsourcing markets like Amazon's Mechanical Turk (MTurk) make it possible to task people with small jobs, such as labeling images or looking up phone numbers, via a programmatic interface. MTurk tasks for processing datasets with humans are currently designed with significant reimplementation of common workflows and ad-hoc selection of parameters such as price to pay per task. We describe how w…
▽ More
Crowdsourcing markets like Amazon's Mechanical Turk (MTurk) make it possible to task people with small jobs, such as labeling images or looking up phone numbers, via a programmatic interface. MTurk tasks for processing datasets with humans are currently designed with significant reimplementation of common workflows and ad-hoc selection of parameters such as price to pay per task. We describe how we have integrated crowds into a declarative workflow engine called Qurk to reduce the burden on workflow designers. In this paper, we focus on how to use humans to compare items for sorting and joining data, two of the most common operations in DBMSs. We describe our basic query interface and the user interface of the tasks we post to MTurk. We also propose a number of optimizations, including task batching, replacing pairwise comparisons with numerical ratings, and pre-filtering tables before joining them, which dramatically reduce the overall cost of running sorts and joins on the crowd. In an experiment joining two sets of images, we reduce the overall cost from $67 in a naive implementation to about $3, without substantially affecting accuracy or latency. In an end-to-end experiment, we reduced cost by a factor of 14.5.
△ Less
Submitted 30 September, 2011;
originally announced September 2011.
-
Entropy and set cardinality inequalities for partition-determined functions
Authors:
Mokshay Madiman,
Adam Marcus,
Prasad Tetali
Abstract:
A new notion of partition-determined functions is introduced, and several basic inequalities are developed for the entropy of such functions of independent random variables, as well as for cardinalities of compound sets obtained using these functions. Here a compound set means a set obtained by varying each argument of a function of several variables over a set associated with that argument, where…
▽ More
A new notion of partition-determined functions is introduced, and several basic inequalities are developed for the entropy of such functions of independent random variables, as well as for cardinalities of compound sets obtained using these functions. Here a compound set means a set obtained by varying each argument of a function of several variables over a set associated with that argument, where all the sets are subsets of an appropriate algebraic structure so that the function is well defined. On the one hand, the entropy inequalities developed for partition-determined functions imply entropic analogues of general inequalities of Plünnecke-Ruzsa type. On the other hand, the cardinality inequalities developed for compound sets imply several inequalities for sumsets, including for instance a generalization of inequalities proved by Gyarmati, Matolcsi and Ruzsa (2010). We also provide partial progress towards a conjecture of Ruzsa (2007) for sumsets in nonabelian groups. All proofs are elementary and rely on properly developing certain information-theoretic inequalities.
△ Less
Submitted 9 August, 2011; v1 submitted 30 December, 2008;
originally announced January 2009.