-
Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems
Authors:
Jack FitzGerald,
Shankar Ananthakrishnan,
Konstantine Arkoudas,
Davide Bernardi,
Abhishek Bhagia,
Claudio Delli Bovi,
Jin Cao,
Rakesh Chada,
Amit Chauhan,
Luoxin Chen,
Anurag Dwarakanath,
Satyam Dwivedi,
Turan Gojayev,
Karthik Gopalakrishnan,
Thomas Gueudre,
Dilek Hakkani-Tur,
Wael Hamza,
Jonathan Hueser,
Kevin Martin Jose,
Haidar Khan,
Beiye Liu,
Jianhua Lu,
Alessandro Manzotti,
Pradeep Natarajan,
Karolina Owczarzak
, et al. (16 additional authors not shown)
Abstract:
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform co…
▽ More
We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Metamorphic Testing of a Deep Learning based Forecaster
Authors:
Anurag Dwarakanath,
Manish Ahuja,
Sanjay Podder,
Silja Vinu,
Arijit Naskar,
Koushik MV
Abstract:
In this paper, we present the Metamorphic Testing of an in-use deep learning based forecasting application. The application looks at the past data of system characteristics (e.g. `memory allocation') to predict outages in the future. We focus on two statistical / machine learning based components - a) detection of co-relation between system characteristics and b) estimating the future value of a s…
▽ More
In this paper, we present the Metamorphic Testing of an in-use deep learning based forecasting application. The application looks at the past data of system characteristics (e.g. `memory allocation') to predict outages in the future. We focus on two statistical / machine learning based components - a) detection of co-relation between system characteristics and b) estimating the future value of a system characteristic using an LSTM (a deep learning architecture). In total, 19 Metamorphic Relations have been developed and we provide proofs & algorithms where applicable. We evaluated our method through two settings. In the first, we executed the relations on the actual application and uncovered 8 issues not known before. Second, we generated hypothetical bugs, through Mutation Testing, on a reference implementation of the LSTM based forecaster and found that 65.9% of the bugs were caught through the relations.
△ Less
Submitted 13 July, 2019;
originally announced July 2019.
-
Trustworthiness in Enterprise Crowdsourcing: a Taxonomy & evidence from data
Authors:
Anurag Dwarakanath,
Shrikanth N. C.,
Kumar Abhinav,
Alex Kass
Abstract:
In this paper we study the trustworthiness of the crowd for crowdsourced software development. Through the study of literature from various domains, we present the risks that impact the trustworthiness in an enterprise context. We survey known techniques to mitigate these risks. We also analyze key metrics from multiple years of empirical data of actual crowdsourced software development tasks from…
▽ More
In this paper we study the trustworthiness of the crowd for crowdsourced software development. Through the study of literature from various domains, we present the risks that impact the trustworthiness in an enterprise context. We survey known techniques to mitigate these risks. We also analyze key metrics from multiple years of empirical data of actual crowdsourced software development tasks from two leading vendors. We present the metrics around untrustworthy behavior and the performance of certain mitigation techniques. Our study and results can serve as guidelines for crowdsourced enterprise software development.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
Machines that test Software like Humans
Authors:
Anurag Dwarakanath,
Neville Dubash,
Sanjay Podder
Abstract:
Automated software testing involves the execution of test scripts by a machine instead of being manually run. This significantly reduces the amount of manual time & effort needed and thus is of great interest to the software testing industry. There have been various tools developed to automate the testing of web applications (e.g. Selenium WebDriver); however, the practical adoption of test automa…
▽ More
Automated software testing involves the execution of test scripts by a machine instead of being manually run. This significantly reduces the amount of manual time & effort needed and thus is of great interest to the software testing industry. There have been various tools developed to automate the testing of web applications (e.g. Selenium WebDriver); however, the practical adoption of test automation is still miniscule. This is due to the complexity of creating and maintaining automation scripts. The key problem with the existing methods is that the automation test scripts require certain implementation specifics of the Application Under Test (AUT) (e.g. the html code of a web element, or an image of a web element). On the other hand, if we look at the way manual testing is done, the tester interprets the textual test scripts and interacts with the AUT purely based on what he perceives visually through the GUI. In this paper, we present an approach to build a machine that can mimic human behavior for software testing using recent advances in Computer Vision. We also present four use-cases of how this approach can significantly advance the test automation space making test automation simple enough to be adopted practically.
△ Less
Submitted 25 September, 2018;
originally announced September 2018.
-
Minimum Number of Test Paths for Prime Path and other Structural Coverage Criteria
Authors:
Anurag Dwarakanath,
Aruna Jankiti
Abstract:
The software system under test can be modeled as a graph comprising of a set of vertices, (V) and a set of edges, (E). Test Cases are Test Paths over the graph meeting a particular test criterion. In this paper, we present a method to achieve the minimum number of Test Paths needed to cover different structural coverage criteria. Our method can accommodate Prime Path, Edge-Pair, Simple & Complete…
▽ More
The software system under test can be modeled as a graph comprising of a set of vertices, (V) and a set of edges, (E). Test Cases are Test Paths over the graph meeting a particular test criterion. In this paper, we present a method to achieve the minimum number of Test Paths needed to cover different structural coverage criteria. Our method can accommodate Prime Path, Edge-Pair, Simple & Complete Round Trip, Edge and Node coverage criteria. Our method obtains the optimal solution by transforming the graph into a flow graph and solving the minimum flow problem. We present an algorithm for the minimum flow problem that matches the best known solution complexity of $O(|V| |E|)$. Our method is evaluated through two sets of tests. In the first, we test against graphs representing actual software. In the second test, we create random graphs of varying complexity. In each test we measure the number of Test Paths, the length of Test Paths, the lower bound on minimum number of Test Paths and the execution time.
△ Less
Submitted 22 September, 2018;
originally announced September 2018.
-
Accelerating Test Automation through a Domain Specific Language
Authors:
Anurag Dwarakanath,
Dipin Era,
Aditya Priyadarshi,
Neville Dubash,
Sanjay Podder
Abstract:
Test automation involves the automatic execution of test scripts instead of being manually run. This significantly reduces the amount of manual effort needed and thus is of great interest to the software testing industry. There are two key problems in the existing tools and methods for test automation - a) Creating an automation test script is essentially a code development task, which most tester…
▽ More
Test automation involves the automatic execution of test scripts instead of being manually run. This significantly reduces the amount of manual effort needed and thus is of great interest to the software testing industry. There are two key problems in the existing tools and methods for test automation - a) Creating an automation test script is essentially a code development task, which most testers are not trained on; and b) the automation test script is seldom readable, making the task of maintenance an effort intensive process. We present the Accelerating Test Automation Platform (ATAP) which is aimed at making test automation accessible to non-programmers. ATAP allows the creation of an automation test script through a domain specific language based on English. The English-like test scripts are automatically converted to machine executable code using Selenium WebDriver. ATAP's English-like test script makes it easy for non-programmers to author. The functional flow of an ATAP script is easy to understand as well thus making maintenance simpler (you can understand the flow of the test script when you revisit it many months later). ATAP has been built around the Eclipse ecosystem and has been used in a real-life testing project. We present the details of the implementation of ATAP and the results from its usage in practice.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.
-
Identifying Implementation Bugs in Machine Learning based Image Classifiers using Metamorphic Testing
Authors:
Anurag Dwarakanath,
Manish Ahuja,
Samarth Sikand,
Raghotham M. Rao,
R. P. Jagadeesh Chandra Bose,
Neville Dubash,
Sanjay Podder
Abstract:
We have recently witnessed tremendous success of Machine Learning (ML) in practical applications. Computer vision, speech recognition and language translation have all seen a near human level performance. We expect, in the near future, most business applications will have some form of ML. However, testing such applications is extremely challenging and would be very expensive if we follow today's m…
▽ More
We have recently witnessed tremendous success of Machine Learning (ML) in practical applications. Computer vision, speech recognition and language translation have all seen a near human level performance. We expect, in the near future, most business applications will have some form of ML. However, testing such applications is extremely challenging and would be very expensive if we follow today's methodologies. In this work, we present an articulation of the challenges in testing ML based applications. We then present our solution approach, based on the concept of Metamorphic Testing, which aims to identify implementation bugs in ML based image classifiers. We have developed metamorphic relations for an application based on Support Vector Machine and a Deep Learning based application. Empirical validation showed that our approach was able to catch 71% of the implementation bugs in the ML applications.
△ Less
Submitted 16 August, 2018;
originally announced August 2018.