Search | arXiv e-print repository

Learning Program Behavioral Models from Synthesized Input-Output Pairs

Authors: Tural Mammadov, Dietrich Klakow, Alexander Koller, Andreas Zeller

Abstract: We introduce Modelizer - a novel framework that, given a black-box program, learns a _model from its input/output behavior_ using _neural machine translation_. The resulting model _mocks_ the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also _reversible_ - that is, the model can predict the input that would have… ▽ More We introduce Modelizer - a novel framework that, given a black-box program, learns a _model from its input/output behavior_ using _neural machine translation_. The resulting model _mocks_ the original program: Given an input, the model predicts the output that would have been produced by the program. However, the model is also _reversible_ - that is, the model can predict the input that would have produced a given output. Finally, the model is _differentiable_ and can be efficiently restricted to predict only a certain aspect of the program behavior. Modelizer uses _grammars_ to synthesize inputs and to parse the resulting outputs, allowing it to learn sequence-to-sequence associations between token streams. Other than input and output grammars, Modelizer only requires the ability to execute the program. The resulting models are _small_, requiring fewer than 6.3 million parameters for languages such as Markdown or HTML; and they are _accurate_, achieving up to 95.4% accuracy and a BLEU score of 0.98 with standard error 0.04 in mocking real-world applications. We foresee several _applications_ of these models, especially as the output of the program can be any aspect of program behavior. Besides mocking and predicting program behavior, the model can also synthesize inputs that are likely to produce a particular behavior, such as failures or coverage. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 42 pages, 6 figures, 8 tables

MSC Class: 68T07 (Primary); 68N30 (Secondary); 68Q42 ACM Class: D.2.5; D.2.7; I.2.6; F.1.1; F.4.3

arXiv:2404.11223 [pdf, other]

AndroLog: Android Instrumentation and Code Coverage Analysis

Authors: Jordan Samhi, Andreas Zeller

Abstract: Dynamic analysis has emerged as a pivotal technique for testing Android apps, enabling the detection of bugs, malicious code, and vulnerabilities. A key metric in evaluating the efficacy of tools employed by both research and practitioner communities for this purpose is code coverage. Obtaining code coverage typically requires planting probes within apps to gather coverage data during runtime. Due… ▽ More Dynamic analysis has emerged as a pivotal technique for testing Android apps, enabling the detection of bugs, malicious code, and vulnerabilities. A key metric in evaluating the efficacy of tools employed by both research and practitioner communities for this purpose is code coverage. Obtaining code coverage typically requires planting probes within apps to gather coverage data during runtime. Due to the general unavailability of source code to analysts, there is a necessity for instrumenting apps to insert these probes in black-box environments. However, the tools available for such instrumentation are limited in their reliability and require intrusive changes interfering with apps' functionalities. This paper introduces AndroLog a novel tool developed on top of the Soot framework, designed to provide fine-grained coverage information at multiple levels, including class, methods, statements, and Android components. In contrast to existing tools, AndroLog leaves the responsibility to test apps to analysts, and its motto is simplicity. As demonstrated in this paper, AndroLog can instrument up to 98% of recent Android apps compared to existing tools with 79% and 48% respectively for COSMO and ACVTool. AndroLog also stands out for its potential for future enhancements to increase granularity on demand. We make AndroLog available to the community and provide a video demonstration of AndroLog (see section 8). △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2309.16618 [pdf, other]

doi 10.1145/3611643.3616308

Revisiting Neural Program Smoothing for Fuzzing

Authors: Maria-Irina Nicolae, Max Eisele, Andreas Zeller

Abstract: Testing with randomly generated inputs (fuzzing) has gained significant traction due to its capacity to expose program vulnerabilities automatically. Fuzz testing campaigns generate large amounts of data, making them ideal for the application of machine learning (ML). Neural program smoothing (NPS), a specific family of ML-guided fuzzers, aims to use a neural network as a smooth approximation of t… ▽ More Testing with randomly generated inputs (fuzzing) has gained significant traction due to its capacity to expose program vulnerabilities automatically. Fuzz testing campaigns generate large amounts of data, making them ideal for the application of machine learning (ML). Neural program smoothing (NPS), a specific family of ML-guided fuzzers, aims to use a neural network as a smooth approximation of the program target for new test case generation. In this paper, we conduct the most extensive evaluation of NPS fuzzers against standard gray-box fuzzers (>11 CPU years and >5.5 GPU years), and make the following contributions: (1) We find that the original performance claims for NPS fuzzers do not hold; a gap we relate to fundamental, implementation, and experimental limitations of prior works. (2) We contribute the first in-depth analysis of the contribution of machine learning and gradient-based mutations in NPS. (3) We implement Neuzz++, which shows that addressing the practical limitations of NPS fuzzers improves performance, but that standard gray-box fuzzers almost always surpass NPS-based fuzzers. (4) As a consequence, we propose new guidelines targeted at benchmarking fuzzing based on machine learning, and present MLFuzz, a platform with GPU access for easy and reproducible evaluation of ML-based fuzzers. Neuzz++, MLFuzz, and all our data are public. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: Accepted as conference paper at ESEC/FSE 2023

arXiv:2307.05147 [pdf, other]

Tests4Py: A Benchmark for System Testing

Authors: Marius Smytzek, Martin Eberlein, Batuhan Serce, Lars Grunske, Andreas Zeller

Abstract: Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equ… ▽ More Benchmarks are among the main drivers of progress in software engineering research. However, many current benchmarks are limited by inadequate system oracles and sparse unit tests. Our Tests4Py benchmark, derived from the BugsInPy benchmark, addresses these limitations. It includes 73 bugs from seven real-world Python applications and six bugs from example programs. Each subject in Tests4Py is equipped with an oracle for verifying functional correctness and supports both system and unit test generation. This allows for comprehensive qualitative studies and extensive evaluations, making Tests4Py a cutting-edge benchmark for research in test generation, debugging, and automatic program repair. △ Less

Submitted 14 May, 2024; v1 submitted 11 July, 2023; originally announced July 2023.

Comments: 5 pages, 4 figures

ACM Class: D.2.5; D.2.13

arXiv:2306.13331 [pdf, other]

Energy-optimal control of adaptive structures

Authors: Manuel Schaller, Amelie Zeller, Michael Böhm, Oliver Sawodny, Cristina Tarín, Karl Worthmann

Abstract: Adaptive structures are equipped with sensors and actuators to actively counteract external loads such as wind. This can significantly reduce resource consumption and emissions during the life cycle compared to conventional structures. A common approach for active damping is to derive a port-Hamiltonian model and to employ linear-quadratic control. However, the quadratic control penalization lacks… ▽ More Adaptive structures are equipped with sensors and actuators to actively counteract external loads such as wind. This can significantly reduce resource consumption and emissions during the life cycle compared to conventional structures. A common approach for active damping is to derive a port-Hamiltonian model and to employ linear-quadratic control. However, the quadratic control penalization lacks physical interpretation and merely serves as a regularization term. Rather, we propose a controller, which achieves the goal of vibration damping while acting energy-optimal. Leveraging the port-Hamiltonian structure, we show that the optimal control is uniquely determined, even on singular arcs. Further, we prove a stable long-time behavior of optimal trajectories by means of a turnpike property. Last, the proposed controller's efficiency is evaluated in a numerical study. △ Less

Submitted 8 December, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: 15 pages, 4 figures

arXiv:2305.19384 [pdf]

User Driven Functionality Deletion for Mobile Apps

Authors: Maleknaz Nayebi, Konstantin Kuznetsov, Andreas Zeller, Guenther Ruhe

Abstract: Evolving software with an increasing number of features is harder to understand and thus harder to use. Software release planning has been concerned with planning these additions. Moreover, software of increasing size takes more effort to be maintained. In the domain of mobile apps, too much functionality can easily impact usability, maintainability, and resource consumption. Hence, it is importan… ▽ More Evolving software with an increasing number of features is harder to understand and thus harder to use. Software release planning has been concerned with planning these additions. Moreover, software of increasing size takes more effort to be maintained. In the domain of mobile apps, too much functionality can easily impact usability, maintainability, and resource consumption. Hence, it is important to understand the extent to which the law of continuous growth applies to mobile apps. Previous work showed that the deletion of functionality is common and sometimes driven by user reviews. However, it is not known if these deletions are visible or important to the app users. In this study, we performed a survey study with 297 mobile app users to understand the significance of functionality deletion for them. Our results showed that for the majority of users, the deletion of features corresponds with negative sentiments and change in usage and even churn. Motivated by these preliminary results, we propose RADIATION to input user reviews and recommend if any functionality should be deleted from an app's User Interface (UI). We evaluate RADIATION using historical data and surveying developers' opinions. From the analysis of 190,062 reviews from 115 randomly selected apps, we show that RADIATION can recommend functionality deletion with an average F-Score of 74% and if sufficiently many negative user reviews suggest so. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: The paper is accepted to RE 2023 research track

Journal ref: IEEE International Conference on Requirements Engineering, 2023

arXiv:2212.03075 [pdf, other]

Systematic Assessment of Fuzzers using Mutation Analysis

Authors: Philipp Görz, Björn Mathis, Keno Hassler, Emre Güler, Thorsten Holz, Andreas Zeller, Rahul Gopinath

Abstract: Fuzzing is an important method to discover vulnerabilities in programs. Despite considerable progress in this area in the past years, measuring and comparing the effectiveness of fuzzers is still an open research question. In software testing, the gold standard for evaluating test quality is mutation analysis, which evaluates a test's ability to detect synthetic bugs: If a set of tests fails to de… ▽ More Fuzzing is an important method to discover vulnerabilities in programs. Despite considerable progress in this area in the past years, measuring and comparing the effectiveness of fuzzers is still an open research question. In software testing, the gold standard for evaluating test quality is mutation analysis, which evaluates a test's ability to detect synthetic bugs: If a set of tests fails to detect such mutations, it is expected to also fail to detect real bugs. Mutation analysis subsumes various coverage measures and provides a large and diverse set of faults that can be arbitrarily hard to trigger and detect, thus preventing the problems of saturation and overfitting. Unfortunately, the cost of traditional mutation analysis is exorbitant for fuzzing, as mutations need independent evaluation. In this paper, we apply modern mutation analysis techniques that pool multiple mutations and allow us -- for the first time -- to evaluate and compare fuzzers with mutation analysis. We introduce an evaluation bench for fuzzers and apply it to a number of popular fuzzers and subjects. In a comprehensive evaluation, we show how we can use it to assess fuzzer performance and measure the impact of improved techniques. The required CPU time remains manageable: 4.09 CPU years are needed to analyze a fuzzer on seven subjects and a total of 141,278 mutations. We find that today's fuzzers can detect only a small percentage of mutations, which should be seen as a challenge for future research -- notably in improving (1) detecting failures beyond generic crashes (2) triggering mutations (and thus faults). △ Less

Submitted 25 July, 2023; v1 submitted 6 December, 2022; originally announced December 2022.

Comments: 13 pages, 4 figures

ACM Class: D.2.5; D.4.6

arXiv:2208.12049 [pdf, ps, other]

Electronic Appendix to "Input Invariants"

Authors: Dominic Steinhöfel, Andreas Zeller

Abstract: In this electronic appendix to our paper "Input Invariants," accepted at ESEC/FSE'22, we provide additional examples, formal definitions, theorems, and proof sketches to complement our paper. Furthermore, we show the invariants that ISLearn mined in our evaluation. In this electronic appendix to our paper "Input Invariants," accepted at ESEC/FSE'22, we provide additional examples, formal definitions, theorems, and proof sketches to complement our paper. Furthermore, we show the invariants that ISLearn mined in our evaluation. △ Less

Submitted 25 August, 2022; originally announced August 2022.

Comments: 9 pages. The main paper "Input Invariants" appeared at ESEC/FSE 2022

ACM Class: D.3.3; D.2.5; F.4.2; F.3.1

arXiv:2208.08235 [pdf, other]

Input Repair via Synthesis and Lightweight Error Feedback

Authors: Lukas Kirschner, Ezekiel Soremekun, Rahul Gopinath, Andreas Zeller

Abstract: Often times, input data may ostensibly conform to a given input format, but cannot be parsed by a conforming program, for instance, due to human error or data corruption. In such cases, a data engineer is tasked with input repair, i.e., she has to manually repair the corrupt data such that it follows a given format, and hence can be processed by the conforming program. Such manual repair can be ti… ▽ More Often times, input data may ostensibly conform to a given input format, but cannot be parsed by a conforming program, for instance, due to human error or data corruption. In such cases, a data engineer is tasked with input repair, i.e., she has to manually repair the corrupt data such that it follows a given format, and hence can be processed by the conforming program. Such manual repair can be time-consuming and error-prone. In particular, input repair is challenging without an input specification (e.g., input grammar) or program analysis. In this work, we show that incorporating lightweight failure feedback (e.g., input incompleteness) to parsers is sufficient to repair any corrupt input data with maximal closeness to the semantics of the input data. We propose an approach (called FSYNTH) that leverages lightweight error-feedback and input synthesis to repair invalid inputs. FSYNTH is grammar-agnostic, and it does not require program analysis. Given a conforming program, and any invalid input, FSYNTH provides a set of repairs prioritized by the distance of the repair from the original input. We evaluate FSYNTH on 806 (real-world) invalid inputs using four well-known input formats, namely INI, TinyC, SExp, and cJSON. In our evaluation, we found that FSYNTH recovers 91% of valid input data. FSYNTH is also highly effective and efficient in input repair: It repairs 77% of invalid inputs within four minutes. It is up to 35% more effective than DDMax, the previously best-known approach. Overall, our approach addresses several limitations of DDMax, both in terms of what it can repair, as well as in terms of the set of repairs offered. △ Less

Submitted 17 August, 2022; originally announced August 2022.

ACM Class: D.2

arXiv:2109.11277 [pdf, other]

FormatFuzzer: Effective Fuzzing of Binary File Formats

Authors: Rafael Dutra, Rahul Gopinath, Andreas Zeller

Abstract: Effective fuzzing of programs that process structured binary inputs, such as multimedia files, is a challenging task, since those programs expect a very specific input format. Existing fuzzers, however, are mostly format-agnostic, which makes them versatile, but also ineffective when a specific format is required. We present FormatFuzzer, a generator for format-specific fuzzers. FormatFuzzer takes… ▽ More Effective fuzzing of programs that process structured binary inputs, such as multimedia files, is a challenging task, since those programs expect a very specific input format. Existing fuzzers, however, are mostly format-agnostic, which makes them versatile, but also ineffective when a specific format is required. We present FormatFuzzer, a generator for format-specific fuzzers. FormatFuzzer takes as input a binary template (a format specification used by the 010 Editor) and compiles it into C++ code that acts as parser, mutator, and highly efficient generator of inputs conforming to the rules of the language. The resulting format-specific fuzzer can be used as a standalone producer or mutator in black-box settings, where no guidance from the program is available. In addition, by providing mutable decision seeds, it can be easily integrated with arbitrary format-agnostic fuzzers such as AFL to make them format-aware. In our evaluation on complex formats such as MP4 or ZIP, FormatFuzzer showed to be a highly effective producer of valid inputs that also detected previously unknown memory errors in ffmpeg and timidity. △ Less

Submitted 27 September, 2023; v1 submitted 23 September, 2021; originally announced September 2021.

Comments: ACM Transactions on Software Engineering and Methodology

arXiv:2105.03144 [pdf, other]

What do all these Buttons do? Statically Mining Android User Interfaces at Scale

Authors: Konstantin Kuznetsov, Chen Fu, Song Gao, David N. Jansen, Lijun Zhang, Andreas Zeller

Abstract: We introduce FRONTMATTER: a tool to automatically mine both user interface models and behavior of Android apps at a large scale with high precision. Given an app, FRONTMATTER statically extracts all declared screens, the user interface elements, their textual and graphical features, as well as Android APIs invoked by interacting with them. Executed on tens of thousands of real-world apps, FRONTMAT… ▽ More We introduce FRONTMATTER: a tool to automatically mine both user interface models and behavior of Android apps at a large scale with high precision. Given an app, FRONTMATTER statically extracts all declared screens, the user interface elements, their textual and graphical features, as well as Android APIs invoked by interacting with them. Executed on tens of thousands of real-world apps, FRONTMATTER opens the door for comprehensive mining of mobile user interfaces, jumpstarting empirical research at a large scale, addressing questions such as "How many travel apps require registration?", "Which apps do not follow accessibility guidelines?", "Does the user interface correspond to the description?", and many more. FRONTMATTER and the mined dataset are available under an open-source license. △ Less

Submitted 7 May, 2021; originally announced May 2021.

Comments: 12 pages, 1 fugure, 2 tables

arXiv:2103.02959 [pdf]

Restoring Execution Environments of Jupyter Notebooks

Authors: Jiawei Wang, Li Li, Andreas Zeller

Abstract: More than ninety percent of published Jupyter notebooks do not state dependencies on external packages. This makes them non-executable and thus hinders reproducibility of scientific results. We present SnifferDog, an approach that 1) collects the APIs of Python packages and versions, creating a database of APIs; 2) analyzes notebooks to determine candidates for required packages and versions; and… ▽ More More than ninety percent of published Jupyter notebooks do not state dependencies on external packages. This makes them non-executable and thus hinders reproducibility of scientific results. We present SnifferDog, an approach that 1) collects the APIs of Python packages and versions, creating a database of APIs; 2) analyzes notebooks to determine candidates for required packages and versions; and 3) checks which packages are required to make the notebook executable (and ideally, reproduce its stored results). In its evaluation, we show that SnifferDog precisely restores execution environments for the largest majority of notebooks, making them immediately executable for end users. △ Less

Submitted 4 March, 2021; originally announced March 2021.

Comments: to be published in the 43rd ACM/IEEE International Conference on Software Engineering (ICSE 2021)

arXiv:2101.03008 [pdf, other]

Locating Faults with Program Slicing: An Empirical Analysis

Authors: Ezekiel Soremekun, Lukas Kirschner, Marcel Böhme, Andreas Zeller

Abstract: Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open sourc… ▽ More Statistical fault localization is an easily deployed technique for quickly determining candidates for faulty code locations. If a human programmer has to search the fault beyond the top candidate locations, though, more traditional techniques of following dependencies along dynamic slices may be better suited. In a large study of 457 bugs (369 single faults and 88 multiple faults) in 46 open source C programs, we compare the effectiveness of statistical fault localization against dynamic slicing. For single faults, we find that dynamic slicing was eight percentage points more effective than the best performing statistical debugging formula; for 66% of the bugs, dynamic slicing finds the fault earlier than the best performing statistical debugging formula. In our evaluation, dynamic slicing is more effective for programs with single fault, but statistical debugging performs better on multiple faults. Best results, however, are obtained by a hybrid approach: If programmers first examine at most the top five most suspicious locations from statistical debugging, and then switch to dynamic slices, on average, they will need to examine 15% (30 lines) of the code. These findings hold for 18 most effective statistical debugging formulas and our results are independent of the number of faults (i.e. single or multiple faults) and error type (i.e. artificial or real errors). △ Less

Submitted 8 January, 2021; originally announced January 2021.

arXiv:2012.13516 [pdf, other]

Fuzzing with Fast Failure Feedback

Authors: Rahul Gopinath, Bachir Bendrissou, Björn Mathis, Andreas Zeller

Abstract: Fuzzing -- testing programs with random inputs -- has become the prime technique to detect bugs and vulnerabilities in programs. To generate inputs that cover new functionality, fuzzers require execution feedback from the program -- for instance, the coverage obtained by previous inputs, or the conditions that need to be resolved to cover new branches. If such execution feedback is not available,… ▽ More Fuzzing -- testing programs with random inputs -- has become the prime technique to detect bugs and vulnerabilities in programs. To generate inputs that cover new functionality, fuzzers require execution feedback from the program -- for instance, the coverage obtained by previous inputs, or the conditions that need to be resolved to cover new branches. If such execution feedback is not available, though, fuzzing can only rely on chance, which is ineffective. In this paper, we introduce a novel fuzzing technique that relies on failure feedback only -- that is, information on whether an input is valid or not, and if not, where the error occurred. Our bFuzzer tool enumerates byte after byte of the input space and tests the program until it finds valid prefixes, and continues exploration from these prefixes. Since no instrumentation or execution feedback is required, bFuzzer is language agnostic and the required tests execute very quickly. We evaluate our technique on five subjects, and show that bFuzzer is effective and efficient even in comparison to its white-box counterpart. △ Less

Submitted 25 December, 2020; originally announced December 2020.

Comments: 12 pages, 6 figures

ACM Class: D.4.6; D.2.5

arXiv:1912.05937 [pdf, other]

Inferring Input Grammars from Dynamic Control Flow

Authors: Rahul Gopinath, Björn Mathis, Andreas Zeller

Abstract: A program is characterized by its input model, and a formal input model can be of use in diverse areas including vulnerability analysis, reverse engineering, fuzzing and software testing, clone detection and refactoring. Unfortunately, input models for typical programs are often unavailable or out of date. While there exist algorithms that can mine the syntactical structure of program inputs, they… ▽ More A program is characterized by its input model, and a formal input model can be of use in diverse areas including vulnerability analysis, reverse engineering, fuzzing and software testing, clone detection and refactoring. Unfortunately, input models for typical programs are often unavailable or out of date. While there exist algorithms that can mine the syntactical structure of program inputs, they either produce unwieldy and incomprehensible grammars, or require heuristics that target specific parsing patterns. In this paper, we present a general algorithm that takes a program and a small set of sample inputs and automatically infers a readable context-free grammar capturing the input language of the program. We infer the syntactic input structure only by observing access of input characters at different locations of the input parser. This works on all program stack based recursive descent input parsers, including PEG and parser combinators, and can do entirely without program specific heuristics. Our Mimid prototype produced accurate and readable grammars for a variety of evaluation subjects, including expr, URLparse, and microJSON. △ Less

Submitted 12 December, 2019; originally announced December 2019.

MSC Class: D.2.0; D.2.4; D.2.5; D.3.0 ACM Class: D.2.0; D.2.4; D.2.5; D.3.0

arXiv:1911.07707 [pdf, other]

Building Fast Fuzzers

Authors: Rahul Gopinath, Andreas Zeller

Abstract: Fuzzing is one of the key techniques for evaluating the robustness of programs against attacks. Fuzzing has to be effective in producing inputs that cover functionality and find vulnerabilities. But it also has to be efficient in producing such inputs quickly. Random fuzzers are very efficient, as they can quickly generate random inputs; but they are not very effective, as the large majority of in… ▽ More Fuzzing is one of the key techniques for evaluating the robustness of programs against attacks. Fuzzing has to be effective in producing inputs that cover functionality and find vulnerabilities. But it also has to be efficient in producing such inputs quickly. Random fuzzers are very efficient, as they can quickly generate random inputs; but they are not very effective, as the large majority of inputs generated is syntactically invalid. Grammar-based fuzzers make use of a grammar (or another model for the input language) to produce syntactically correct inputs, and thus can quickly cover input space and associated functionality. Existing grammar-based fuzzers are surprisingly inefficient, though: Even the fastest grammar fuzzer Dharma still produces inputs about a thousand times slower than the fastest random fuzzer. So far, one can have an effective or an efficient fuzzer, but not both. In this paper, we describe how to build fast grammar fuzzers from the ground up, treating the problem of fuzzing from a programming language implementation perspective. Starting with a Python textbook approach, we adopt and adapt optimization techniques from functional programming and virtual machine implementation techniques together with other novel domain-specific optimizations in a step-by-step fashion. In our F1 prototype fuzzer, these improve production speed by a factor of 100--300 over the fastest grammar fuzzer Dharma. As F1 is even 5--8 times faster than a lexical random fuzzer, we can find bugs faster and test with much larger valid inputs than previously possible. △ Less

Submitted 18 November, 2019; originally announced November 2019.

Comments: 12 pages, 12 figures

ACM Class: D.4.6; D.2.5

arXiv:1906.05234 [pdf]

Better Code, Better Sharing:On the Need of Analyzing Jupyter Notebooks

Authors: Jiawei Wang, Li Li, Andreas Zeller

Abstract: By bringing together code, text, and examples, Jupyter notebooks have become one of the most popular means to produce scientific results in a productive and reproducible way. As many of the notebook authors are experts in their scientific fields, but laymen with respect to software engineering, one may ask questions on the quality of notebooks and their code. In a preliminary study, we experimenta… ▽ More By bringing together code, text, and examples, Jupyter notebooks have become one of the most popular means to produce scientific results in a productive and reproducible way. As many of the notebook authors are experts in their scientific fields, but laymen with respect to software engineering, one may ask questions on the quality of notebooks and their code. In a preliminary study, we experimentally demonstrate that Jupyter notebooks are inundated with poor quality code, e.g., not respecting recommended coding practices, or containing unused variables and deprecated functions. Considering the education nature of Jupyter notebooks, these poor coding practices as well as the lacks of quality control might be propagated into the next generation of developers. Hence, we argue that there is a strong need to programmatically analyze Jupyter notebooks, calling on our community to pay more attention to the reliability of Jupyter notebooks. △ Less

Submitted 12 June, 2019; originally announced June 2019.

arXiv:1906.01463 [pdf, other]

Bridging the Gap between Unit Test Generation and System Test Generation

Authors: Alexander Kampmann, Andreas Zeller

Abstract: Common test generators fall into two categories. Generating test inputs at the unit level is fast, but can lead to false alarms when a function is called with inputs that would not occur in a system context. If a generated input at the system level causes a failure, this is a true alarm, as the input could also have come from the user or a third party; but system testing is much slower. In this… ▽ More Common test generators fall into two categories. Generating test inputs at the unit level is fast, but can lead to false alarms when a function is called with inputs that would not occur in a system context. If a generated input at the system level causes a failure, this is a true alarm, as the input could also have come from the user or a third party; but system testing is much slower. In this paper, we introduce the concept of a test generation bridge, which joins the accuracy of system testing with the speed of unit testing. A Test Generation Bridge allows to combine an arbitrary system test generator with an arbitrary unit test generator. It does so by carving parameterized unit tests from system (test) executions. These unit tests run in a context recorded from the system test, but individual parameters are left free for the unit test generator to systematically explore. This allows symbolic test generators such as KLEE to operate on individual functions in the recorded system context. If the test generator detects a failure, we lift the failure-inducing parameter back to the system input; if the failure can be reproduced at the system level, it is reported as a true alarm. Our BASILISK prototype can extract and test units out of complex systems such as a Web/Python/SQLite/C stack; in its evaluation, it achieves a higher coverage than a state-of-the-art system test generator. △ Less

Submitted 4 June, 2019; originally announced June 2019.

Comments: this article supersedes arXiv:1812.07932

arXiv:1812.07932 [pdf, other]

Carving Parameterized Unit Tests

Authors: Alexander Kampmann, Andreas Zeller

Abstract: We present a method to automatically extract ("carve") parameterized unit tests from system executions. The unit tests execute the same functions as the system tests they are carved from, but can do so much faster as they call functions directly; furthermore, being parameterized, they can execute the functions with a large variety of randomly selected input values. If a unit-level test fails, we l… ▽ More We present a method to automatically extract ("carve") parameterized unit tests from system executions. The unit tests execute the same functions as the system tests they are carved from, but can do so much faster as they call functions directly; furthermore, being parameterized, they can execute the functions with a large variety of randomly selected input values. If a unit-level test fails, we lift it to the system level to ensure the failure can be reproduced there. Our method thus allows to focus testing efforts on selected modules while still avoiding false alarms: In our experiments, running parameterized unit tests for individual functions was, on average, 30~times faster than running the system tests they were carved from. △ Less

Submitted 19 December, 2018; originally announced December 2018.

arXiv:1812.07525 [pdf, other]

Inputs from Hell: Generating Uncommon Inputs from Common Samples

Authors: Esteban Pavese, Ezekiel Soremekun, Nikolas Havrikov, Lars Grunske, Andreas Zeller

Abstract: Generating structured input files to test programs can be performed by techniques that produce them from a grammar that serves as the specification for syntactically correct input files. Two interesting scenarios then arise for effective testing. In the first scenario, software engineers would like to generate inputs that are as similar as possible to the inputs in common usage of the program, to… ▽ More Generating structured input files to test programs can be performed by techniques that produce them from a grammar that serves as the specification for syntactically correct input files. Two interesting scenarios then arise for effective testing. In the first scenario, software engineers would like to generate inputs that are as similar as possible to the inputs in common usage of the program, to test the reliability of the program. More interesting is the second scenario where inputs should be as dissimilar as possible from normal usage. This is useful for robustness testing and exploring yet uncovered behavior. To provide test cases for both scenarios, we leverage a context-free grammar to parse a set of sample input files that represent the program's common usage, and determine probabilities for individual grammar production as they occur during parsing the inputs. Replicating these probabilities during grammar-based test input generation, we obtain inputs that are close to the samples. Inverting these probabilities yields inputs that are strongly dissimilar to common inputs, yet still valid with respect to the grammar. Our evaluation on three common input formats (JSON, JavaScript, CSS) shows the effectiveness of these approaches in obtaining instances from both sets of inputs. △ Less

Submitted 19 December, 2018; v1 submitted 18 December, 2018; originally announced December 2018.

arXiv:1810.08289 [pdf, other]

Sample-Free Learning of Input Grammars for Comprehensive Software Fuzzing

Authors: Rahul Gopinath, Björn Mathis, Mathias Höschele, Alexander Kampmann, Andreas Zeller

Abstract: Generating valid test inputs for a program is much easier if one knows the input language. We present first successes for a technique that, given a program P without any input samples or models, learns an input grammar that represents the syntactically valid inputs for P -- a grammar which can then be used for highly effective test generation for P . To this end, we introduce a test generator targ… ▽ More Generating valid test inputs for a program is much easier if one knows the input language. We present first successes for a technique that, given a program P without any input samples or models, learns an input grammar that represents the syntactically valid inputs for P -- a grammar which can then be used for highly effective test generation for P . To this end, we introduce a test generator targeted at input parsers that systematically explores parsing alternatives based on dynamic tracking of constraints; the resulting inputs go into a grammar learner producing a grammar that can then be used for fuzzing. In our evaluation on subjects such as JSON, URL, or Mathexpr, our PYGMALION prototype took only a few minutes to infer grammars and generate thousands of valid high-quality inputs. △ Less

Submitted 18 October, 2018; originally announced October 2018.

arXiv:1805.07248 [pdf, other]

Derivative-Free Optimization Algorithms based on Non-Commutative Maps

Authors: Jan Feiling, Amelie Zeller, Christian Ebenbauer

Abstract: A novel class of derivative-free optimization algorithms is developed. The main idea is to utilize certain non-commutative maps in order to approximate the gradient of the objective function. Convergence properties of the novel algorithms are established and simulation examples are presented. A novel class of derivative-free optimization algorithms is developed. The main idea is to utilize certain non-commutative maps in order to approximate the gradient of the objective function. Convergence properties of the novel algorithms are established and simulation examples are presented. △ Less

Submitted 18 May, 2018; originally announced May 2018.

arXiv:1708.08731 [pdf, other]

Active Learning of Input Grammars

Authors: Matthias Höschele, Alexander Kampmann, Andreas Zeller

Abstract: Knowing the precise format of a program's input is a necessary prerequisite for systematic testing. Given a program and a small set of sample inputs, we (1) track the data flow of inputs to aggregate input fragments that share the same data flow through program execution into lexical and syntactic entities; (2) assign these entities names that are based on the associated variable and function iden… ▽ More Knowing the precise format of a program's input is a necessary prerequisite for systematic testing. Given a program and a small set of sample inputs, we (1) track the data flow of inputs to aggregate input fragments that share the same data flow through program execution into lexical and syntactic entities; (2) assign these entities names that are based on the associated variable and function identifiers; and (3) systematically generalize production rules by means of membership queries. As a result, we need only a minimal set of sample inputs to obtain human-readable context-free grammars that reflect valid input structure. In our evaluation on inputs like URLs, spreadsheets, or configuration files, our AUTOGRAM prototype obtains input grammars that are both accurate and very readable - and that can be directly fed into test generators for comprehensive automated testing. △ Less

Submitted 29 August, 2017; originally announced August 2017.

Comments: 12 pages

ACM Class: F.4.2; F.3.2; D.2.5

arXiv:1611.04426 [pdf, other]

Quantifying the Information Leak in Cache Attacks through Symbolic Execution

Authors: Sudipta Chattopadhyay, Moritz Beck, Ahmed Rezine, Andreas Zeller

Abstract: Cache timing attacks allow attackers to infer the properties of a secret execution by observing cache hits and misses. But how much information can actually leak through such attacks? For a given program, a cache model, and an input, our CHALICE framework leverages symbolic execution to compute the amount of information that can possibly leak through cache attacks. At the core of CHALICE is a nove… ▽ More Cache timing attacks allow attackers to infer the properties of a secret execution by observing cache hits and misses. But how much information can actually leak through such attacks? For a given program, a cache model, and an input, our CHALICE framework leverages symbolic execution to compute the amount of information that can possibly leak through cache attacks. At the core of CHALICE is a novel approach to quantify information leak that can highlight critical cache side-channel leaks on arbitrary binary code. In our evaluation on real-world programs from OpenSSL and Linux GDK libraries, CHALICE effectively quantifies information leaks: For an AES-128 implementation on Linux, for instance, CHALICE finds that a cache attack can leak as much as 127 out of 128 bits of the encryption key. △ Less

Submitted 14 November, 2016; originally announced November 2016.

arXiv:1601.02976 [pdf, ps, other]

Higher regularity of the free boundary in the parabolic Signorini problem

Authors: Agnid Banerjee, Mariana Smit Vega Garcia, Andrew K. Zeller

Abstract: We show that the quotient of two caloric functions which vanish on a portion of an $H^{k+ α}$ regular slit is $H^{k+ α}$ at the slit, for $k \geq 2$. In the case $k=1$, we show that the quotient is in $H^{1+α}$ if the slit is assumed to be space-time $C^{1, α}$ regular. This can be thought of as a parabolic analogue of a recent important result in [DSS14a], whose ideas inspired us. As an applicati… ▽ More We show that the quotient of two caloric functions which vanish on a portion of an $H^{k+ α}$ regular slit is $H^{k+ α}$ at the slit, for $k \geq 2$. In the case $k=1$, we show that the quotient is in $H^{1+α}$ if the slit is assumed to be space-time $C^{1, α}$ regular. This can be thought of as a parabolic analogue of a recent important result in [DSS14a], whose ideas inspired us. As an application, we show that the free boundary near a regular point of the parabolic thin obstacle problem studied in [DGPT] with zero obstacle is $C^{\infty}$ regular in space and time. △ Less

Submitted 23 September, 2016; v1 submitted 12 January, 2016; originally announced January 2016.

Comments: Revised version, to appear in Calculus of Variations and Partial Differential Equations

arXiv:1512.09173 [pdf, other]

Boundedness and continuity of the time derivative in the parabolic Signorini problem

Authors: Arshak Petrosyan, Andrew Zeller

Abstract: We prove the boundedness of the time derivative in the parabolic Signorini problem, as well as establish its Hölder continuity at regular free boundary points. We prove the boundedness of the time derivative in the parabolic Signorini problem, as well as establish its Hölder continuity at regular free boundary points. △ Less

Submitted 30 December, 2015; originally announced December 2015.

Comments: 8 pages, 1 figure

MSC Class: 35R35

arXiv:1407.5286 [pdf, other]

doi 10.1109/TSE.2015.2431688

Inferring Loop Invariants by Mutation, Dynamic Analysis, and Static Checking

Authors: Juan P. Galeotti, Carlo A. Furia, Eva May, Gordon Fraser, Andreas Zeller

Abstract: Verifiers that can prove programs correct against their full functional specification require, for programs with loops, additional annotations in the form of loop invariants---propeties that hold for every iteration of a loop. We show that significant loop invariant candidates can be generated by systematically mutating postconditions; then, dynamic checking (based on automatically generated tests… ▽ More Verifiers that can prove programs correct against their full functional specification require, for programs with loops, additional annotations in the form of loop invariants---propeties that hold for every iteration of a loop. We show that significant loop invariant candidates can be generated by systematically mutating postconditions; then, dynamic checking (based on automatically generated tests) weeds out invalid candidates, and static checking selects provably valid ones. We present a framework that automatically applies these techniques to support a program prover, paving the way for fully automatic verification without manually written loop invariants: Applied to 28 methods (including 39 different loops) from various java.util classes (occasionally modified to avoid using Java features not fully supported by the static checker), our DYNAMATE prototype automatically discharged 97% of all proof obligations, resulting in automatic complete correctness proofs of 25 out of the 28 methods---outperforming several state-of-the-art tools for fully automatic verification. △ Less

Submitted 5 February, 2016; v1 submitted 20 July, 2014; originally announced July 2014.

Comments: Only change in v4: rectified May's affiliation

Journal ref: IEEE Transactions on Software Engineering, 41(10):1019-1037, October 2015

arXiv:1403.1117 [pdf, other]

doi 10.1109/TSE.2014.2312918

Automated Fixing of Programs with Contracts

Authors: Yu Pei, Carlo A. Furia, Martin Nordio, Yi Wei, Bertrand Meyer, Andreas Zeller

Abstract: This paper describes AutoFix, an automatic debugging technique that can fix faults in general-purpose software. To provide high-quality fix suggestions and to enable automation of the whole debugging process, AutoFix relies on the presence of simple specification elements in the form of contracts (such as pre- and postconditions). Using contracts enhances the precision of dynamic analysis techniqu… ▽ More This paper describes AutoFix, an automatic debugging technique that can fix faults in general-purpose software. To provide high-quality fix suggestions and to enable automation of the whole debugging process, AutoFix relies on the presence of simple specification elements in the form of contracts (such as pre- and postconditions). Using contracts enhances the precision of dynamic analysis techniques for fault detection and localization, and for validating fixes. The only required user input to the AutoFix supporting tool is then a faulty program annotated with contracts; the tool produces a collection of validated fixes for the fault ranked according to an estimate of their suitability. In an extensive experimental evaluation, we applied AutoFix to over 200 faults in four code bases of different maturity and quality (of implementation and of contracts). AutoFix successfully fixed 42% of the faults, producing, in the majority of cases, corrections of quality comparable to those competent programmers would write; the used computational resources were modest, with an average time per fix below 20 minutes on commodity hardware. These figures compare favorably to the state of the art in automated program fixing, and demonstrate that the AutoFix approach is successfully applicable to reduce the debugging burden in real-world scenarios. △ Less

Submitted 25 April, 2014; v1 submitted 5 March, 2014; originally announced March 2014.

Comments: Minor changes after proofreading

Journal ref: IEEE Transactions on Software Engineering, 40(5):427-449. IEEE Computer Society, May 2014

arXiv:physics/0703232 [pdf, other]

doi 10.1007/s10751-007-9539-y

The cyclotron gas stopper project at the NSCL

Authors: C. Guenaut, G. Bollen, S. Chouhan, F. Marti, D. J. Morrissey, D. Lawton, J. Ottarson, G. K. Pang, S. Schwarz, B. M. Sherrill, M. Wada, A. F. Zeller

Abstract: Gas stopping is becoming the method of choice for converting beams of rare isotopes obtained via projectile fragmentation and in-flight separation into low-energy beams. These beams allow ISOL-type experiments, such as mass measurements with traps or laser spectroscopy, to be performed with projectile fragmentation products. Current gas stopper systems for high-energy beams are based on linear g… ▽ More Gas stopping is becoming the method of choice for converting beams of rare isotopes obtained via projectile fragmentation and in-flight separation into low-energy beams. These beams allow ISOL-type experiments, such as mass measurements with traps or laser spectroscopy, to be performed with projectile fragmentation products. Current gas stopper systems for high-energy beams are based on linear gas cells filled with 0.1-1 bar of helium. While already used successfully for experiments, it was found that space charge effects induced by the ionization of the helium atoms during the stopping process pose a limit on the maximum beam rate that can be used. Furthermore, the extraction time of stopped ions from these devices can exceed 100 ms causing substantial decay losses for very short-lived isotopes. To avoid these limitations, a new type of gas stopper is being developed at the NSCL/MSU. The new system is based on a cyclotron-type magnet with a stopping chamber filled with Helium buffer gas at low pressure. RF-guiding techniques are used to extract the ions. The space charge effects are considerably reduced by the large volume and due to a separation between the stopping region and the region of highest ionization. Cyclotron gas stopper systems of different sizes and with different magnetic field strengths and field shapes are presently investigated. △ Less

Submitted 26 March, 2007; originally announced March 2007.

Comments: Proceedings of the TCP06 conference, accepted in Hyp. Int

arXiv:cs/0309047 [pdf, ps, other]

Causes and Effects in Computer Programs

Authors: Andreas Zeller

Abstract: Debugging is commonly understood as finding and fixing the cause of a problem. But what does ``cause'' mean? How can we find causes? How can we prove that a cause is a cause--or even ``the'' cause? This paper defines common terms in debugging, highlights the principal techniques, their capabilities and limitations. Debugging is commonly understood as finding and fixing the cause of a problem. But what does ``cause'' mean? How can we find causes? How can we prove that a cause is a cause--or even ``the'' cause? This paper defines common terms in debugging, highlights the principal techniques, their capabilities and limitations. △ Less

Submitted 24 September, 2003; originally announced September 2003.

ACM Class: D.2.5

arXiv:cs/0012009 [pdf, ps, other]

Finding Failure Causes through Automated Testing

Authors: Holger Cleve, Andreas Zeller

Abstract: A program fails. Under which circumstances does this failure occur? One single algorithm, the delta debugging algorithm, suffices to determine these failure-inducing circumstances. Delta debugging tests a program systematically and automatically to isolate failure-inducing circumstances such as the program input, changes to the program code, or executed statements. A program fails. Under which circumstances does this failure occur? One single algorithm, the delta debugging algorithm, suffices to determine these failure-inducing circumstances. Delta debugging tests a program systematically and automatically to isolate failure-inducing circumstances such as the program input, changes to the program code, or executed statements. △ Less

Submitted 14 December, 2000; originally announced December 2000.

ACM Class: D.2.5

arXiv:nucl-ex/9908013 [pdf]

Analysis of a Cyclotron Based 400 MeV/u Driver System for a Radioactive Beam Facility

Authors: F. Marti, R. C. York, H. Blosser, M. M. Gordon, D. Gorelov, T. Grimm, D. Johnson, P. Miller, E. Pozdeyev, J. Vincent, X. Wu, A. Zeller

Abstract: The creation of intense radioactive beams requires intense and energetic primary beams. A task force analysis of this subject recommended an acceleration system capable of 400 MeV/u uranium at 1 particle uA as an appropriate driver for such a facility. The driver system should be capable of accelerating lighter ions at higher intensity such that a constant final beam power (~100kW) is maintained… ▽ More The creation of intense radioactive beams requires intense and energetic primary beams. A task force analysis of this subject recommended an acceleration system capable of 400 MeV/u uranium at 1 particle uA as an appropriate driver for such a facility. The driver system should be capable of accelerating lighter ions at higher intensity such that a constant final beam power (~100kW) is maintained. This document is a more detailed follow on to the previous analysis of such a system incorporating a cyclotron. The proposed driver pre-acceleration system consists of an ion source, radio frequency quadrupole, and linac chain capable of producing a final energy of 30 MeV/u and a charge (Q) to mass (A) of Q/A ~1/3. This acceleration system would be followed by a Separated Sector Cyclotron with a final output energy of 400 MeV/u. This system provides a more cost-effective solution in terms of initial capital investment as well as of operation compared to a fully linac system with the same primary beam output parameters. △ Less

Submitted 20 August, 1999; originally announced August 1999.

Comments: 37 pages, 30 figures, 12 tables

Report number: MSUCL-1131

Showing 1–32 of 32 results for author: Zeller, A