Search | arXiv e-print repository

Got Root? A Linux Priv-Esc Benchmark

Abstract: Linux systems are integral to the infrastructure of modern computing environments, necessitating robust security measures to prevent unauthorized access. Privilege escalation attacks represent a significant threat, typically allowing attackers to elevate their privileges from an initial low-privilege account to the all-powerful root account. A benchmark set of vulnerable systems is of high impor… ▽ More Linux systems are integral to the infrastructure of modern computing environments, necessitating robust security measures to prevent unauthorized access. Privilege escalation attacks represent a significant threat, typically allowing attackers to elevate their privileges from an initial low-privilege account to the all-powerful root account. A benchmark set of vulnerable systems is of high importance to evaluate the effectiveness of privilege-escalation techniques performed by both humans and automated tooling. Analyzing their behavior allows defenders to better fortify their entrusted Linux systems and thus protect their infrastructure from potentially devastating attacks. To address this gap, we developed a comprehensive benchmark for Linux privilege escalation. It provides a standardized platform to evaluate and compare the performance of human and synthetic actors, e.g., hacking scripts or automated tooling. △ Less

Submitted 6 May, 2024; v1 submitted 3 May, 2024; originally announced May 2024.

Comments: arXiv admin note: text overlap with arXiv:2310.11409

arXiv:2402.15632 [pdf, other]

Statically Inferring Usage Bounds for Infrastructure as Code

Authors: Feitong Qiao, Aryana Mohammadi, Jürgen Cito, Mark Santolucito

Abstract: Infrastructure as Code (IaC) has enabled cloud customers to have more agility in creating and modifying complex deployments of cloud-provisioned resources. By writing a configuration in IaC languages such as CloudFormation, users can declaratively specify their infrastructure and CloudFormation will handle the creation of the resources. However, understanding the complexity of IaC deployments has… ▽ More Infrastructure as Code (IaC) has enabled cloud customers to have more agility in creating and modifying complex deployments of cloud-provisioned resources. By writing a configuration in IaC languages such as CloudFormation, users can declaratively specify their infrastructure and CloudFormation will handle the creation of the resources. However, understanding the complexity of IaC deployments has emerged as an unsolved issue. In particular, estimating the cost of an IaC deployment requires estimating the future usage and pricing models of every cloud resource in the deployment. Gaining transparency into predicted usage/costs is a leading challenge in cloud management. Existing work either relies on historical usage metrics to predict cost or on coarse-grain static analysis that ignores interactions between resources. Our key insight is that the topology of an IaC deployment imposes constraints on the usage of each resource, and we can formalize and automate the reasoning on constraints by using an SMT solver. This allows customers to have formal guarantees on the bounds of their cloud usage. We propose a tool for fine-grained static usage analysis that works by modeling the inter-resource interactions in an IaC deployment as a set of SMT constraints, and evaluate our tool on a benchmark of over 1000 real world IaC configurations. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2310.11409 [pdf, other]

LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks

Authors: Andreas Happe, Aaron Kaplan, Juergen Cito

Abstract: Penetration testing, an essential component of software security testing, allows organizations to identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We explore the intersection of LLMs and penetration testing to gain insight i… ▽ More Penetration testing, an essential component of software security testing, allows organizations to identify and remediate vulnerabilities in their systems, thus bolstering their defense mechanisms against cyberattacks. One recent advancement in the realm of penetration testing is the utilization of Language Models (LLMs). We explore the intersection of LLMs and penetration testing to gain insight into their capabilities and challenges in the context of privilege escalation. We introduce a fully automated privilege-escalation tool designed for evaluating the efficacy of LLMs for (ethical) hacking, executing benchmarks using multiple LLMs, and investigating their respective results. Our results show that GPT-4-turbo is well suited to exploit vulnerabilities (33-83% of vulnerabilities). GPT-3.5-turbo can abuse 16-50% of vulnerabilities, while local models, such as Llama3, can only exploit between 0 and 33% of the vulnerabilities. We analyze the impact of different context sizes, in-context learning, optional high-level guidance mechanisms, and memory management techniques. We discuss challenging areas for LLMs, including maintaining focus during testing, coping with errors, and finally comparing LLMs with human hackers. The current version of the LLM-guided privilege-escalation prototype can be found at https://github.com/ipa-labs/hackingBuddyGPT. △ Less

Submitted 1 August, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

Comments: 12 pages

arXiv:2308.07057 [pdf, other]

doi 10.1145/3611643.3613900

Understanding Hackers' Work: An Empirical Study of Offensive Security Practitioners

Authors: Andreas Happe, Jürgen Cito

Abstract: Offensive security-tests are a common way to pro-actively discover potential vulnerabilities. They are performed by specialists, often called penetration-testers or white-hat hackers. The chronic lack of available white-hat hackers prevents sufficient security test coverage of software. Research into automation tries to alleviate this problem by improving the efficiency of security testing. To ach… ▽ More Offensive security-tests are a common way to pro-actively discover potential vulnerabilities. They are performed by specialists, often called penetration-testers or white-hat hackers. The chronic lack of available white-hat hackers prevents sufficient security test coverage of software. Research into automation tries to alleviate this problem by improving the efficiency of security testing. To achieve this, researchers and tool builders need a solid understanding of how hackers work, their assumptions, and pain points. In this paper, we present a first data-driven exploratory qualitative study of twelve security professionals, their work and problems occurring therein. We perform a thematic analysis to gain insights into the execution of security assignments, hackers' thought processes and encountered challenges. This analysis allows us to conclude with recommendations for researchers and tool builders to increase the efficiency of their automation and identify novel areas for research. △ Less

Submitted 23 August, 2023; v1 submitted 14 August, 2023; originally announced August 2023.

arXiv:2308.00121 [pdf, other]

doi 10.1145/3611643.3613083

Getting pwn'd by AI: Penetration Testing with Large Language Models

Authors: Andreas Happe, Jürgen Cito

Abstract: The field of software security testing, more specifically penetration testing, is an activity that requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential usage of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore the feasibility of supplementing penetration testers with AI mo… ▽ More The field of software security testing, more specifically penetration testing, is an activity that requires high levels of expertise and involves many manual testing and analysis steps. This paper explores the potential usage of large-language models, such as GPT3.5, to augment penetration testers with AI sparring partners. We explore the feasibility of supplementing penetration testers with AI models for two distinct use cases: high-level task planning for security testing assignments and low-level vulnerability hunting within a vulnerable virtual machine. For the latter, we implemented a closed-feedback loop between LLM-generated low-level actions with a vulnerable virtual machine (connected through SSH) and allowed the LLM to analyze the machine state for vulnerabilities and suggest concrete attack vectors which were automatically executed within the virtual machine. We discuss promising initial results, detail avenues for improvement, and close deliberating on the ethics of providing AI-based sparring partners. △ Less

Submitted 17 August, 2023; v1 submitted 24 July, 2023; originally announced August 2023.

arXiv:2304.09733 [pdf, ps, other]

An Exploratory Study of Ad Hoc Parsers in Python

Authors: Michael Schröder, Marc Goritschnig, Jürgen Cito

Abstract: Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file formats, or any number of other minor string processing tasks, ad hoc parsing is ubiquitous -- yet poorly understood. Objective: This study aims to reveal the commo… ▽ More Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file formats, or any number of other minor string processing tasks, ad hoc parsing is ubiquitous -- yet poorly understood. Objective: This study aims to reveal the common syntactic and semantic characteristics of ad hoc parsing code in real world Python projects. Our goal is to understand the nature of ad hoc parsers in order to inform future program analysis efforts in this area. Method: We plan to conduct an exploratory study based on large-scale mining of open-source Python repositories from GitHub. We will use program slicing to identify program fragments related to ad hoc parsing and analyze these parsers and their surrounding contexts across 9 research questions using 25 initial syntactic and semantic metrics. Beyond descriptive statistics, we will attempt to identify common parsing patterns by cluster analysis. △ Less

Submitted 19 April, 2023; originally announced April 2023.

Comments: 5 pages, accepted as a registered report for MSR 2023 with Continuity Acceptance (CA)

arXiv:2208.04351 [pdf, other]

Learning to Learn to Predict Performance Regressions in Production at Meta

Authors: Moritz Beller, Hongyu Li, Vivek Nair, Vijayaraghavan Murali, Imad Ahmad, Jürgen Cito, Drew Carlson, Ari Aye, Wes Dyer

Abstract: Catching and attributing code change-induced performance regressions in production is hard; predicting them beforehand, even harder. A primer on automatically learning to predict performance regressions in software, this article gives an account of the experiences we gained when researching and deploying an ML-based regression prediction pipeline at Meta. In this paper, we report on a comparative… ▽ More Catching and attributing code change-induced performance regressions in production is hard; predicting them beforehand, even harder. A primer on automatically learning to predict performance regressions in software, this article gives an account of the experiences we gained when researching and deploying an ML-based regression prediction pipeline at Meta. In this paper, we report on a comparative study with four ML models of increasing complexity, from (1) code-opaque, over (2) Bag of Words, (3) off-the-shelve Transformer-based, to (4) a bespoke Transformer-based model, coined SuperPerforator. Our investigation shows the inherent difficulty of the performance prediction problem, which is characterized by a large imbalance of benign onto regressing changes. Our results also call into question the general applicability of Transformer-based architectures for performance prediction: an off-the-shelve CodeBERT-based approach had surprisingly poor performance; our highly customized SuperPerforator architecture initially achieved prediction performance that was just on par with simpler Bag of Words models, and only outperformed them for down-stream use cases. This ability of SuperPerforator to transfer to an application with few learning examples afforded an opportunity to deploy it in practice at Meta: it can act as a pre-filter to sort out changes that are unlikely to introduce a regression, truncating the space of changes to search a regression in by up to 43%, a 45x improvement over a random baseline. To gain further insight into SuperPerforator, we explored it via a series of experiments computing counterfactual explanations. These highlight which parts of a code change the model deems important, thereby validating the learned black-box model. △ Less

Submitted 22 May, 2023; v1 submitted 8 August, 2022; originally announced August 2022.

arXiv:2202.01021 [pdf, other]

doi 10.1109/ICSE-NIER55298.2022.9793523

Grammars for Free: Toward Grammar Inference for Ad Hoc Parsers

Authors: Michael Schröder, Jürgen Cito

Abstract: Ad hoc parsers are everywhere: they appear any time a string is split, looped over, interpreted, transformed, or otherwise processed. Every ad hoc parser gives rise to a language: the possibly infinite set of input strings that the program accepts without going wrong. Any language can be described by a formal grammar: a finite set of rules that can generate all strings of that language. But progra… ▽ More Ad hoc parsers are everywhere: they appear any time a string is split, looped over, interpreted, transformed, or otherwise processed. Every ad hoc parser gives rise to a language: the possibly infinite set of input strings that the program accepts without going wrong. Any language can be described by a formal grammar: a finite set of rules that can generate all strings of that language. But programmers do not write grammars for ad hoc parsers -- even though they would be eminently useful. Grammars can serve as documentation, aid program comprehension, generate test inputs, and allow reasoning about language-theoretic security. We propose an automatic grammar inference system for ad hoc parsers that would enable all of these use cases, in addition to opening up new possibilities in mining software repositories and bi-directional parser synthesis. △ Less

Submitted 2 February, 2022; originally announced February 2022.

Journal ref: 2022 IEEE/ACM 44th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER)

arXiv:2111.05711 [pdf, other]

Counterfactual Explanations for Models of Code

Authors: Jürgen Cito, Isil Dillig, Vijayaraghavan Murali, Satish Chandra

Abstract: Machine learning (ML) models play an increasingly prevalent role in many software engineering tasks. However, because most models are now powered by opaque deep neural networks, it can be difficult for developers to understand why the model came to a certain conclusion and how to act upon the model's prediction. Motivated by this problem, this paper explores counterfactual explanations for models… ▽ More Machine learning (ML) models play an increasingly prevalent role in many software engineering tasks. However, because most models are now powered by opaque deep neural networks, it can be difficult for developers to understand why the model came to a certain conclusion and how to act upon the model's prediction. Motivated by this problem, this paper explores counterfactual explanations for models of source code. Such counterfactual explanations constitute minimal changes to the source code under which the model "changes its mind". We integrate counterfactual explanation generation to models of source code in a real-world setting. We describe considerations that impact both the ability to find realistic and plausible counterfactual explanations, as well as the usefulness of such explanation to the user of the model. In a series of experiments we investigate the efficacy of our approach on three different models, each based on a BERT-like architecture operating over source code. △ Less

Submitted 10 November, 2021; originally announced November 2021.

Comments: 10 pages, 6 listings, 2 algorithms, 2 tables, 1 figure

arXiv:2105.02023 [pdf, other]

Interactive Static Software Performance Analysis in the IDE

Authors: Aaron Beigelbeck, Maurício Aniche, Jürgen Cito

Abstract: Detecting performance issues due to suboptimal code during the development process can be a daunting task, especially when it comes to localizing them after noticing performance degradation after deployment. Static analysis has the potential to provide early feedback on performance problems to developers without having to run profilers with expensive (and often unavailable) performance tests. We d… ▽ More Detecting performance issues due to suboptimal code during the development process can be a daunting task, especially when it comes to localizing them after noticing performance degradation after deployment. Static analysis has the potential to provide early feedback on performance problems to developers without having to run profilers with expensive (and often unavailable) performance tests. We develop a VSCode tool that integrates the static performance analysis results from Infer via code annotations and decorations (surfacing complexity analysis results in context) and side panel views showing details and overviews (enabling explainability of the results). Additionally, we design our system for interactivity to allow for more responsiveness to code changes as they happen. We evaluate the efficacy of our tool by measuring the overhead that the static performance analysis integration introduces in the development workflow. Further, we report on a case study that illustrates how our system can be used to reason about software performance in the context of a real performance bug in the ElasticSearch open-source project. Demo video: https://www.youtube.com/watch?v=-GqPb_YZMOs Repository: https://github.com/ipa-lab/vscode-infer-performance △ Less

Submitted 4 May, 2021; originally announced May 2021.

arXiv:2103.15787 [pdf, other]

Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Authors: Micah J. Smith, Jürgen Cito, Kalyan Veeramachaneni

Abstract: Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which… ▽ More Developers in data science and other domains frequently use computational notebooks to create exploratory analyses and prototype models. However, they often struggle to incorporate existing software engineering tooling into these notebook-based workflows, leading to fragile development processes. We introduce Assemblé, a new development environment for collaborative data science projects, in which promising code fragments of data science pipelines can be contributed as pull requests to an upstream repository entirely from within JupyterLab, abstracting away low-level version control tool usage. We describe the design and implementation of Assemblé and report on a user study of 23 data scientists. △ Less

Submitted 29 March, 2021; originally announced March 2021.

arXiv:2012.10206 [pdf, other]

doi 10.1007/s10664-021-10036-y

An Empirical Investigation of Command-Line Customization

Authors: Michael Schröder, Jürgen Cito

Abstract: The interactive command line, also known as the shell, is a prominent mechanism used extensively by a wide range of software professionals (engineers, system administrators, data scientists, etc.). Shell customizations can therefore provide insight into the tasks they repeatedly perform, how well the standard environment supports those tasks, and ways in which the environment could be productively… ▽ More The interactive command line, also known as the shell, is a prominent mechanism used extensively by a wide range of software professionals (engineers, system administrators, data scientists, etc.). Shell customizations can therefore provide insight into the tasks they repeatedly perform, how well the standard environment supports those tasks, and ways in which the environment could be productively extended or modified. To characterize the patterns and complexities of command-line customization, we mined the collective knowledge of command-line users by analyzing more than 2.2 million shell alias definitions found on GitHub. Shell aliases allow command-line users to customize their environment by defining arbitrarily complex command substitutions. Using inductive coding methods, we found three types of aliases that each enable a number of customization practices: Shortcuts (for nicknaming commands, abbreviating subcommands, and bookmarking locations), Modifications (for substituting commands, overriding defaults, colorizing output, and elevating privilege), and Scripts (for transforming data and chaining subcommands). We conjecture that identifying common customization practices can point to particular usability issues within command-line programs, and that a deeper understanding of these practices can support researchers and tool developers in designing better user experiences. In addition to our analysis, we provide an extensive reproducibility package in the form of a curated dataset together with well-documented computational notebooks enabling further knowledge discovery and a basis for learning approaches to improve command-line workflows. △ Less

Submitted 6 August, 2021; v1 submitted 18 December, 2020; originally announced December 2020.

Journal ref: Empir Software Eng 27, 30 (2022)

arXiv:2012.07816 [pdf, other]

doi 10.1145/3479575

Enabling Collaborative Data Science Development with the Ballet Framework

Authors: Micah J. Smith, Jürgen Cito, Kelvin Lu, Kalyan Veeramachaneni

Abstract: While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight fra… ▽ More While the open-source software development model has led to successful large-scale collaborations in building software systems, data science projects are frequently developed by individuals or small teams. We describe challenges to scaling data science collaborations and present a conceptual framework and ML programming model to address them. We instantiate these ideas in Ballet, a lightweight framework for collaborative, open-source data science through a focus on feature engineering, and an accompanying cloud-based development environment. Using our framework, collaborators incrementally propose feature definitions to a repository which are each subjected to an ML performance evaluation and can be automatically merged into an executable feature engineering pipeline. We leverage Ballet to conduct a case study analysis of an income prediction problem with 27 collaborators, and discuss implications for future designers of collaborative projects. △ Less

Submitted 22 October, 2021; v1 submitted 14 December, 2020; originally announced December 2020.

Journal ref: Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 431 (October 2021), 39 pages

arXiv:1907.06535 [pdf, other]

Characterizing Developer Use of Automatically Generated Patches

Authors: José Pablo Cambronero, Jiasi Shen, Jürgen Cito, Elena Glassman, Martin Rinard

Abstract: We present a study that characterizes the way developers use automatically generated patches when fixing software defects. Our study tasked two groups of developers with repairing defects in C programs. Both groups were provided with the defective line of code. One was also provided with five automatically generated and validated patches, all of which modified the defective line of code, and one o… ▽ More We present a study that characterizes the way developers use automatically generated patches when fixing software defects. Our study tasked two groups of developers with repairing defects in C programs. Both groups were provided with the defective line of code. One was also provided with five automatically generated and validated patches, all of which modified the defective line of code, and one of which was correct. Contrary to our initial expectations, the group with access to the generated patches did not produce more correct patches and did not produce patches in less time. We characterize the main behaviors observed in experimental subjects: a focus on understanding the defect and the relationship of the patches to the original source code. Based on this characterization, we highlight various potentially productive directions for future developer-centric automatic patch generation systems. △ Less

Submitted 22 November, 2019; v1 submitted 15 July, 2019; originally announced July 2019.

arXiv:1411.2429 [pdf, other]

Patterns in the Chaos - a Study of Performance Variation and Predictability in Public IaaS Clouds

Authors: Philipp Leitner, Juergen Cito

Abstract: Benchmarking the performance of public cloud providers is a common research topic. Previous research has already extensively evaluated the performance of different cloud platforms for different use cases, and under different constraints and experiment setups. In this paper, we present a principled, large-scale literature review to collect and codify existing research regarding the predictability o… ▽ More Benchmarking the performance of public cloud providers is a common research topic. Previous research has already extensively evaluated the performance of different cloud platforms for different use cases, and under different constraints and experiment setups. In this paper, we present a principled, large-scale literature review to collect and codify existing research regarding the predictability of performance in public Infrastructure-as-a-Service (IaaS) clouds. We formulate 15 hypotheses relating to the nature of performance variations in IaaS systems, to the factors of influence of performance variations, and how to compare different instance types. In a second step, we conduct extensive real-life experimentation on Amazon EC2 and Google Compute Engine to empirically validate those hypotheses. At the time of our research, performance in EC2 was substantially less predictable than in GCE. Further, we show that hardware heterogeneity is in practice less prevalent than anticipated by earlier research, while multi-tenancy has a dramatic impact on performance and predictability. △ Less

Submitted 19 January, 2016; v1 submitted 10 November, 2014; originally announced November 2014.

arXiv:1409.6502 [pdf, other]

The Making of Cloud Applications An Empirical Study on Software Development for the Cloud

Authors: Jürgen Cito, Philipp Leitner, Thomas Fritz, Harald C. Gall

Abstract: Cloud computing is gaining more and more traction as a deployment and provisioning model for software. While a large body of research already covers how to optimally operate a cloud system, we still lack insights into how professional software engineers actually use clouds, and how the cloud impacts development practices. This paper reports on the first systematic study on how software developers… ▽ More Cloud computing is gaining more and more traction as a deployment and provisioning model for software. While a large body of research already covers how to optimally operate a cloud system, we still lack insights into how professional software engineers actually use clouds, and how the cloud impacts development practices. This paper reports on the first systematic study on how software developers build applications in the cloud. We conducted a mixed-method study, consisting of qualitative interviews of 25 professional developers and a quantitative survey with 294 responses. Our results show that adopting the cloud has a profound impact throughout the software development process, as well as on how developers utilize tools and data in their daily work. Among other things, we found that (1) developers need better means to anticipate runtime problems and rigorously define metrics for improved fault localization and (2) the cloud offers an abundance of operational data, however, developers still often rely on their experience and intuition rather than utilizing metrics. From our findings, we extracted a set of guidelines for cloud development and identified challenges for researchers and tool vendors. △ Less

Submitted 17 March, 2015; v1 submitted 23 September, 2014; originally announced September 2014.

arXiv:1408.4565 [pdf, other]

Cloud WorkBench - Infrastructure-as-Code Based Cloud Benchmarking

Authors: Joel Scheuner, Philipp Leitner, Jurgen Cito, Harald Gall

Abstract: To optimally deploy their applications, users of Infrastructure-as-a-Service clouds are required to evaluate the costs and performance of different combinations of cloud configurations to find out which combination provides the best service level for their specific application. Unfortunately, benchmarking cloud services is cumbersome and error-prone. In this paper, we propose an architecture and c… ▽ More To optimally deploy their applications, users of Infrastructure-as-a-Service clouds are required to evaluate the costs and performance of different combinations of cloud configurations to find out which combination provides the best service level for their specific application. Unfortunately, benchmarking cloud services is cumbersome and error-prone. In this paper, we propose an architecture and concrete implementation of a cloud benchmarking Web service, which fosters the definition of reusable and representative benchmarks. In distinction to existing work, our system is based on the notion of Infrastructure-as-Code, which is a state of the art concept to define IT infrastructure in a reproducible, well-defined, and testable way. We demonstrate our system based on an illustrative case study, in which we measure and compare the disk IO speeds of different instance and storage types in Amazon EC2. △ Less

Submitted 20 August, 2014; originally announced August 2014.

Showing 1–17 of 17 results for author: Cito, J