-
ANVIL: Anomaly-based Vulnerability Identification without Labelled Training Data
Authors:
Weizhou Wang,
Eric Liu,
Xiangyu Guo,
David Lie
Abstract:
Supervised learning-based software vulnerability detectors often fall short due to the inadequate availability of labelled training data. In contrast, Large Language Models (LLMs) such as GPT-4, are not trained on labelled data, but when prompted to detect vulnerabilities, LLM prediction accuracy is only marginally better than random guessing. In this paper, we explore a different approach by refr…
▽ More
Supervised learning-based software vulnerability detectors often fall short due to the inadequate availability of labelled training data. In contrast, Large Language Models (LLMs) such as GPT-4, are not trained on labelled data, but when prompted to detect vulnerabilities, LLM prediction accuracy is only marginally better than random guessing. In this paper, we explore a different approach by reframing vulnerability detection as one of anomaly detection. Since the vast majority of code does not contain vulnerabilities and LLMs are trained on massive amounts of such code, vulnerable code can be viewed as an anomaly from the LLM's predicted code distribution, freeing the model from the need for labelled data to provide a learnable representation of vulnerable code. Leveraging this perspective, we demonstrate that LLMs trained for code generation exhibit a significant gap in prediction accuracy when prompted to reconstruct vulnerable versus non-vulnerable code.
Using this insight, we implement ANVIL, a detector that identifies software vulnerabilities at line-level granularity. Our experiments explore the discriminating power of different anomaly scoring methods, as well as the sensitivity of ANVIL to context size. We also study the effectiveness of ANVIL on various LLM families, and conduct leakage experiments on vulnerabilities that were discovered after the knowledge cutoff of our evaluated LLMs. On a collection of vulnerabilities from the Magma benchmark, ANVIL outperforms state-of-the-art line-level vulnerability detectors, LineVul and LineVD, which have been trained with labelled data, despite ANVIL having never been trained with labelled vulnerabilities. Specifically, our approach achieves $1.62\times$ to $2.18\times$ better Top-5 accuracies and $1.02\times$ to $1.29\times$ times better ROC scores on line-level vulnerability detection tasks.
△ Less
Submitted 27 August, 2024;
originally announced August 2024.
-
LDPKiT: Recovering Utility in LDP Schemes by Training with Noise^2
Authors:
Kexin Li,
Yang Xi,
Aastha Mehta,
David Lie
Abstract:
The adoption of large cloud-based models for inference has been hampered by concerns about the privacy leakage of end-user data. One method to mitigate this leakage is to add local differentially private noise to queries before sending them to the cloud, but this degrades utility as a side effect. Our key insight is that knowledge available in the noisy labels returned from performing inference on…
▽ More
The adoption of large cloud-based models for inference has been hampered by concerns about the privacy leakage of end-user data. One method to mitigate this leakage is to add local differentially private noise to queries before sending them to the cloud, but this degrades utility as a side effect. Our key insight is that knowledge available in the noisy labels returned from performing inference on noisy inputs can be aggregated and used to recover the correct labels. We implement this insight in LDPKiT, which stands for Local Differentially-Private and Utility-Preserving Inference via Knowledge Transfer. LDPKiT uses the noisy labels returned from querying a set of noised inputs to train a local model (noise^2), which is then used to perform inference on the original set of inputs. Our experiments on CIFAR-10, Fashion-MNIST, SVHN, and CARER NLP datasets demonstrate that LDPKiT can improve utility without compromising privacy. For instance, on CIFAR-10, compared to a standard $ε$-LDP scheme with $ε=15$, which provides a weak privacy guarantee, LDPKiT can achieve nearly the same accuracy (within 1% drop) with $ε=7$, offering an enhanced privacy guarantee. Moreover, the benefits of using LDPKiT increase at higher, more privacy-protective noise levels. For Fashion-MNIST and CARER, LDPKiT's accuracy on the sensitive dataset with $ε=7$ not only exceeds the average accuracy of the standard $ε$-LDP scheme with $ε=7$ by roughly 20% and 9% but also outperforms the standard $ε$-LDP scheme with $ε=15$, a scenario with less noise and minimal privacy protection. We also perform Zest distance measurements to demonstrate that the type of distillation performed by LDPKiT is different from a model extraction attack.
△ Less
Submitted 25 May, 2024;
originally announced May 2024.
-
Maximizing Information Gain in Privacy-Aware Active Learning of Email Anomalies
Authors:
Mu-Huan Miles Chung,
Sharon Li,
Jaturong Kongmanee,
Lu Wang,
Yuhong Yang,
Calvin Giang,
Khilan Jerath,
Abhay Raman,
David Lie,
Mark Chignell
Abstract:
Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due…
▽ More
Redacted emails satisfy most privacy requirements but they make it more difficult to detect anomalous emails that may be indicative of data exfiltration. In this paper we develop an enhanced method of Active Learning using an information gain maximizing heuristic, and we evaluate its effectiveness in a real world setting where only redacted versions of email could be labeled by human analysts due to privacy concerns. In the first case study we examined how Active Learning should be carried out. We found that model performance was best when a single highly skilled (in terms of the labelling task) analyst provided the labels. In the second case study we used confidence ratings to estimate the labeling uncertainty of analysts and then prioritized instances for labeling based on the expected information gain (the difference between model uncertainty and analyst uncertainty) that would be provided by labelling each instance. We found that the information maximization gain heuristic improved model performance over existing sampling methods for Active Learning. Based on the results obtained, we recommend that analysts should be screened, and possibly trained, prior to implementation of Active Learning in cybersecurity applications. We also recommend that the information gain maximizing sample method (based on expert confidence) should be used in early stages of Active Learning, providing that well-calibrated confidence can be obtained. We also note that the expertise of analysts should be assessed prior to Active Learning, as we found that analysts with lower labelling skill had poorly calibrated (over-) confidence in their labels.
△ Less
Submitted 12 May, 2024;
originally announced May 2024.
-
Dumviri: Detecting Trackers and Mixed Trackers with a Breakage Detector
Authors:
He Shuang,
Lianying Zhao,
David Lie
Abstract:
Previous automatic tracker detection work lacks features to recognize web page breakage and often resort to manual analysis to assess the breakage caused by blocking trackers.
We introduce Dumviri, which incorporates a breakage detector that can automatically detect web page breakage caused by erroneously blocking a resource that is needed by the page to function properly. This addition allows D…
▽ More
Previous automatic tracker detection work lacks features to recognize web page breakage and often resort to manual analysis to assess the breakage caused by blocking trackers.
We introduce Dumviri, which incorporates a breakage detector that can automatically detect web page breakage caused by erroneously blocking a resource that is needed by the page to function properly. This addition allows Dumviri to prevent functional resources from being misclassified as trackers and increases overall detection accuracy. We designed Dumviri to take differential features. We further find that these features are agnostic to analysis granularity and enable Dumviri to predict tracking resources at the request field granularity, allowing Dumviri to handle some mixed trackers.
Evaluating Dumviri on 15K pages shows its ability to replicate the labels of human-generated filter lists with an accuracy of 97.44%. Through a manual analysis, we found that Dumviri identified previously unreported trackers and its breakage detector can identify rules that cause web page breakage in commonly used filter lists like EasyPrivacy. In the case of mixed trackers, Dumviri, being the first automated mixed tracker detector, achieves a 79.09% accuracy. We have confirmed 22 previously unreported unique trackers and 26 unique mixed trackers. We promptly reported these findings to privacy developers, and we will publish our filter lists in uBlock Origin's extended syntax.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Calpric: Inclusive and Fine-grain Labeling of Privacy Policies with Crowdsourcing and Active Learning
Authors:
Wenjun Qiu,
David Lie,
Lisa Austin
Abstract:
A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies…
▽ More
A significant challenge to training accurate deep learning models on privacy policies is the cost and difficulty of obtaining a large and comprehensive set of training data. To address these challenges, we present Calpric , which combines automatic text selection and segmentation, active learning and the use of crowdsourced annotators to generate a large, balanced training set for privacy policies at low cost. Automated text selection and segmentation simplifies the labeling task, enabling untrained annotators from crowdsourcing platforms, like Amazon's Mechanical Turk, to be competitive with trained annotators, such as law students, and also reduces inter-annotator agreement, which decreases labeling cost. Having reliable labels for training enables the use of active learning, which uses fewer training samples to efficiently cover the input space, further reducing cost and improving class and data category balance in the data set. The combination of these techniques allows Calpric to produce models that are accurate over a wider range of data categories, and provide more detailed, fine-grain labels than previous work. Our crowdsourcing process enables Calpric to attain reliable labeled data at a cost of roughly $0.92-$1.71 per labeled text segment. Calpric 's training process also generates a labeled data set of 16K privacy policy text segments across 9 Data categories with balanced positive and negative samples.
△ Less
Submitted 15 January, 2024;
originally announced January 2024.
-
Implementing Active Learning in Cybersecurity: Detecting Anomalies in Redacted Emails
Authors:
Mu-Huan Chung,
Lu Wang,
Sharon Li,
Yuhong Yang,
Calvin Giang,
Khilan Jerath,
Abhay Raman,
David Lie,
Mark Chignell
Abstract:
Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling…
▽ More
Research on email anomaly detection has typically relied on specially prepared datasets that may not adequately reflect the type of data that occurs in industry settings. In our research, at a major financial services company, privacy concerns prevented inspection of the bodies of emails and attachment details (although subject headings and attachment filenames were available). This made labeling possible anomalies in the resulting redacted emails more difficult. Another source of difficulty is the high volume of emails combined with the scarcity of resources making machine learning (ML) a necessity, but also creating a need for more efficient human training of ML models. Active learning (AL) has been proposed as a way to make human training of ML models more efficient. However, the implementation of Active Learning methods is a human-centered AI challenge due to potential human analyst uncertainty, and the labeling task can be further complicated in domains such as the cybersecurity domain (or healthcare, aviation, etc.) where mistakes in labeling can have highly adverse consequences. In this paper we present research results concerning the application of Active Learning to anomaly detection in redacted emails, comparing the utility of different methods for implementing active learning in this context. We evaluate different AL strategies and their impact on resulting model performance. We also examine how ratings of confidence that experts have in their labels can inform AL. The results obtained are discussed in terms of their implications for AL methodology and for the role of experts in model-assisted email anomaly screening.
△ Less
Submitted 2 March, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
In Differential Privacy, There is Truth: On Vote Leakage in Ensemble Private Learning
Authors:
Jiaqi Wang,
Roei Schuster,
Ilia Shumailov,
David Lie,
Nicolas Papernot
Abstract:
When learning from sensitive data, care must be taken to ensure that training algorithms address privacy concerns. The canonical Private Aggregation of Teacher Ensembles, or PATE, computes output labels by aggregating the predictions of a (possibly distributed) collection of teacher models via a voting mechanism. The mechanism adds noise to attain a differential privacy guarantee with respect to t…
▽ More
When learning from sensitive data, care must be taken to ensure that training algorithms address privacy concerns. The canonical Private Aggregation of Teacher Ensembles, or PATE, computes output labels by aggregating the predictions of a (possibly distributed) collection of teacher models via a voting mechanism. The mechanism adds noise to attain a differential privacy guarantee with respect to the teachers' training data. In this work, we observe that this use of noise, which makes PATE predictions stochastic, enables new forms of leakage of sensitive information. For a given input, our adversary exploits this stochasticity to extract high-fidelity histograms of the votes submitted by the underlying teachers. From these histograms, the adversary can learn sensitive attributes of the input such as race, gender, or age. Although this attack does not directly violate the differential privacy guarantee, it clearly violates privacy norms and expectations, and would not be possible at all without the noise inserted to obtain differential privacy. In fact, counter-intuitively, the attack becomes easier as we add more noise to provide stronger differential privacy. We hope this encourages future work to consider privacy holistically rather than treat differential privacy as a panacea.
△ Less
Submitted 21 September, 2022;
originally announced September 2022.
-
On the Exploitability of Audio Machine Learning Pipelines to Surreptitious Adversarial Examples
Authors:
Adelin Travers,
Lorna Licollari,
Guanghan Wang,
Varun Chandrasekaran,
Adam Dziedzic,
David Lie,
Nicolas Papernot
Abstract:
Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human…
▽ More
Machine learning (ML) models are known to be vulnerable to adversarial examples. Applications of ML to voice biometrics authentication are no exception. Yet, the implications of audio adversarial examples on these real-world systems remain poorly understood given that most research targets limited defenders who can only listen to the audio samples. Conflating detectability of an attack with human perceptibility, research has focused on methods that aim to produce imperceptible adversarial examples which humans cannot distinguish from the corresponding benign samples. We argue that this perspective is coarse for two reasons: 1. Imperceptibility is impossible to verify; it would require an experimental process that encompasses variations in listener training, equipment, volume, ear sensitivity, types of background noise etc, and 2. It disregards pipeline-based detection clues that realistic defenders leverage. This results in adversarial examples that are ineffective in the presence of knowledgeable defenders. Thus, an adversary only needs an audio sample to be plausible to a human. We thus introduce surreptitious adversarial examples, a new class of attacks that evades both human and pipeline controls. In the white-box setting, we instantiate this class with a joint, multi-stage optimization attack. Using an Amazon Mechanical Turk user study, we show that this attack produces audio samples that are more surreptitious than previous attacks that aim solely for imperceptibility. Lastly we show that surreptitious adversarial examples are challenging to develop in the black-box setting.
△ Less
Submitted 3 August, 2021;
originally announced August 2021.
-
Deep Active Learning with Crowdsourcing Data for Privacy Policy Classification
Authors:
Wenjun Qiu,
David Lie
Abstract:
Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled…
▽ More
Privacy policies are statements that notify users of the services' data practices. However, few users are willing to read through policy texts due to the length and complexity. While automated tools based on machine learning exist for privacy policy analysis, to achieve high classification accuracy, classifiers need to be trained on a large labeled dataset. Most existing policy corpora are labeled by skilled human annotators, requiring significant amount of labor hours and effort. In this paper, we leverage active learning and crowdsourcing techniques to develop an automated classification tool named Calpric (Crowdsourcing Active Learning PRIvacy Policy Classifier), which is able to perform annotation equivalent to those done by skilled human annotators with high accuracy while minimizing the labeling cost. Specifically, active learning allows classifiers to proactively select the most informative segments to be labeled. On average, our model is able to achieve the same F1 score using only 62% of the original labeling effort. Calpric's use of active learning also addresses naturally occurring class imbalance in unlabeled privacy policy datasets as there are many more statements stating the collection of private information than stating the absence of collection. By selecting samples from the minority class for labeling, Calpric automatically creates a more balanced training set.
△ Less
Submitted 6 August, 2020;
originally announced August 2020.
-
vWitness: Certifying Web Page Interactions with Computer Vision
Authors:
He Shuang,
Lianying Zhao,
David Lie
Abstract:
Web servers service client requests, some of which might cause the web server to perform security-sensitive operations (e.g. money transfer, voting). An attacker may thus forge or maliciously manipulate such requests by compromising a web client. Unfortunately, a web server has no way of knowing whether the client from which it receives a request has been compromised or not -- current "best practi…
▽ More
Web servers service client requests, some of which might cause the web server to perform security-sensitive operations (e.g. money transfer, voting). An attacker may thus forge or maliciously manipulate such requests by compromising a web client. Unfortunately, a web server has no way of knowing whether the client from which it receives a request has been compromised or not -- current "best practice" defenses such as user authentication or network encryption cannot aid a server as they all assume web client integrity. To address this shortcoming, we propose vWitness, which "witnesses" the interactions of a user with a web page and certifies whether they match a specification provided by the web server, enabling the web server to know that the web request is user-intended. The main challenge that vWitness overcomes is that even benign clients introduce unpredictable variations in the way they render web pages. vWitness differentiates between these benign variations and malicious manipulation using computer vision, allowing it to certify to the web server that 1) the web page user interface is properly displayed 2) observed user interactions are used to construct the web request. Our vWitness prototype achieves compatibility with modern web pages, is resilient to adversarial example attacks and is accurate and performant -- vWitness achieves 99.97% accuracy and adds 197ms of overhead to the entire interaction session in the average case.
△ Less
Submitted 4 July, 2023; v1 submitted 30 July, 2020;
originally announced July 2020.
-
Machine Unlearning
Authors:
Lucas Bourtoule,
Varun Chandrasekaran,
Christopher A. Choquette-Choo,
Hengrui Jia,
Adelin Travers,
Baiwu Zhang,
David Lie,
Nicolas Papernot
Abstract:
Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. We introduce SISA trainin…
▽ More
Once users have shared their data online, it is generally difficult for them to revoke access and ask for the data to be deleted. Machine learning (ML) exacerbates this problem because any model trained with said data may have memorized it, putting users at risk of a successful privacy attack exposing their information. Yet, having models unlearn is notoriously difficult. We introduce SISA training, a framework that expedites the unlearning process by strategically limiting the influence of a data point in the training procedure. While our framework is applicable to any learning algorithm, it is designed to achieve the largest improvements for stateful algorithms like stochastic gradient descent for deep neural networks. SISA training reduces the computational overhead associated with unlearning, even in the worst-case setting where unlearning requests are made uniformly across the training set. In some cases, the service provider may have a prior on the distribution of unlearning requests that will be issued by users. We may take this prior into account to partition and order data accordingly, and further decrease overhead from unlearning. Our evaluation spans several datasets from different domains, with corresponding motivations for unlearning. Under no distributional assumptions, for simple learning tasks, we observe that SISA training improves time to unlearn points from the Purchase dataset by 4.63x, and 2.45x for the SVHN dataset, over retraining from scratch. SISA training also provides a speed-up of 1.36x in retraining for complex learning tasks such as ImageNet classification; aided by transfer learning, this results in a small degradation in accuracy. Our work contributes to practical data governance in machine unlearning.
△ Less
Submitted 15 December, 2020; v1 submitted 8 December, 2019;
originally announced December 2019.
-
SoK: Hardware Security Support for Trustworthy Execution
Authors:
Lianying Zhao,
He Shuang,
Shengjie Xu,
Wei Huang,
Rongzhen Cui,
Pushkar Bettadpur,
David Lie
Abstract:
In recent years, there have emerged many new hardware mechanisms for improving the security of our computer systems. Hardware offers many advantages over pure software approaches: immutability of mechanisms to software attacks, better execution and power efficiency and a smaller interface allowing it to better maintain secrets. This has given birth to a plethora of hardware mechanisms providing tr…
▽ More
In recent years, there have emerged many new hardware mechanisms for improving the security of our computer systems. Hardware offers many advantages over pure software approaches: immutability of mechanisms to software attacks, better execution and power efficiency and a smaller interface allowing it to better maintain secrets. This has given birth to a plethora of hardware mechanisms providing trusted execution environments (TEEs), support for integrity checking and memory safety and widespread uses of hardware roots of trust.
In this paper, we systematize these approaches through the lens of abstraction. Abstraction is key to computing systems, and the interface between hardware and software contains many abstractions. We find that these abstractions, when poorly designed, can both obscure information that is needed for security enforcement, as well as reveal information that needs to be kept secret, leading to vulnerabilities. We summarize such vulnerabilities and discuss several research trends of this area.
△ Less
Submitted 10 October, 2019;
originally announced October 2019.
-
MultiK: A Framework for Orchestrating Multiple Specialized Kernels
Authors:
Hsuan-Chi Kuo,
Akshith Gunasekaran,
Yeongjin Jang,
Sibin Mohan,
Rakesh B. Bobba,
David Lie,
Jesse Walker
Abstract:
We present, MultiK, a Linux-based framework 1 that reduces the attack surface for operating system kernels by reducing code bloat. MultiK "orchestrates" multiple kernels that are specialized for individual applications in a transparent manner. This framework is flexible to accommodate different kernel code reduction techniques and, most importantly, run the specialized kernels with near-zero addit…
▽ More
We present, MultiK, a Linux-based framework 1 that reduces the attack surface for operating system kernels by reducing code bloat. MultiK "orchestrates" multiple kernels that are specialized for individual applications in a transparent manner. This framework is flexible to accommodate different kernel code reduction techniques and, most importantly, run the specialized kernels with near-zero additional runtime overheads. MultiK avoids the overheads of virtualization and runs natively on the system. For instance, an Apache instance is shown to run on a kernel that has (a) 93.68% of its code reduced, (b) 19 of 23 known kernel vulnerabilities eliminated and (c) with negligible performance overheads (0.19%). MultiK is a framework that can integrate with existing code reduction and OS security techniques. We demonstrate this by using D-KUT and S-KUT -- two methods to profile and eliminate unwanted kernel code. The whole process is transparent to the user applications because MultiK does not require a recompilation of the application.
△ Less
Submitted 16 March, 2019;
originally announced March 2019.
-
Sound Patch Generation for Vulnerabilities
Authors:
Zhen Huang,
David Lie
Abstract:
Security vulnerabilities are among the most critical software defects in existence. As such, they require patches that are correct and quickly deployed. This motivates an automatic patch generation method that emphasizes both soundness and wide applicability. To address this challenge, we propose Senx, which uses three novel patch generation techniques to create patches for out-of-bounds read/writ…
▽ More
Security vulnerabilities are among the most critical software defects in existence. As such, they require patches that are correct and quickly deployed. This motivates an automatic patch generation method that emphasizes both soundness and wide applicability. To address this challenge, we propose Senx, which uses three novel patch generation techniques to create patches for out-of-bounds read/write vulnerabilities. Senx uses symbolic execution to extract expressions from the source code of a target application to synthesize patches. To reduce the runtime overhead of patches, it uses loop cloning and access range analysis to analyze loops involved in these vulnerabilities and elevate patches outside of loops. For vulnerabilities that span multiple functions, Senx uses expression translation to translate expressions and place them in a function scope where all values are available to create the patch. This enables Senx to patch vulnerabilities with complex loops and interprocedural dependencies that previous semantics-based patch generation systems cannot handle.
We have implemented a prototype using this approach. Our evaluation shows that the patches generated by Senx successfully fix 76% of 42 real-world vulnerabilities from 11 applications including various tools or libraries for manipulating graphics/media files, a programming language interpreter, a relational database engine, a collection of programming tools for creating and managing binary programs, and a collection of basic file, shell, and text manipulation tools. All patches that Senx produces are sound, and Senx correctly aborts patch generations in cases where its analysis will fall short.
△ Less
Submitted 11 June, 2018; v1 submitted 29 November, 2017;
originally announced November 2017.
-
Ocasta: Clustering Configuration Settings For Error Recovery
Authors:
Zhen Huang,
David Lie
Abstract:
Effective machine-aided diagnosis and repair of configuration errors continues to elude computer systems designers. Most of the literature targets errors that can be attributed to a single erroneous configuration setting. However, a recent study found that a significant amount of configuration errors require fixing more than one setting together. To address this limitation, Ocasta statistically cl…
▽ More
Effective machine-aided diagnosis and repair of configuration errors continues to elude computer systems designers. Most of the literature targets errors that can be attributed to a single erroneous configuration setting. However, a recent study found that a significant amount of configuration errors require fixing more than one setting together. To address this limitation, Ocasta statistically clusters dependent configuration settings based on the application's accesses to its configuration settings and utilizes the extracted clustering of configuration settings to fix configuration errors involving more than one configuration settings. Ocasta treats applications as black-boxes and only relies on the ability to observe application accesses to their configuration settings.
We collected traces of real application usage from 24 Linux and 5 Windows desktops computers and found that Ocasta is able to correctly identify clusters with 88.6% accuracy. To demonstrate the effectiveness of Ocasta, we evaluated it on 16 real-world configuration errors of 11 Linux and Windows applications. Ocasta is able to successfully repair all evaluated configuration errors in 11 minutes on average and only requires the user to examine an average of 3 screenshots of the output of the application to confirm that the error is repaired. A user study we conducted shows that Ocasta is easy to use by both expert and non-expert users and is more efficient than manual configuration error troubleshooting.
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
SAIC: Identifying Configuration Files for System Configuration Management
Authors:
Zhen Huang,
David Lie
Abstract:
Systems can become misconfigured for a variety of reasons such as operator errors or buggy patches. When a misconfiguration is discovered, usually the first order of business is to restore availability, often by undoing the misconfiguration. To simplify this task, we propose the Statistical Analysis for Identifying Configuration Files (SAIC), which analyzes how the contents of a file changes over…
▽ More
Systems can become misconfigured for a variety of reasons such as operator errors or buggy patches. When a misconfiguration is discovered, usually the first order of business is to restore availability, often by undoing the misconfiguration. To simplify this task, we propose the Statistical Analysis for Identifying Configuration Files (SAIC), which analyzes how the contents of a file changes over time to automatically determine which files contain configuration state. In this way, SAIC reduces the number of files a user must manually examine during recovery and allows versioning file systems to make more efficient use of their versioning storage.
The two key insights that enable SAIC to identify configuration files are that configuration state must persist across executions of an application and that configuration state changes at a slower rate than other types of application state. SAIC applies these insights through a set of filters, which eliminate non-persistent files from consideration, and a novel similarity metric, which measures how similar a file's versions are to each other. Together, these two mechanisms enable SAIC to identify all 72 configuration files out of 2363 versioned files from 6 common applications in two user traces, while mistaking only 33 non-configuration files as configuration files, which allows a versioning file system to eliminate roughly 66% of non-configuration file versions from its logs, thus reducing the number of file versions that a user must try to recover from a misconfiguration.
△ Less
Submitted 6 November, 2017;
originally announced November 2017.
-
BinPro: A Tool for Binary Source Code Provenance
Authors:
Dhaval Miyani,
Zhen Huang,
David Lie
Abstract:
Enforcing open source licenses such as the GNU General Public License (GPL), analyzing a binary for possible vulnerabilities, and code maintenance are all situations where it is useful to be able to determine the source code provenance of a binary. While previous work has either focused on computing binary-to-binary similarity or source-to-source similarity, BinPro is the first work we are aware o…
▽ More
Enforcing open source licenses such as the GNU General Public License (GPL), analyzing a binary for possible vulnerabilities, and code maintenance are all situations where it is useful to be able to determine the source code provenance of a binary. While previous work has either focused on computing binary-to-binary similarity or source-to-source similarity, BinPro is the first work we are aware of to tackle the problem of source-to-binary similarity. BinPro can match binaries with their source code even without knowing which compiler was used to produce the binary, or what optimization level was used with the compiler. To do this, BinPro utilizes machine learning to compute optimal code features for determining binary-to-source similarity and a static analysis pipeline to extract and compute similarity based on those features. Our experiments show that on average BinPro computes a similarity of 81% for matching binaries and source code of the same applications, and an average similarity of 25% for binaries and source code of similar but different applications. This shows that BinPro's similarity score is useful for determining if a binary was derived from a particular source code.
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
Talos: Neutralizing Vulnerabilities with Security Workarounds for Rapid Response
Authors:
Zhen Huang,
Mariana D'Angelo,
Dhaval Miyani,
David Lie
Abstract:
Considerable delays often exist between the discovery of a vulnerability and the issue of a patch. One way to mitigate this window of vulnerability is to use a configuration workaround, which prevents the vulnerable code from being executed at the cost of some lost functionality -- but only if one is available. Since program configurations are not specifically designed to mitigate software vulnera…
▽ More
Considerable delays often exist between the discovery of a vulnerability and the issue of a patch. One way to mitigate this window of vulnerability is to use a configuration workaround, which prevents the vulnerable code from being executed at the cost of some lost functionality -- but only if one is available. Since program configurations are not specifically designed to mitigate software vulnerabilities, we find that they only cover 25.2% of vulnerabilities.
To minimize patch delay vulnerabilities and address the limitations of configuration workarounds, we propose Security Workarounds for Rapid Response (SWRRs), which are designed to neutralize security vulnerabilities in a timely, secure, and unobtrusive manner. Similar to configuration workarounds, SWRRs neutralize vulnerabilities by preventing vulnerable code from being executed at the cost of some lost functionality. However, the key difference is that SWRRs use existing error-handling code within programs, which enables them to be mechanically inserted with minimal knowledge of the program and minimal developer effort. This allows SWRRs to achieve high coverage while still being fast and easy to deploy.
We have designed and implemented Talos, a system that mechanically instruments SWRRs into a given program, and evaluate it on five popular Linux server programs. We run exploits against 11 real-world software vulnerabilities and show that SWRRs neutralize the vulnerabilities in all cases. Quantitative measurements on 320 SWRRs indicate that SWRRs instrumented by Talos can neutralize 75.1% of all potential vulnerabilities and incur a loss of functionality similar to configuration workarounds in 71.3% of those cases. Our overall conclusion is that automatically generated SWRRs can safely mitigate 2.1x more vulnerabilities, while only incurring a loss of functionality comparable to that of traditional configuration workarounds.
△ Less
Submitted 2 November, 2017;
originally announced November 2017.
-
Unity 2.0: Secure and Durable Personal Cloud Storage
Authors:
Beom Heyn Kim,
Wei Huang,
Afshar Ganjali,
David Lie
Abstract:
While personal cloud storage services such as Dropbox, OneDrive, Google Drive and iCloud have become very popular in recent years, these services offer few security guarantees to users. These cloud services are aimed at end users, whose applications often assume a local file system storage, and thus require strongly consistent data. In addition, users usually access these services using personal c…
▽ More
While personal cloud storage services such as Dropbox, OneDrive, Google Drive and iCloud have become very popular in recent years, these services offer few security guarantees to users. These cloud services are aimed at end users, whose applications often assume a local file system storage, and thus require strongly consistent data. In addition, users usually access these services using personal computers and portable devices such as phones and tablets, which are upload bandwidth constrained and in many cases battery powered. Unity is a system that provides confidentiality, integrity, durability and strong consistency while minimizing the upload bandwidth of its clients. We find that Unity consumes minimal upload bandwidth for compute-heavy workload compared to NFS and Dropbox, while uses similar amount of upload bandwidth for write-heavy workload relative to NBD. Although read-heavy workload tends to consume more upload bandwidth with Unity, it is no more than an eighth of the size of blocks replicated and there is much room for optimization. Moreover, Unity provides flexibility to maintain multiple DEs to provide scalability for multiple devices to concurrently access the data with the minimal lease switch cost.
△ Less
Submitted 10 October, 2017;
originally announced October 2017.
-
The Case for a Single System Image for Personal Devices
Authors:
Beom Heyn Kim,
Eyal de Lara,
David Lie
Abstract:
Computing technology has gotten cheaper and more powerful, allowing users to have a growing number of personal computing devices at their disposal. While this trend is beneficial for the user, it also creates a growing management burden for the user. Each device must be managed independently and users must repeat the same management tasks on the each device, such as updating software, changing con…
▽ More
Computing technology has gotten cheaper and more powerful, allowing users to have a growing number of personal computing devices at their disposal. While this trend is beneficial for the user, it also creates a growing management burden for the user. Each device must be managed independently and users must repeat the same management tasks on the each device, such as updating software, changing configurations, backup, and replicating data for availability. To prevent the management burden from increasing with the number of devices, we propose that all devices run a single system image called a personal computing image. Personal computing images export a device-specific user interface on each device, but provide a consistent view of application and operating state across all devices. As a result, management tasks can be performed once on any device and will be automatically propagated to all other devices belonging to the user. We discuss evolutionary steps that can be taken to achieve personal computing images for devices and elaborate on challenges that we believe building such systems will face.
△ Less
Submitted 10 October, 2017;
originally announced October 2017.
-
Prochlo: Strong Privacy for Analytics in the Crowd
Authors:
Andrea Bittau,
Úlfar Erlingsson,
Petros Maniatis,
Ilya Mironov,
Ananth Raghunathan,
David Lie,
Mitch Rudominer,
Usharsee Kode,
Julien Tinnes,
Bernhard Seefeld
Abstract:
The large-scale monitoring of computer users' software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture---Encode, Shuffle, Analyze (ESA)---for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informe…
▽ More
The large-scale monitoring of computer users' software activities has become commonplace, e.g., for application telemetry, error reporting, or demographic profiling. This paper describes a principled systems architecture---Encode, Shuffle, Analyze (ESA)---for performing such monitoring with high utility while also protecting user privacy. The ESA design, and its Prochlo implementation, are informed by our practical experiences with an existing, large deployment of privacy-preserving software monitoring.
(cont.; see the paper)
△ Less
Submitted 2 October, 2017;
originally announced October 2017.
-
Glimmers: Resolving the Privacy/Trust Quagmire
Authors:
David Lie,
Petros Maniatis
Abstract:
Many successful services rely on trustworthy contributions from users. To establish that trust, such services often require access to privacy-sensitive information from users, thus creating a conflict between privacy and trust. Although it is likely impractical to expect both absolute privacy and trustworthiness at the same time, we argue that the current state of things, where individual privacy…
▽ More
Many successful services rely on trustworthy contributions from users. To establish that trust, such services often require access to privacy-sensitive information from users, thus creating a conflict between privacy and trust. Although it is likely impractical to expect both absolute privacy and trustworthiness at the same time, we argue that the current state of things, where individual privacy is usually sacrificed at the altar of trustworthy services, can be improved with a pragmatic $Glimmer$ $of$ $Trust$, which allows services to validate user contributions in a trustworthy way without forfeiting user privacy. We describe how trustworthy hardware such as Intel's SGX can be used client-side -- in contrast to much recent work exploring SGX in cloud services -- to realize the Glimmer architecture, and demonstrate how this realization is able to resolve the tension between privacy and trust in a variety of cases.
△ Less
Submitted 23 February, 2017;
originally announced February 2017.
-
Automated Epilepsy Diagnosis Using Interictal Scalp EEG
Authors:
Forrest Sheng Bao,
Jue-Ming Gao,
Jing Hu,
Donald Y. -C. Lie,
Yuanlin Zhang,
K. J. Oommen
Abstract:
Approximately over 50 million people worldwide suffer from epilepsy. Traditional diagnosis of epilepsy relies on tedious visual screening by highly trained clinicians from lengthy EEG recording that contains the presence of seizure (ictal) activities. Nowadays, there are many automatic systems that can recognize seizure-related EEG signals to help the diagnosis. However, it is very costly and in…
▽ More
Approximately over 50 million people worldwide suffer from epilepsy. Traditional diagnosis of epilepsy relies on tedious visual screening by highly trained clinicians from lengthy EEG recording that contains the presence of seizure (ictal) activities. Nowadays, there are many automatic systems that can recognize seizure-related EEG signals to help the diagnosis. However, it is very costly and inconvenient to obtain long-term EEG data with seizure activities, especially in areas short of medical resources. We demonstrate in this paper that we can use the interictal scalp EEG data, which is much easier to collect than the ictal data, to automatically diagnose whether a person is epileptic. In our automated EEG recognition system, we extract three classes of features from the EEG data and build Probabilistic Neural Networks (PNNs) fed with these features. We optimize the feature extraction parameters and combine these PNNs through a voting mechanism. As a result, our system achieves an impressive 94.07% accuracy, which is very close to reported human recognition accuracy by experienced medical professionals.
△ Less
Submitted 24 April, 2009; v1 submitted 24 April, 2009;
originally announced April 2009.
-
A New Approach to Automated Epileptic Diagnosis Using EEG and Probabilistic Neural Network
Authors:
Forrest Sheng Bao,
Donald Yu-Chun Lie,
Yuanlin Zhang
Abstract:
Epilepsy is one of the most common neurological disorders that greatly impair patient' daily lives. Traditional epileptic diagnosis relies on tedious visual screening by neurologists from lengthy EEG recording that requires the presence of seizure (ictal) activities. Nowadays, there are many systems helping the neurologists to quickly find interesting segments of the lengthy signal by automatic…
▽ More
Epilepsy is one of the most common neurological disorders that greatly impair patient' daily lives. Traditional epileptic diagnosis relies on tedious visual screening by neurologists from lengthy EEG recording that requires the presence of seizure (ictal) activities. Nowadays, there are many systems helping the neurologists to quickly find interesting segments of the lengthy signal by automatic seizure detection. However, we notice that it is very difficult, if not impossible, to obtain long-term EEG data with seizure activities for epilepsy patients in areas lack of medical resources and trained neurologists. Therefore, we propose to study automated epileptic diagnosis using interictal EEG data that is much easier to collect than ictal data. The authors are not aware of any report on automated EEG diagnostic system that can accurately distinguish patients' interictal EEG from the EEG of normal people. The research presented in this paper, therefore, aims to develop an automated diagnostic system that can use interictal EEG data to diagnose whether the person is epileptic. Such a system should also detect seizure activities for further investigation by doctors and potential patient monitoring. To develop such a system, we extract four classes of features from the EEG data and build a Probabilistic Neural Network (PNN) fed with these features. Leave-one-out cross-validation (LOO-CV) on a widely used epileptic-normal data set reflects an impressive 99.5% accuracy of our system on distinguishing normal people's EEG from patient's interictal EEG. We also find our system can be used in patient monitoring (seizure detection) and seizure focus localization, with 96.7% and 77.5% accuracy respectively on the data set.
△ Less
Submitted 4 July, 2008; v1 submitted 21 April, 2008;
originally announced April 2008.