Search | arXiv e-print repository

Evaluating the Efficacy of Foundational Models: Advancing Benchmarking Practices to Enhance Fine-Tuning Decision-Making

Authors: Oluyemi Enoch Amujo, Shanchieh Jay Yang

Abstract: Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains,… ▽ More Recently, large language models (LLMs) have expanded into various domains. However, there remains a need to evaluate how these models perform when prompted with commonplace queries compared to domain-specific queries, which may be useful for benchmarking prior to fine-tuning for domain-specific downstream tasks. This study evaluates LLMs, specifically Gemma-2B and Gemma-7B, across diverse domains, including cybersecurity, medicine, and finance, compared to common knowledge queries. This study utilizes a comprehensive methodology to assess foundational models, which includes problem formulation, data analysis, and the development of ThroughCut, a novel outlier detection technique that automatically identifies response throughput outliers based on their conciseness. This methodological rigor enhances the credibility of the presented evaluation frameworks. This study focused on assessing inference time, response length, throughput, quality, and resource utilization and investigated the correlations between these factors. The results indicate that model size and types of prompts used for inference significantly influenced response length and quality. In addition, common prompts, which include various types of queries, generate diverse and inconsistent responses at irregular intervals. In contrast, domain-specific prompts consistently generate concise responses within a reasonable time. Overall, this study underscores the need for comprehensive evaluation frameworks to enhance the reliability of benchmarking procedures in multidomain AI research. △ Less

Submitted 20 August, 2024; v1 submitted 25 June, 2024; originally announced July 2024.

Comments: 10 pages, 5 figures, 2 tables, and algorithms

arXiv:2401.00280 [pdf, other]

Advancing TTP Analysis: Harnessing the Power of Large Language Models with Retrieval Augmented Generation

Authors: Reza Fayyazi, Rozhina Taghdimi, Shanchieh Jay Yang

Abstract: Tactics, Techniques, and Procedures (TTPs) outline the methods attackers use to exploit vulnerabilities. The interpretation of TTPs in the MITRE ATT&CK framework can be challenging for cybersecurity practitioners due to presumed expertise and complex dependencies. Meanwhile, advancements with Large Language Models (LLMs) have led to recent surge in studies exploring its uses in cybersecurity opera… ▽ More Tactics, Techniques, and Procedures (TTPs) outline the methods attackers use to exploit vulnerabilities. The interpretation of TTPs in the MITRE ATT&CK framework can be challenging for cybersecurity practitioners due to presumed expertise and complex dependencies. Meanwhile, advancements with Large Language Models (LLMs) have led to recent surge in studies exploring its uses in cybersecurity operations. It is, however, unclear how LLMs can be used in an efficient and proper way to provide accurate responses for critical domains such as cybersecurity. This leads us to investigate how to better use two types of LLMs: small-scale encoder-only (e.g., RoBERTa) and larger decoder-only (e.g., GPT-3.5) LLMs to comprehend and summarize TTPs with the intended purposes (i.e., tactics) of a cyberattack procedure. This work studies and compares the uses of supervised fine-tuning (SFT) of encoder-only LLMs vs. Retrieval Augmented Generation (RAG) for decoder-only LLMs (without fine-tuning). Both SFT and RAG techniques presumably enhance the LLMs with relevant contexts for each cyberattack procedure. Our studies show decoder-only LLMs with RAG achieves better performance than encoder-only models with SFT, particularly when directly relevant context is extracted by RAG. The decoder-only results could suffer low `Precision' while achieving high `Recall'. Our findings further highlight a counter-intuitive observation that more generic prompts tend to yield better predictions of cyberattack tactics than those that are more specifically tailored. △ Less

Submitted 21 July, 2024; v1 submitted 30 December, 2023; originally announced January 2024.

arXiv:2312.03419 [pdf, other]

Synthesizing Physical Backdoor Datasets: An Automated Framework Leveraging Deep Generative Models

Authors: Sze Jue Yang, Chinh D. La, Quang H. Nguyen, Kok-Seng Wong, Anh Tuan Tran, Chee Seng Chan, Khoa D. Doan

Abstract: Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world. Co… ▽ More Backdoor attacks, representing an emerging threat to the integrity of deep neural networks, have garnered significant attention due to their ability to compromise deep learning systems clandestinely. While numerous backdoor attacks occur within the digital realm, their practical implementation in real-world prediction systems remains limited and vulnerable to disturbances in the physical world. Consequently, this limitation has given rise to the development of physical backdoor attacks, where trigger objects manifest as physical entities within the real world. However, creating the requisite dataset to train or evaluate a physical backdoor model is a daunting task, limiting the backdoor researchers and practitioners from studying such physical attack scenarios. This paper unleashes a recipe that empowers backdoor researchers to effortlessly create a malicious, physical backdoor dataset based on advances in generative modeling. Particularly, this recipe involves 3 automatic modules: suggesting the suitable physical triggers, generating the poisoned candidate samples (either by synthesizing new samples or editing existing clean samples), and finally refining for the most plausible ones. As such, it effectively mitigates the perceived complexity associated with creating a physical backdoor dataset, transforming it from a daunting task into an attainable objective. Extensive experiment results show that datasets created by our "recipe" enable adversaries to achieve an impressive attack success rate on real physical world data and exhibit similar properties compared to previous physical backdoor attack studies. This paper offers researchers a valuable toolkit for studies of physical backdoors, all within the confines of their laboratories. △ Less

Submitted 15 March, 2024; v1 submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.13778 [pdf]

Accurate Prediction of Experimental Band Gaps from Large Language Model-Based Data Extraction

Authors: Samuel J. Yang, Shutong Li, Subhashini Venugopalan, Vahe Tshitoyan, Muratahan Aykol, Amil Merchant, Ekin Dogus Cubuk, Gowoon Cheon

Abstract: Machine learning is transforming materials discovery by providing rapid predictions of material properties, which enables large-scale screening for target materials. However, such models require training data. While automated data extraction from scientific literature has potential, current auto-generated datasets often lack sufficient accuracy and critical structural and processing details of mat… ▽ More Machine learning is transforming materials discovery by providing rapid predictions of material properties, which enables large-scale screening for target materials. However, such models require training data. While automated data extraction from scientific literature has potential, current auto-generated datasets often lack sufficient accuracy and critical structural and processing details of materials that influence the properties. Using band gap as an example, we demonstrate Large language model (LLM)-prompt-based extraction yields an order of magnitude lower error rate. Combined with additional prompts to select a subset of experimentally measured properties from pure, single-crystalline bulk materials, this results in an automatically extracted dataset that's larger and more diverse than the largest existing human-curated database of experimental band gaps. Compared to the existing human-curated database, we show the model trained on our extracted database achieves a 19% reduction in the mean absolute error of predicted band gaps. Finally, we demonstrate that LLMs are able to train models predicting band gap on the extracted data, achieving an automated pipeline of data extraction to materials property prediction. △ Less

Submitted 22 November, 2023; originally announced November 2023.

arXiv:2310.00065 [pdf, ps, other]

Elucidating the impact of microstructure on mechanical properties of phase-segregated polyurea: Finite element modeling of molecular dynamics derived microstructures

Authors: Steven J. Yang, Stephanie I. Rosenbloom, Brett P. Fors, Meredith N. Silberstein

Abstract: Phase-segregated polyureas (PU) have received considerable interest due to their use as tough, impact-resistant coatings. Polyureas are favored for these applications due to their mechanical strain rate sensitivity and energy dissipation. Predicting and tailoring the mechanical response of PU remains challenging due to the complex interaction between its elastomeric and glassy phases. To elucidate… ▽ More Phase-segregated polyureas (PU) have received considerable interest due to their use as tough, impact-resistant coatings. Polyureas are favored for these applications due to their mechanical strain rate sensitivity and energy dissipation. Predicting and tailoring the mechanical response of PU remains challenging due to the complex interaction between its elastomeric and glassy phases. To elucidate the role of PU microstructure on its mechanical properties, we developed a finite element modeling framework in which each phase is represented by a volume fraction within a representative volume element (RVE). Critically, we used separate constitutive models to describe the elastomeric and glassy phases. We developed a plasticity-driven breakdown process in which we model the glassy phase disaggregating into a new phase. The overall contribution of each phase at a material point is determined by their respective volume fractions within the RVE. We applied our modeling methods to two compositions of PU with differing elastomeric segment lengths derived from oligoether diamines, Versalink P650 and P1000. Our simulations show that a combination of microstructural differences and elastomeric phase properties accounts for the difference in mechanical response between P650 and P1000. We show our model's ability to predict PU behavior in various loading conditions, including low-rate cyclic loading and monotonic loading over a wide range of strain rates. Our model produces microstructure transformations that mirror those indicated by small-angle X-ray scattering (SAXS) experiments. Fourier transform analysis of our RVEs reveals glassy phase fibrillation due to deformation, a finding consistent with SAXS experiments. △ Less

Submitted 29 September, 2023; originally announced October 2023.

arXiv:2308.16684 [pdf, other]

Everyone Can Attack: Repurpose Lossy Compression as a Natural Backdoor Attack

Authors: Sze Jue Yang, Quang Nguyen, Chee Seng Chan, Khoa D. Doan

Abstract: The vulnerabilities to backdoor attacks have recently threatened the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that not everyone can be an attacker since the process of designing the trigger generation algorithm often involves significant effort and extensive experimentation to ensure the attack's stealthiness and effectiveness. Alternativel… ▽ More The vulnerabilities to backdoor attacks have recently threatened the trustworthiness of machine learning models in practical applications. Conventional wisdom suggests that not everyone can be an attacker since the process of designing the trigger generation algorithm often involves significant effort and extensive experimentation to ensure the attack's stealthiness and effectiveness. Alternatively, this paper shows that there exists a more severe backdoor threat: anyone can exploit an easily-accessible algorithm for silent backdoor attacks. Specifically, this attacker can employ the widely-used lossy image compression from a plethora of compression tools to effortlessly inject a trigger pattern into an image without leaving any noticeable trace; i.e., the generated triggers are natural artifacts. One does not require extensive knowledge to click on the "convert" or "save as" button while using tools for lossy image compression. Via this attack, the adversary does not need to design a trigger generator as seen in prior works and only requires poisoning the data. Empirically, the proposed attack consistently achieves 100% attack success rate in several benchmark datasets such as MNIST, CIFAR-10, GTSRB and CelebA. More significantly, the proposed attack can still achieve almost 100% attack success rate with very small (approximately 10%) poisoning rates in the clean label setting. The generated trigger of the proposed attack using one lossy compression algorithm is also transferable across other related compression algorithms, exacerbating the severity of this backdoor threat. This work takes another crucial step toward understanding the extensive risks of backdoor attacks in practice, urging practitioners to investigate similar attacks and relevant backdoor mitigation methods. △ Less

Submitted 3 September, 2023; v1 submitted 31 August, 2023; originally announced August 2023.

Comments: 14 pages. This paper shows everyone can mount a powerful and stealthy backdoor attack with the widely-used lossy image compression

arXiv:2308.14376 [pdf, other]

Are Existing Out-Of-Distribution Techniques Suitable for Network Intrusion Detection?

Authors: Andrea Corsini, Shanchieh Jay Yang

Abstract: Machine learning (ML) has become increasingly popular in network intrusion detection. However, ML-based solutions always respond regardless of whether the input data reflects known patterns, a common issue across safety-critical applications. While several proposals exist for detecting Out-Of-Distribution (OOD) in other fields, it remains unclear whether these approaches can effectively identify n… ▽ More Machine learning (ML) has become increasingly popular in network intrusion detection. However, ML-based solutions always respond regardless of whether the input data reflects known patterns, a common issue across safety-critical applications. While several proposals exist for detecting Out-Of-Distribution (OOD) in other fields, it remains unclear whether these approaches can effectively identify new forms of intrusions for network security. New attacks, not necessarily affecting overall distributions, are not guaranteed to be clearly OOD as instead, images depicting new classes are in computer vision. In this work, we investigate whether existing OOD detectors from other fields allow the identification of unknown malicious traffic. We also explore whether more discriminative and semantically richer embedding spaces within models, such as those created with contrastive learning and multi-class tasks, benefit detection. Our investigation covers a set of six OOD techniques that employ different detection strategies. These techniques are applied to models trained in various ways and subsequently exposed to unknown malicious traffic from the same and different datasets (network environments). Our findings suggest that existing detectors can identify a consistent portion of new malicious traffic, and that improved embedding spaces enhance detection. We also demonstrate that simple combinations of certain detectors can identify almost 100% of malicious traffic in our tested scenarios. △ Less

Submitted 28 August, 2023; originally announced August 2023.

arXiv:2306.14062 [pdf, other]

On the Uses of Large Language Models to Interpret Ambiguous Cyberattack Descriptions

Authors: Reza Fayyazi, Shanchieh Jay Yang

Abstract: The volume, variety, and velocity of change in vulnerabilities and exploits have made incident threat analysis challenging with human expertise and experience along. Tactics, Techniques, and Procedures (TTPs) are to describe how and why attackers exploit vulnerabilities. However, a TTP description written by one security professional can be interpreted very differently by another, leading to confu… ▽ More The volume, variety, and velocity of change in vulnerabilities and exploits have made incident threat analysis challenging with human expertise and experience along. Tactics, Techniques, and Procedures (TTPs) are to describe how and why attackers exploit vulnerabilities. However, a TTP description written by one security professional can be interpreted very differently by another, leading to confusion in cybersecurity operations or even business, policy, and legal decisions. Meanwhile, advancements in AI have led to the increasing use of Natural Language Processing (NLP) algorithms to assist the various tasks in cyber operations. With the rise of Large Language Models (LLMs), NLP tasks have significantly improved because of the LLM's semantic understanding and scalability. This leads us to question how well LLMs can interpret TTPs or general cyberattack descriptions to inform analysts of the intended purposes of cyberattacks. We propose to analyze and compare the direct use of LLMs (e.g., GPT-3.5) versus supervised fine-tuning (SFT) of small-scale-LLMs (e.g., BERT) to study their capabilities in predicting ATT&CK tactics. Our results reveal that the small-scale-LLMs with SFT provide a more focused and clearer differentiation between the ATT&CK tactics (if such differentiation exists). On the other hand, direct use of LLMs offer a broader interpretation of cyberattack techniques. When treating more general cases, despite the power of LLMs, inherent ambiguity exists and limits their predictive power. We then summarize the challenges and recommend research directions on LLMs to treat the inherent ambiguity of TTP descriptions used in various cyber operations. △ Less

Submitted 22 August, 2023; v1 submitted 24 June, 2023; originally announced June 2023.

arXiv:2303.07533 [pdf, other]

Speech Intelligibility Classifiers from 550k Disordered Speech Samples

Authors: Subhashini Venugopalan, Jimmy Tobin, Samuel J. Yang, Katie Seaver, Richard J. N. Cave, Pan-Pan Jiang, Neil Zeghidour, Rus Heywood, Jordan Green, Michael P. Brenner

Abstract: We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a five-point scale. We trained three models following different deep learning approaches and evaluated them on ~94K utterances from 100 speakers. We further found… ▽ More We developed dysarthric speech intelligibility classifiers on 551,176 disordered speech samples contributed by a diverse set of 468 speakers, with a range of self-reported speaking disorders and rated for their overall intelligibility on a five-point scale. We trained three models following different deep learning approaches and evaluated them on ~94K utterances from 100 speakers. We further found the models to generalize well (without further training) on the TORGO database (100% accuracy), UASpeech (0.93 correlation), ALS-TDI PMP (0.81 AUC) datasets as well as on a dataset of realistic unprompted speech we gathered (106 dysarthric and 76 control speakers,~2300 samples). △ Less

Submitted 15 March, 2023; v1 submitted 13 March, 2023; originally announced March 2023.

Comments: ICASSP 2023 camera-ready

arXiv:2212.13941 [pdf, other]

HeATed Alert Triage (HeAT): Transferrable Learning to Extract Multistage Attack Campaigns

Authors: Stephen Moskal, Shanchieh Jay Yang

Abstract: With growing sophistication and volume of cyber attacks combined with complex network structures, it is becoming extremely difficult for security analysts to corroborate evidences to identify multistage campaigns on their network. This work develops HeAT (Heated Alert Triage): given a critical indicator of compromise (IoC), e.g., a severe IDS alert, HeAT produces a HeATed Attack Campaign (HAC) dep… ▽ More With growing sophistication and volume of cyber attacks combined with complex network structures, it is becoming extremely difficult for security analysts to corroborate evidences to identify multistage campaigns on their network. This work develops HeAT (Heated Alert Triage): given a critical indicator of compromise (IoC), e.g., a severe IDS alert, HeAT produces a HeATed Attack Campaign (HAC) depicting the multistage activities that led up to the critical event. We define the concept of "Alert Episode Heat" to represent the analysts opinion of how much an event contributes to the attack campaign of the critical IoC given their knowledge of the network and security expertise. Leveraging a network-agnostic feature set, HeAT learns the essence of analyst's assessment of "HeAT" for a small set of IoC's, and applies the learned model to extract insightful attack campaigns for IoC's not seen before, even across networks by transferring what have been learned. We demonstrate the capabilities of HeAT with data collected in Collegiate Penetration Testing Competition (CPTC) and through collaboration with a real-world SOC. We developed HeAT-Gain metrics to demonstrate how analysts may assess and benefit from the extracted attack campaigns in comparison to common practices where IP addresses are used to corroborate evidences. Our results demonstrates the practical uses of HeAT by finding campaigns that span across diverse attack stages, remove a significant volume of irrelevant alerts, and achieve coherency to the analyst's original assessments. △ Less

Submitted 28 December, 2022; originally announced December 2022.

arXiv:2107.02783 [pdf, other]

SAGE: Intrusion Alert-driven Attack Graph Extractor

Authors: Azqa Nadeem, Sicco Verwer, Shanchieh Jay Yang

Abstract: Attack graphs (AG) are used to assess pathways availed by cyber adversaries to penetrate a network. State-of-the-art approaches for AG generation focus mostly on deriving dependencies between system vulnerabilities based on network scans and expert knowledge. In real-world operations however, it is costly and ineffective to rely on constant vulnerability scanning and expert-crafted AGs. We propose… ▽ More Attack graphs (AG) are used to assess pathways availed by cyber adversaries to penetrate a network. State-of-the-art approaches for AG generation focus mostly on deriving dependencies between system vulnerabilities based on network scans and expert knowledge. In real-world operations however, it is costly and ineffective to rely on constant vulnerability scanning and expert-crafted AGs. We propose to automatically learn AGs based on actions observed through intrusion alerts, without prior expert knowledge. Specifically, we develop an unsupervised sequence learning system, SAGE, that leverages the temporal and probabilistic dependence between alerts in a suffix-based probabilistic deterministic finite automaton (S-PDFA) -- a model that accentuates infrequent severe alerts and summarizes paths leading to them. AGs are then derived from the S-PDFA on a per-objective, per-victim basis. Tested with intrusion alerts collected through Collegiate Penetration Testing Competition, SAGE compresses over 330k alerts into 93 AGs. These AGs reflect the strategies used by the participating teams. The AGs are succinct, interpretable, and capture behavioral dynamics, e.g., that attackers will often follow shorter paths to re-exploit objectives. △ Less

Submitted 14 October, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

Comments: Appeared at VizSec '21 (proceedings) and KDD AI4Cyber '21 (without proceedings)

arXiv:2106.07961 [pdf, other]

doi 10.1145/3465481.3470065

On the Evaluation of Sequential Machine Learning for Network Intrusion Detection

Authors: Andrea Corsini, Shanchieh Jay Yang, Giovanni Apruzzese

Abstract: Recent advances in deep learning renewed the research interests in machine learning for Network Intrusion Detection Systems (NIDS). Specifically, attention has been given to sequential learning models, due to their ability to extract the temporal characteristics of Network traffic Flows (NetFlows), and use them for NIDS tasks. However, the applications of these sequential models often consist of t… ▽ More Recent advances in deep learning renewed the research interests in machine learning for Network Intrusion Detection Systems (NIDS). Specifically, attention has been given to sequential learning models, due to their ability to extract the temporal characteristics of Network traffic Flows (NetFlows), and use them for NIDS tasks. However, the applications of these sequential models often consist of transferring and adapting methodologies directly from other fields, without an in-depth investigation on how to leverage the specific circumstances of cybersecurity scenarios; moreover, there is a lack of comprehensive studies on sequential models that rely on NetFlow data, which presents significant advantages over traditional full packet captures. We tackle this problem in this paper. We propose a detailed methodology to extract temporal sequences of NetFlows that denote patterns of malicious activities. Then, we apply this methodology to compare the efficacy of sequential learning models against traditional static learning models. In particular, we perform a fair comparison of a `sequential' Long Short-Term Memory (LSTM) against a `static' Feedforward Neural Networks (FNN) in distinct environments represented by two well-known datasets for NIDS: the CICIDS2017 and the CTU13. Our results highlight that LSTM achieves comparable performance to FNN in the CICIDS2017 with over 99.5\% F1-score; while obtaining superior performance in the CTU13, with 95.7\% F1-score against 91.5\%. This paper thus paves the way to future applications of sequential learning models for NIDS. △ Less

Submitted 15 June, 2021; originally announced June 2021.

arXiv:2105.03855 [pdf]

GMOTE: Gaussian based minority oversampling technique for imbalanced classification adapting tail probability of outliers

Authors: Seung Jee Yang, Kyung Joon Cha

Abstract: Classification of imbalanced data is one of the common problems in the recent field of data mining. Imbalanced data substantially affects the performance of standard classification models. Data-level approaches mainly use the oversampling methods to solve the problem, such as synthetic minority oversampling Technique (SMOTE). However, since the methods such as SMOTE generate instances by linear in… ▽ More Classification of imbalanced data is one of the common problems in the recent field of data mining. Imbalanced data substantially affects the performance of standard classification models. Data-level approaches mainly use the oversampling methods to solve the problem, such as synthetic minority oversampling Technique (SMOTE). However, since the methods such as SMOTE generate instances by linear interpolation, synthetic data space may look like a polygonal. Also, the oversampling methods generate outliers of the minority class. In this paper, we proposed Gaussian based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets. To avoid linear interpolation and to consider outliers, this proposed method generates instances by the Gaussian Mixture Model. Motivated by clustering-based multivariate Gaussian outlier score (CMGOS), we propose to adapt tail probability of instances through the Mahalanobis distance to consider local outliers. The experiment was carried out on a representative set of benchmark datasets. The performance of the GMOTE is compared with other methods such as SMOTE. When the GMOTE is combined with classification and regression tree (CART) or support vector machine (SVM), it shows better accuracy and F1-Score. Experimental results demonstrate the robust performance. △ Less

Submitted 9 May, 2021; originally announced May 2021.

Comments: 20 pages, 6 figures

MSC Class: 62P99

arXiv:2105.01210 [pdf]

doi 10.1021/acs.chemmater.1c00267

Single-Crystalline Metallic Films Induced by van der Waals Epitaxy on Black Phosphorus

Authors: Yangjin Lee, Han-gyu Kim, Tae Keun Yun, Jong Chan Kim, Sol Lee, Sung Jin Yang, Myeongjin Jang, Donggyu Kim, Huije Ryu, Gwan-Hyoung Lee, Seongil Im, Hu Young Jeong, Hyoung Joon Choi, Kwanpyo Kim

Abstract: The properties of metal-semiconductor junctions are often unpredictable because of non-ideal interfacial structures, such as interfacial defects or chemical reactions introduced at junctions. Black phosphorus (BP), an elemental two-dimensional (2D) semiconducting crystal, possesses the puckered atomic structure with high chemical reactivity, and the establishment of a realistic atomic-scale pictur… ▽ More The properties of metal-semiconductor junctions are often unpredictable because of non-ideal interfacial structures, such as interfacial defects or chemical reactions introduced at junctions. Black phosphorus (BP), an elemental two-dimensional (2D) semiconducting crystal, possesses the puckered atomic structure with high chemical reactivity, and the establishment of a realistic atomic-scale picture of BP's interface toward metallic contact has remained elusive. Here we examine the interfacial structures and properties of physically-deposited metals of various kinds on BP. We find that Au, Ag, and Bi form single-crystalline films with (110) orientation through guided van der Waals epitaxy. Transmission electron microscopy and X-ray photoelectron spectroscopy confirm that atomically sharp van der Waals metal-BP interfaces forms with exceptional rotational alignment. Under a weak metal-BP interaction regime, the BP's puckered structure play an essential role in the adatom assembly process and can lead to the formation of a single crystal, which is supported by our theoretical analysis and calculations. The experimental survey also demonstrates that the BP-metal junctions can exhibit various types of interfacial structures depending on metals, such as the formation of polycrystalline microstructure or metal phosphides. This study provides a guideline for obtaining a realistic view on metal-2D semiconductor interfacial structures, especially for atomically puckered 2D crystals. △ Less

Submitted 3 May, 2021; originally announced May 2021.

Comments: 27 pages, 5 figures

Journal ref: Chemistry of Materials, 2021

arXiv:2103.13902 [pdf, other]

Near Real-time Learning and Extraction of Attack Models from Intrusion Alerts

Authors: Shanchieh Jay Yang, Ahmet Okutan, Gordon Werner, Shao-Hsuan Su, Ayush Goel, Nathan D. Cahill

Abstract: Critical and sophisticated cyberattacks often take multitudes of reconnaissance, exploitations, and obfuscation techniques to penetrate through well protected enterprise networks. The discovery and detection of attacks, though needing continuous efforts, is no longer sufficient. Security Operation Center (SOC) analysts are overwhelmed by the significant volume of intrusion alerts without being abl… ▽ More Critical and sophisticated cyberattacks often take multitudes of reconnaissance, exploitations, and obfuscation techniques to penetrate through well protected enterprise networks. The discovery and detection of attacks, though needing continuous efforts, is no longer sufficient. Security Operation Center (SOC) analysts are overwhelmed by the significant volume of intrusion alerts without being able to extract actionable intelligence. Recognizing this challenge, this paper describes the advances and findings through deploying ASSERT to process intrusion alerts from OmniSOC in collaboration with the Center for Applied Cybersecurity Research (CACR) at Indiana University. ASSERT utilizes information theoretic unsupervised learning to extract and update `attack models' in near real-time without expert knowledge. It consumes streaming intrusion alerts and generates a small number of statistical models for SOC analysts to comprehend ongoing and emerging attacks in a timely manner. This paper presents the architecture and key processes of ASSERT and discusses a few real-world attack models to highlight the use-cases that benefit SOC operations. The research team is developing a light-weight containerized ASSERT that will be shared through a public repository to help the community combat the overwhelming intrusion alerts. △ Less

Submitted 25 March, 2021; originally announced March 2021.

arXiv:2004.04306 [pdf, other]

Physics-enhanced machine learning for virtual fluorescence microscopy

Authors: Colin L. Cooke, Fanjie Kong, Amey Chaware, Kevin C. Zhou, Kanghyun Kim, Rong Xu, D. Michael Ando, Samuel J. Yang, Pavan Chandra Konda, Roarke Horstmeyer

Abstract: This paper introduces a new method of data-driven microscope design for virtual fluorescence microscopy. Our results show that by including a model of illumination within the first layers of a deep convolutional neural network, it is possible to learn task-specific LED patterns that substantially improve the ability to infer fluorescence image information from unstained transmission microscopy ima… ▽ More This paper introduces a new method of data-driven microscope design for virtual fluorescence microscopy. Our results show that by including a model of illumination within the first layers of a deep convolutional neural network, it is possible to learn task-specific LED patterns that substantially improve the ability to infer fluorescence image information from unstained transmission microscopy images. We validated our method on two different experimental setups, with different magnifications and different sample types, to show a consistent improvement in performance as compared to conventional illumination methods. Additionally, to understand the importance of learned illumination on inference task, we varied the dynamic range of the fluorescent image targets (from one to seven bits), and showed that the margin of improvement for learned patterns increased with the information content of the target. This work demonstrates the power of programmable optical elements at enabling better machine learning algorithm performance and at providing physical insight into next generation of machine-controlled imaging systems. △ Less

Submitted 21 April, 2020; v1 submitted 8 April, 2020; originally announced April 2020.

Comments: 12 pages, 13 figures

arXiv:2002.07838 [pdf, other]

Cyberattack Action-Intent-Framework for Mapping Intrusion Observables

Authors: Stephen Moskal, Shanchieh Jay Yang

Abstract: The techniques and tactics used by cyber adversaries are becoming more sophisticated, ironically, as defense getting stronger and the cost of a breach continuing to rise. Understanding the thought processes and behaviors of adversaries is extremely challenging as high profile or even amateur attackers have no incentive to share the trades associated with their illegal activities. One opportunity t… ▽ More The techniques and tactics used by cyber adversaries are becoming more sophisticated, ironically, as defense getting stronger and the cost of a breach continuing to rise. Understanding the thought processes and behaviors of adversaries is extremely challenging as high profile or even amateur attackers have no incentive to share the trades associated with their illegal activities. One opportunity to observe the actions the adversaries perform is through the use of Intrusion Detection Systems (IDS) which generate alerts in the event that suspicious behavior was detected. The alerts raised by these systems typically describe the suspicious actions via the form of attack 'signature', which do not necessarily reveal the true intent of the attacker performing the action. Meanwhile, several high level frameworks exist to describe the sequence or chain of action types an adversary might perform. These frameworks, however, do not connect the action types to observables of standard intrusion detection systems, nor describing the plausible intents of the adversarial actions. To address these gaps, this work proposes the Action-Intent Framework (AIF) to complement existing Cyber Attack Kill Chains and Attack Taxonomies. The AIF defines a set of Action-Intent States (AIS) at two levels of description: the Macro-AIS describes 'what' the attacker is trying to achieve and the Micro-AIS describes "how" the intended goal is achieved. A full description of both the Macro is provided along with a set of guiding principals of how the AIS is derived and added to the framework. △ Less

Submitted 21 February, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:1908.01219 [pdf, other]

On the Veracity of Cyber Intrusion Alerts Synthesized by Generative Adversarial Networks

Authors: Christopher Sweet, Stephen Moskal, Shanchieh Jay Yang

Abstract: Recreating cyber-attack alert data with a high level of fidelity is challenging due to the intricate interaction between features, non-homogeneity of alerts, and potential for rare yet critical samples. Generative Adversarial Networks (GANs) have been shown to effectively learn complex data distributions with the intent of creating increasingly realistic data. This paper presents the application o… ▽ More Recreating cyber-attack alert data with a high level of fidelity is challenging due to the intricate interaction between features, non-homogeneity of alerts, and potential for rare yet critical samples. Generative Adversarial Networks (GANs) have been shown to effectively learn complex data distributions with the intent of creating increasingly realistic data. This paper presents the application of GANs to cyber-attack alert data and shows that GANs not only successfully learn to generate realistic alerts, but also reveal feature dependencies within alerts. This is accomplished by reviewing the intersection of histograms for varying alert-feature combinations between the ground truth and generated datsets. Traditional statistical metrics, such as conditional and joint entropy, are also employed to verify the accuracy of these dependencies. Finally, it is shown that a Mutual Information constraint on the network can be used to increase the generation of low probability, critical, alert values. By mapping alerts to a set of attack stages it is shown that the output of these low probability alerts has a direct contextual meaning for Cyber Security analysts. Overall, this work provides the basis for generating new cyber intrusion alerts and provides evidence that synthesized alerts emulate critical dependencies from the source dataset. △ Less

Submitted 3 August, 2019; originally announced August 2019.

arXiv:1809.01562 [pdf, other]

Probabilistic Modeling and Inference for Obfuscated Cyber Attack Sequences

Authors: Haitao Du, Shanchieh Jay Yang

Abstract: A key element in defending computer networks is to recognize the types of cyber attacks based on the observed malicious activities. Obfuscation onto what could have been observed of an attack sequence may lead to mis-interpretation of its effect and intent, leading to ineffective defense or recovery deployments. This work develops probabilistic graphical models to generalize a few obfuscation tech… ▽ More A key element in defending computer networks is to recognize the types of cyber attacks based on the observed malicious activities. Obfuscation onto what could have been observed of an attack sequence may lead to mis-interpretation of its effect and intent, leading to ineffective defense or recovery deployments. This work develops probabilistic graphical models to generalize a few obfuscation techniques and to enable analyses of the Expected Classification Accuracy (ECA) as a result of these different obfuscation on various attack models. Determining the ECA is a NP-Hard problem due to the combinatorial number of possibilities. This paper presents several polynomial-time algorithms to find the theoretically bounded approximation of ECA under different attack obfuscation models. Comprehensive simulation shows the impact on ECA due to alteration, insertion and removal of attack action sequence, with increasing observation length, level of obfuscation and model complexity. △ Less

Submitted 5 September, 2018; originally announced September 2018.

arXiv:1804.07646 [pdf]

Toward Intelligent Autonomous Agents for Cyber Defense: Report of the 2017 Workshop by the North Atlantic Treaty Organization (NATO) Research Group IST-152-RTG

Authors: Alexander Kott, Ryan Thomas, Martin Drašar, Markus Kont, Alex Poylisher, Benjamin Blakely, Paul Theron, Nathaniel Evans, Nandi Leslie, Rajdeep Singh, Maria Rigaki, S Jay Yang, Benoit LeBlanc, Paul Losiewicz, Sylvain Hourlier, Misty Blowers, Hugh Harney, Gregory Wehner, Alessandro Guarino, Jana Komárková, James Rowell

Abstract: This report summarizes the discussions and findings of the Workshop on Intelligent Autonomous Agents for Cyber Defence and Resilience organized by the NATO research group IST-152-RTG. The workshop was held in Prague, Czech Republic, on 18-20 October 2017. There is a growing recognition that future cyber defense should involve extensive use of partially autonomous agents that actively patrol the fr… ▽ More This report summarizes the discussions and findings of the Workshop on Intelligent Autonomous Agents for Cyber Defence and Resilience organized by the NATO research group IST-152-RTG. The workshop was held in Prague, Czech Republic, on 18-20 October 2017. There is a growing recognition that future cyber defense should involve extensive use of partially autonomous agents that actively patrol the friendly network, and detect and react to hostile activities rapidly (far faster than human reaction time), before the hostile malware is able to inflict major damage, evade friendly agents, or destroy friendly agents. This requires cyber-defense agents with a significant degree of intelligence, autonomy, self-learning, and adaptability. The report focuses on the following questions: In what computing and tactical environments would such an agent operate? What data would be available for the agent to observe or ingest? What actions would the agent be able to take? How would such an agent plan a complex course of actions? Would the agent learn from its experiences, and how? How would the agent collaborate with humans? How can we ensure that the agent will not take undesirable destructive actions? Is it possible to help envision such an agent with a simple example? △ Less

Submitted 20 April, 2018; originally announced April 2018.

Report number: ARL-SR-0395

arXiv:1803.09560 [pdf, other]

Forecasting Cyber Attacks with Imbalanced Data Sets and Different Time Granularities

Authors: Ahmet Okutan, Shanchieh Jay Yang, Katie McConky

Abstract: If cyber incidents are predicted a reasonable amount of time before they occur, defensive actions to prevent their destructive effects could be planned. Unfortunately, most of the time we do not have enough observables of the malicious activities before they are already under way. Therefore, this work suggests to use unconventional signals extracted from various data sources with different time gr… ▽ More If cyber incidents are predicted a reasonable amount of time before they occur, defensive actions to prevent their destructive effects could be planned. Unfortunately, most of the time we do not have enough observables of the malicious activities before they are already under way. Therefore, this work suggests to use unconventional signals extracted from various data sources with different time granularities to predict cyber incidents for target entities. A Bayesian network is used to predict cyber attacks where the unconventional signals are used as indicative random variables. This work also develops a novel minority class over sampling technique to improve cyber attack prediction on imbalanced data sets. The results show that depending on the selected time granularity, the unconventional signals are able to predict cyber attacks for the anonimyzed target organization even though the signals are not explicitly related to that organization. Furthermore, the minority over sampling approach developed achieves better performance compared to the existing filtering techniques in the literature. △ Less

Submitted 26 March, 2018; originally announced March 2018.

arXiv:1711.00882 [pdf, other]

Correcting Nuisance Variation using Wasserstein Distance

Authors: Gil Tabak, Minjie Fan, Samuel J. Yang, Stephan Hoyer, Geoff Davis

Abstract: Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a funct… ▽ More Profiling cellular phenotypes from microscopic imaging can provide meaningful biological information resulting from various factors affecting the cells. One motivating application is drug development: morphological cell features can be captured from images, from which similarities between different drug compounds applied at different doses can be quantified. The general approach is to find a function mapping the images to an embedding space of manageable dimensionality whose geometry captures relevant features of the input images. An important known issue for such methods is separating relevant biological signal from nuisance variation. For example, the embedding vectors tend to be more correlated for cells that were cultured and imaged during the same week than for those from different weeks, despite having identical drug compounds applied in both cases. In this case, the particular batch in which a set of experiments were conducted constitutes the domain of the data; an ideal set of image embeddings should contain only the relevant biological information (e.g. drug effects). We develop a general framework for adjusting the image embeddings in order to `forget' domain-specific information while preserving relevant biological information. To achieve this, we minimize a loss function based on distances between marginal distributions (such as the Wasserstein distance) of embeddings across domains for each replicated treatment. For the dataset we present results with, the only replicated treatment happens to be the negative control treatment, for which we do not expect any treatment-induced cell morphology changes. We find that for our transformed embeddings (i) the underlying geometric structure is not only preserved but the embeddings also carry improved biological signal; and (ii) less domain-specific information is present. △ Less

Submitted 17 June, 2019; v1 submitted 2 November, 2017; originally announced November 2017.

Comments: 21 pages, 20 figures, 6 tables

Showing 1–22 of 22 results for author: Yang, S J