-
EMP: Effective Multidimensional Persistence for Graph Representation Learning
Authors:
Ignacio Segovia-Dominguez,
Yuzhou Chen,
Cuneyt G. Akcora,
Zhiwei Zhen,
Murat Kantarcioglu,
Yulia R. Gel,
Baris Coskunuzer
Abstract:
Topological data analysis (TDA) is gaining prominence across a wide spectrum of machine learning tasks that spans from manifold learning to graph classification. A pivotal technique within TDA is persistent homology (PH), which furnishes an exclusive topological imprint of data by tracing the evolution of latent structures as a scale parameter changes. Present PH tools are confined to analyzing da…
▽ More
Topological data analysis (TDA) is gaining prominence across a wide spectrum of machine learning tasks that spans from manifold learning to graph classification. A pivotal technique within TDA is persistent homology (PH), which furnishes an exclusive topological imprint of data by tracing the evolution of latent structures as a scale parameter changes. Present PH tools are confined to analyzing data through a single filter parameter. However, many scenarios necessitate the consideration of multiple relevant parameters to attain finer insights into the data. We address this issue by introducing the Effective Multidimensional Persistence (EMP) framework. This framework empowers the exploration of data by simultaneously varying multiple scale parameters. The framework integrates descriptor functions into the analysis process, yielding a highly expressive data summary. It seamlessly integrates established single PH summaries into multidimensional counterparts like EMP Landscapes, Silhouettes, Images, and Surfaces. These summaries represent data's multidimensional aspects as matrices and arrays, aligning effectively with diverse ML models. We provide theoretical guarantees and stability proofs for EMP summaries. We demonstrate EMP's utility in graph classification tasks, showing its effectiveness. Results reveal that EMP enhances various single PH descriptors, outperforming cutting-edge methods on multiple benchmark datasets.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Using AI Uncertainty Quantification to Improve Human Decision-Making
Authors:
Laura R. Marusich,
Jonathan Z. Bakdash,
Yan Zhou,
Murat Kantarcioglu
Abstract:
AI Uncertainty Quantification (UQ) has the potential to improve human decision-making beyond AI predictions alone by providing additional probabilistic information to users. The majority of past research on AI and human decision-making has concentrated on model explainability and interpretability, with little focus on understanding the potential impact of UQ on human decision-making. We evaluated…
▽ More
AI Uncertainty Quantification (UQ) has the potential to improve human decision-making beyond AI predictions alone by providing additional probabilistic information to users. The majority of past research on AI and human decision-making has concentrated on model explainability and interpretability, with little focus on understanding the potential impact of UQ on human decision-making. We evaluated the impact on human decision-making for instance-level UQ, calibrated using a strict scoring rule, in two online behavioral experiments. In the first experiment, our results showed that UQ was beneficial for decision-making performance compared to only AI predictions. In the second experiment, we found UQ had generalizable benefits for decision-making across a variety of representations for probabilistic information. These results indicate that implementing high quality, instance-level UQ for AI may improve decision-making with real systems compared to AI predictions alone.
△ Less
Submitted 6 February, 2024; v1 submitted 19 September, 2023;
originally announced September 2023.
-
Game Theory in Distributed Systems Security: Foundations, Challenges, and Future Directions
Authors:
Mustafa Abdallah,
Saurabh Bagchi,
Shaunak D. Bopardikar,
Kevin Chan,
Xing Gao,
Murat Kantarcioglu,
Congmiao Li,
Peng Liu,
Quanyan Zhu
Abstract:
Many of our critical infrastructure systems and personal computing systems have a distributed computing systems structure. The incentives to attack them have been growing rapidly as has their attack surface due to increasing levels of connectedness. Therefore, we feel it is time to bring in rigorous reasoning to secure such systems. The distributed system security and the game theory technical com…
▽ More
Many of our critical infrastructure systems and personal computing systems have a distributed computing systems structure. The incentives to attack them have been growing rapidly as has their attack surface due to increasing levels of connectedness. Therefore, we feel it is time to bring in rigorous reasoning to secure such systems. The distributed system security and the game theory technical communities can come together to effectively address this challenge. In this article, we lay out the foundations from each that we can build upon to achieve our goals. Next, we describe a set of research challenges for the community, organized into three categories -- analytical, systems, and integration challenges, each with "short term" time horizon (2-3 years) and "long term" (5-10 years) items. This article was conceived of through a community discussion at the 2022 NSF SaTC PI meeting.
△ Less
Submitted 28 May, 2024; v1 submitted 3 September, 2023;
originally announced September 2023.
-
Chainlet Orbits: Topological Address Embedding for the Bitcoin Blockchain
Authors:
Poupak Azad,
Baris Coskunuzer,
Murat Kantarcioglu,
Cuneyt Gurcan Akcora
Abstract:
The rise of cryptocurrencies like Bitcoin, which enable transactions with a degree of pseudonymity, has led to a surge in various illicit activities, including ransomware payments and transactions on darknet markets. These illegal activities often utilize Bitcoin as the preferred payment method. However, current tools for detecting illicit behavior either rely on a few heuristics and laborious dat…
▽ More
The rise of cryptocurrencies like Bitcoin, which enable transactions with a degree of pseudonymity, has led to a surge in various illicit activities, including ransomware payments and transactions on darknet markets. These illegal activities often utilize Bitcoin as the preferred payment method. However, current tools for detecting illicit behavior either rely on a few heuristics and laborious data collection processes or employ computationally inefficient graph neural network (GNN) models that are challenging to interpret.
To overcome the computational and interpretability limitations of existing techniques, we introduce an effective solution called Chainlet Orbits. This approach embeds Bitcoin addresses by leveraging their topological characteristics in transactions. By employing our innovative address embedding, we investigate e-crime in Bitcoin networks by focusing on distinctive substructures that arise from illicit behavior.
The results of our node classification experiments demonstrate superior performance compared to state-of-the-art methods, including both topological and GNN-based approaches. Moreover, our approach enables the use of interpretable and explainable machine learning models in as little as 15 minutes for most days on the Bitcoin transaction network.
△ Less
Submitted 18 May, 2023;
originally announced June 2023.
-
Interpreting GNN-based IDS Detections Using Provenance Graph Structural Features
Authors:
Kunal Mukherjee,
Joshua Wiedemeier,
Tianhao Wang,
Muhyun Kim,
Feng Chen,
Murat Kantarcioglu,
Kangkook Jee
Abstract:
The black-box nature of complex Neural Network (NN)-based models has hindered their widespread adoption in security domains due to the lack of logical explanations and actionable follow-ups for their predictions. To enhance the transparency and accountability of Graph Neural Network (GNN) security models used in system provenance analysis, we propose PROVEXPLAINER, a framework for projecting abstr…
▽ More
The black-box nature of complex Neural Network (NN)-based models has hindered their widespread adoption in security domains due to the lack of logical explanations and actionable follow-ups for their predictions. To enhance the transparency and accountability of Graph Neural Network (GNN) security models used in system provenance analysis, we propose PROVEXPLAINER, a framework for projecting abstract GNN decision boundaries onto interpretable feature spaces.
We first replicate the decision-making process of GNNbased security models using simpler and explainable models such as Decision Trees (DTs). To maximize the accuracy and fidelity of the surrogate models, we propose novel graph structural features founded on classical graph theory and enhanced by extensive data study with security domain knowledge. Our graph structural features are closely tied to problem-space actions in the system provenance domain, which allows the detection results to be explained in descriptive, human language. PROVEXPLAINER allowed simple DT models to achieve 95% fidelity to the GNN on program classification tasks with general graph structural features, and 99% fidelity on malware detection tasks with a task-specific feature package tailored for direct interpretation. The explanations for malware classification are demonstrated with case studies of five real-world malware samples across three malware families.
△ Less
Submitted 6 June, 2023; v1 submitted 1 June, 2023;
originally announced June 2023.
-
IoTFlowGenerator: Crafting Synthetic IoT Device Traffic Flows for Cyber Deception
Authors:
Joseph Bao,
Murat Kantarcioglu,
Yevgeniy Vorobeychik,
Charles Kamhoua
Abstract:
Over the years, honeypots emerged as an important security tool to understand attacker intent and deceive attackers to spend time and resources. Recently, honeypots are being deployed for Internet of things (IoT) devices to lure attackers, and learn their behavior. However, most of the existing IoT honeypots, even the high interaction ones, are easily detected by an attacker who can observe honeyp…
▽ More
Over the years, honeypots emerged as an important security tool to understand attacker intent and deceive attackers to spend time and resources. Recently, honeypots are being deployed for Internet of things (IoT) devices to lure attackers, and learn their behavior. However, most of the existing IoT honeypots, even the high interaction ones, are easily detected by an attacker who can observe honeypot traffic due to lack of real network traffic originating from the honeypot. This implies that, to build better honeypots and enhance cyber deception capabilities, IoT honeypots need to generate realistic network traffic flows. To achieve this goal, we propose a novel deep learning based approach for generating traffic flows that mimic real network traffic due to user and IoT device interactions. A key technical challenge that our approach overcomes is scarcity of device-specific IoT traffic data to effectively train a generator. We address this challenge by leveraging a core generative adversarial learning algorithm for sequences along with domain specific knowledge common to IoT devices. Through an extensive experimental evaluation with 18 IoT devices, we demonstrate that the proposed synthetic IoT traffic generation tool significantly outperforms state of the art sequence and packet generators in remaining indistinguishable from real traffic even to an adaptive attacker.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Reduction Algorithms for Persistence Diagrams of Networks: CoralTDA and PrunIT
Authors:
Cuneyt Gurcan Akcora,
Murat Kantarcioglu,
Yulia R. Gel,
Baris Coskunuzer
Abstract:
Topological data analysis (TDA) delivers invaluable and complementary information on the intrinsic properties of data inaccessible to conventional methods. However, high computational costs remain the primary roadblock hindering the successful application of TDA in real-world studies, particularly with machine learning on large complex networks.
Indeed, most modern networks such as citation, blo…
▽ More
Topological data analysis (TDA) delivers invaluable and complementary information on the intrinsic properties of data inaccessible to conventional methods. However, high computational costs remain the primary roadblock hindering the successful application of TDA in real-world studies, particularly with machine learning on large complex networks.
Indeed, most modern networks such as citation, blockchain, and online social networks often have hundreds of thousands of vertices, making the application of existing TDA methods infeasible. We develop two new, remarkably simple but effective algorithms to compute the exact persistence diagrams of large graphs to address this major TDA limitation. First, we prove that $(k+1)$-core of a graph $\mathcal{G}$ suffices to compute its $k^{th}$ persistence diagram, $PD_k(\mathcal{G})$. Second, we introduce a pruning algorithm for graphs to compute their persistence diagrams by removing the dominated vertices. Our experiments on large networks show that our novel approach can achieve computational gains up to 95%.
The developed framework provides the first bridge between the graph theory and TDA, with applications in machine learning of large complex networks. Our implementation is available at https://github.com/cakcora/PersistentHomologyWithCoralPrunit
△ Less
Submitted 24 November, 2022;
originally announced November 2022.
-
The Impact of Data Distribution on Fairness and Robustness in Federated Learning
Authors:
Mustafa Safa Ozdayi,
Murat Kantarcioglu
Abstract:
Federated Learning (FL) is a distributed machine learning protocol that allows a set of agents to collaboratively train a model without sharing their datasets. This makes FL particularly suitable for settings where data privacy is desired. However, it has been observed that the performance of FL is closely related to the similarity of the local data distributions of agents. Particularly, as the da…
▽ More
Federated Learning (FL) is a distributed machine learning protocol that allows a set of agents to collaboratively train a model without sharing their datasets. This makes FL particularly suitable for settings where data privacy is desired. However, it has been observed that the performance of FL is closely related to the similarity of the local data distributions of agents. Particularly, as the data distributions of agents differ, the accuracy of the trained models drop. In this work, we look at how variations in local data distributions affect the fairness and the robustness properties of the trained models in addition to the accuracy. Our experimental results indicate that, the trained models exhibit higher bias, and become more susceptible to attacks as local data distributions differ. Importantly, the degradation in the fairness, and robustness can be much more severe than the accuracy. Therefore, we reveal that small variations that have little impact on the accuracy could still be important if the trained model is to be deployed in a fairness/security critical context.
△ Less
Submitted 29 November, 2021;
originally announced December 2021.
-
Multi-concept adversarial attacks
Authors:
Vibha Belavadi,
Yan Zhou,
Murat Kantarcioglu,
Bhavani M. Thuraisingham
Abstract:
As machine learning (ML) techniques are being increasingly used in many applications, their vulnerability to adversarial attacks becomes well-known. Test time attacks, usually launched by adding adversarial noise to test instances, have been shown effective against the deployed ML models. In practice, one test input may be leveraged by different ML models. Test time attacks targeting a single ML m…
▽ More
As machine learning (ML) techniques are being increasingly used in many applications, their vulnerability to adversarial attacks becomes well-known. Test time attacks, usually launched by adding adversarial noise to test instances, have been shown effective against the deployed ML models. In practice, one test input may be leveraged by different ML models. Test time attacks targeting a single ML model often neglect their impact on other ML models. In this work, we empirically demonstrate that naively attacking the classifier learning one concept may negatively impact classifiers trained to learn other concepts. For example, for the online image classification scenario, when the Gender classifier is under attack, the (wearing) Glasses classifier is simultaneously attacked with the accuracy dropped from 98.69 to 88.42. This raises an interesting question: is it possible to attack one set of classifiers without impacting the other set that uses the same test instance? Answers to the above research question have interesting implications for protecting privacy against ML model misuse. Attacking ML models that pose unnecessary risks of privacy invasion can be an important tool for protecting individuals from harmful privacy exploitation. In this paper, we address the above research question by developing novel attack techniques that can simultaneously attack one set of ML models while preserving the accuracy of the other. In the case of linear classifiers, we provide a theoretical framework for finding an optimal solution to generate such adversarial examples. Using this theoretical framework, we develop a multi-concept attack strategy in the context of deep learning. Our results demonstrate that our techniques can successfully attack the target classes while protecting the protected classes in many different settings, which is not possible with the existing test-time attack-single strategies.
△ Less
Submitted 19 October, 2021;
originally announced October 2021.
-
Learning Generative Deception Strategies in Combinatorial Masking Games
Authors:
Junlin Wu,
Charles Kamhoua,
Murat Kantarcioglu,
Yevgeniy Vorobeychik
Abstract:
Deception is a crucial tool in the cyberdefence repertoire, enabling defenders to leverage their informational advantage to reduce the likelihood of successful attacks. One way deception can be employed is through obscuring, or masking, some of the information about how systems are configured, increasing attacker's uncertainty about their targets. We present a novel game-theoretic model of the res…
▽ More
Deception is a crucial tool in the cyberdefence repertoire, enabling defenders to leverage their informational advantage to reduce the likelihood of successful attacks. One way deception can be employed is through obscuring, or masking, some of the information about how systems are configured, increasing attacker's uncertainty about their targets. We present a novel game-theoretic model of the resulting defender-attacker interaction, where the defender chooses a subset of attributes to mask, while the attacker responds by choosing an exploit to execute. The strategies of both players have combinatorial structure with complex informational dependencies, and therefore even representing these strategies is not trivial. First, we show that the problem of computing an equilibrium of the resulting zero-sum defender-attacker game can be represented as a linear program with a combinatorial number of system configuration variables and constraints, and develop a constraint generation approach for solving this problem. Next, we present a novel highly scalable approach for approximately solving such games by representing the strategies of both players as neural networks. The key idea is to represent the defender's mixed strategy using a deep neural network generator, and then using alternating gradient-descent-ascent algorithm, analogous to the training of Generative Adversarial Networks. Our experiments, as well as a case study, demonstrate the efficacy of the proposed approach.
△ Less
Submitted 17 June, 2022; v1 submitted 23 September, 2021;
originally announced September 2021.
-
Dynamically Adjusting Case Reporting Policy to Maximize Privacy and Utility in the Face of a Pandemic
Authors:
J. Thomas Brown,
Chao Yan,
Weiyi Xia,
Zhijun Yin,
Zhiyu Wan,
Aris Gkoulalas-Divanis,
Murat Kantarcioglu,
Bradley A. Malin
Abstract:
Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and recent state-level regulations, permits sharing de-identified person-level data; however, current de-identification approaches are limite…
▽ More
Supporting public health research and the public's situational awareness during a pandemic requires continuous dissemination of infectious disease surveillance data. Legislation, such as the Health Insurance Portability and Accountability Act of 1996 (HIPAA) and recent state-level regulations, permits sharing de-identified person-level data; however, current de-identification approaches are limited. namely, they are inefficient, relying on retrospective disclosure risk assessments, and do not flex with changes in infection rates or population demographics over time. In this paper, we introduce a framework to dynamically adapt de-identification for near-real time sharing of person-level surveillance data. The framework leverages a simulation mechanism, capable of application at any geographic level, to forecast the re-identification risk of sharing the data under a wide range of generalization policies. The estimates inform weekly, prospective policy selection to maintain the proportion of records corresponding to a group size less than 11 (PK11) at or below 0.1. Fixing the policy at the start of each week facilitates timely dataset updates and supports sharing granular date information. We use August 2020 through October 2021 case data from Johns Hopkins University and the Centers for Disease Control and Prevention to demonstrate the framework's effectiveness in maintaining the PK!1 threshold of 0.01. When sharing COVID-19 county-level case data across all US counties, the framework's approach meets the threshold for 96.2% of daily data releases, while a policy based on current de-identification techniques meets the threshold for 32.3%. Periodically adapting the data publication policies preserves privacy while enhancing public health utility through timely updates and sharing epidemiologically critical features.
△ Less
Submitted 25 February, 2022; v1 submitted 21 June, 2021;
originally announced June 2021.
-
The Queen's Guard: A Secure Enforcement of Fine-grained Access Control In Distributed Data Analytics Platforms
Authors:
Fahad Shaon,
Sazzadur Rahaman,
Murat Kantarcioglu
Abstract:
Distributed data analytics platforms (i.e., Apache Spark, Hadoop) provide high-level APIs to programmatically write analytics tasks that are run distributedly in multiple computing nodes. The design of these frameworks was primarily motivated by performance and usability. Thus, the security takes a back seat. Consequently, they do not inherently support fine-grained access control or offer any plu…
▽ More
Distributed data analytics platforms (i.e., Apache Spark, Hadoop) provide high-level APIs to programmatically write analytics tasks that are run distributedly in multiple computing nodes. The design of these frameworks was primarily motivated by performance and usability. Thus, the security takes a back seat. Consequently, they do not inherently support fine-grained access control or offer any plugin mechanism to enable it, making them risky to be used in multi-tier organizational settings.
There have been attempts to build "add-on" solutions to enable fine-grained access control for distributed data analytics platforms. In this paper, first, we show that straightforward enforcement of ``add-on'' access control is insecure under adversarial code execution. Specifically, we show that an attacker can abuse platform-provided APIs to evade access controls without leaving any traces. Second, we designed a two-layered (i.e., proactive and reactive) defense system to protect against API abuses. On submission of a user code, our proactive security layer statically screens it to find potential attack signatures prior to its execution. The reactive security layer employs code instrumentation-based runtime checks and sandboxed execution to throttle any exploits at runtime. Next, we propose a new fine-grained access control framework with an enhanced policy language that supports map and filter primitives. Finally, we build a system named SecureDL with our new access control framework and defense system on top of Apache Spark, which ensures secure access control policy enforcement under adversaries capable of executing code.
To the best of our knowledge, this is the first fine-grained attribute-based access control framework for distributed data analytics platforms that is secure against platform API abuse attacks. Performance evaluation showed that the overhead due to added security is low.
△ Less
Submitted 3 December, 2023; v1 submitted 24 June, 2021;
originally announced June 2021.
-
Fair Machine Learning under Limited Demographically Labeled Data
Authors:
Mustafa Safa Ozdayi,
Murat Kantarcioglu,
Rishabh Iyer
Abstract:
Research has shown that, machine learning models might inherit and propagate undesired social biases encoded in the data. To address this problem, fair training algorithms are developed. However, most algorithms assume we know demographic/sensitive data features such as gender and race. This assumption falls short in scenarios where collecting demographic information is not feasible due to privacy…
▽ More
Research has shown that, machine learning models might inherit and propagate undesired social biases encoded in the data. To address this problem, fair training algorithms are developed. However, most algorithms assume we know demographic/sensitive data features such as gender and race. This assumption falls short in scenarios where collecting demographic information is not feasible due to privacy concerns, and data protection policies. A recent line of work develops fair training methods that can function without any demographic feature on the data, that are collectively referred as Rawlsian methods. Yet, we show in experiments that, Rawlsian methods tend to exhibit relatively high bias. Given this, we look at the middle ground between the previous approaches, and consider a setting where we know the demographic attributes for only a small subset of our data. In such a setting, we design fair training algorithms which exhibit both good utility, and low bias. In particular, we show that our techniques can train models to significantly outperform Rawlsian approaches even when 0.1% of demographic attributes are available in the training data. Furthermore, our main algorithm can accommodate multiple training objectives easily. We expand our main algorithm to achieve robustness to label noise in addition to fairness in the limited demographics setting to highlight that property as well.
△ Less
Submitted 10 April, 2022; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Topological Anomaly Detection in Dynamic Multilayer Blockchain Networks
Authors:
Dorcas Ofori-Boateng,
Ignacio Segovia Dominguez,
Murat Kantarcioglu,
Cuneyt G. Akcora,
Yulia R. Gel
Abstract:
Motivated by the recent surge of criminal activities with cross-cryptocurrency trades, we introduce a new topological perspective to structural anomaly detection in dynamic multilayer networks. We postulate that anomalies in the underlying blockchain transaction graph that are composed of multiple layers are likely to also be manifested in anomalous patterns of the network shape properties. As suc…
▽ More
Motivated by the recent surge of criminal activities with cross-cryptocurrency trades, we introduce a new topological perspective to structural anomaly detection in dynamic multilayer networks. We postulate that anomalies in the underlying blockchain transaction graph that are composed of multiple layers are likely to also be manifested in anomalous patterns of the network shape properties. As such, we invoke the machinery of clique persistent homology on graphs to systematically and efficiently track evolution of the network shape and, as a result, to detect changes in the underlying network topology and geometry. We develop a new persistence summary for multilayer networks, called stacked persistence diagram, and prove its stability under input data perturbations. We validate our new topological anomaly detection framework in application to dynamic multilayer networks from the Ethereum Blockchain and the Ripple Credit Network, and demonstrate that our stacked PD approach substantially outperforms state-of-art techniques.
△ Less
Submitted 6 July, 2021; v1 submitted 3 June, 2021;
originally announced June 2021.
-
Improving Fairness of AI Systems with Lossless De-biasing
Authors:
Yan Zhou,
Murat Kantarcioglu,
Chris Clifton
Abstract:
In today's society, AI systems are increasingly used to make critical decisions such as credit scoring and patient triage. However, great convenience brought by AI systems comes with troubling prevalence of bias against underrepresented groups. Mitigating bias in AI systems to increase overall fairness has emerged as an important challenge. Existing studies on mitigating bias in AI systems focus o…
▽ More
In today's society, AI systems are increasingly used to make critical decisions such as credit scoring and patient triage. However, great convenience brought by AI systems comes with troubling prevalence of bias against underrepresented groups. Mitigating bias in AI systems to increase overall fairness has emerged as an important challenge. Existing studies on mitigating bias in AI systems focus on eliminating sensitive demographic information embedded in data. Given the temporal and contextual complexity of conceptualizing fairness, lossy treatment of demographic information may contribute to an unnecessary trade-off between accuracy and fairness, especially when demographic attributes and class labels are correlated. In this paper, we present an information-lossless de-biasing technique that targets the scarcity of data in the disadvantaged group. Unlike the existing work, we demonstrate, both theoretically and empirically, that oversampling underrepresented groups can not only mitigate algorithmic bias in AI systems that consistently predict a favorable outcome for a certain group, but improve overall accuracy by mitigating class imbalance within data that leads to a bias towards the majority class. We demonstrate the effectiveness of our technique on real datasets using a variety of fairness metrics.
△ Less
Submitted 10 May, 2021;
originally announced May 2021.
-
Smart Vectorizations for Single and Multiparameter Persistence
Authors:
Baris Coskunuzer,
CUneyt Gurcan Akcora,
Ignacio Segovia Dominguez,
Zhiwei Zhen,
Murat Kantarcioglu,
Yulia R. Gel
Abstract:
The machinery of topological data analysis becomes increasingly popular in a broad range of machine learning tasks, ranging from anomaly detection and manifold learning to graph classification. Persistent homology is one of the key approaches here, allowing us to systematically assess the evolution of various hidden patterns in the data as we vary a scale parameter. The extracted patterns, or homo…
▽ More
The machinery of topological data analysis becomes increasingly popular in a broad range of machine learning tasks, ranging from anomaly detection and manifold learning to graph classification. Persistent homology is one of the key approaches here, allowing us to systematically assess the evolution of various hidden patterns in the data as we vary a scale parameter. The extracted patterns, or homological features, along with information on how long such features persist throughout the considered filtration of a scale parameter, convey a critical insight into salient data characteristics and data organization.
In this work, we introduce two new and easily interpretable topological summaries for single and multi-parameter persistence, namely, saw functions and multi-persistence grid functions, respectively. Compared to the existing topological summaries which tend to assess the numbers of topological features and/or their lifespans at a given filtration step, our proposed saw and multi-persistence grid functions allow us to explicitly account for essential complementary information such as the numbers of births and deaths at each filtration step.
These new topological summaries can be regarded as the complexity measures of the evolving subspaces determined by the filtration and are of particular utility for applications of persistent homology on graphs. We derive theoretical guarantees on the stability of the new saw and multi-persistence grid functions and illustrate their applicability for graph classification tasks.
△ Less
Submitted 10 April, 2021;
originally announced April 2021.
-
Blockchain Networks: Data Structures of Bitcoin, Monero, Zcash, Ethereum, Ripple and Iota
Authors:
Cuneyt Gurcan Akcora,
Murat Kantarcioglu,
Yulia R. Gel
Abstract:
Blockchain is an emerging technology that has enabled many applications, from cryptocurrencies to digital asset management and supply chains. Due to this surge of popularity, analyzing the data stored on blockchains poses a new critical challenge in data science.
To assist data scientists in various analytic tasks on a blockchain, in this tutorial, we provide a systematic and comprehensive overv…
▽ More
Blockchain is an emerging technology that has enabled many applications, from cryptocurrencies to digital asset management and supply chains. Due to this surge of popularity, analyzing the data stored on blockchains poses a new critical challenge in data science.
To assist data scientists in various analytic tasks on a blockchain, in this tutorial, we provide a systematic and comprehensive overview of the fundamental elements of blockchain network models. We discuss how we can abstract blockchain data as various types of networks and further use such associated network abstractions to reap important insights on blockchains' structure, organization, and functionality.
△ Less
Submitted 29 September, 2021; v1 submitted 15 March, 2021;
originally announced March 2021.
-
Improving Accuracy of Federated Learning in Non-IID Settings
Authors:
Mustafa Safa Ozdayi,
Murat Kantarcioglu,
Rishabh Iyer
Abstract:
Federated Learning (FL) is a decentralized machine learning protocol that allows a set of participating agents to collaboratively train a model without sharing their data. This makes FL particularly suitable for settings where data privacy is desired. However, it has been observed that the performance of FL is closely tied with the local data distributions of agents. Particularly, in settings wher…
▽ More
Federated Learning (FL) is a decentralized machine learning protocol that allows a set of participating agents to collaboratively train a model without sharing their data. This makes FL particularly suitable for settings where data privacy is desired. However, it has been observed that the performance of FL is closely tied with the local data distributions of agents. Particularly, in settings where local data distributions vastly differ among agents, FL performs rather poorly with respect to the centralized training. To address this problem, we hypothesize the reasons behind the performance degradation, and develop some techniques to address these reasons accordingly. In this work, we identify four simple techniques that can improve the performance of trained models without incurring any additional communication overhead to FL, but rather, some light computation overhead either on the client, or the server-side. In our experimental analysis, combination of our techniques improved the validation accuracy of a model trained via FL by more than 12% with respect to our baseline. This is about 5% less than the accuracy of the model trained on centralized data.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
How to Not Get Caught When You Launder Money on Blockchain?
Authors:
Cuneyt G. Akcora,
Sudhanva Purusotham,
Yulia R. Gel,
Mitchell Krawiec-Thayer,
Murat Kantarcioglu
Abstract:
The number of blockchain users has tremendously grown in recent years. As an unintended consequence, e-crime transactions on blockchains has been on the rise. Consequently, public blockchains have become a hotbed of research for developing AI tools to detect and trace users and transactions that are related to e-crime.
We argue that following a few select strategies can make money laundering on…
▽ More
The number of blockchain users has tremendously grown in recent years. As an unintended consequence, e-crime transactions on blockchains has been on the rise. Consequently, public blockchains have become a hotbed of research for developing AI tools to detect and trace users and transactions that are related to e-crime.
We argue that following a few select strategies can make money laundering on blockchain virtually undetectable with most of the existing tools and algorithms. As a result, the effective combating of e-crime activities involving cryptocurrencies requires the development of novel analytic methodology in AI.
△ Less
Submitted 21 September, 2020;
originally announced October 2020.
-
GOAT: GPU Outsourcing of Deep Learning Training With Asynchronous Probabilistic Integrity Verification Inside Trusted Execution Environment
Authors:
Aref Asvadishirehjini,
Murat Kantarcioglu,
Bradley Malin
Abstract:
Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with out…
▽ More
Machine learning models based on Deep Neural Networks (DNNs) are increasingly deployed in a wide range of applications ranging from self-driving cars to COVID-19 treatment discovery. To support the computational power necessary to learn a DNN, cloud environments with dedicated hardware support have emerged as critical infrastructure. However, there are many integrity challenges associated with outsourcing computation. Various approaches have been developed to address these challenges, building on trusted execution environments (TEE). Yet, no existing approach scales up to support realistic integrity-preserving DNN model training for heavy workloads (deep architectures and millions of training examples) without sustaining a significant performance hit. To mitigate the time gap between pure TEE (full integrity) and pure GPU (no integrity), we combine random verification of selected computation steps with systematic adjustments of DNN hyper-parameters (e.g., a narrow gradient clipping range), hence limiting the attacker's ability to shift the model parameters significantly provided that the step is not selected for verification during its training phase. Experimental results show the new approach achieves 2X to 20X performance improvement over pure TEE based solution while guaranteeing a very high probability of integrity (e.g., 0.999) with respect to state-of-the-art DNN backdoor attacks.
△ Less
Submitted 17 October, 2020;
originally announced October 2020.
-
BlockFLA: Accountable Federated Learning via Hybrid Blockchain Architecture
Authors:
Harsh Bimal Desai,
Mustafa Safa Ozdayi,
Murat Kantarcioglu
Abstract:
Federated Learning (FL) is a distributed, and decentralized machine learning protocol. By executing FL, a set of agents can jointly train a model without sharing their datasets with each other, or a third-party. This makes FL particularly suitable for settings where data privacy is desired.
At the same time, concealing training data gives attackers an opportunity to inject backdoors into the tra…
▽ More
Federated Learning (FL) is a distributed, and decentralized machine learning protocol. By executing FL, a set of agents can jointly train a model without sharing their datasets with each other, or a third-party. This makes FL particularly suitable for settings where data privacy is desired.
At the same time, concealing training data gives attackers an opportunity to inject backdoors into the trained model. It has been shown that an attacker can inject backdoors to the trained model during FL, and then can leverage the backdoor to make the model misclassify later. Several works tried to alleviate this threat by designing robust aggregation functions. However, given more sophisticated attacks are developed over time, which by-pass the existing defenses, we approach this problem from a complementary angle in this work. Particularly, we aim to discourage backdoor attacks by detecting, and punishing the attackers, possibly after the end of training phase.
To this end, we develop a hybrid blockchain-based FL framework that uses smart contracts to automatically detect, and punish the attackers via monetary penalties. Our framework is general in the sense that, any aggregation function, and any attacker detection algorithm can be plugged into it. We conduct experiments to demonstrate that our framework preserves the communication-efficient nature of FL, and provide empirical results to illustrate that it can successfully penalize attackers by leveraging our novel attacker detection algorithm.
△ Less
Submitted 14 October, 2020;
originally announced October 2020.
-
Secure IoT Data Analytics in Cloud via Intel SGX
Authors:
Md Shihabul Islam,
Mustafa Safa Ozdayi,
Latifur Khan,
Murat Kantarcioglu
Abstract:
The growing adoption of IoT devices in our daily life is engendering a data deluge, mostly private information that needs careful maintenance and secure storage system to ensure data integrity and protection. Also, the prodigious IoT ecosystem has provided users with opportunities to automate systems by interconnecting their devices and other services with rule-based programs. The cloud services t…
▽ More
The growing adoption of IoT devices in our daily life is engendering a data deluge, mostly private information that needs careful maintenance and secure storage system to ensure data integrity and protection. Also, the prodigious IoT ecosystem has provided users with opportunities to automate systems by interconnecting their devices and other services with rule-based programs. The cloud services that are used to store and process sensitive IoT data turn out to be vulnerable to outside threats. Hence, sensitive IoT data and rule-based programs need to be protected against cyberattacks. To address this important challenge, in this paper, we propose a framework to maintain confidentiality and integrity of IoT data and rule-based program execution. We design the framework to preserve data privacy utilizing Trusted Execution Environment (TEE) such as Intel SGX, and end-to-end data encryption mechanism. We evaluate the framework by executing rule-based programs in the SGX securely with both simulated and real IoT device data.
△ Less
Submitted 10 August, 2020;
originally announced August 2020.
-
Defending against Backdoors in Federated Learning with Robust Learning Rate
Authors:
Mustafa Safa Ozdayi,
Murat Kantarcioglu,
Yulia R. Gel
Abstract:
Federated learning (FL) allows a set of agents to collaboratively train a model without sharing their potentially sensitive data. This makes FL suitable for privacy-preserving applications. At the same time, FL is susceptible to adversarial attacks due to decentralized and unvetted data. One important line of attacks against FL is the backdoor attacks. In a backdoor attack, an adversary tries to e…
▽ More
Federated learning (FL) allows a set of agents to collaboratively train a model without sharing their potentially sensitive data. This makes FL suitable for privacy-preserving applications. At the same time, FL is susceptible to adversarial attacks due to decentralized and unvetted data. One important line of attacks against FL is the backdoor attacks. In a backdoor attack, an adversary tries to embed a backdoor functionality to the model during training that can later be activated to cause a desired misclassification. To prevent backdoor attacks, we propose a lightweight defense that requires minimal change to the FL protocol. At a high level, our defense is based on carefully adjusting the aggregation server's learning rate, per dimension and per round, based on the sign information of agents' updates. We first conjecture the necessary steps to carry a successful backdoor attack in FL setting, and then, explicitly formulate the defense based on our conjecture. Through experiments, we provide empirical evidence that supports our conjecture, and we test our defense against backdoor attacks under different settings. We observe that either backdoor is completely eliminated, or its accuracy is significantly reduced. Overall, our experiments suggest that our defense significantly outperforms some of the recently proposed defenses in the literature. We achieve this by having minimal influence over the accuracy of the trained models. In addition, we also provide convergence rate analysis for our proposed scheme.
△ Less
Submitted 29 July, 2021; v1 submitted 7 July, 2020;
originally announced July 2020.
-
Does Explainable Artificial Intelligence Improve Human Decision-Making?
Authors:
Yasmeen Alufaisan,
Laura R. Marusich,
Jonathan Z. Bakdash,
Yan Zhou,
Murat Kantarcioglu
Abstract:
Explainable AI provides insight into the "why" for model predictions, offering potential for users to better understand and trust a model, and to recognize and correct AI predictions that are incorrect. Prior research on human and explainable AI interactions has focused on measures such as interpretability, trust, and usability of the explanation. Whether explainable AI can improve actual human de…
▽ More
Explainable AI provides insight into the "why" for model predictions, offering potential for users to better understand and trust a model, and to recognize and correct AI predictions that are incorrect. Prior research on human and explainable AI interactions has focused on measures such as interpretability, trust, and usability of the explanation. Whether explainable AI can improve actual human decision-making and the ability to identify the problems with the underlying model are open questions. Using real datasets, we compare and evaluate objective human decision accuracy without AI (control), with an AI prediction (no explanation), and AI prediction with explanation. We find providing any kind of AI prediction tends to improve user decision accuracy, but no conclusive evidence that explainable AI has a meaningful impact. Moreover, we observed the strongest predictor for human decision accuracy was AI accuracy and that users were somewhat able to detect when the AI was correct versus incorrect, but this was not significantly affected by including an explanation. Our results indicate that, at least in some situations, the "why" information provided in explainable AI may not enhance user decision-making, and further research may be needed to understand how to integrate explainable AI into real systems.
△ Less
Submitted 19 June, 2020;
originally announced June 2020.
-
Leveraging Blockchain for Immutable Logging and Querying Across Multiple Sites
Authors:
Mustafa Safa Ozdayi,
Murat Kantarcioglu,
Bradley Malin
Abstract:
Blockchain has emerged as a decentralized and distributed framework that enables tamper-resilience and, thus, practical immutability for stored data. This immutability property is important in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare and biomedical data.However, the underlying data structure of blockchain, by default, does not provide cap…
▽ More
Blockchain has emerged as a decentralized and distributed framework that enables tamper-resilience and, thus, practical immutability for stored data. This immutability property is important in scenarios where auditability is desired, such as in maintaining access logs for sensitive healthcare and biomedical data.However, the underlying data structure of blockchain, by default, does not provide capabilities to efficiently query the stored data. In this investigation, we show that it is possible to efficiently run complex audit queries over the access log data stored on blockchains by using additional key-value stores. This paper specifically reports on the approach we designed for the blockchain track of iDASH Privacy & Security Workshop 2018 competition.Particularly, we implemented our solution and compared its loading and query-response performance with SQLite, a commonly used relational database, using the data provided by the iDASH 2018 organizers. Depending on the query type and the data size, the run time difference between blockchain based query-response and SQLite based query-response ranged from 0.2 seconds to 6 seconds. A deeper inspection revealed that range queries were the bottleneck of our solution which, nevertheless, scales up linearly. Concretely, this investigation demonstrates that blockchain-based systems can provide reasonable query-response times to complex queries even if they only use simple key-value stores to manage their data. Consequently, we show that blockchains may be useful for maintaining data with auditability and immutability requirements across multiple sites.
△ Less
Submitted 5 March, 2020; v1 submitted 13 January, 2020;
originally announced January 2020.
-
Dissecting Ethereum Blockchain Analytics: What We Learn from Topology and Geometry of Ethereum Graph
Authors:
Yitao Li,
Umar Islambekov,
Cuneyt Akcora,
Ekaterina Smirnova,
Yulia R. Gel,
Murat Kantarcioglu
Abstract:
Blockchain technology and, in particular, blockchain-based cryptocurrencies offer us information that has never been seen before in the financial world. In contrast to fiat currencies, all transactions of crypto-currencies and crypto-tokens are permanently recorded on distributed ledgers and are publicly available. As a result, this allows us to construct a transaction graph and to assess not only…
▽ More
Blockchain technology and, in particular, blockchain-based cryptocurrencies offer us information that has never been seen before in the financial world. In contrast to fiat currencies, all transactions of crypto-currencies and crypto-tokens are permanently recorded on distributed ledgers and are publicly available. As a result, this allows us to construct a transaction graph and to assess not only its organization but to glean relationships between transaction graph properties and crypto price dynamics. The ultimate goal of this paper is to facilitate our understanding on horizons and limitations of what can be learned on crypto-tokens from local topology and geometry of the Ethereum transaction network whose even global network properties remain scarcely explored. By introducing novel tools based on topological data analysis and functional data depth into Blockchain Data Analytics, we show that Ethereum network (one of the most popular blockchains for creating new crypto-tokens) can provide critical insights on price strikes of crypto-tokens that are otherwise largely inaccessible with conventional data sources and traditional analytic methods.
△ Less
Submitted 20 December, 2019;
originally announced December 2019.
-
ChainNet: Learning on Blockchain Graphs with Topological Features
Authors:
Nazmiye Ceren Abay,
Cuneyt Gurcan Akcora,
Yulia R. Gel,
Umar D. Islambekov,
Murat Kantarcioglu,
Yahui Tian,
Bhavani Thuraisingham
Abstract:
With emergence of blockchain technologies and the associated cryptocurrencies, such as Bitcoin, understanding network dynamics behind Blockchain graphs has become a rapidly evolving research direction. Unlike other financial networks, such as stock and currency trading, blockchain based cryptocurrencies have the entire transaction graph accessible to the public (i.e., all transactions can be downl…
▽ More
With emergence of blockchain technologies and the associated cryptocurrencies, such as Bitcoin, understanding network dynamics behind Blockchain graphs has become a rapidly evolving research direction. Unlike other financial networks, such as stock and currency trading, blockchain based cryptocurrencies have the entire transaction graph accessible to the public (i.e., all transactions can be downloaded and analyzed). A natural question is then to ask whether the dynamics of the transaction graph impacts the price of the underlying cryptocurrency. We show that standard graph features such as degree distribution of the transaction graph may not be sufficient to capture network dynamics and its potential impact on fluctuations of Bitcoin price. In contrast, the new graph associated topological features computed using the tools of persistent homology, are found to exhibit a high utility for predicting Bitcoin price dynamics. %explain higher order interactions among the nodes in Blockchain graphs and can be used to build much more accurate price prediction models. Using the proposed persistent homology-based techniques, we offer a new elegant, easily extendable and computationally light approach for graph representation learning on Blockchain.
△ Less
Submitted 18 August, 2019;
originally announced August 2019.
-
BitcoinHeist: Topological Data Analysis for Ransomware Detection on the Bitcoin Blockchain
Authors:
Cuneyt Gurcan Akcora,
Yitao Li,
Yulia R. Gel,
Murat Kantarcioglu
Abstract:
Proliferation of cryptocurrencies (e.g., Bitcoin) that allow pseudo-anonymous transactions, has made it easier for ransomware developers to demand ransom by encrypting sensitive user data. The recently revealed strikes of ransomware attacks have already resulted in significant economic losses and societal harm across different sectors, ranging from local governments to health care.
Most modern r…
▽ More
Proliferation of cryptocurrencies (e.g., Bitcoin) that allow pseudo-anonymous transactions, has made it easier for ransomware developers to demand ransom by encrypting sensitive user data. The recently revealed strikes of ransomware attacks have already resulted in significant economic losses and societal harm across different sectors, ranging from local governments to health care.
Most modern ransomware use Bitcoin for payments. However, although Bitcoin transactions are permanently recorded and publicly available, current approaches for detecting ransomware depend only on a couple of heuristics and/or tedious information gathering steps (e.g., running ransomware to collect ransomware related Bitcoin addresses). To our knowledge, none of the previous approaches have employed advanced data analytics techniques to automatically detect ransomware related transactions and malicious Bitcoin addresses.
By capitalizing on the recent advances in topological data analysis, we propose an efficient and tractable data analytics framework to automatically detect new malicious addresses in a ransomware family, given only a limited records of previous transactions. Furthermore, our proposed techniques exhibit high utility to detect the emergence of new ransomware families, that is, ransomware with no previous records of transactions. Using the existing known ransomware data sets, we show that our proposed methodology provides significant improvements in precision and recall for ransomware transaction detection, compared to existing heuristic based approaches, and can be utilized to automate ransomware detection.
△ Less
Submitted 18 June, 2019;
originally announced June 2019.
-
CryptoGuard: High Precision Detection of Cryptographic Vulnerabilities in Massive-sized Java Projects
Authors:
Sazzadur Rahaman,
Ya Xiao,
Sharmin Afrose,
Fahad Shaon,
Ke Tian,
Miles Frantz,
Danfeng,
Yao,
Murat Kantarcioglu
Abstract:
Cryptographic API misuses, such as exposed secrets, predictable random numbers, and vulnerable certificate verification, seriously threaten software security. The vision of automatically screening cryptographic API calls in massive-sized (e.g., millions of LoC) Java programs is not new. However, hindered by the practical difficulty of reducing false positives without compromising analysis quality,…
▽ More
Cryptographic API misuses, such as exposed secrets, predictable random numbers, and vulnerable certificate verification, seriously threaten software security. The vision of automatically screening cryptographic API calls in massive-sized (e.g., millions of LoC) Java programs is not new. However, hindered by the practical difficulty of reducing false positives without compromising analysis quality, this goal has not been accomplished. State-of-the-art crypto API screening solutions are not designed to operate on a large scale.
Our technical innovation is a set of fast and highly accurate slicing algorithms. Our algorithms refine program slices by identifying language-specific irrelevant elements. The refinements reduce false alerts by 76% to 80% in our experiments. Running our tool, CrytoGuard, on 46 high-impact large-scale Apache projects and 6,181 Android apps generate many security insights. Our findings helped multiple popular Apache projects to harden their code, including Spark, Ranger, and Ofbiz. We also have made substantial progress towards the science of analysis in this space, including: i) manually analyzing 1,295 Apache alerts and confirming 1,277 true positives (98.61% precision), ii) creating a benchmark with 38-unit basic cases and 74-unit advanced cases, iii) performing an in-depth comparison with leading solutions including CrySL, SpotBugs, and Coverity. We are in the process of integrating CryptoGuard with the Software Assurance Marketplace (SWAMP).
△ Less
Submitted 27 March, 2019; v1 submitted 18 June, 2018;
originally announced June 2018.
-
Breaking Transferability of Adversarial Samples with Randomness
Authors:
Yan Zhou,
Murat Kantarcioglu,
Bowei Xi
Abstract:
We investigate the role of transferability of adversarial attacks in the observed vulnerabilities of Deep Neural Networks (DNNs). We demonstrate that introducing randomness to the DNN models is sufficient to defeat adversarial attacks, given that the adversary does not have an unlimited attack budget. Instead of making one specific DNN model robust to perfect knowledge attacks (a.k.a, white box at…
▽ More
We investigate the role of transferability of adversarial attacks in the observed vulnerabilities of Deep Neural Networks (DNNs). We demonstrate that introducing randomness to the DNN models is sufficient to defeat adversarial attacks, given that the adversary does not have an unlimited attack budget. Instead of making one specific DNN model robust to perfect knowledge attacks (a.k.a, white box attacks), creating randomness within an army of DNNs completely eliminates the possibility of perfect knowledge acquisition, resulting in a significantly more robust DNN ensemble against the strongest form of attacks. We also show that when the adversary has an unlimited budget of data perturbation, all defensive techniques would eventually break down as the budget increases. Therefore, it is important to understand the game saddle point where the adversary would not further pursue this endeavor.
Furthermore, we explore the relationship between attack severity and decision boundary robustness in the version space. We empirically demonstrate that by simply adding a small Gaussian random noise to the learned weights, a DNN model can increase its resilience to adversarial attacks by as much as 74.2%. More importantly, we show that by randomly activating/revealing a model from a pool of pre-trained DNNs at each query request, we can put a tremendous strain on the adversary's attack strategies. We compare our randomization techniques to the Ensemble Adversarial Training technique and show that our randomization techniques are superior under different attack budget constraints.
△ Less
Submitted 17 June, 2018; v1 submitted 11 May, 2018;
originally announced May 2018.
-
Enforceable Data Sharing Agreements Using Smart Contracts
Authors:
Kevin Liu,
Harsh Desai,
Lalana Kagal,
Murat Kantarcioglu
Abstract:
As more and more data is collected for various reasons, the sharing of such data becomes paramount to increasing its value. Many applications ranging from smart cities to personalized health care require individuals and organizations to share data at an unprecedented scale. Data sharing is crucial in today's world, but due to privacy reasons, security concerns and regulation issues, the conditions…
▽ More
As more and more data is collected for various reasons, the sharing of such data becomes paramount to increasing its value. Many applications ranging from smart cities to personalized health care require individuals and organizations to share data at an unprecedented scale. Data sharing is crucial in today's world, but due to privacy reasons, security concerns and regulation issues, the conditions under which the sharing occurs needs to be carefully specified. Currently, this process is done by lawyers and requires the costly signing of legal agreements. In many cases, these data sharing agreements are hard to track, manage or enforce. In this work, we propose a novel alternative for tracking, managing and especially enforcing such data sharing agreements using smart contracts and blockchain technology. We design a framework that generates smart contracts from parameters based on legal data sharing agreements. The terms in these agreements are automatically enforced by the system. Monetary punishment can be employed using secure voting by external auditors to hold the violators accountable. Our experimental evaluation shows that our proposed framework is efficient and low-cost.
△ Less
Submitted 27 April, 2018;
originally announced April 2018.
-
Adversarial Clustering: A Grid Based Clustering Algorithm Against Active Adversaries
Authors:
Wutao Wei,
Bowei Xi,
Murat Kantarcioglu
Abstract:
Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most…
▽ More
Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most of the previous work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, in this paper, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the centers of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.
△ Less
Submitted 13 April, 2018;
originally announced April 2018.
-
Graph Based Proactive Secure Decomposition Algorithm for Context Dependent Attribute Based Inference Control Problem
Authors:
Ugur Turan,
Ismail H. Toroslu,
Murat Kantarcioglu
Abstract:
Relational DBMSs continue to dominate the database market, and inference problem on external schema of relational DBMS's is still an important issue in terms of data privacy.Especially for the last 10 years, external schema construction for application-specific database usage has increased its independency from the conceptual schema, as the definitions and implementations of views and procedures h…
▽ More
Relational DBMSs continue to dominate the database market, and inference problem on external schema of relational DBMS's is still an important issue in terms of data privacy.Especially for the last 10 years, external schema construction for application-specific database usage has increased its independency from the conceptual schema, as the definitions and implementations of views and procedures have been optimized. This paper offers an optimized decomposition strategy for the external schema, which concentrates on the privacy policy and required associations of attributes for the intended user roles. The method proposed in this article performs a proactive decomposition of the external schema, in order to satisfy both the forbidden and required associations of attributes.Functional dependency constraints of a database schema can be represented as a graph, in which vertices are attribute sets and edges are functional dependencies. In this representation, inference problem can be defined as a process of searching a subtree in the dependency graph containing the attributes that need to be related. The optimized decomposition process aims to generate an external schema, which guarantees the prevention of the inference of the forbidden attribute sets while guaranteeing the association of the required attribute sets with a minimal loss of possible association among other attributes, if the inhibited and required attribute sets are consistent with each other. Our technique is purely proactive, and can be viewed as a normalization process. Due to the usage independency of external schema construction tools, it can be easily applied to any existing systems without rewriting data access layer of applications. Our extensive experimental analysis shows the effectiveness of this optimized proactive strategy for a wide variety of logical schema volumes.
△ Less
Submitted 1 March, 2018;
originally announced March 2018.
-
Using Blockchain and smart contracts for secure data provenance management
Authors:
Aravind Ramachandran,
Dr. Murat Kantarcioglu
Abstract:
Blockchain technology has evolved from being an immutable ledger of transactions for cryptocurrencies to a programmable interactive the environment for building distributed reliable applications. Although, blockchain technology has been used to address various challenges, to our knowledge none of the previous work focused on using blockchain to develop a secure and immutable scientific data proven…
▽ More
Blockchain technology has evolved from being an immutable ledger of transactions for cryptocurrencies to a programmable interactive the environment for building distributed reliable applications. Although, blockchain technology has been used to address various challenges, to our knowledge none of the previous work focused on using blockchain to develop a secure and immutable scientific data provenance management framework that automatically verifies the provenance records. In this work, we leverage blockchain as a platform to facilitate trustworthy data provenance collection, verification, and management. The developed system utilizes smart contracts and open provenance model (OPM) to record immutable data trails. We show that our proposed framework can efficiently and securely capture and validate provenance data, and prevent any malicious modification to the captured data as long as the majority of the participants are honest.
△ Less
Submitted 28 September, 2017;
originally announced September 2017.
-
Blockchain: A Graph Primer
Authors:
Cuneyt Gurcan Akcora,
Yulia R. Gel,
Murat Kantarcioglu
Abstract:
Bitcoin and its underlying technology, blockchain, have gained significant popularity in recent years. Satoshi Nakamoto designed Bitcoin to enable a secure, distributed platform without the need for central authorities, and blockchain has been hailed as a paradigm that will be as impactful as Big Data, Cloud Computing, and Machine Learning.
Blockchain incorporates innovative ideas from various f…
▽ More
Bitcoin and its underlying technology, blockchain, have gained significant popularity in recent years. Satoshi Nakamoto designed Bitcoin to enable a secure, distributed platform without the need for central authorities, and blockchain has been hailed as a paradigm that will be as impactful as Big Data, Cloud Computing, and Machine Learning.
Blockchain incorporates innovative ideas from various fields, such as public-key encryption and distributed systems. As a result, readers often encounter resources that explain Blockchain technology from a single perspective, leaving them with more questions than answers.
In this primer, we aim to provide a comprehensive view of blockchain. We will begin with a brief history and introduce the building blocks of the blockchain. As graph mining is a major area of blockchain analysis, we will delve into the graph-theoretical aspects of Blockchain technology. We will also discuss the future of blockchain and explain how extensions such as smart contracts and decentralized autonomous organizations will function.
Our goal is to provide a concise but complete description of blockchain technology that is accessible to readers with no prior expertise in the field.
△ Less
Submitted 11 December, 2022; v1 submitted 10 August, 2017;
originally announced August 2017.
-
CheapSMC: A Framework to Minimize SMC Cost in Cloud
Authors:
Erman Pattuk,
Murat Kantarcioglu,
Huseyin Ulusoy,
Bradley Malin
Abstract:
Secure multi-party computation (SMC) techniques are increasingly becoming more efficient and practical thanks to many recent novel improvements. The recent work have shown that different protocols that are implemented using different sharing mechanisms (e.g., boolean, arithmetic sharings, etc.) may have different computational and communication costs. Although there are some works that automatical…
▽ More
Secure multi-party computation (SMC) techniques are increasingly becoming more efficient and practical thanks to many recent novel improvements. The recent work have shown that different protocols that are implemented using different sharing mechanisms (e.g., boolean, arithmetic sharings, etc.) may have different computational and communication costs. Although there are some works that automatically mix protocols of different sharing schemes to fasten execution, none of them provide a generic optimization framework to find the cheapest mixed-protocol SMC execution for cloud deployment.
In this work, we propose a generic SMC optimization framework CheapSMC that can use any mixed-protocol SMC circuit evaluation tool as a black-box to find the cheapest SMC cloud deployment option. To find the cheapest SMC protocol, CheapSMC runs one time benchmarks for the target cloud service and gathers performance statistics for basic circuit components. Using these performance statistics, optimization layer of CheapSMC runs multiple heuristics to find the cheapest mix-protocol circuit evaluation. Later on, the optimized circuit is passed to a mix-protocol SMC tool for actual executable generation. Our empirical results gathered by running different cases studies show that significant cost savings could be achieved using our optimization framework.
△ Less
Submitted 1 May, 2016;
originally announced May 2016.
-
A Distributed Framework for Scalable Search over Encrypted Documents
Authors:
Mehmet Kuzu,
Mohammad Saiful Islam,
Murat Kantarcioglu
Abstract:
Nowadays, huge amount of documents are increasingly transferred to the remote servers due to the appealing features of cloud computing. On the other hand, privacy and security of the sensitive information in untrusted cloud environment is a big concern. To alleviate such concerns, encryption of sensitive data before its transfer to the cloud has become an important risk mitigation option. Encrypte…
▽ More
Nowadays, huge amount of documents are increasingly transferred to the remote servers due to the appealing features of cloud computing. On the other hand, privacy and security of the sensitive information in untrusted cloud environment is a big concern. To alleviate such concerns, encryption of sensitive data before its transfer to the cloud has become an important risk mitigation option. Encrypted storage provides protection at the expense of a significant increase in the data management complexity. For effective management, it is critical to provide efficient selective document retrieval capability on the encrypted collection. In fact, considerable amount of searchable symmetric encryption schemes have been designed in the literature to achieve this task. However, with the emergence of big data everywhere, available approaches are insufficient to address some crucial real-world problems such as scalability.
In this study, we focus on practical aspects of a secure keyword search mechanism over encrypted data on a real cloud infrastructure. First, we propose a provably secure distributed index along with a parallelizable retrieval technique that can easily scale to big data. Second, we integrate authorization into the search scheme to limit the information leakage in multi-user setting where users are allowed to access only particular documents. Third, we offer efficient updates on the distributed secure index. In addition, we conduct extensive empirical analysis on a real dataset to illustrate the efficiency of the proposed practical techniques.
△ Less
Submitted 23 August, 2014;
originally announced August 2014.
-
Experiments in Information Sharing
Authors:
Nathan Berg,
Chunyu Chen,
Murat Kantarcioglu
Abstract:
This paper reports experimental data describing the dynamics of three key information-sharing outcomes: quantity of information shared, falsification and accuracy. The experimental design follows a formal model predicting that cooperative incentives are needed to motivate subsidiaries of large organizations to share information. Empirical reaction functions reveal how lagged values of information-…
▽ More
This paper reports experimental data describing the dynamics of three key information-sharing outcomes: quantity of information shared, falsification and accuracy. The experimental design follows a formal model predicting that cooperative incentives are needed to motivate subsidiaries of large organizations to share information. Empirical reaction functions reveal how lagged values of information-sharing outcomes influence information sharing in the current round. Cooperative treatments pay bonuses to everyone if at least one individual (or subsidiary) achieves accuracy. Tournament treatments pay a single bonus to whoever achieves accuracy first. As expected, tournament incentives tend to reduce sharing, increase falsification and impair accuracy. Several surprises not predicted by the formal model emerge from the data. Conditional cooperation occurs regardless of the incentive scheme, implying that the mechanism through which incentives influence improvements in information sharing is indirect.
△ Less
Submitted 22 May, 2013;
originally announced May 2013.
-
Efficient Query Verification on Outsourced Data: A Game-Theoretic Approach
Authors:
Robert Nix,
Murat Kantarcioglu
Abstract:
To save time and money, businesses and individuals have begun outsourcing their data and computations to cloud computing services. These entities would, however, like to ensure that the queries they request from the cloud services are being computed correctly. In this paper, we use the principles of economics and competition to vastly reduce the complexity of query verification on outsourced data.…
▽ More
To save time and money, businesses and individuals have begun outsourcing their data and computations to cloud computing services. These entities would, however, like to ensure that the queries they request from the cloud services are being computed correctly. In this paper, we use the principles of economics and competition to vastly reduce the complexity of query verification on outsourced data. We consider two cases: First, we consider the scenario where multiple non-colluding data outsourcing services exist, and then we consider the case where only a single outsourcing service exists. Using a game theoretic model, we show that given the proper incentive structure, we can effectively deter dishonest behavior on the part of the data outsourcing services with very few computational and monetary resources. We prove that the incentive for an outsourcing service to cheat can be reduced to zero. Finally, we show that a simple verification method can achieve this reduction through extensive experimental evaluation.
△ Less
Submitted 7 February, 2012;
originally announced February 2012.
-
Secure Data Processing in a Hybrid Cloud
Authors:
Vaibhav Khadilkar,
Murat Kantarcioglu,
Bhavani Thuraisingham,
Sharad Mehrotra
Abstract:
Cloud computing has made it possible for a user to be able to select a computing service precisely when needed. However, certain factors such as security of data and regulatory issues will impact a user's choice of using such a service. A solution to these problems is the use of a hybrid cloud that combines a user's local computing capabilities (for mission- or organization-critical tasks) with a…
▽ More
Cloud computing has made it possible for a user to be able to select a computing service precisely when needed. However, certain factors such as security of data and regulatory issues will impact a user's choice of using such a service. A solution to these problems is the use of a hybrid cloud that combines a user's local computing capabilities (for mission- or organization-critical tasks) with a public cloud (for less influential tasks). We foresee three challenges that must be overcome before the adoption of a hybrid cloud approach: 1) data design: How to partition relations in a hybrid cloud? The solution to this problem must account for the sensitivity of attributes in a relation as well as the workload of a user; 2) data security: How to protect a user's data in a public cloud with encryption while enabling query processing over this encrypted data? and 3) query processing: How to execute queries efficiently over both, encrypted and unencrypted data? This paper addresses these challenges and incorporates their solutions into an add-on tool for a Hadoop and Hive based cloud computing infrastructure.
△ Less
Submitted 10 May, 2011;
originally announced May 2011.