-
BenthicNet: A global compilation of seafloor images for deep learning applications
Authors:
Scott C. Lowe,
Benjamin Misiuk,
Isaac Xu,
Shakhboz Abdulazizov,
Amit R. Baroi,
Alex C. Bastos,
Merlin Best,
Vicki Ferrini,
Ariell Friedman,
Deborah Hart,
Ove Hoegh-Guldberg,
Daniel Ierodiaconou,
Julia Mackin-McLaughlin,
Kathryn Markey,
Pedro S. Menandro,
Jacquomo Monk,
Shreya Nemani,
John O'Brien,
Elizabeth Oh,
Luba Y. Reshitnyk,
Katleen Robert,
Chris M. Roelfsema,
Jessica A. Sameoto,
Alexandre C. G. Schimel,
Jordan A. Thomson
, et al. (4 additional authors not shown)
Abstract:
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with…
▽ More
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with which seafloor image datasets are analyzed, yet large and consistent datasets necessary to support development of such approaches are scarce. Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models. An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. These are accompanied by 2.6 million annotations translated to the CATAMI scheme, which span 190,000 of the images. A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks. The compilation and model are made openly available for use by the scientific community at https://doi.org/10.20383/103.0614.
△ Less
Submitted 11 July, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
Hypergraph Topological Features for Autoencoder-Based Intrusion Detection for Cybersecurity Data
Authors:
Bill Kay,
Sinan G. Aksoy,
Molly Baird,
Daniel M. Best,
Helen Jenne,
Cliff Joslyn,
Christopher Potvin,
Gregory Henselman-Petrusek,
Garret Seppala,
Stephen J. Young,
Emilie Purvine
Abstract:
In this position paper, we argue that when hypergraphs are used to capture multi-way local relations of data, their resulting topological features describe global behaviour. Consequently, these features capture complex correlations that can then serve as high fidelity inputs to autoencoder-driven anomaly detection pipelines. We propose two such potential pipelines for cybersecurity data, one that…
▽ More
In this position paper, we argue that when hypergraphs are used to capture multi-way local relations of data, their resulting topological features describe global behaviour. Consequently, these features capture complex correlations that can then serve as high fidelity inputs to autoencoder-driven anomaly detection pipelines. We propose two such potential pipelines for cybersecurity data, one that uses an autoencoder directly to determine network intrusions, and one that de-noises input data for a persistent homology system, PHANTOM. We provide heuristic justification for the use of the methods described therein for an intrusion detection pipeline for cyber data. We conclude by showing a small example over synthetic cyber attack data.
△ Less
Submitted 9 November, 2023;
originally announced December 2023.
-
A Material Lens on Coloniality in NLP
Authors:
William Held,
Camille Harris,
Michael Best,
Diyi Yang
Abstract:
Coloniality, the continuation of colonial harms beyond "official" colonization, has pervasive effects across society and scientific fields. Natural Language Processing (NLP) is no exception to this broad phenomenon. In this work, we argue that coloniality is implicitly embedded in and amplified by NLP data, algorithms, and software. We formalize this analysis using Actor-Network Theory (ANT): an a…
▽ More
Coloniality, the continuation of colonial harms beyond "official" colonization, has pervasive effects across society and scientific fields. Natural Language Processing (NLP) is no exception to this broad phenomenon. In this work, we argue that coloniality is implicitly embedded in and amplified by NLP data, algorithms, and software. We formalize this analysis using Actor-Network Theory (ANT): an approach to understanding social phenomena through the network of relationships between human stakeholders and technology. We use our Actor-Network to guide a quantitative survey of the geography of different phases of NLP research, providing evidence that inequality along colonial boundaries increases as NLP builds on itself. Based on this, we argue that combating coloniality in NLP requires not only changing current values but also active work to remove the accumulation of colonial ideals in our foundational data and algorithms.
△ Less
Submitted 14 November, 2023;
originally announced November 2023.
-
Hate Speech Detection in Limited Data Contexts using Synthetic Data Generation
Authors:
Aman Khullar,
Daniel Nkemelu,
Cuong V. Nguyen,
Michael L. Best
Abstract:
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in…
▽ More
A growing body of work has focused on text classification methods for detecting the increasing amount of hate speech posted online. This progress has been limited to only a select number of highly-resourced languages causing detection systems to either under-perform or not exist in limited data contexts. This is majorly caused by a lack of training data which is expensive to collect and curate in these settings. In this work, we propose a data augmentation approach that addresses the problem of lack of data for online hate speech detection in limited data contexts using synthetic data generation techniques. Given a handful of hate speech examples in a high-resource language such as English, we present three methods to synthesize new examples of hate speech data in a target language that retains the hate sentiment in the original examples but transfers the hate targets. We apply our approach to generate training data for hate speech classification tasks in Hindi and Vietnamese. Our findings show that a model trained on synthetic data performs comparably to, and in some cases outperforms, a model trained only on the samples available in the target domain. This method can be adopted to bootstrap hate speech detection models from scratch in limited data contexts. As the growth of social media within these contexts continues to outstrip response efforts, this work furthers our capacities for detection, understanding, and response to hate speech.
△ Less
Submitted 4 October, 2023;
originally announced October 2023.
-
Malicious Cyber Activity Detection Using Zigzag Persistence
Authors:
Audun Myers,
Alyson Bittner,
Sinan Aksoy,
Daniel M. Best,
Gregory Henselman-Petrusek,
Helen Jenne,
Cliff Joslyn,
Bill Kay,
Garret Seppala,
Stephen J. Young,
Emilie Purvine
Abstract:
In this study we synthesize zigzag persistence from topological data analysis with autoencoder-based approaches to detect malicious cyber activity and derive analytic insights. Cybersecurity aims to safeguard computers, networks, and servers from various forms of malicious attacks, including network damage, data theft, and activity monitoring. Here we focus on the detection of malicious activity u…
▽ More
In this study we synthesize zigzag persistence from topological data analysis with autoencoder-based approaches to detect malicious cyber activity and derive analytic insights. Cybersecurity aims to safeguard computers, networks, and servers from various forms of malicious attacks, including network damage, data theft, and activity monitoring. Here we focus on the detection of malicious activity using log data. To do this we consider the dynamics of the data by exploring the changing topology of a hypergraph representation gaining insights into the underlying activity. Hypergraphs provide a natural representation of cyber log data by capturing complex interactions between processes. To study the changing topology we use zigzag persistence which captures how topological features persist at multiple dimensions over time. We observe that the resulting barcodes represent malicious activity differently than benign activity. To automate this detection we implement an autoencoder trained on a vectorization of the resulting zigzag persistence barcodes. Our experimental results demonstrate the effectiveness of the autoencoder in detecting malicious activity in comparison to standard summary statistics. Overall, this study highlights the potential of zigzag persistence and its combination with temporal hypergraphs for analyzing cybersecurity log data and detecting malicious behavior.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Tackling Hate Speech in Low-resource Languages with Context Experts
Authors:
Daniel Nkemelu,
Harshil Shah,
Irfan Essa,
Michael L. Best
Abstract:
Given Myanmars historical and socio-political context, hate speech spread on social media has escalated into offline unrest and violence. This paper presents findings from our remote study on the automatic detection of hate speech online in Myanmar. We argue that effectively addressing this problem will require community-based approaches that combine the knowledge of context experts with machine l…
▽ More
Given Myanmars historical and socio-political context, hate speech spread on social media has escalated into offline unrest and violence. This paper presents findings from our remote study on the automatic detection of hate speech online in Myanmar. We argue that effectively addressing this problem will require community-based approaches that combine the knowledge of context experts with machine learning tools that can analyze the vast amount of data produced. To this end, we develop a systematic process to facilitate this collaboration covering key aspects of data collection, annotation, and model validation strategies. We highlight challenges in this area stemming from small and imbalanced datasets, the need to balance non-glamorous data work and stakeholder priorities, and closed data-sharing practices. Stemming from these findings, we discuss avenues for further work in developing and deploying hate speech detection systems for low-resource languages.
△ Less
Submitted 29 March, 2023;
originally announced March 2023.
-
Why So Inflammatory? Explainability in Automatic Detection of Inflammatory Social Media Users
Authors:
Cuong Nguyen,
Daniel Nkemelu,
Ankit Mehta,
Michael Best
Abstract:
Hate speech and misinformation, spread over social networking services (SNS) such as Facebook and Twitter, have inflamed ethnic and political violence in countries across the globe. We argue that there is limited research on this problem within the context of the Global South and present an approach for tackling them. Prior works have shown how machine learning models built with user-level interac…
▽ More
Hate speech and misinformation, spread over social networking services (SNS) such as Facebook and Twitter, have inflamed ethnic and political violence in countries across the globe. We argue that there is limited research on this problem within the context of the Global South and present an approach for tackling them. Prior works have shown how machine learning models built with user-level interaction features can effectively identify users who spread inflammatory content. While this technique is beneficial in low-resource language settings where linguistic resources such as ground truth data and processing capabilities are lacking, it is still unclear how these interaction features contribute to model performance. In this work, we investigate and show significant differences in interaction features between users who spread inflammatory content and others who do not, applying explainability tools to understand our trained model. We find that features with higher interaction significance (such as account age and activity count) show higher explanatory power than features with lower interaction significance (such as name length and if the user has a location on their bio). Our work extends research directions that aim to understand the nature of inflammatory content in low-resource, high-risk contexts as the growth of social media use in the Global South outstrips moderation efforts.
△ Less
Submitted 21 August, 2022;
originally announced August 2022.