-
BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
Authors:
Zahra Gharaee,
Scott C. Lowe,
ZeMing Gong,
Pablo Millan Arias,
Nicholas Pellegrino,
Austin T. Wang,
Joakim Bruslund Haurum,
Iuliia Zarubiieva,
Lila Kari,
Dirk Steinke,
Graham W. Taylor,
Paul Fieguth,
Angel X. Chang
Abstract:
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by includin…
▽ More
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the BIOSCAN-5M dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at https://github.com/zahrag/BIOSCAN-5M.
△ Less
Submitted 24 June, 2024; v1 submitted 18 June, 2024;
originally announced June 2024.
-
An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders
Authors:
Scott C. Lowe,
Joakim Bruslund Haurum,
Sageev Oore,
Thomas B. Moeslund,
Graham W. Taylor
Abstract:
Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not…
▽ More
Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learned representations orthogonal to existing methods such as kNN. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels. Our code implementation is available at \url{https://github.com/scottclowe/zs-ssl-clustering/}.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
BIOSCAN-CLIP: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
Authors:
ZeMing Gong,
Austin T. Wang,
Joakim Bruslund Haurum,
Scott C. Lowe,
Graham W. Taylor,
Angel X. Chang
Abstract:
Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for the taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, DNA barcodes, and textual data in a unified embedding space. This allows for…
▽ More
Measuring biodiversity is crucial for understanding ecosystem health. While prior works have developed machine learning models for the taxonomic classification of photographic images and DNA separately, in this work, we introduce a multimodal approach combining both, using CLIP-style contrastive learning to align images, DNA barcodes, and textual data in a unified embedding space. This allows for accurate classification of both known and unknown insect species without task-specific fine-tuning, leveraging contrastive learning for the first time to fuse DNA and image data. Our method surpasses previous single-modality approaches in accuracy by over 11% on zero-shot learning tasks, showcasing its effectiveness in biodiversity studies.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
BenthicNet: A global compilation of seafloor images for deep learning applications
Authors:
Scott C. Lowe,
Benjamin Misiuk,
Isaac Xu,
Shakhboz Abdulazizov,
Amit R. Baroi,
Alex C. Bastos,
Merlin Best,
Vicki Ferrini,
Ariell Friedman,
Deborah Hart,
Ove Hoegh-Guldberg,
Daniel Ierodiaconou,
Julia Mackin-McLaughlin,
Kathryn Markey,
Pedro S. Menandro,
Jacquomo Monk,
Shreya Nemani,
John O'Brien,
Elizabeth Oh,
Luba Y. Reshitnyk,
Katleen Robert,
Chris M. Roelfsema,
Jessica A. Sameoto,
Alexandre C. G. Schimel,
Jordan A. Thomson
, et al. (4 additional authors not shown)
Abstract:
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with…
▽ More
Advances in underwater imaging enable the collection of extensive seafloor image datasets that are necessary for monitoring important benthic ecosystems. The ability to collect seafloor imagery has outpaced our capacity to analyze it, hindering expedient mobilization of this crucial environmental information. Recent machine learning approaches provide opportunities to increase the efficiency with which seafloor image datasets are analyzed, yet large and consistent datasets necessary to support development of such approaches are scarce. Here we present BenthicNet: a global compilation of seafloor imagery designed to support the training and evaluation of large-scale image recognition models. An initial set of over 11.4 million images was collected and curated to represent a diversity of seafloor environments using a representative subset of 1.3 million images. These are accompanied by 2.6 million annotations translated to the CATAMI scheme, which span 190,000 of the images. A large deep learning model was trained on this compilation and preliminary results suggest it has utility for automating large and small-scale image analysis tasks. The compilation and model are made openly available for use by the scientific community at https://doi.org/10.20383/103.0614.
△ Less
Submitted 11 July, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
BarcodeBERT: Transformers for Biodiversity Analysis
Authors:
Pablo Millan Arias,
Niousha Sadjadi,
Monireh Safari,
ZeMing Gong,
Austin T. Wang,
Scott C. Lowe,
Joakim Bruslund Haurum,
Iuliia Zarubiieva,
Dirk Steinke,
Lila Kari,
Angel X. Chang,
Graham W. Taylor
Abstract:
Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across…
▽ More
Understanding biodiversity is a global challenge, in which DNA barcodes - short snippets of DNA that cluster by species - play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining. We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks. The code repository is available at https://github.com/Kari-Genomics-Lab/BarcodeBERT
△ Less
Submitted 4 November, 2023;
originally announced November 2023.
-
A Step Towards Worldwide Biodiversity Assessment: The BIOSCAN-1M Insect Dataset
Authors:
Zahra Gharaee,
ZeMing Gong,
Nicholas Pellegrino,
Iuliia Zarubiieva,
Joakim Bruslund Haurum,
Scott C. Lowe,
Jaclyn T. A. McKeown,
Chris C. Y. Ho,
Joschka McLeod,
Yi-Yun C Wei,
Jireh Agda,
Sujeevan Ratnasingham,
Dirk Steinke,
Angel X. Chang,
Graham W. Taylor,
Paul Fieguth
Abstract:
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a c…
▽ More
In an effort to catalog insect biodiversity, we propose a new large dataset of hand-labelled insect images, the BIOSCAN-Insect Dataset. Each record is taxonomically classified by an expert, and also has associated genetic information including raw nucleotide barcode sequences and assigned barcode index numbers, which are genetically-based proxies for species classification. This paper presents a curated million-image dataset, primarily to train computer-vision models capable of providing image-based taxonomic assessment, however, the dataset also presents compelling characteristics, the study of which would be of interest to the broader machine learning community. Driven by the biological nature inherent to the dataset, a characteristic long-tailed class-imbalance distribution is exhibited. Furthermore, taxonomic labelling is a hierarchical classification scheme, presenting a highly fine-grained classification problem at lower levels. Beyond spurring interest in biodiversity research within the machine learning community, progress on creating an image-based taxonomic classifier will also further the ultimate goal of all BIOSCAN research: to lay the foundation for a comprehensive survey of global biodiversity. This paper introduces the dataset and explores the classification task through the implementation and analysis of a baseline classifier.
△ Less
Submitted 13 November, 2023; v1 submitted 19 July, 2023;
originally announced July 2023.
-
Echofilter: A Deep Learning Segmentation Model Improves the Automation, Standardization, and Timeliness for Post-Processing Echosounder Data in Tidal Energy Streams
Authors:
Scott C. Lowe,
Louise P. McGarry,
Jessica Douglas,
Jason Newport,
Sageev Oore,
Christopher Whidden,
Daniel J. Hasselman
Abstract:
Understanding the abundance and distribution of fish in tidal energy streams is important to assess risks presented by introducing tidal energy devices to the habitat. However tidal current flows suitable for tidal energy are often highly turbulent, complicating the interpretation of echosounder data. The portion of the water column contaminated by returns from entrained air must be excluded from…
▽ More
Understanding the abundance and distribution of fish in tidal energy streams is important to assess risks presented by introducing tidal energy devices to the habitat. However tidal current flows suitable for tidal energy are often highly turbulent, complicating the interpretation of echosounder data. The portion of the water column contaminated by returns from entrained air must be excluded from data used for biological analyses. Application of a single conventional algorithm to identify the depth-of-penetration of entrained air is insufficient for a boundary that is discontinuous, depth-dynamic, porous, and varies with tidal flow speed.
Using a case study at a tidal energy demonstration site in the Bay of Fundy, we describe the development and application of a deep machine learning model with a U-Net based architecture. Our model, Echofilter, was highly responsive to the dynamic range of turbulence conditions and sensitive to the fine-scale nuances in the boundary position, producing an entrained-air boundary line with an average error of 0.33m on mobile downfacing and 0.5-1.0m on stationary upfacing data, less than half that of existing algorithmic solutions. The model's overall annotations had a high level of agreement with the human segmentation, with an intersection-over-union score of 99% for mobile downfacing recordings and 92-95% for stationary upfacing recordings. This resulted in a 50% reduction in the time required for manual edits when compared to the time required to manually edit the line placement produced by the currently available algorithms. Because of the improved initial automated placement, the implementation of the models permits an increase in the standardization and repeatability of line placement.
△ Less
Submitted 18 August, 2022; v1 submitted 19 February, 2022;
originally announced February 2022.
-
LogAvgExp Provides a Principled and Performant Global Pooling Operator
Authors:
Scott C. Lowe,
Thomas Trappenberg,
Sageev Oore
Abstract:
We seek to improve the pooling operation in neural networks, by applying a more theoretically justified operator. We demonstrate that LogSumExp provides a natural OR operator for logits. When one corrects for the number of elements inside the pooling operator, this becomes $\text{LogAvgExp} := \log(\text{mean}(\exp(x)))$. By introducing a single temperature parameter, LogAvgExp smoothly transition…
▽ More
We seek to improve the pooling operation in neural networks, by applying a more theoretically justified operator. We demonstrate that LogSumExp provides a natural OR operator for logits. When one corrects for the number of elements inside the pooling operator, this becomes $\text{LogAvgExp} := \log(\text{mean}(\exp(x)))$. By introducing a single temperature parameter, LogAvgExp smoothly transitions from the max of its operands to the mean (found at the limiting cases $t \to 0^+$ and $t \to +\infty$). We experimentally tested LogAvgExp, both with and without a learnable temperature parameter, in a variety of deep neural network architectures for computer vision.
△ Less
Submitted 2 November, 2021;
originally announced November 2021.
-
Logical Activation Functions: Logit-space equivalents of Probabilistic Boolean Operators
Authors:
Scott C. Lowe,
Robert Earle,
Jason d'Eon,
Thomas Trappenberg,
Sageev Oore
Abstract:
The choice of activation functions and their motivation is a long-standing issue within the neural network community. Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence of features within the stimulus. We derive logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independe…
▽ More
The choice of activation functions and their motivation is a long-standing issue within the neural network community. Neuronal representations within artificial neural networks are commonly understood as logits, representing the log-odds score of presence of features within the stimulus. We derive logit-space operators equivalent to probabilistic Boolean logic-gates AND, OR, and XNOR for independent probabilities. Such theories are important to formalize more complex dendritic operations in real neurons, and these operations can be used as activation functions within a neural network, introducing probabilistic Boolean-logic as the core operation of the neural network. Since these functions involve taking multiple exponents and logarithms, they are computationally expensive and not well suited to be directly used within neural networks. Consequently, we construct efficient approximations named $\text{AND}_\text{AIL}$ (the AND operator Approximate for Independent Logits), $\text{OR}_\text{AIL}$, and $\text{XNOR}_\text{AIL}$, which utilize only comparison and addition operations, have well-behaved gradients, and can be deployed as activation functions in neural networks. Like MaxOut, $\text{AND}_\text{AIL}$ and $\text{OR}_\text{AIL}$ are generalizations of ReLU to two-dimensions. While our primary aim is to formalize dendritic computations within a logit-space probabilistic-Boolean framework, we deploy these new activation functions, both in isolation and in conjunction to demonstrate their effectiveness on a variety of tasks including image classification, transfer learning, abstract reasoning, and compositional zero-shot learning.
△ Less
Submitted 29 November, 2022; v1 submitted 22 October, 2021;
originally announced October 2021.
-
Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test
Authors:
Lu Yihe,
Scott C. Lowe,
Penelope A. Lewis,
Mark C. W. van Rossum
Abstract:
Despite remarkable advances in automated visual recognition by machines, some visual tasks remain challenging for machines. Fleuret et al. (2011) introduced the Synthetic Visual Reasoning Test (SVRT) to highlight this point, which required classification of images consisting of randomly generated shapes based on hidden abstract rules using only a few examples. Ellis et al. (2015) demonstrated that…
▽ More
Despite remarkable advances in automated visual recognition by machines, some visual tasks remain challenging for machines. Fleuret et al. (2011) introduced the Synthetic Visual Reasoning Test (SVRT) to highlight this point, which required classification of images consisting of randomly generated shapes based on hidden abstract rules using only a few examples. Ellis et al. (2015) demonstrated that a program synthesis approach could solve some of the SVRT problems with unsupervised, few-shot learning, whereas they remained challenging for several convolutional neural networks trained with thousands of examples. Here we re-considered the human and machine experiments, because they followed different protocols and yielded different statistics. We thus proposed a quantitative reintepretation of the data between the protocols, so that we could make fair comparison between human and machine performance. We improved the program synthesis classifier by correcting the image parsings, and compared the results to the performance of other machine agents and human subjects. We grouped the SVRT problems into different types by the two aspects of the core characteristics for classification: shape specification and location relation. We found that the program synthesis classifier could not solve problems involving shape distances, because it relied on symbolic computation which scales poorly with input dimension and adding distances into such computation would increase the dimension combinatorially with the number of shapes in an image. Therefore, although the program synthesis classifier is capable of abstract reasoning, its performance is highly constrained by the accessible information in image parsings.
△ Less
Submitted 19 November, 2019; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Exploring Conditioning for Generative Music Systems with Human-Interpretable Controls
Authors:
Nicholas Meade,
Nicholas Barreyre,
Scott C. Lowe,
Sageev Oore
Abstract:
Performance RNN is a machine-learning system designed primarily for the generation of solo piano performances using an event-based (rather than audio) representation. More specifically, Performance RNN is a long short-term memory (LSTM) based recurrent neural network that models polyphonic music with expressive timing and dynamics (Oore et al., 2018). The neural network uses a simple language mode…
▽ More
Performance RNN is a machine-learning system designed primarily for the generation of solo piano performances using an event-based (rather than audio) representation. More specifically, Performance RNN is a long short-term memory (LSTM) based recurrent neural network that models polyphonic music with expressive timing and dynamics (Oore et al., 2018). The neural network uses a simple language model based on the Musical Instrument Digital Interface (MIDI) file format. Performance RNN is trained on the e-Piano Junior Competition Dataset (International Piano e-Competition, 2018), a collection of solo piano performances by expert pianists. As an artistic tool, one of the limitations of the original model has been the lack of useable controls. The standard form of Performance RNN can generate interesting pieces, but little control is provided over what specifically is generated. This paper explores a set of conditioning-based controls used to influence the generation process.
△ Less
Submitted 3 August, 2019; v1 submitted 9 July, 2019;
originally announced July 2019.