-
REFORMS: Reporting Standards for Machine Learning Based Science
Authors:
Sayash Kapoor,
Emily Cantrell,
Kenny Peng,
Thanh Hien Pham,
Christopher A. Bail,
Odd Erik Gundersen,
Jake M. Hofman,
Jessica Hullman,
Michael A. Lones,
Momin M. Malik,
Priyanka Nanayakkara,
Russell A. Poldrack,
Inioluwa Deborah Raji,
Michael Roberts,
Matthew J. Salganik,
Marta Serra-Garcia,
Brandon M. Stewart,
Gilles Vandewiele,
Arvind Narayanan
Abstract:
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways acros…
▽ More
Machine learning (ML) methods are proliferating in scientific research. However, the adoption of these methods has been accompanied by failures of validity, reproducibility, and generalizability. These failures can hinder scientific progress, lead to false consensus around invalid claims, and undermine the credibility of ML-based science. ML methods are often applied and fail in similar ways across disciplines. Motivated by this observation, our goal is to provide clear reporting standards for ML-based science. Drawing from an extensive review of past literature, we present the REFORMS checklist ($\textbf{Re}$porting Standards $\textbf{For}$ $\textbf{M}$achine Learning Based $\textbf{S}$cience). It consists of 32 questions and a paired set of guidelines. REFORMS was developed based on a consensus of 19 researchers across computer science, data science, mathematics, social sciences, and biomedical sciences. REFORMS can serve as a resource for researchers when designing and implementing a study, for referees when reviewing papers, and for journals when enforcing standards for transparency and reproducibility.
△ Less
Submitted 19 September, 2023; v1 submitted 15 August, 2023;
originally announced August 2023.
-
A Comparison of Neuroelectrophysiology Databases
Authors:
Priyanka Subash,
Alex Gray,
Misque Boswell,
Samantha L. Cohen,
Rachael Garner,
Sana Salehi,
Calvary Fisher,
Samuel Hobel,
Satrajit Ghosh,
Yaroslav Halchenko,
Benjamin Dichter,
Russell A. Poldrack,
Chris Markiewicz,
Dora Hermes,
Arnaud Delorme,
Scott Makeig,
Brendan Behan,
Alana Sparks,
Stephen R Arnott,
Zhengjia Wang,
John Magnotti,
Michael S. Beauchamp,
Nader Pouratian,
Arthur W. Toga,
Dominique Duncan
Abstract:
As data sharing has become more prevalent, three pillars - archives, standards, and analysis tools - have emerged as critical components in facilitating effective data sharing and collaboration. This paper compares four freely available intracranial neuroelectrophysiology data repositories: Data Archive for the BRAIN Initiative (DABI), Distributed Archives for Neurophysiology Data Integration (DAN…
▽ More
As data sharing has become more prevalent, three pillars - archives, standards, and analysis tools - have emerged as critical components in facilitating effective data sharing and collaboration. This paper compares four freely available intracranial neuroelectrophysiology data repositories: Data Archive for the BRAIN Initiative (DABI), Distributed Archives for Neurophysiology Data Integration (DANDI), OpenNeuro, and Brain-CODE. The aim of this review is to describe archives that provide researchers with tools to store, share, and reanalyze both human and non-human neurophysiology data based on criteria that are of interest to the neuroscientific community. The Brain Imaging Data Structure (BIDS) and Neurodata Without Borders (NWB) are utilized by these archives to make data more accessible to researchers by implementing a common standard. As the necessity for integrating large-scale analysis into data repository platforms continues to grow within the neuroscientific community, this article will highlight the various analytical and customizable tools developed within the chosen archives that may advance the field of neuroinformatics.
△ Less
Submitted 30 August, 2023; v1 submitted 26 June, 2023;
originally announced June 2023.
-
brainlife.io: A decentralized and open source cloud platform to support neuroscience research
Authors:
Soichi Hayashi,
Bradley A. Caron,
Anibal Sólon Heinsfeld,
Sophia Vinci-Booher,
Brent McPherson,
Daniel N. Bullock,
Giulia Bertò,
Guiomar Niso,
Sandra Hanekamp,
Daniel Levitas,
Kimberly Ray,
Anne MacKenzie,
Lindsey Kitchell,
Josiah K. Leong,
Filipi Nascimento-Silva,
Serge Koudoro,
Hanna Willis,
Jasleen K. Jolly,
Derek Pisner,
Taylor R. Zuidema,
Jan W. Kurzawski,
Kyriaki Mikellidou,
Aurore Bussalb,
Christopher Rorden,
Conner Victory
, et al. (39 additional authors not shown)
Abstract:
Neuroscience research has expanded dramatically over the past 30 years by advancing standardization and tool development to support rigor and transparency. Consequently, the complexity of the data pipeline has also increased, hindering access to FAIR (Findable, Accessible, Interoperabile, and Reusable) data analysis to portions of the worldwide research community. brainlife.io was developed to red…
▽ More
Neuroscience research has expanded dramatically over the past 30 years by advancing standardization and tool development to support rigor and transparency. Consequently, the complexity of the data pipeline has also increased, hindering access to FAIR (Findable, Accessible, Interoperabile, and Reusable) data analysis to portions of the worldwide research community. brainlife.io was developed to reduce these burdens and democratize modern neuroscience research across institutions and career levels. Using community software and hardware infrastructure, the platform provides open-source data standardization, management, visualization, and processing and simplifies the data pipeline. brainlife.io automatically tracks the provenance history of thousands of data objects, supporting simplicity, efficiency, and transparency in neuroscience research. Here brainlife.io's technology and data services are described and evaluated for validity, reliability, reproducibility, replicability, and scientific utility. Using data from 4 modalities and 3,200 participants, we demonstrate that brainlife.io's services produce outputs that adhere to best practices in modern neuroscience research.
△ Less
Submitted 11 August, 2023; v1 submitted 3 June, 2023;
originally announced June 2023.
-
AI-assisted coding: Experiments with GPT-4
Authors:
Russell A Poldrack,
Thomas Lu,
Gašper Beguš
Abstract:
Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demo…
▽ More
Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
△ Less
Submitted 25 April, 2023;
originally announced April 2023.
-
On the long-term archiving of research data
Authors:
Cyril Pernet,
Claus Svarer,
Ross Blair,
John D. Van Horn,
Russell A. Poldrack
Abstract:
Accessing research data at any time is what FAIR (Findable Accessible Interoperable Reusable) data sharing aims to achieve at scale. Yet, we argue that it is not sustainable to keep accumulating and maintaining all datasets for rapid access, considering the monetary and ecological cost of maintaining repositories. Here, we address the issue of cold data storage: when to dispose of data for offline…
▽ More
Accessing research data at any time is what FAIR (Findable Accessible Interoperable Reusable) data sharing aims to achieve at scale. Yet, we argue that it is not sustainable to keep accumulating and maintaining all datasets for rapid access, considering the monetary and ecological cost of maintaining repositories. Here, we address the issue of cold data storage: when to dispose of data for offline storage, how can this be done while maintaining FAIR principles and who should be responsible for cold archiving and long-term preservation.
△ Less
Submitted 3 January, 2023;
originally announced January 2023.
-
Differentiable programming for functional connectomics
Authors:
Rastko Ciric,
Armin W. Thomas,
Oscar Esteban,
Russell A. Poldrack
Abstract:
Mapping the functional connectome has the potential to uncover key insights into brain organisation. However, existing workflows for functional connectomics are limited in their adaptability to new data, and principled workflow design is a challenging combinatorial problem. We introduce a new analytic paradigm and software toolbox that implements common operations used in functional connectomics a…
▽ More
Mapping the functional connectome has the potential to uncover key insights into brain organisation. However, existing workflows for functional connectomics are limited in their adaptability to new data, and principled workflow design is a challenging combinatorial problem. We introduce a new analytic paradigm and software toolbox that implements common operations used in functional connectomics as fully differentiable processing blocks. Under this paradigm, workflow configurations exist as reparameterisations of a differentiable functional that interpolates them. The differentiable program that we envision occupies a niche midway between traditional pipelines and end-to-end neural networks, combining the glass-box tractability and domain knowledge of the former with the amenability to optimisation of the latter. In this preliminary work, we provide a proof of concept for differentiable connectomics, demonstrating the capacity of our processing blocks both to recapitulate canonical knowledge in neuroscience and to make new discoveries in an unsupervised setting. Our differentiable modules are competitive with state-of-the-art methods in problem domains including functional parcellation, denoising, and covariance modelling. Taken together, our results and software demonstrate the promise of differentiable programming for functional connectomics.
△ Less
Submitted 31 May, 2022;
originally announced June 2022.
-
Comparing interpretation methods in mental state decoding analyses with deep learning models
Authors:
Armin W. Thomas,
Christopher Ré,
Russell A. Poldrack
Abstract:
Deep learning (DL) models find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (e.g., perceiving fear or joy) and brain activity by identifying those brain regions (and networks) whose activity allows to accurately identify (i.e., decode) these states. Once a DL model has been trained to accurately decode a set of mental state…
▽ More
Deep learning (DL) models find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (e.g., perceiving fear or joy) and brain activity by identifying those brain regions (and networks) whose activity allows to accurately identify (i.e., decode) these states. Once a DL model has been trained to accurately decode a set of mental states, neuroimaging researchers often make use of interpretation methods from explainable artificial intelligence research to understand the model's learned mappings between mental states and brain activity. Here, we compare the explanation performance of prominent interpretation methods in a mental state decoding analysis of three functional Magnetic Resonance Imaging (fMRI) datasets. Our findings demonstrate a gradient between two key characteristics of an explanation in mental state decoding, namely, its biological plausibility and faithfulness: interpretation methods with high explanation faithfulness, which capture the model's decision process well, generally provide explanations that are biologically less plausible than the explanations of interpretation methods with less explanation faithfulness. Based on this finding, we provide specific recommendations for the application of interpretation methods in mental state decoding.
△ Less
Submitted 14 October, 2022; v1 submitted 31 May, 2022;
originally announced May 2022.
-
DeepDefacer: Automatic Removal of Facial Features via U-Net Image Segmentation
Authors:
Anish Khazane,
Julien Hoachuck,
Krzysztof J. Gorgolewski,
Russell A. Poldrack
Abstract:
Recent advancements in the field of magnetic resonance imaging (MRI) have enabled large-scale collaboration among clinicians and researchers for neuroimaging tasks. However, researchers are often forced to use outdated and slow software to anonymize MRI images for publication. These programs specifically perform expensive mathematical operations over 3D images that rapidly slow down anonymization…
▽ More
Recent advancements in the field of magnetic resonance imaging (MRI) have enabled large-scale collaboration among clinicians and researchers for neuroimaging tasks. However, researchers are often forced to use outdated and slow software to anonymize MRI images for publication. These programs specifically perform expensive mathematical operations over 3D images that rapidly slow down anonymization speed as an image's volume increases in size. In this paper, we introduce DeepDefacer, an application of deep learning to MRI anonymization that uses a streamlined 3D U-Net network to mask facial regions in MRI images with a significant increase in speed over traditional de-identification software. We train DeepDefacer on MRI images from the Brain Development Organization (IXI) and International Consortium for Brain Mapping (ICBM) and quantitatively evaluate our model against a baseline 3D U-Net model with regards to Dice, recall, and precision scores. We also evaluate DeepDefacer against Pydeface, a traditional defacing application, with regards to speed on a range of CPU and GPU devices and qualitatively evaluate our model's defaced output versus the ground truth images produced by Pydeface. We provide a link to a PyPi program at the end of this manuscript to encourage further research into the application of deep learning to MRI anonymization.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Challenges for cognitive decoding using deep learning methods
Authors:
Armin W. Thomas,
Christopher Ré,
Russell A. Poldrack
Abstract:
In cognitive decoding, researchers aim to characterize a brain region's representations by identifying the cognitive states (e.g., accepting/rejecting a gamble) that can be identified from the region's activity. Deep learning (DL) methods are highly promising for cognitive decoding, with their unmatched ability to learn versatile representations of complex data. Yet, their widespread application i…
▽ More
In cognitive decoding, researchers aim to characterize a brain region's representations by identifying the cognitive states (e.g., accepting/rejecting a gamble) that can be identified from the region's activity. Deep learning (DL) methods are highly promising for cognitive decoding, with their unmatched ability to learn versatile representations of complex data. Yet, their widespread application in cognitive decoding is hindered by their general lack of interpretability as well as difficulties in applying them to small datasets and in ensuring their reproducibility and robustness. We propose to approach these challenges by leveraging recent advances in explainable artificial intelligence and transfer learning, while also providing specific recommendations on how to improve the reproducibility and robustness of DL modeling results.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Computational and informatics advances for reproducible data analysis in neuroimaging
Authors:
Russell A. Poldrack,
Krzysztof J. Gorgolewski,
Gael Varoquaux
Abstract:
The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data sharing resources that have been developed for neuroimaging data, and the role of data sta…
▽ More
The reproducibility of scientific research has become a point of critical concern. We argue that openness and transparency are critical for reproducibility, and we outline an ecosystem for open and transparent science that has emerged within the human neuroimaging community. We discuss the range of open data sharing resources that have been developed for neuroimaging data, and the role of data standards (particularly the Brain Imaging Data Structure) in enabling the automated sharing, processing, and reuse of large neuroimaging datasets. We outline how the open-source Python language has provided the basis for a data science platform that enables reproducible data analysis and visualization. We also discuss how new advances in software engineering, such as containerization, provide the basis for greater reproducibility in data analysis. The emergence of this new ecosystem provides an example for many areas of science that are currently struggling with reproducibility.
△ Less
Submitted 24 September, 2018;
originally announced September 2018.
-
Text to brain: predicting the spatial distribution of neuroimaging observations from text reports
Authors:
Jérôme Dockès,
Demian Wassermann,
Russell Poldrack,
Fabian Suchanek,
Bertrand Thirion,
Gaël Varoquaux
Abstract:
Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is form…
▽ More
Despite the digital nature of magnetic resonance imaging, the resulting observations are most frequently reported and stored in text documents. There is a trove of information untapped in medical health records, case reports, and medical publications. In this paper, we propose to mine brain medical publications to learn the spatial distribution associated with anatomical terms. The problem is formulated in terms of minimization of a risk on distributions which leads to a least-deviation cost function. An efficient algorithm in the dual then learns the mapping from documents to brain structures. Empirical results using coordinates extracted from the brain-imaging literature show that i) models must adapt to semantic variation in the terms used to describe a given anatomical structure, ii) voxel-wise parameterization leads to higher likelihood of locations reported in unseen documents, iii) least-deviation cost outperforms least-square. As a proof of concept for our method, we use our model of spatial distributions to predict the distribution of specific neurological conditions from text-only reports.
△ Less
Submitted 28 June, 2018; v1 submitted 4 June, 2018;
originally announced June 2018.
-
Information Projection and Approximate Inference for Structured Sparse Variables
Authors:
Rajiv Khanna,
Joydeep Ghosh,
Russell Poldrack,
Oluwasanmi Koyejo
Abstract:
Approximate inference via information projection has been recently introduced as a general-purpose approach for efficient probabilistic inference given sparse variables. This manuscript goes beyond classical sparsity by proposing efficient algorithms for approximate inference via information projection that are applicable to any structure on the set of variables that admits enumeration using a \em…
▽ More
Approximate inference via information projection has been recently introduced as a general-purpose approach for efficient probabilistic inference given sparse variables. This manuscript goes beyond classical sparsity by proposing efficient algorithms for approximate inference via information projection that are applicable to any structure on the set of variables that admits enumeration using a \emph{matroid}. We show that the resulting information projection can be reduced to combinatorial submodular optimization subject to matroid constraints. Further, leveraging recent advances in submodular optimization, we provide an efficient greedy algorithm with strong optimization-theoretic guarantees. The class of probabilistic models that can be expressed in this way is quite broad and, as we show, includes group sparse regression, group sparse principal components analysis and sparse canonical correlation analysis, among others. Moreover, empirical results on simulated data and high dimensional neuroimaging data highlight the superior performance of the information projection approach as compared to established baselines for a range of probabilistic models.
△ Less
Submitted 11 July, 2016;
originally announced July 2016.
-
A simple and provable algorithm for sparse diagonal CCA
Authors:
Megasthenis Asteris,
Anastasios Kyrillidis,
Oluwasanmi Koyejo,
Russell Poldrack
Abstract:
Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard.
We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that vari…
▽ More
Given two sets of variables, derived from a common set of samples, sparse Canonical Correlation Analysis (CCA) seeks linear combinations of a small number of variables in each set, such that the induced canonical variables are maximally correlated. Sparse CCA is NP-hard.
We propose a novel combinatorial algorithm for sparse diagonal CCA, i.e., sparse CCA under the additional assumption that variables within each set are standardized and uncorrelated. Our algorithm operates on a low rank approximation of the input data and its computational complexity scales linearly with the number of input variables. It is simple to implement, and parallelizable. In contrast to most existing approaches, our algorithm administers precise control on the sparsity of the extracted canonical vectors, and comes with theoretical data-dependent global approximation guarantees, that hinge on the spectrum of the input data. Finally, it can be straightforwardly adapted to other constrained variants of CCA enforcing structure beyond sparsity.
We empirically evaluate the proposed scheme and apply it on a real neuroimaging dataset to investigate associations between brain activity and behavior measurements.
△ Less
Submitted 28 May, 2016;
originally announced May 2016.