-
An Annotated Glossary for Data Commons, Data Meshes, and Other Data Platforms
Authors:
Robert L. Grossman
Abstract:
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
Cloud-based data commons, data meshes, data hubs, and other data platforms are important ways to manage, analyze and share data to accelerate research and to support reproducible research. This is an annotated glossary of some of the more common terms used in articles and discussions about these platforms.
△ Less
Submitted 23 April, 2024;
originally announced April 2024.
-
Enhancing Instance-Level Image Classification with Set-Level Labels
Authors:
Renyu Zhang,
Aly A. Khan,
Yuxin Chen,
Robert L. Grossman
Abstract:
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveragin…
▽ More
Instance-level image classification tasks have traditionally relied on single-instance labels to train models, e.g., few-shot learning and transfer learning. However, set-level coarse-grained labels that capture relationships among instances can provide richer information in real-world scenarios. In this paper, we present a novel approach to enhance instance-level image classification by leveraging set-level labels. We provide a theoretical analysis of the proposed method, including recognition conditions for fast excess risk rate, shedding light on the theoretical foundations of our approach. We conducted experiments on two distinct categories of datasets: natural image datasets and histopathology image datasets. Our experimental results demonstrate the effectiveness of our approach, showcasing improved classification performance compared to traditional single-instance label-based methods. Notably, our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks. Importantly, our experimental findings align with the theoretical analysis, reinforcing the robustness and reliability of our proposed method. This work bridges the gap between instance-level and set-level image classification, offering a promising avenue for advancing the capabilities of image classification models with set-level coarse-grained labels.
△ Less
Submitted 17 November, 2023; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Principles and Guidelines for Sharing Biomedical Data for Secondary Use: The University of Chicago Perspective
Authors:
Robert L. Grossman,
Maryellen L. Giger,
Julie A. Johnson,
Jeremy D. Marks,
Jessica P. Ridgway,
Julian Solway,
Walter M. Stadler
Abstract:
Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patien…
▽ More
Academic medical centers are generating an increasing amount of biomedical data and there is an increasing demand for biomedical data for research purposes by research projects, research consortia, companies, and other third parties. At the same time, as the number of patients grows and the amount of data per patient grows, there is an increasing possibility that some information about some patients may become available if the data is shared with third parties and the third parties have a data breach or violate the terms of the data use agreement. Balancing the importance of research that may result in improved patient outcomes with the importance of protecting patient data is challenging. The article discusses the principles, considerations about risks and mitigating risks, and guidelines used at the University of Chicago used for making decisions about sharing biomedical data with third parties.
△ Less
Submitted 5 February, 2023;
originally announced February 2023.
-
Ten Lessons for Data Sharing With a Data Commons
Authors:
Robert L. Grossman
Abstract:
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past…
▽ More
A data commons is a cloud-based data platform with a governance structure that allows a community to manage, analyze and share its data. Data commons provide a research community with the ability to manage and analyze large datasets using the elastic scalability provided by cloud computing and to share data securely and compliantly, and, in this way, accelerate the pace of research. Over the past decade, a number of data commons have been developed and we discuss some of the lessons learned from this effort.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
A Framework for the Interoperability of Cloud Platforms: Towards FAIR Data in SAFE Environments
Authors:
Robert L. Grossman,
Rebecca R. Boyles,
Brandi N. Davis-Dusenbery,
Amanda Haddock,
Allison P. Heath,
Brian D. O'Connor,
Adam C. Resnick,
Deanne M. Taylor,
Stan Ahalt
Abstract:
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies…
▽ More
As the number of cloud platforms supporting scientific research grows, there is an increasing need to support interoperability between two or more cloud platforms, as a growing amount of data is being hosted in cloud-based platforms. A well accepted core concept is to make data in cloud platforms Findable, Accessible, Interoperable and Reusable (FAIR). We introduce a companion concept that applies to cloud-based computing environments that we call a Secure and Authorized FAIR Environment (SAFE). SAFE environments require data and platform governance structures and are designed to support the interoperability of sensitive or controlled access data, such as biomedical data. A SAFE environment is a cloud platform that has been approved through a defined data and platform governance process as authorized to hold data from another cloud platform and exposes appropriate APIs for the two platforms to interoperate.
△ Less
Submitted 15 February, 2024; v1 submitted 9 March, 2022;
originally announced March 2022.
-
Scalable Batch-Mode Deep Bayesian Active Learning via Equivalence Class Annealing
Authors:
Renyu Zhang,
Aly A. Khan,
Robert L. Grossman,
Yuxin Chen
Abstract:
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines in…
▽ More
Active learning has demonstrated data efficiency in many fields. Existing active learning algorithms, especially in the context of batch-mode deep Bayesian active models, rely heavily on the quality of uncertainty estimations of the model, and are often challenging to scale to large batches. In this paper, we propose Batch-BALanCe, a scalable batch-mode active learning algorithm, which combines insights from decision-theoretic active learning, combinatorial information measure, and diversity sampling. At its core, Batch-BALanCe relies on a novel decision-theoretic acquisition function that facilitates differentiation among different equivalence classes. Intuitively, each equivalence class consists of hypotheses (e.g., posterior samples of deep neural networks) with similar predictions, and Batch-BALanCe adaptively adjusts the size of the equivalence classes as learning progresses. To scale up the computation of queries to large batches, we further propose an efficient batch-mode acquisition procedure, which aims to maximize a novel information measure defined through the acquisition function. We show that our algorithm can effectively handle realistic multi-class classification tasks, and achieves compelling performance on several benchmark datasets for active learning under both low- and large-batch regimes. Reference code is released at https://github.com/zhangrenyuuchicago/BALanCe.
△ Less
Submitted 20 February, 2023; v1 submitted 27 December, 2021;
originally announced December 2021.
-
The realization of input-output maps using bialgebras
Authors:
Robert L. Grossman,
Richard G. Larson
Abstract:
We use the theory of bialgebras to provide the algebraic background for state space realization theorems for input-output maps of control systems. This allows us to consider from a common viewpoint classical results about formal state space realizations of nonlinear systems and more recent results involving analysis related to families of trees. If $H$ is a bialgebra, we say that $p \in H^*$ is di…
▽ More
We use the theory of bialgebras to provide the algebraic background for state space realization theorems for input-output maps of control systems. This allows us to consider from a common viewpoint classical results about formal state space realizations of nonlinear systems and more recent results involving analysis related to families of trees. If $H$ is a bialgebra, we say that $p \in H^*$ is differentially produced by the algebra $R$ with the augmentation $ε$ if there is right $H$-module algebra structure on $R$ and there exists $f \in R$ satisfying $p(h) = ε(f \cdot h)$. We characterize those $p \in H^*$ which are differentially produced.
△ Less
Submitted 18 July, 2020;
originally announced July 2020.
-
Data Lakes, Clouds and Commons: A Review of Platforms for Analyzing and Sharing Genomic Data
Authors:
Robert L. Grossman
Abstract:
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interopera…
▽ More
Data commons collate data with cloud computing infrastructure and commonly used software services, tools and applications to create biomedical resources for the large-scale management, analysis, harmonization, and sharing of biomedical data. Over the past few years, data commons have been used to analyze, harmonize and share large scale genomics datasets. Data ecosystems can be built by interoperating multiple data commons. It can be quite labor intensive to curate, import and analyze the data in a data commons. Data lakes provide an alternative to data commons and simply provide access to data, with the data curation and analysis deferred until later and delegated to those that access the data. We review software platforms for managing, analyzing and sharing genomic data, with an emphasis on data commons, but also covering data ecosystems and data lakes.
△ Less
Submitted 24 December, 2018; v1 submitted 5 September, 2018;
originally announced September 2018.
-
Detecting Spatial Patterns of Disease in Large Collections of Electronic Medical Records Using Neighbor-Based Bootstrapping (NB2)
Authors:
Maria T Patterson,
Robert L Grossman
Abstract:
We introduce a method called neighbor-based bootstrapping (NB2) that can be used to quantify the geospatial variation of a variable. We applied this method to an analysis of the incidence rates of disease from electronic medical record data (ICD-9 codes) for approximately 100 million individuals in the US over a period of 8 years. We considered the incidence rate of disease in each county and its…
▽ More
We introduce a method called neighbor-based bootstrapping (NB2) that can be used to quantify the geospatial variation of a variable. We applied this method to an analysis of the incidence rates of disease from electronic medical record data (ICD-9 codes) for approximately 100 million individuals in the US over a period of 8 years. We considered the incidence rate of disease in each county and its geospatially contiguous neighbors and rank ordered diseases in terms of their degree of geospatial variation as quantified by the NB2 method.
We show that this method yields results in good agreement with established methods for detecting spatial autocorrelation (Moran's I method and kriging). Moreover, the NB2 method can be tuned to identify both large area and small area geospatial variations. This method also applies more generally in any parameter space that can be partitioned to consist of regions and their neighbors.
△ Less
Submitted 5 March, 2017;
originally announced March 2017.
-
A Case for Data Commons: Towards Data Science as a Service
Authors:
Robert L. Grossman,
Allison Heath,
Mark Murphy,
Maria Patterson,
Walt Wells
Abstract:
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scienti…
▽ More
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience developing data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.
△ Less
Submitted 9 April, 2016;
originally announced April 2016.
-
The Design of a Community Science Cloud: The Open Science Data Cloud Perspective
Authors:
Robert L. Grossman,
Matthew Greenway,
Allison P. Heath,
Ray Powell,
Rafael D. Suarez,
Walt Wells,
Kevin White,
Malcolm Atkinson,
Iraklis Klampanos,
Heidi L. Alvarez,
Christine Harvey,
Joe J. Mambretti
Abstract:
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss som…
▽ More
In this paper we describe the design, and implementation of the Open Science Data Cloud, or OSDC. The goal of the OSDC is to provide petabyte-scale data cloud infrastructure and related services for scientists working with large quantities of data. Currently, the OSDC consists of more than 2000 cores and 2 PB of storage distributed across four data centers connected by 10G networks. We discuss some of the lessons learned during the past three years of operation and describe the software stacks used in the OSDC. We also describe some of the research projects in biology, the earth sciences, and social sciences enabled by the OSDC.
△ Less
Submitted 3 January, 2016;
originally announced January 2016.
-
MalStone: Towards A Benchmark for Analytics on Large Data Clouds
Authors:
Collin Bennett,
Robert L. Grossman,
David Locke,
Jonathan Seidman,
Steve Vejcik
Abstract:
Developing data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is developing cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper…
▽ More
Developing data mining algorithms that are suitable for cloud computing platforms is currently an active area of research, as is developing cloud computing platforms appropriate for data mining. Currently, the most common benchmark for cloud computing is the Terasort (and related) benchmarks. Although the Terasort Benchmark is quite useful, it was not designed for data mining per se. In this paper, we introduce a benchmark called MalStone that is specifically designed to measure the performance of cloud computing middleware that supports the type of data intensive computing common when building data mining models. We also introduce MalGen, which is a utility for generating data on clouds that can be used with MalStone.
△ Less
Submitted 7 July, 2010;
originally announced July 2010.
-
State Space Realization Theorems For Data Mining
Authors:
Robert L Grossman,
Richard G Larson
Abstract:
In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras.
In this paper, we consider formal series associated with events, profiles derived from events, and statistical models that make predictions about events. We prove theorems about realizations for these formal series using the language and tools of Hopf algebras.
△ Less
Submitted 18 January, 2009;
originally announced January 2009.
-
Sector and Sphere: Towards Simplified Storage and Processing of Large Scale Distributed Data
Authors:
Yunhong Gu,
Robert L Grossman
Abstract:
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also ac…
▽ More
Cloud computing has demonstrated that processing very large datasets over commodity clusters can be done simply given the right programming model and infrastructure. In this paper, we describe the design and implementation of the Sector storage cloud and the Sphere compute cloud. In contrast to existing storage and compute clouds, Sector can manage data not only within a data center, but also across geographically distributed data centers. Similarly, the Sphere compute cloud supports User Defined Functions (UDF) over data both within a data center and across data centers. As a special case, MapReduce style programming can be implemented in Sphere by using a Map UDF followed by a Reduce UDF. We describe some experimental studies comparing Sector/Sphere and Hadoop using the Terasort Benchmark. In these studies, Sector is about twice as fast as Hadoop. Sector/Sphere is open source.
△ Less
Submitted 16 January, 2009; v1 submitted 6 September, 2008;
originally announced September 2008.
-
Data Mining Using High Performance Data Clouds: Experimental Studies Using Sector and Sphere
Authors:
Robert L Grossman,
Yunhong Gu
Abstract:
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it p…
▽ More
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.
△ Less
Submitted 21 August, 2008;
originally announced August 2008.
-
Compute and Storage Clouds Using Wide Area High Performance Networks
Authors:
Robert L. Grossman,
Yunhong Gu,
Michael Sabala,
Wanzhi Zhang
Abstract:
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
We describe a cloud based infrastructure that we have developed that is optimized for wide area, high performance networks and designed to support data mining applications. The infrastructure consists of a storage cloud called Sector and a compute cloud called Sphere. We describe two applications that we have built using the cloud and some experimental studies.
△ Less
Submitted 13 August, 2008;
originally announced August 2008.
-
Hopf-algebraic structures of families of trees
Authors:
R. L. Grossman,
R. G. Larson
Abstract:
Description of cocommutative Hopf algebras associated with families of trees. Applications include Cayley's theorem on the number of rooted trees with n nodes, and Catalan's theorem on the number of rooted ordered trees with n nodes.
Description of cocommutative Hopf algebras associated with families of trees. Applications include Cayley's theorem on the number of rooted trees with n nodes, and Catalan's theorem on the number of rooted ordered trees with n nodes.
△ Less
Submitted 24 November, 2007;
originally announced November 2007.
-
An Overview of Hopf Algebras of Trees and Their Actions on Functions
Authors:
Robert L. Grossman,
Richard G. Larson
Abstract:
We provide an expository account of some of the Hopf algebras that can be defined using trees, labeled trees, ordered trees and heap ordered trees. We also describe some actions of these Hopf algebras on algebra of functions.
We provide an expository account of some of the Hopf algebras that can be defined using trees, labeled trees, ordered trees and heap ordered trees. We also describe some actions of these Hopf algebras on algebra of functions.
△ Less
Submitted 24 November, 2007;
originally announced November 2007.
-
Hopf Algebras of Heap Ordered Trees and Permutations
Authors:
R. L. Grossman,
R. G. Larson
Abstract:
It is known that there is a Hopf algebra structure on the vector space with basis all heap-ordered trees. We give a new bialgebra structure on the space with basis all permutations and show that there is a direct bialgebra isomorphism between the Hopf algebra of heap-ordered trees and the bialgebra of permutations.
It is known that there is a Hopf algebra structure on the vector space with basis all heap-ordered trees. We give a new bialgebra structure on the space with basis all permutations and show that there is a direct bialgebra isomorphism between the Hopf algebra of heap-ordered trees and the bialgebra of permutations.
△ Less
Submitted 14 November, 2007; v1 submitted 9 June, 2007;
originally announced June 2007.
-
Differential Algebra Structures on Familes of Trees
Authors:
Robert L Grossman,
Richard G Larson
Abstract:
It is known that the vector space spanned by labeled rooted trees forms a Hopf algebra. Let k be a field and let R be a commutative k-algebra. Let H denote the Hopf algebra of rooted trees labeled using derivations D in Der(R). In this paper, we introduce a construction which gives R a H-module algebra structure and show this induces a differential algebra structure of H acting on R. The work he…
▽ More
It is known that the vector space spanned by labeled rooted trees forms a Hopf algebra. Let k be a field and let R be a commutative k-algebra. Let H denote the Hopf algebra of rooted trees labeled using derivations D in Der(R). In this paper, we introduce a construction which gives R a H-module algebra structure and show this induces a differential algebra structure of H acting on R. The work here extends the notion of a R/k-bialgebra introduced by Nichols and Weisfeiler.
△ Less
Submitted 31 August, 2004;
originally announced September 2004.