-
The Vertebrate Breed Ontology: Towards Effective Breed Data Standardization
Authors:
Kathleen R. Mullen,
Imke Tammen,
Nicolas A. Matentzoglu,
Marius Mather,
Christopher J. Mungall,
Melissa A. Haendel,
Frank W. Nicholas,
Sabrina Toro,
the Vertebrate Breed Ontology Consortium
Abstract:
Background: Limited universally adopted data standards in veterinary science hinders data interoperability and therefore integration and comparison; this ultimately impedes application of existing information-based tools to support advancement in veterinary diagnostics, treatments, and precision medicine.
Objectives: Creation of a Vertebrate Breed Ontology (VBO) as a single, coherent logic-based…
▽ More
Background: Limited universally adopted data standards in veterinary science hinders data interoperability and therefore integration and comparison; this ultimately impedes application of existing information-based tools to support advancement in veterinary diagnostics, treatments, and precision medicine.
Objectives: Creation of a Vertebrate Breed Ontology (VBO) as a single, coherent logic-based standard for documenting breed names in animal health, production and research-related records will improve data use capabilities in veterinary and comparative medicine.
Animals: No live animals were used in this study.
Methods: A list of breed names and related information was compiled from relevant sources, organizations, communities, and experts using manual and computational approaches to create VBO. Each breed is represented by a VBO term that includes all provenance and the breed's related information as metadata. VBO terms are classified using description logic to allow computational applications and Artificial Intelligence-readiness.
Results: VBO is an open, community-driven ontology representing over 19,000 livestock and companion animal breeds covering 41 species. Breeds are classified based on community and expert conventions (e.g., horse breed, cattle breed). This classification is supported by relations to the breeds' genus and species indicated by NCBI Taxonomy terms. Relationships between VBO terms, e.g. relating breeds to their foundation stock, provide additional context to support advanced data analytics. VBO term metadata includes common names and synonyms, breed identifiers or codes, and attributed cross-references to other databases.
Conclusion and clinical importance: Veterinary data interoperability and computability can be enhanced by the adoption of VBO as a source of standard breed names in databases and veterinary electronic health records.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation
Authors:
Oluwamayowa O. Amusat,
Harshad Hegde,
Christopher J. Mungall,
Anna Giannakou,
Neil P. Byers,
Dan Gunter,
Kjiersten Fagnan,
Lavanya Ramakrishnan
Abstract:
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automati…
▽ More
Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets.
In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
△ Less
Submitted 8 November, 2023;
originally announced November 2023.
-
Gene Set Summarization using Large Language Models
Authors:
Marcin P. Joachimiak,
J. Harry Caufield,
Nomi L. Harris,
Hyeongsik Kim,
Christopher J. Mungall
Abstract:
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpretin…
▽ More
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling the use of Large Language Models (LLMs), potentially utilizing scientific texts directly and avoiding reliance on a KB.
We developed SPINDOCTOR (Structured Prompt Interpolation of Natural Language Descriptions of Controlled Terms for Ontology Reporting), a method that uses GPT models to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct model retrieval.
We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for gene sets. However, GPT-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, these methods were rarely able to recapitulate the most precise and informative term from standard enrichment, likely due to an inability to generalize and reason using an ontology. Results are highly nondeterministic, with minor variations in prompt resulting in radically different term lists. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis and that manual curation of ontological assertions remains necessary.
△ Less
Submitted 3 July, 2024; v1 submitted 20 May, 2023;
originally announced May 2023.
-
KG-Hub -- Building and Exchanging Biological Knowledge Graphs
Authors:
J Harry Caufield,
Tim Putman,
Kevin Schaper,
Deepak R Unni,
Harshad Hegde,
Tiffany J Callahan,
Luca Cappelletti,
Sierra AT Moxon,
Vida Ravanmehr,
Seth Carbon,
Lauren E Chan,
Katherina Cortes,
Kent A Shefchek,
Glass Elsarboukh,
James P Balhoff,
Tommaso Fontana,
Nicolas Matentzoglu,
Richard M Bruskiewich,
Anne E Thessen,
Nomi L Harris,
Monica C Munoz-Torres,
Melissa A Haendel,
Peter N Robinson,
Marcin P Joachimiak,
Christopher J Mungall
, et al. (1 additional authors not shown)
Abstract:
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simp…
▽ More
Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of knowledge graphs is lacking. Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of knowledge graphs. Features include a simple, modular extract-transform-load (ETL) pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate knowledge graphs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph machine learning, including node embeddings and training of models for link prediction and node classification.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
Perspectives for self-driving labs in synthetic biology
Authors:
Hector Garcia Martin,
Tijana Radivojevic,
Jeremy Zucker,
Kristofer Bouchard,
Jess Sustarich,
Sean Peisert,
Dan Arnold,
Nathan Hillson,
Gyorgy Babnigg,
Jose Manuel Marti,
Christopher J. Mungall,
Gregg T. Beckham,
Lucas Waldburger,
James Carothers,
ShivShankar Sundaram,
Deb Agarwal,
Blake A. Simmons,
Tyler Backman,
Deepanwita Banerjee,
Deepti Tanjore,
Lavanya Ramakrishnan,
Anup Singh
Abstract:
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we…
▽ More
Self-driving labs (SDLs) combine fully automated experiments with artificial intelligence (AI) that decides the next set of experiments. Taken to their ultimate expression, SDLs could usher a new paradigm of scientific research, where the world is probed, interpreted, and explained by machines for human benefit. While there are functioning SDLs in the fields of chemistry and materials science, we contend that synthetic biology provides a unique opportunity since the genome provides a single target for affecting the incredibly wide repertoire of biological cell behavior. However, the level of investment required for the creation of biological SDLs is only warranted if directed towards solving difficult and enabling biological questions. Here, we discuss challenges and opportunities in creating SDLs for synthetic biology.
△ Less
Submitted 1 November, 2022; v1 submitted 14 October, 2022;
originally announced October 2022.
-
Creation and unification of development and life stage ontologies for animals
Authors:
Anne Niknejad,
Christopher J. Mungall,
David Osumi-Sutherland,
Marc Robinson-Rechavi,
Frederic B. Bastian
Abstract:
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper ou…
▽ More
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper our ability to answer comparative biology questions and to interpret functional results. We present here a collection of development and life stage ontologies for 21 animal species, and their merge into a common multi-species ontology. This work has allowed the integration and comparison of transcriptomics data in 52 animal species.
△ Less
Submitted 24 June, 2022;
originally announced June 2022.
-
Guidelines for reporting cell types: the MIRACL standard
Authors:
Tiago Lubiana,
Paola Roncaglia,
Christopher J. Mungall,
Ellen M. Quardokus,
Joshua D. Fortriede,
David Osumi-Sutherland,
Alexander D. Diehl
Abstract:
Cell types are at the root of modern biology, and describing them is a core task of the Human Cell Atlas project. Surprisingly, there are no standards for reporting new cell types, leading to a gap between classes mentioned in biomedical literature and the Cell Ontology, the primary registry of cell types. Here we introduce the Minimal Information Reporting About a CelL (MIRACL) standard, a guidel…
▽ More
Cell types are at the root of modern biology, and describing them is a core task of the Human Cell Atlas project. Surprisingly, there are no standards for reporting new cell types, leading to a gap between classes mentioned in biomedical literature and the Cell Ontology, the primary registry of cell types. Here we introduce the Minimal Information Reporting About a CelL (MIRACL) standard, a guideline for describing cell types alongside scientific articles. In a MIRACL sheet, authors organize a label, a diagnostic description, a taxon, an anatomical structure, and a parent cell class for each cell type of interest. The MIRACL standard bridges the gap between wet-lab researchers and ontologists, facilitating the integration of biomedical knowledge into ontologies and artificial intelligence systems.
△ Less
Submitted 25 May, 2022; v1 submitted 18 April, 2022;
originally announced April 2022.
-
Recommendations for extending the GFF3 specification for improved interoperability of genomic data
Authors:
Surya Saha,
Scott Cain,
Ethalinda K. S. Cannon,
Nathan Dunn,
Andrew Farmer,
Zhi-Liang Hu,
Gareth Maslen,
Sierra Moxon,
Christopher J Mungall,
Rex Nelson,
Monica F. Poelchau
Abstract:
The GFF3 format is a common, flexible tab-delimited format representing the structure and function of genes or other mapped features (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). However, with increasing re-use of annotation data, this flexibility has become an obstacle for standardized downstream processing. Common software packages that export annotations in GFF3…
▽ More
The GFF3 format is a common, flexible tab-delimited format representing the structure and function of genes or other mapped features (https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md). However, with increasing re-use of annotation data, this flexibility has become an obstacle for standardized downstream processing. Common software packages that export annotations in GFF3 format model the same data and metadata in different notations, which puts the burden on end-users to interpret the data model. The AgBioData consortium is a group of genomics, genetics and breeding databases and partners working towards shared practices and standards. Providing concrete guidelines for generating GFF3, and creating a standard representation of the most common biological data types would provide a major increase in efficiency for AgBioData databases and the genomics research community that use the GFF3 format in their daily operations. The AgBioData GFF3 working group has developed recommendations to solve common problems in the GFF3 format. We suggest improvements for each of the GFF3 fields, as well as the special cases of modeling functional annotations, and standard protein-coding genes. We welcome further discussion of these recommendations. We request the genomics and bioinformatics community to utilize the github repository (https://github.com/NAL-i5K/AgBioData_GFF3_recommendation) to provide feedback via issues or pull requests.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
GOTaxon: Representing the evolution of biological functions in the Gene Ontology
Authors:
Haiming Tang,
Christopher J Mungall,
Huaiyu Mi,
Paul D Thomas
Abstract:
The Gene Ontology aims to define the universe of functions known for gene products, at the molecular, cellular and organism levels. While the ontology is designed to cover all aspects of biology in a "species independent manner", the fact remains that many if not most biological functions are restricted in their taxonomic range. This is simply because functions evolve, i.e. like other biological c…
▽ More
The Gene Ontology aims to define the universe of functions known for gene products, at the molecular, cellular and organism levels. While the ontology is designed to cover all aspects of biology in a "species independent manner", the fact remains that many if not most biological functions are restricted in their taxonomic range. This is simply because functions evolve, i.e. like other biological characteristics they are gained and lost over evolutionary time. Here we introduce a general method of representing the evolutionary gain and loss of biological functions within the Gene Ontology. We then apply a variety of techniques, including manual curation, logical reasoning over the ontology structure, and previously published "taxon constraints" to assign evolutionary gain and loss events to the majority of terms in the GO. These gain and loss events now almost triple the number of terms with taxon constraints, and currently cover a total of 76% of GO terms, including 40% of molecular function terms, 78% of cellular component terms, and 89% of biological process terms.
Database URL: GOTaxon is freely available at https://github.com/haimingt/GOTaxonConstraint
△ Less
Submitted 16 February, 2018;
originally announced February 2018.