-
Experimenting with Large Language Models and vector embeddings in NASA SciX
Authors:
Sergi Blanco-Cuaresma,
Ioana Ciucă,
Alberto Accomazzi,
Michael J. Kurtz,
Edwin A. Henneken,
Kelly E. Lockhart,
Felix Grezes,
Thomas Allen,
Golnaz Shapurian,
Carolyn S. Grant,
Donna M. Thompson,
Timothy W. Hostetler,
Matthew R. Templeton,
Shinyi Chen,
Jennifer Koch,
Taylor Jacovich,
Daniel Chivvis,
Fernanda de Macedo Alves,
Jean-Claude Paquin,
Jennifer Bartlett,
Mugdha Polimera,
Stephanie Jarmak
Abstract:
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed a…
▽ More
Open-source Large Language Models enable projects such as NASA SciX (i.e., NASA ADS) to think out of the box and try alternative approaches for information retrieval and data augmentation, while respecting data copyright and users' privacy. However, when large language models are directly prompted with questions without any context, they are prone to hallucination. At NASA SciX we have developed an experiment where we created semantic vectors for our large collection of abstracts and full-text content, and we designed a prompt system to ask questions using contextual chunks from our system. Based on a non-systematic human evaluation, the experiment shows a lower degree of hallucination and better responses when using Retrieval Augmented Generation. Further exploration is required to design new features and data augmentation processes at NASA SciX that leverages this technology while respecting the high level of trust and quality that the project holds.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Improving astroBERT using Semantic Textual Similarity
Authors:
Felix Grezes,
Thomas Allen,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Golnaz Shapurian,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Shinyi Chen,
Jennifer Koch,
Taylor Jacovich,
Pavlos Protopapas
Abstract:
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we:
- announce the first…
▽ More
The NASA Astrophysics Data System (ADS) is an essential tool for researchers that allows them to explore the astronomy and astrophysics scientific literature, but it has yet to exploit recent advances in natural language processing. At ADASS 2021, we introduced astroBERT, a machine learning language model tailored to the text used in astronomy papers in ADS. In this work we:
- announce the first public release of the astroBERT language model;
- show how astroBERT improves over existing public language models on astrophysics specific tasks;
- and detail how ADS plans to harness the unique structure of scientific papers, the citation graph and citation context, to further improve astroBERT.
△ Less
Submitted 29 November, 2022;
originally announced December 2022.
-
Web accessibility trends and implementation in dynamic web applications
Authors:
Timothy W. Hostetler,
Shinyi Chen,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Carolyn S. Grant,
Edwin Henneken,
Donna M. Thompson,
Roman Chyla,
Golnaz Shapurian,
Matthew R. Templeton,
Kelly E. Lockhart,
Nemanja Martinovic,
Stephen McDonald,
Felix Grezes
Abstract:
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to pre…
▽ More
The NASA Astrophysics Data System (ADS), a critical research service for the astrophysics community, strives to provide the most accessible and inclusive environment for the discovery and exploration of the astronomical literature. Part of this goal involves creating a digital platform that can accommodate everybody, including those with disabilities that would benefit from alternative ways to present the information provided by the website. NASA ADS follows the official Web Content Accessibility Guidelines (WCAG) standard for ensuring accessibility of all its applications, striving to exceed this standard where possible. Through the use of both internal audits and external expert review based on these guidelines, we have identified many areas for improving accessibility in our current web application, and have implemented a number of updates to the UI as a result of this. We present an overview of some current web accessibility trends, discuss our experience incorporating these trends in our web application, and discuss the lessons learned and recommendations for future projects.
△ Less
Submitted 1 February, 2022;
originally announced February 2022.
-
Building astroBERT, a language model for Astronomy & Astrophysics
Authors:
Felix Grezes,
Sergi Blanco-Cuaresma,
Alberto Accomazzi,
Michael J. Kurtz,
Golnaz Shapurian,
Edwin Henneken,
Carolyn S. Grant,
Donna M. Thompson,
Roman Chyla,
Stephen McDonald,
Timothy W. Hostetler,
Matthew R. Templeton,
Kelly E. Lockhart,
Nemanja Martinovic,
Shinyi Chen,
Chris Tanner,
Pavlos Protopapas
Abstract:
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and…
▽ More
The existing search tools for exploring the NASA Astrophysics Data System (ADS) can be quite rich and empowering (e.g., similar and trending operators), but researchers are not yet allowed to fully leverage semantic search. For example, a query for "results from the Planck mission" should be able to distinguish between all the various meanings of Planck (person, mission, constant, institutions and more) without further clarification from the user. At ADS, we are applying modern machine learning and natural language processing techniques to our dataset of recent astronomy publications to train astroBERT, a deeply contextual language model based on research at Google. Using astroBERT, we aim to enrich the ADS dataset and improve its discoverability, and in particular we are developing our own named entity recognition tool. We present here our preliminary results and lessons learned.
△ Less
Submitted 1 December, 2021;
originally announced December 2021.