Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness

Nathan M Wolfrath; Nathaniel B Verhagen; Bradley H Crotty; Melek Somai; Anai N Kothari

doi:10.3791/66796

Augmenting Large Language Models via Vector Embeddings to Improve Domain-specific Responsiveness

J Vis Exp. 2024 Dec 6:(214). doi: 10.3791/66796.

Authors

Nathan M Wolfrath¹, Nathaniel B Verhagen², Bradley H Crotty³, Melek Somai³, Anai N Kothari⁴

Affiliations

¹ Department of Surgery, Division of Surgical Oncology, Medical College of Wisconsin; Inception Health Labs, Medical College of Wisconsin.
² Department of Surgery, Division of Surgical Oncology, Medical College of Wisconsin.
³ Inception Health Labs, Medical College of Wisconsin.
⁴ Department of Surgery, Division of Surgical Oncology, Medical College of Wisconsin; [email protected].

PMID: 39714043
DOI: 10.3791/66796

Abstract

Large language models (LLMs) have emerged as a popular resource for generating information relevant to a user query. Such models are created through a resource-intensive training process utilizing an extensive, static corpus of textual data. This static nature results in limitations for adoption in domains with rapidly changing knowledge, proprietary information, and sensitive data. In this work, methods are outlined for augmenting general-purpose LLMs, known as foundation models, with domain-specific information using an embeddings-based approach for incorporating up-to-date, peer-reviewed scientific manuscripts. This is achieved through open-source tools such as Llama-Index and publicly available models such as Llama-2 to maximize transparency, user privacy and control, and replicability. While scientific manuscripts are used as an example use case, this approach can be extended to any text data source. Additionally, methods for evaluating model performance following this enhancement are discussed. These methods enable the rapid development of LLM systems for highly specialized domains regardless of the comprehensiveness of information in the training corpus.

Publication types

Video-Audio Media

MeSH terms

Humans
Language
Natural Language Processing*