-
Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding
Authors:
Ahmad Idrissi-Yaghir,
Amin Dada,
Henning Schäfer,
Kamyar Arzideh,
Giulia Baldini,
Jan Trienes,
Max Hasin,
Jeanette Bewersdorff,
Cynthia S. Schmidt,
Marie Bauer,
Kaleb E. Smith,
Jiang Bian,
Yonghui Wu,
Jörg Schlötterer,
Torsten Zesch,
Peter A. Horn,
Christin Seifert,
Felix Nensa,
Jens Kleesiek,
Christoph M. Friedrich
Abstract:
Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are commo…
▽ More
Recent advances in natural language processing (NLP) can be largely attributed to the advent of pre-trained language models such as BERT and RoBERTa. While these models demonstrate remarkable performance on general datasets, they can struggle in specialized domains such as medicine, where unique domain-specific terminologies, domain-specific abbreviations, and varying document structures are common. This paper explores strategies for adapting these models to domain-specific requirements, primarily through continuous pre-training on domain-specific data. We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data. The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering. Our results suggest that models augmented by clinical and translation-based pre-training typically outperform general domain models in medical contexts. We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch. Furthermore, pre-training on clinical data or leveraging translated texts have proven to be reliable methods for domain adaptation in medical NLP tasks.
△ Less
Submitted 8 May, 2024; v1 submitted 8 April, 2024;
originally announced April 2024.
-
CLUE: A Clinical Language Understanding Evaluation for LLMs
Authors:
Amin Dada,
Marie Bauer,
Amanda Butler Contreras,
Osman Alperen Koraş,
Constantin Marc Seibold,
Kaleb E Smith,
Jens Kleesiek
Abstract:
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been lim…
▽ More
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE.
△ Less
Submitted 24 June, 2024; v1 submitted 5 April, 2024;
originally announced April 2024.
-
Improving Generalizability of Extracting Social Determinants of Health Using Large Language Models through Prompt-tuning
Authors:
Cheng Peng,
Zehao Yu,
Kaleb E Smith,
Wei-Hsuan Lo-Ciganic,
Jiang Bian,
Yonghui Wu
Abstract:
The progress in natural language processing (NLP) using large language models (LLMs) has greatly improved patient information extraction from clinical narratives. However, most methods based on the fine-tuning strategy have limited transfer learning ability for cross-domain applications. This study proposed a novel approach that employs a soft prompt-based learning architecture, which introduces t…
▽ More
The progress in natural language processing (NLP) using large language models (LLMs) has greatly improved patient information extraction from clinical narratives. However, most methods based on the fine-tuning strategy have limited transfer learning ability for cross-domain applications. This study proposed a novel approach that employs a soft prompt-based learning architecture, which introduces trainable prompts to guide LLMs toward desired outputs. We examined two types of LLM architectures, including encoder-only GatorTron and decoder-only GatorTronGPT, and evaluated their performance for the extraction of social determinants of health (SDoH) using a cross-institution dataset from the 2022 n2c2 challenge and a cross-disease dataset from the University of Florida (UF) Health. The results show that decoder-only LLMs with prompt tuning achieved better performance in cross-domain applications. GatorTronGPT achieved the best F1 scores for both datasets, outperforming traditional fine-tuned GatorTron by 8.9% and 21.8% in a cross-institution setting, and 5.5% and 14.5% in a cross-disease setting.
△ Less
Submitted 18 March, 2024;
originally announced March 2024.
-
Generative Large Language Models Are All-purpose Text Analytics Engines: Text-to-text Learning Is All Your Need
Authors:
Cheng Peng,
Xi Yang,
Aokun Chen,
Zehao Yu,
Kaleb E Smith,
Anthony B Costa,
Mona G Flores,
Jiang Bian,
Yonghui Wu
Abstract:
Objective To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning. Methods We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 b…
▽ More
Objective To solve major clinical natural language processing (NLP) tasks using a unified text-to-text learning architecture based on a generative large language model (LLM) via prompt tuning. Methods We formulated 7 key clinical NLP tasks as text-to-text learning and solved them using one unified generative clinical LLM, GatorTronGPT, developed using GPT-3 architecture and trained with up to 20 billion parameters. We adopted soft prompts (i.e., trainable vectors) with frozen LLM, where the LLM parameters were not updated (i.e., frozen) and only the vectors of soft prompts were updated, known as prompt tuning. We added additional soft prompts as a prefix to the input layer, which were optimized during the prompt tuning. We evaluated the proposed method using 7 clinical NLP tasks and compared them with previous task-specific solutions based on Transformer models. Results and Conclusion The proposed approach achieved state-of-the-art performance for 5 out of 7 major clinical NLP tasks using one unified generative LLM. Our approach outperformed previous task-specific transformer models by ~3% for concept extraction and 7% for relation extraction applied to social determinants of health, 3.4% for clinical concept normalization, 3.4~10% for clinical abbreviation disambiguation, and 5.5~9% for natural language inference. Our approach also outperformed a previously developed prompt-based machine reading comprehension (MRC) model, GatorTron-MRC, for clinical concept and relation extraction. The proposed approach can deliver the ``one model for all`` promise from training to deployment using a unified generative LLM.
△ Less
Submitted 10 December, 2023;
originally announced December 2023.
-
On the Impact of Cross-Domain Data on German Language Models
Authors:
Amin Dada,
Aokun Chen,
Cheng Peng,
Kaleb E Smith,
Ahmad Idrissi-Yaghir,
Constantin Marc Seibold,
Jianning Li,
Lars Heiliger,
Xi Yang,
Christoph M. Friedrich,
Daniel Truhn,
Jan Egger,
Jiang Bian,
Jens Kleesiek,
Yonghui Wu
Abstract:
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed…
▽ More
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over quality, we present a German dataset comprising texts from five domains, along with another dataset aimed at containing high-quality data. Through training a series of models ranging between 122M and 750M parameters on both datasets, we conduct a comprehensive benchmark on multiple downstream tasks. Our findings demonstrate that the models trained on the cross-domain dataset outperform those trained on quality data alone, leading to improvements up to $4.45\%$ over the previous state-of-the-art. The models are available at https://huggingface.co/ikim-uk-essen
△ Less
Submitted 13 October, 2023; v1 submitted 11 October, 2023;
originally announced October 2023.
-
Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction
Authors:
Cheng Peng,
Xi Yang,
Kaleb E Smith,
Zehao Yu,
Aokun Chen,
Jiang Bian,
Yonghui Wu
Abstract:
Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt wi…
▽ More
Objective To develop soft prompt-based learning algorithms for large language models (LLMs), examine the shape of prompts, prompt-tuning using frozen/unfrozen LLMs, transfer learning, and few-shot learning abilities. Methods We developed a soft prompt-based LLM model and compared 4 training strategies including (1) fine-tuning without prompts; (2) hard-prompt with unfrozen LLMs; (3) soft-prompt with unfrozen LLMs; and (4) soft-prompt with frozen LLMs. We evaluated 7 pretrained LLMs using the 4 training strategies for clinical concept and relation extraction on two benchmark datasets. We evaluated the transfer learning ability of the prompt-based learning algorithms in a cross-institution setting. We also assessed the few-shot learning ability. Results and Conclusion When LLMs are unfrozen, GatorTron-3.9B with soft prompting achieves the best strict F1-scores of 0.9118 and 0.8604 for concept extraction, outperforming the traditional fine-tuning and hard prompt-based models by 0.6~3.1% and 1.2~2.9%, respectively; GatorTron-345M with soft prompting achieves the best F1-scores of 0.8332 and 0.7488 for end-to-end relation extraction, outperforming the other two models by 0.2~2% and 0.6~11.7%, respectively. When LLMs are frozen, small (i.e., 345 million parameters) LLMs have a big gap to be competitive with unfrozen models; scaling LLMs up to billions of parameters makes frozen LLMs competitive with unfrozen LLMs. For cross-institute evaluation, soft prompting with a frozen GatorTron-8.9B model achieved the best performance. This study demonstrates that (1) machines can learn soft prompts better than humans, (2) frozen LLMs have better few-shot learning ability and transfer learning ability to facilitate muti-institution applications, and (3) frozen LLMs require large models.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
A Study of Generative Large Language Model for Medical Research and Healthcare
Authors:
Cheng Peng,
Xi Yang,
Aokun Chen,
Kaleb E Smith,
Nima PourNejatian,
Anthony B Costa,
Cheryl Martin,
Mona G Flores,
Ying Zhang,
Tanja Magoc,
Gloria Lipori,
Duane A Mitchell,
Naykky S Ospina,
Mustafa M Ahmed,
William R Hogan,
Elizabeth A Shenkman,
Yi Guo,
Jiang Bian,
Yonghui Wu
Abstract:
There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language proc…
▽ More
There is enormous enthusiasm and concerns in using large language models (LLMs) in healthcare, yet current assumptions are all based on general-purpose LLMs such as ChatGPT. This study develops a clinical generative LLM, GatorTronGPT, using 277 billion words of mixed clinical and English text with a GPT-3 architecture of 20 billion parameters. GatorTronGPT improves biomedical natural language processing for medical research. Synthetic NLP models trained using GatorTronGPT generated text outperform NLP models trained using real-world clinical text. Physicians Turing test using 1 (worst) to 9 (best) scale shows that there is no significant difference in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights on the opportunities and challenges of LLMs for medical research and healthcare.
△ Less
Submitted 22 May, 2023;
originally announced May 2023.
-
Machine Learning Methods Applied to Cortico-Cortical Evoked Potentials Aid in Localizing Seizure Onset Zones
Authors:
Ian G. Malone,
Kaleb E. Smith,
Morgan E. Urdaneta,
Tyler S. Davis,
Daria Nesterovich Anderson,
Brian J. Phillip,
John D. Rolston,
Christopher R. Butson
Abstract:
Epilepsy affects millions of people, reducing quality of life and increasing risk of premature death. One-third of epilepsy cases are drug-resistant and require surgery for treatment, which necessitates localizing the seizure onset zone (SOZ) in the brain. Attempts have been made to use cortico-cortical evoked potentials (CCEPs) to improve SOZ localization but none have been successful enough for…
▽ More
Epilepsy affects millions of people, reducing quality of life and increasing risk of premature death. One-third of epilepsy cases are drug-resistant and require surgery for treatment, which necessitates localizing the seizure onset zone (SOZ) in the brain. Attempts have been made to use cortico-cortical evoked potentials (CCEPs) to improve SOZ localization but none have been successful enough for clinical adoption. Here, we compare the performance of ten machine learning classifiers in localizing SOZ from CCEP data. This preliminary study validates a novel application of machine learning, and the results establish our approach as a promising line of research that warrants further investigation. This work also serves to facilitate discussion and collaboration with fellow machine learning and/or epilepsy researchers.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records
Authors:
Xi Yang,
Aokun Chen,
Nima PourNejatian,
Hoo Chang Shin,
Kaleb E Smith,
Christopher Parisien,
Colin Compas,
Cheryl Martin,
Mona G Flores,
Ying Zhang,
Tanja Magoc,
Christopher A Harle,
Gloria Lipori,
Duane A Mitchell,
William R Hogan,
Elizabeth A Shenkman,
Jiang Bian,
Yonghui Wu
Abstract:
There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is compar…
▽ More
There is an increasing interest in developing artificial intelligence (AI) systems to process and interpret electronic health records (EHRs). Natural language processing (NLP) powered by pretrained language models is the key technology for medical AI systems utilizing clinical narratives. However, there are few clinical language models, the largest of which trained in the clinical domain is comparatively small at 110 million parameters (compared with billions of parameters in the general domain). It is not clear how large clinical language models with billions of parameters can help medical AI systems utilize unstructured EHRs. In this study, we develop from scratch a large clinical language model - GatorTron - using >90 billion words of text (including >82 billion words of de-identified clinical text) and systematically evaluate it on 5 clinical NLP tasks including clinical concept extraction, medical relation extraction, semantic textual similarity, natural language inference (NLI), and medical question answering (MQA). We examine how (1) scaling up the number of parameters and (2) scaling up the size of the training data could benefit these NLP tasks. GatorTron models scale up the clinical language model from 110 million to 8.9 billion parameters and improve 5 clinical NLP tasks (e.g., 9.6% and 9.5% improvement in accuracy for NLI and MQA), which can be applied to medical AI systems to improve healthcare delivery. The GatorTron models are publicly available at: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/gatortron_og.
△ Less
Submitted 16 December, 2022; v1 submitted 2 February, 2022;
originally announced March 2022.
-
Building an AI-ready RSE Workforce
Authors:
Ying Zhang,
Matthew A. Gitzendanner,
Dan S. Maxwell,
Justin W. Richardson,
Kaleb E. Smith,
Eric A. Stubbs,
Brian J. Stucky,
Jingchao Zhang,
Erik Deumens
Abstract:
Artificial Intelligence has been transforming industries and academic research across the globe, and research software development is no exception. Machine learning and deep learning are being applied in every aspect of the research software development lifecycles, from new algorithm design paradigms to software development processes. In this paper, we discuss our views on today's challenges and o…
▽ More
Artificial Intelligence has been transforming industries and academic research across the globe, and research software development is no exception. Machine learning and deep learning are being applied in every aspect of the research software development lifecycles, from new algorithm design paradigms to software development processes. In this paper, we discuss our views on today's challenges and opportunities that AI has presented on research software development and engineers, and the approaches we, at the University of Florida, are taking to prepare our workforce for the new era of AI.
△ Less
Submitted 8 November, 2021;
originally announced November 2021.
-
A Spectral Enabled GAN for Time Series Data Generation
Authors:
Kaleb E. Smith,
Anthony O. Smith
Abstract:
Time dependent data is a main source of information in today's data driven world. Generating this type of data though has shown its challenges and made it an interesting research area in the field of generative machine learning. One such approach was that by Smith et al. who developed Time Series Generative Adversarial Network (TSGAN) which showed promising performance in generating time dependent…
▽ More
Time dependent data is a main source of information in today's data driven world. Generating this type of data though has shown its challenges and made it an interesting research area in the field of generative machine learning. One such approach was that by Smith et al. who developed Time Series Generative Adversarial Network (TSGAN) which showed promising performance in generating time dependent data and the ability of few shot generation though being flawed in certain aspects of training and learning. This paper looks to improve on the results from TSGAN and address those flaws by unifying the training of the independent networks in TSGAN and creating a dependency both in training and learning. This improvement, called unified TSGAN (uTSGAN) was tested and comapred both quantitatively and qualitatively to its predecessor on 70 benchmark time series data sets used in the community. uTSGAN showed to outperform TSGAN in 80\% of the data sets by the same number of training epochs and 60\% of the data sets in 3/4th the amount of training time or less while maintaining the few shot generation ability with better FID scores across those data sets.
△ Less
Submitted 2 March, 2021;
originally announced March 2021.
-
Conditional GAN for timeseries generation
Authors:
Kaleb E Smith,
Anthony O Smith
Abstract:
It is abundantly clear that time dependent data is a vital source of information in the world. The challenge has been for applications in machine learning to gain access to a considerable amount of quality data needed for algorithm development and analysis. Modeling synthetic data using a Generative Adversarial Network (GAN) has been at the heart of providing a viable solution. Our work focuses on…
▽ More
It is abundantly clear that time dependent data is a vital source of information in the world. The challenge has been for applications in machine learning to gain access to a considerable amount of quality data needed for algorithm development and analysis. Modeling synthetic data using a Generative Adversarial Network (GAN) has been at the heart of providing a viable solution. Our work focuses on one dimensional times series and explores the few shot approach, which is the ability of an algorithm to perform well with limited data. This work attempts to ease the frustration by proposing a new architecture, Time Series GAN (TSGAN), to model realistic time series data. We evaluate TSGAN on 70 data sets from a benchmark time series database. Our results demonstrate that TSGAN performs better than the competition both quantitatively using the Frechet Inception Score (FID) metric, and qualitatively when classification is used as the evaluation criteria.
△ Less
Submitted 29 June, 2020;
originally announced June 2020.