-
RadioRAG: Factual Large Language Models for Enhanced Diagnostics in Radiology Using Dynamic Retrieval Augmented Generation
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Keno Bressem,
Robert Siepmann,
Dyke Ferber,
Christiane Kuhl,
Jakob Nikolas Kather,
Sven Nebelung,
Daniel Truhn
Abstract:
Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiolog…
▽ More
Large language models (LLMs) have advanced the field of artificial intelligence (AI) in medicine. However LLMs often generate outdated or inaccurate information based on static training datasets. Retrieval augmented generation (RAG) mitigates this by integrating outside data sources. While previous RAG systems used pre-assembled, fixed databases with limited flexibility, we have developed Radiology RAG (RadioRAG) as an end-to-end framework that retrieves data from authoritative radiologic online sources in real-time. RadioRAG is evaluated using a dedicated radiologic question-and-answer dataset (RadioQA). We evaluate the diagnostic accuracy of various LLMs when answering radiology-specific questions with and without access to additional online information via RAG. Using 80 questions from RSNA Case Collection across radiologic subspecialties and 24 additional expert-curated questions, for which the correct gold-standard answers were available, LLMs (GPT-3.5-turbo, GPT-4, Mistral-7B, Mixtral-8x7B, and Llama3 [8B and 70B]) were prompted with and without RadioRAG. RadioRAG retrieved context-specific information from www.radiopaedia.org in real-time and incorporated them into its reply. RadioRAG consistently improved diagnostic accuracy across all LLMs, with relative improvements ranging from 2% to 54%. It matched or exceeded question answering without RAG across radiologic subspecialties, particularly in breast imaging and emergency radiology. However, degree of improvement varied among models; GPT-3.5-turbo and Mixtral-8x7B-instruct-v0.1 saw notable gains, while Mistral-7B-instruct-v0.2 showed no improvement, highlighting variability in its effectiveness. LLMs benefit when provided access to domain-specific data beyond their training data. For radiology, RadioRAG establishes a robust framework that substantially improves diagnostic accuracy and factuality in radiological question answering.
△ Less
Submitted 22 July, 2024;
originally announced July 2024.
-
Large Language Models Streamline Automated Machine Learning for Clinical Studies
Authors:
Soroosh Tayebi Arasteh,
Tianyu Han,
Mahshad Lotfinia,
Christiane Kuhl,
Jakob Nikolas Kather,
Daniel Truhn,
Sven Nebelung
Abstract:
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from l…
▽ More
A knowledge gap persists between machine learning (ML) developers (e.g., data scientists) and practitioners (e.g., clinicians), hampering the full utilization of ML for clinical data analysis. We investigated the potential of the ChatGPT Advanced Data Analysis (ADA), an extension of GPT-4, to bridge this gap and perform ML analyses efficiently. Real-world clinical datasets and study details from large trials across various medical specialties were presented to ChatGPT ADA without specific guidance. ChatGPT ADA autonomously developed state-of-the-art ML models based on the original study's training data to predict clinical outcomes such as cancer development, cancer progression, disease complications, or biomarkers such as pathogenic gene sequences. Following the re-implementation and optimization of the published models, the head-to-head comparison of the ChatGPT ADA-crafted ML models and their respective manually crafted counterparts revealed no significant differences in traditional performance metrics (P>0.071). Strikingly, the ChatGPT ADA-crafted ML models often outperformed their counterparts. In conclusion, ChatGPT ADA offers a promising avenue to democratize ML in medicine by simplifying complex data analyses, yet should enhance, not replace, specialized training and resources, to promote broader applications in medical research and practice.
△ Less
Submitted 21 February, 2024; v1 submitted 27 August, 2023;
originally announced August 2023.
-
Preserving privacy in domain transfer of medical AI models comes at no performance costs: The integral role of differential privacy
Authors:
Soroosh Tayebi Arasteh,
Mahshad Lotfinia,
Teresa Nolte,
Marwin Saehn,
Peter Isfort,
Christiane Kuhl,
Sven Nebelung,
Georgios Kaissis,
Daniel Truhn
Abstract:
Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integ…
▽ More
Developing robust and effective artificial intelligence (AI) models in medicine requires access to large amounts of patient data. The use of AI models solely trained on large multi-institutional datasets can help with this, yet the imperative to ensure data privacy remains, particularly as membership inference risks breaching patient confidentiality. As a proposed remedy, we advocate for the integration of differential privacy (DP). We specifically investigate the performance of models trained with DP as compared to models trained without DP on data from institutions that the model had not seen during its training (i.e., external validation) - the situation that is reflective of the clinical use of AI models. By leveraging more than 590,000 chest radiographs from five institutions, we evaluated the efficacy of DP-enhanced domain transfer (DP-DT) in diagnosing cardiomegaly, pleural effusion, pneumonia, atelectasis, and in identifying healthy subjects. We juxtaposed DP-DT with non-DP-DT and examined diagnostic accuracy and demographic fairness using the area under the receiver operating characteristic curve (AUC) as the main metric, as well as accuracy, sensitivity, and specificity. Our results show that DP-DT, even with exceptionally high privacy levels (epsilon around 1), performs comparably to non-DP-DT (P>0.119 across all domains). Furthermore, DP-DT led to marginal AUC differences - less than 1% - for nearly all subgroups, relative to non-DP-DT. Despite consistent evidence suggesting that DP models induce significant performance degradation for on-domain applications, we show that off-domain performance is almost not affected. Therefore, we ardently advocate for the adoption of DP in training diagnostic medical AI models, given its minimal impact on performance.
△ Less
Submitted 7 December, 2023; v1 submitted 10 June, 2023;
originally announced June 2023.
-
How Will Your Tweet Be Received? Predicting the Sentiment Polarity of Tweet Replies
Authors:
Soroosh Tayebi Arasteh,
Mehrpad Monajem,
Vincent Christlein,
Philipp Heinrich,
Anguelos Nicolaou,
Hamidreza Naderi Boldaji,
Mahshad Lotfinia,
Stefan Evert
Abstract:
Twitter sentiment analysis, which often focuses on predicting the polarity of tweets, has attracted increasing attention over the last years, in particular with the rise of deep learning (DL). In this paper, we propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet. Therefore, we created RETWEET, a large dataset of tweets and replies manually annotate…
▽ More
Twitter sentiment analysis, which often focuses on predicting the polarity of tweets, has attracted increasing attention over the last years, in particular with the rise of deep learning (DL). In this paper, we propose a new task: predicting the predominant sentiment among (first-order) replies to a given tweet. Therefore, we created RETWEET, a large dataset of tweets and replies manually annotated with sentiment labels. As a strong baseline, we propose a two-stage DL-based method: first, we create automatically labeled training data by applying a standard sentiment classifier to tweet replies and aggregating its predictions for each original tweet; our rationale is that individual errors made by the classifier are likely to cancel out in the aggregation step. Second, we use the automatically labeled data for supervised training of a neural network to predict reply sentiment from the original tweets. The resulting classifier is evaluated on the new RETWEET dataset, showing promising results, especially considering that it has been trained without any manually labeled data. Both the dataset and the baseline implementation are publicly available.
△ Less
Submitted 21 April, 2021;
originally announced April 2021.
-
Machine Learning-Based Generalized Model for Finite Element Analysis of Roll Deflection During the Austenitic Stainless Steel 316L Strip Rolling
Authors:
Mahshad Lotfinia,
Soroosh Tayebi Arasteh
Abstract:
During the strip rolling process, a considerable amount of the forces of the material pressure cause elastic deformation on the work-roll, i.e., the deflection process. The uncontrollable amount of the work-roll deflection leads to the high deviations in the permissible thickness of the plate along its width. In the context of the Austenitic Stainless Steels (ASS), due to the instability of the Au…
▽ More
During the strip rolling process, a considerable amount of the forces of the material pressure cause elastic deformation on the work-roll, i.e., the deflection process. The uncontrollable amount of the work-roll deflection leads to the high deviations in the permissible thickness of the plate along its width. In the context of the Austenitic Stainless Steels (ASS), due to the instability of the Austenite phase in a cold temperature, cold deformation leads to the production of Strain-Induced Martensite (SIM), which improves the mechanical properties. It leads to the hardening of the ASS 316L during the cold deformation, which causes the Strain-Stress curve of the ASS 316L to behave non-linearly, which distinguishes it from other categories of steels. To account for this phenomenon, we propose to utilize a Machine Learning (ML) method to predict more accurately the flow stress of the ASS 316L during the cold rolling. Furthermore, we conduct various mechanical tensile tests in order to obtain the required dataset, Stress316L, for training the neural network. Moreover, instead of using a constant value of flow stress during the multi-pass rolling process, we use a Finite Difference (FD) formulation of the equilibrium equation in order to account for the dynamic behavior of the flow stress, which leads to the estimation of the mean pressure, which the strip enforces to the rolls during deformation. Finally, using the Finite Element Analysis (FEA), the deflection of the work-roll tools will be calculated. As a result, we end up with a generalized model for the calculation of the roll deflection, specific to the ASS 316L. To the best of our knowledge, this is the first model for ASS 316L which considers dynamic flow stress and SIM of the rolled plate, using FEM and an ML approach, which could contribute to the better design of the tolls.
△ Less
Submitted 24 April, 2022; v1 submitted 4 February, 2021;
originally announced February 2021.