Search | arXiv e-print repository

CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images

Authors: Seowoo Lee, Jiwon Youn, Hyungjin Kim, Mansu Kim, Soon Ho Yoon

Abstract: Purpose: This study aimed to develop an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially replicate the image interpretation skills of human radiologists Materials and Methods: For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain… ▽ More Purpose: This study aimed to develop an open-source multimodal large language model (CXR-LLAVA) for interpreting chest X-ray images (CXRs), leveraging recent advances in large language models (LLMs) to potentially replicate the image interpretation skills of human radiologists Materials and Methods: For training, we collected 592,580 publicly available CXRs, of which 374,881 had labels for certain radiographic abnormalities (Dataset 1) and 217,699 provided free-text radiology reports (Dataset 2). After pre-training a vision transformer with Dataset 1, we integrated it with an LLM influenced by the LLAVA network. Then, the model was fine-tuned, primarily using Dataset 2. The model's diagnostic performance for major pathological findings was evaluated, along with the acceptability of radiologic reports by human radiologists, to gauge its potential for autonomous reporting. Results: The model demonstrated impressive performance in test sets, achieving an average F1 score of 0.81 for six major pathological findings in the MIMIC internal test set and 0.62 for seven major pathological findings in the external test set. The model's F1 scores surpassed those of GPT-4-vision and Gemini-Pro-Vision in both test sets. In human radiologist evaluations of the external test set, the model achieved a 72.7% success rate in autonomous reporting, slightly below the 84.0% rate of ground truth reports. Conclusion: This study highlights the significant potential of multimodal LLMs for CXR interpretation, while also acknowledging the performance limitations. Despite these challenges, we believe that making our model open-source will catalyze further research, expanding its effectiveness and applicability in various clinical contexts. CXR-LLAVA is available at https://github.com/ECOFRI/CXR_LLAVA. △ Less

Submitted 14 January, 2024; v1 submitted 22 October, 2023; originally announced October 2023.

arXiv:2212.12170 [pdf, other]

Sense of Embodiment Inducement for People with Reduced Lower-body Mobility and Sensations with Partial-Visuomotor Stimulation

Authors: Hyuckjin Jang, Taehei Kim, Seo Young Oh, Jeongmi Lee, Sunghee Lee, Sang Ho Yoon

Abstract: To induce the Sense of Embodiment~(SoE) on the virtual 3D avatar during a Virtual Reality~(VR) walking scenario, VR interfaces have employed the visuotactile or visuomotor approaches. However, people with reduced lower-body mobility and sensation~(PRLMS) who are incapable of feeling or moving their legs would find this task extremely challenging. Here, we propose an upper-body motion tracking-base… ▽ More To induce the Sense of Embodiment~(SoE) on the virtual 3D avatar during a Virtual Reality~(VR) walking scenario, VR interfaces have employed the visuotactile or visuomotor approaches. However, people with reduced lower-body mobility and sensation~(PRLMS) who are incapable of feeling or moving their legs would find this task extremely challenging. Here, we propose an upper-body motion tracking-based partial-visuomotor technique to induce SoE and positive feedback for PRLMS patients. We design partial-visuomotor stimulation consisting of two distinctive inputs~(\textit{Button Control} \& \textit{Upper Motion tracking}) and outputs~(\textit{wheelchair motion} \& \textit{Gait Motion}). The preliminary user study was conducted to explore subjective preference with qualitative feedback. From the qualitative study result, we observed the positive response on the partial-visuomotor regarding SoE in the asynchronous VR experience for PRLMS. △ Less

Submitted 23 December, 2022; originally announced December 2022.

Journal ref: ACM SIGGRAPH 2022 Emerging Technologies

arXiv:2211.11381 [pdf, other]

LISA: Localized Image Stylization with Audio via Implicit Neural Representation

Authors: Seung Hyun Lee, Chanyoung Kim, Wonmin Byeon, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim

Abstract: We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a pa… ▽ More We present a novel framework, Localized Image Stylization with Audio (LISA) which performs audio-driven localized image stylization. Sound often provides information about the specific context of the scene and is closely related to a certain part of the scene or object. However, existing image stylization works have focused on stylizing the entire image using an image or text input. Stylizing a particular part of the image based on audio input is natural but challenging. In this work, we propose a framework that a user provides an audio input to localize the sound source in the input image and another for locally stylizing the target object or scene. LISA first produces a delicate localization map with an audio-visual localization network by leveraging CLIP embedding space. We then utilize implicit neural representation (INR) along with the predicted localization map to stylize the target object or scene based on sound information. The proposed INR can manipulate the localized pixel values to be semantically consistent with the provided audio input. Through a series of experiments, we show that the proposed framework outperforms the other audio-guided stylization methods. Moreover, LISA constructs concise localization maps and naturally manipulates the target object or scene in accordance with the given audio input. △ Less

Submitted 21 November, 2022; originally announced November 2022.

arXiv:2208.14114 [pdf, other]

Robust Sound-Guided Image Manipulation

Authors: Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Sang Ho Yoon, Jinkyu Kim, Sangpil Kim

Abstract: Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input "raining". These approaches often utilize a StyleCLIP-based image generator, which leverages multi-modal (text and image) embedding space. However, we observe that such text inputs are often bottlenecked in provi… ▽ More Recent successes suggest that an image can be manipulated by a text prompt, e.g., a landscape scene on a sunny day is manipulated into the same scene on a rainy day driven by a text input "raining". These approaches often utilize a StyleCLIP-based image generator, which leverages multi-modal (text and image) embedding space. However, we observe that such text inputs are often bottlenecked in providing and synthesizing rich semantic cues, e.g., differentiating heavy rain from rain with thunderstorms. To address this issue, we advocate leveraging an additional modality, sound, which has notable advantages in image manipulation as it can convey more diverse semantic cues (vivid emotions or dynamic expressions of the natural world) than texts. In this paper, we propose a novel approach that first extends the image-text joint embedding space with sound and applies a direct latent optimization method to manipulate a given image based on audio input, e.g., the sound of rain. Our extensive experiments show that our sound-guided image manipulation approach produces semantically and visually more plausible manipulation results than the state-of-the-art text and sound-guided image manipulation methods, which are further confirmed by our human evaluations. Our downstream task evaluations also show that our learned image-text-sound joint embedding space effectively encodes sound inputs. △ Less

Submitted 24 April, 2023; v1 submitted 30 August, 2022; originally announced August 2022.

Comments: arXiv admin note: text overlap with arXiv:2112.00007

arXiv:2204.09273 [pdf, other]

Sound-Guided Semantic Video Generation

Authors: Seung Hyun Lee, Gyeongrok Oh, Wonmin Byeon, Chanyoung Kim, Won Jeong Ryoo, Sang Ho Yoon, Hyunjun Cho, Jihyun Bae, Jinkyu Kim, Sangpil Kim

Abstract: The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound… ▽ More The recent success in StyleGAN demonstrates that pre-trained StyleGAN latent space is useful for realistic video generation. However, the generated motion in the video is usually not semantically meaningful due to the difficulty of determining the direction and magnitude in the StyleGAN latent space. In this paper, we propose a framework to generate realistic videos by leveraging multimodal (sound-image-text) embedding space. As sound provides the temporal contexts of the scene, our framework learns to generate a video that is semantically consistent with sound. First, our sound inversion module maps the audio directly into the StyleGAN latent space. We then incorporate the CLIP-based multimodal embedding space to further provide the audio-visual relationships. Finally, the proposed frame generator learns to find the trajectory in the latent space which is coherent with the corresponding sound and generates a video in a hierarchical manner. We provide the new high-resolution landscape video dataset (audio-visual pair) for the sound-guided video generation task. The experiments show that our model outperforms the state-of-the-art methods in terms of video quality. We further show several applications including image and video editing to verify the effectiveness of our method. △ Less

Submitted 21 October, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

arXiv:2112.00007 [pdf, other]

Sound-Guided Semantic Image Manipulation

Authors: Seung Hyun Lee, Wonseok Roh, Wonmin Byeon, Sang Ho Yoon, Chan Young Kim, Jinkyu Kim, Sangpil Kim

Abstract: The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a fra… ▽ More The recent success of the generative model shows that leveraging the multi-modal embedding space can manipulate an image using text information. However, manipulating an image with other sources rather than text, such as sound, is not easy due to the dynamic characteristics of the sources. Especially, sound can convey vivid emotions and dynamic expressions of the real world. Here, we propose a framework that directly encodes sound into the multi-modal (image-text) embedding space and manipulates an image from the space. Our audio encoder is trained to produce a latent representation from an audio input, which is forced to be aligned with image and text representations in the multi-modal embedding space. We use a direct latent optimization method based on aligned embeddings for sound-guided image manipulation. We also show that our method can mix text and audio modalities, which enrich the variety of the image modification. We verify the effectiveness of our sound-guided image manipulation quantitatively and qualitatively. We also show that our method can mix different modalities, i.e., text and audio, which enrich the variety of the image modification. The experiments on zero-shot audio classification and semantic-level image classification show that our proposed model outperforms other text and sound-guided state-of-the-art methods. △ Less

Submitted 30 November, 2021; originally announced December 2021.

Showing 1–6 of 6 results for author: Yoon, S H