Zum Hauptinhalt springen

Showing 1–50 of 74 results for author: Ro, Y M

.
  1. arXiv:2408.12114  [pdf, other

    cs.CV

    SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

    Authors: Youngjoon Yu, Sangyun Chung, Byung-Kwan Lee, Yong Man Ro

    Abstract: Large-scale Vision-Language Models (LVLMs) have significantly advanced with text-aligned vision inputs. They have made remarkable progress in computer vision tasks by aligning text modality with vision inputs. There are also endeavors to incorporate multi-vision sensors beyond RGB, including thermal, depth, and medical X-ray images. However, we observe that current LVLMs view images taken from mul… ▽ More

    Submitted 23 August, 2024; v1 submitted 21 August, 2024; originally announced August 2024.

    Comments: Codes and data are available at https://github.com/top-yun/SPARK

  2. arXiv:2406.12246  [pdf, other

    cs.LG cs.CL cs.CV

    TroL: Traversal of Layers for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro

    Abstract: Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparabl… ▽ More

    Submitted 19 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

    Comments: Code is available in https://github.com/ByungKwanLee/TroL

  3. arXiv:2406.07867  [pdf, other

    cs.CV cs.AI cs.HC

    Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

    Authors: Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro

    Abstract: In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corp… ▽ More

    Submitted 2 August, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

    Comments: Accepted to ACL 2024 (Oral)

  4. arXiv:2406.01920  [pdf, other

    cs.CV cs.AI

    CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

    Authors: Junho Kim, Hyunjun Kim, Yeonju Kim, Yong Man Ro

    Abstract: Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COu… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: Project page: https://ivy-lvlm.github.io/CODE/

  5. arXiv:2405.15574  [pdf, other

    cs.CV

    Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Chae Won Kim, Beomchan Park, Yong Man Ro

    Abstract: The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to m… ▽ More

    Submitted 27 May, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: Code is available in https://github.com/ByungKwanLee/Meteor

  6. arXiv:2404.19299  [pdf, other

    cs.CV

    Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank

    Authors: Sungjune Park, Hyunjun Kim, Yong Man Ro

    Abstract: Pedestrian detection is a crucial field of computer vision research which can be adopted in various real-world applications (e.g., self-driving systems). However, despite noticeable evolution of pedestrian detection, pedestrian representations learned within a detection framework are usually limited to particular scene data in which they were trained. Therefore, in this paper, we propose a novel a… ▽ More

    Submitted 30 April, 2024; originally announced April 2024.

  7. arXiv:2403.15209  [pdf, other

    cs.CV

    MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection

    Authors: Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

    Abstract: Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate moda… ▽ More

    Submitted 29 May, 2024; v1 submitted 22 March, 2024; originally announced March 2024.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  8. arXiv:2403.13513  [pdf, other

    cs.CV cs.AI cs.CL

    What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models

    Authors: Junho Kim, Yeon Ju Kim, Yong Man Ro

    Abstract: This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of cou… ▽ More

    Submitted 21 June, 2024; v1 submitted 20 March, 2024; originally announced March 2024.

    Comments: Project page: https://ivy-lvlm.github.io/Counterfactual-Inception/

  9. arXiv:2403.07508  [pdf, other

    cs.CV

    MoAI: Mixture of All Intelligence for Large Language and Vision Models

    Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

    Abstract: The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed a… ▽ More

    Submitted 17 July, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

    Comments: ECCV 2024. Code available: https://github.com/ByungKwanLee/MoAI

  10. arXiv:2403.04212  [pdf, other

    cs.CL

    Persona Extraction Through Semantic Similarity for Emotional Support Conversation Generation

    Authors: Seunghee Han, Se Jin Park, Chae Won Kim, Yong Man Ro

    Abstract: Providing emotional support through dialogue systems is becoming increasingly important in today's world, as it can support both mental health and social interactions in many conversation scenarios. Previous works have shown that using persona is effective for generating empathetic and supportive responses. They have often relied on pre-provided persona rather than inferring them during conversati… ▽ More

    Submitted 6 March, 2024; originally announced March 2024.

    Comments: Accepted by ICASSP2024

  11. arXiv:2403.01300  [pdf, other

    cs.CV

    Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection

    Authors: Taeheon Kim, Sebin Shin, Youngjoon Yu, Hak Gu Kim, Yong Man Ro

    Abstract: RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However, the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically, datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the m… ▽ More

    Submitted 5 April, 2024; v1 submitted 2 March, 2024; originally announced March 2024.

    Comments: CVPR2024

  12. arXiv:2402.16021  [pdf, other

    cs.CL cs.AI cs.CV eess.AS

    TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages

    Authors: Minsu Kim, Jee-weon Jung, Hyeongseop Rha, Soumi Maiti, Siddhant Arora, Xuankai Chang, Shinji Watanabe, Yong Man Ro

    Abstract: The capability to jointly process multi-modal information is becoming an essential task. However, the limited number of paired multi-modal data and the large computational requirements in multi-modal learning hinder the development. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a novel viewpoint, whe… ▽ More

    Submitted 25 February, 2024; originally announced February 2024.

  13. arXiv:2402.15151  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

    Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

    Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More

    Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

    Comments: An Erratum was added on the last page of this paper

  14. arXiv:2402.11248  [pdf, other

    cs.CV

    CoLLaVO: Crayon Large Language and Vision mOdel

    Authors: Byung-Kwan Lee, Beomchan Park, Chae Won Kim, Yong Man Ro

    Abstract: The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box… ▽ More

    Submitted 2 June, 2024; v1 submitted 17 February, 2024; originally announced February 2024.

    Comments: ACL 2024 Findings. Code available: https://github.com/ByungKwanLee/CoLLaVO

  15. arXiv:2401.09802  [pdf, other

    eess.AS cs.CV cs.SD

    Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

    Authors: Minsu Kim, Jeong Hun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro

    Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speec… ▽ More

    Submitted 18 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

    Comments: ACMMM 2024

  16. arXiv:2312.02512  [pdf, other

    cs.CV cs.AI cs.MM eess.AS

    AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

    Authors: Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

    Abstract: This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast… ▽ More

    Submitted 26 March, 2024; v1 submitted 5 December, 2023; originally announced December 2023.

    Comments: CVPR 2024. Code & Demo: https://choijeongsoo.github.io/av2av

  17. arXiv:2311.01025  [pdf, other

    cs.CV

    Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection

    Authors: Sungjune Park, Hyunjun Kim, Yong Man Ro

    Abstract: Large language models (LLMs) have shown their capabilities in understanding contextual and semantic information regarding knowledge of instance appearances. In this paper, we introduce a novel approach to utilize the strengths of LLMs in understanding contextual appearance variations and to leverage this knowledge into a vision model (here, pedestrian detection). While pedestrian detection is cons… ▽ More

    Submitted 30 April, 2024; v1 submitted 2 November, 2023; originally announced November 2023.

  18. arXiv:2310.14946  [pdf, other

    cs.MM cs.SD eess.AS

    Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained Model

    Authors: Joanna Hong, Se Jin Park, Yong Man Ro

    Abstract: We present a novel approach to multilingual audio-visual speech recognition tasks by introducing a single model on a multilingual dataset. Motivated by a human cognitive system where humans can intuitively distinguish different languages without any conscious effort or guidance, we propose a model that can capture which language is given as an input speech by distinguishing the inherent similariti… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: EMNLP 2023 Findings

  19. arXiv:2310.07379  [pdf, other

    cs.CV cs.AI cs.LG

    Causal Unsupervised Semantic Segmentation

    Authors: Junho Kim, Byung-Kwan Lee, Yong Man Ro

    Abstract: Unsupervised semantic segmentation aims to achieve high-quality semantic grouping without human-labeled annotations. With the advent of self-supervised pre-training, various frameworks utilize the pre-trained features to train prediction heads for unsupervised dense prediction. However, a significant challenge in this unsupervised setup is determining the appropriate level of clustering required f… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: code available: https://github.com/ByungKwanLee/Causal-Unsupervised-Segmentation

  20. arXiv:2310.05934  [pdf, other

    cs.CV cs.AI cs.MM eess.IV

    DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion

    Authors: Se Jin Park, Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: Speech-driven 3D facial animation has gained significant attention for its ability to create realistic and expressive facial animations in 3D space based on speech. Learning-based methods have shown promising progress in achieving accurate facial motion synchronized with speech. However, one-to-many nature of speech-to-3D facial synthesis has not been fully explored: while the lip accurately synch… ▽ More

    Submitted 23 August, 2023; originally announced October 2023.

  21. arXiv:2309.08535  [pdf, other

    cs.CV cs.AI eess.AS

    Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

    Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

    Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More

    Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  22. arXiv:2309.08531  [pdf, other

    cs.CV cs.CL eess.AS eess.IV

    Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

    Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

    Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  23. arXiv:2308.09311  [pdf, other

    cs.CV cs.CL cs.SD eess.AS eess.IV

    Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

    Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

    Abstract: This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order… ▽ More

    Submitted 12 January, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  24. arXiv:2308.07787  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding

    Authors: Jeongsoo Choi, Joanna Hong, Yong Man Ro

    Abstract: Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speak… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: ICCV 2023

  25. arXiv:2308.07593  [pdf, other

    cs.CV cs.MM eess.AS eess.IV

    AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

    Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different fro… ▽ More

    Submitted 11 January, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Accepted by IEEE Transactions on Multimedia

  26. arXiv:2308.01831  [pdf, other

    cs.CL eess.AS eess.SP

    Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation

    Authors: Minsu Kim, Jeongsoo Choi, Dahun Kim, Yong Man Ro

    Abstract: This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised… ▽ More

    Submitted 18 August, 2024; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: TASLP

  27. arXiv:2307.07250  [pdf, other

    cs.LG cs.AI cs.CV

    Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning

    Authors: Byung-Kwan Lee, Junho Kim, Yong Man Ro

    Abstract: Adversarial examples derived from deliberately crafted perturbations on visual inputs can easily harm decision process of deep neural networks. To prevent potential threats, various adversarial training-based defense methods have grown rapidly and become a de facto standard approach for robustness. Despite recent competitive achievements, we observe that adversarial vulnerability varies across tar… ▽ More

    Submitted 18 July, 2023; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: Accepted in ICCV 2023

  28. arXiv:2306.16003  [pdf, other

    cs.GR cs.CV cs.SD eess.AS

    Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

    Authors: Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

    Abstract: In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into t… ▽ More

    Submitted 18 January, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: ICASSP 2024

  29. Robust Proxy: Improving Adversarial Robustness by Robust Proxy Learning

    Authors: Hong Joo Lee, Yong Man Ro

    Abstract: Recently, it has been widely known that deep neural networks are highly vulnerable and easily broken by adversarial attacks. To mitigate the adversarial vulnerability, many defense algorithms have been proposed. Recently, to improve adversarial robustness, many works try to enhance feature representation by imposing more direct supervision on the discriminative feature. However, existing approache… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted at IEEE Transactions on Information Forensics and Security (TIFS)

  30. Advancing Adversarial Training by Injecting Booster Signal

    Authors: Hong Joo Lee, Youngjoon Yu, Yong Man Ro

    Abstract: Recent works have demonstrated that deep neural networks (DNNs) are highly vulnerable to adversarial attacks. To defend against adversarial attacks, many defense strategies have been proposed, among which adversarial training has been demonstrated to be the most effective strategy. However, it has been known that adversarial training sometimes hurts natural accuracy. Then, many works focus on opti… ▽ More

    Submitted 27 June, 2023; originally announced June 2023.

    Comments: Accepted at IEEE Transactions on Neural Networks and Learning Systems

  31. arXiv:2305.19603  [pdf, other

    cs.SD cs.CV eess.AS

    Intelligible Lip-to-Speech Synthesis with Speech Units

    Authors: Jeongsoo Choi, Minsu Kim, Yong Man Ro

    Abstract: In this paper, we propose a novel Lip-to-Speech synthesis (L2S) framework, for synthesizing intelligible speech from a silent lip movement video. Specifically, to complement the insufficient supervisory signal of the previous L2S model, we propose to use quantized self-supervised speech representations, named speech units, as an additional prediction target for the L2S model. Therefore, the propos… ▽ More

    Submitted 31 May, 2023; originally announced May 2023.

    Comments: Interspeech 2023

  32. arXiv:2305.19556  [pdf, other

    cs.CV cs.AI cs.SD eess.AS eess.IV

    Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

    Authors: Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

    Abstract: Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temp… ▽ More

    Submitted 1 April, 2024; v1 submitted 31 May, 2023; originally announced May 2023.

    Comments: Accepted at ICASSP 2024

  33. arXiv:2305.04542  [pdf, other

    cs.CV eess.AS

    Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

    Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Mem… ▽ More

    Submitted 8 May, 2023; originally announced May 2023.

    Comments: Presented at ICASSP 2023

  34. arXiv:2303.08670  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video

    Authors: Minsu Kim, Chae Won Kim, Yong Man Ro

    Abstract: Forced alignment refers to a technology that time-aligns a given transcription with a corresponding speech. However, as the forced alignment technologies have developed using speech audio, they might fail in alignment when the input speech audio is noise-corrupted or is not accessible. We focus on that there is another component that the speech can be inferred from, the speech video (i.e., talking… ▽ More

    Submitted 26 February, 2023; originally announced March 2023.

    Comments: Accepted in AAAI2023

  35. arXiv:2303.08536  [pdf, other

    cs.MM cs.CV cs.LG cs.SD eess.AS

    Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring

    Authors: Joanna Hong, Minsu Kim, Jeongsoo Choi, Yong Man Ro

    Abstract: This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situations where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. Howev… ▽ More

    Submitted 20 March, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at CVPR 2023. Implementation available: https://github.com/joannahong/AV-RelScore

  36. arXiv:2303.01052  [pdf, other

    cs.LG cs.AI cs.CV stat.ME

    Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression

    Authors: Junho Kim, Byung-Kwan Lee, Yong Man Ro

    Abstract: The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations. In this paper, we propose a way of delving into the unexpected vulnerability in adversarially trained networks from a causal perspective, namely adversarial instrumental variable (IV) regression. By deploying it, we estimate the causal… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Accepted in CVPR 2023

  37. arXiv:2302.08841  [pdf, other

    cs.SD cs.CV cs.LG cs.MM eess.AS

    Lip-to-Speech Synthesis in the Wild with Multi-task Learning

    Authors: Minsu Kim, Joanna Hong, Yong Man Ro

    Abstract: Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that… ▽ More

    Submitted 17 February, 2023; originally announced February 2023.

    Comments: Accepted at ICASSP 2023. Demo available: https://github.com/joannahong/Lip-to-Speech-Synthesis-in-the-Wild

  38. arXiv:2302.08102  [pdf, other

    cs.CL cs.AI cs.CV cs.SD eess.AS eess.IV

    Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition

    Authors: Minsu Kim, Hyung-Il Kim, Yong Man Ro

    Abstract: Visual Speech Recognition (VSR) aims to infer speech into text depending on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements, and this makes the VSR models show degraded performance when they are applied to unseen speakers. In this paper, to remedy the performance degradation of the VSR m… ▽ More

    Submitted 16 February, 2023; originally announced February 2023.

  39. arXiv:2211.00924  [pdf, other

    cs.CV cs.AI eess.IV

    SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

    Authors: Se Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, Yong Man Ro

    Abstract: The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation learning or leverage intermediate structural information such as landmarks and 3D models. However, they struggle to synthesize fine details of the lips varying at th… ▽ More

    Submitted 2 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

    Comments: Accepted at AAAI 2022 (Oral)

  40. arXiv:2210.13186  [pdf, other

    cs.LG cs.AI cs.CV

    Meta Input: How to Leverage Off-the-Shelf Deep Neural Networks

    Authors: Minsu Kim, Youngjoon Yu, Sungjune Park, Yong Man Ro

    Abstract: These days, although deep neural networks (DNNs) have achieved a noticeable progress in a wide range of research area, it lacks the adaptability to be employed in the real-world applications because of the environment discrepancy problem. Such a problem originates from the difference between training and testing environments, and it is widely known that it causes serious performance degradation, w… ▽ More

    Submitted 20 October, 2022; originally announced October 2022.

  41. arXiv:2209.07220  [pdf, other

    cs.CV

    Face Shape-Guided Deep Feature Alignment for Face Recognition Robust to Face Misalignment

    Authors: Hyung-Il Kim, Kimin Yun, Yong Man Ro

    Abstract: For the past decades, face recognition (FR) has been actively studied in computer vision and pattern recognition society. Recently, due to the advances in deep learning, the FR technology shows high performance for most of the benchmark datasets. However, when the FR algorithm is applied to a real-world scenario, the performance has been known to be still unsatisfactory. This is mainly attributed… ▽ More

    Submitted 15 September, 2022; originally announced September 2022.

    Comments: 14 pages, 9 figures

  42. arXiv:2208.04498  [pdf, other

    cs.CV cs.AI eess.AS eess.IV

    Speaker-adaptive Lip Reading with User-dependent Padding

    Authors: Minsu Kim, Hyunjun Kim, Yong Man Ro

    Abstract: Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims… ▽ More

    Submitted 8 August, 2022; originally announced August 2022.

    Comments: Accepted at ECCV2022

  43. arXiv:2207.06020  [pdf, other

    cs.SD cs.AI cs.CV cs.MM eess.AS eess.IV

    Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition

    Authors: Joanna Hong, Minsu Kim, Daehun Yoo, Yong Man Ro

    Abstract: This paper focuses on designing a noise-robust end-to-end Audio-Visual Speech Recognition (AVSR) system. To this end, we propose Visual Context-driven Audio Feature Enhancement module (V-CAFE) to enhance the input noisy audio speech with a help of audio-visual correspondence. The proposed V-CAFE is designed to capture the transition of lip movements, namely visual context and to generate a noise r… ▽ More

    Submitted 13 July, 2022; originally announced July 2022.

    Comments: Accepted at Interspeech 2022

  44. arXiv:2206.07458  [pdf, other

    cs.CV cs.LG cs.SD eess.AS

    VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection

    Authors: Joanna Hong, Minsu Kim, Yong Man Ro

    Abstract: The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker set… ▽ More

    Submitted 20 July, 2022; v1 submitted 15 June, 2022; originally announced June 2022.

    Comments: Accepted by ECCV 2022

  45. Defending Person Detection Against Adversarial Patch Attack by using Universal Defensive Frame

    Authors: Youngjoon Yu, Hong Joo Lee, Hakmin Lee, Yong Man Ro

    Abstract: Person detection has attracted great attention in the computer vision area and is an imperative element in human-centric computer vision. Although the predictive performances of person detection networks have been improved dramatically, they are vulnerable to adversarial patch attacks. Changing the pixels in a restricted region can easily fool the person detection network in safety-critical applic… ▽ More

    Submitted 20 October, 2022; v1 submitted 27 April, 2022; originally announced April 2022.

    Comments: Accepted at IEEE Transactions on Image Processing (TIP), 2022

  46. arXiv:2204.02738  [pdf, other

    cs.CV

    Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network

    Authors: Byung-Kwan Lee, Junho Kim, Yong Man Ro

    Abstract: Adversarial examples provoke weak reliability and potential security issues in deep neural networks. Although adversarial training has been widely studied to improve adversarial robustness, it works in an over-parameterized regime and requires high computations and large memory budgets. To bridge adversarial robustness and model compression, we propose a novel adversarial pruning method, Masking A… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: CVPR 2022

  47. arXiv:2204.02735  [pdf, other

    cs.LG

    Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck

    Authors: Junho Kim, Byung-Kwan Lee, Yong Man Ro

    Abstract: Adversarial examples, generated by carefully crafted perturbation, have attracted considerable attention in research fields. Recent works have argued that the existence of the robust and non-robust features is a primary cause of the adversarial examples, and investigated their internal interactions in the feature space. In this paper, we propose a way of explicitly distilling feature representatio… ▽ More

    Submitted 6 April, 2022; originally announced April 2022.

    Comments: NeurIPS 2021

  48. arXiv:2204.01726  [pdf, other

    cs.CV cs.AI eess.AS

    Lip to Speech Synthesis with Visual Context Attentional GAN

    Authors: Minsu Kim, Joanna Hong, Yong Man Ro

    Abstract: In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermed… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at NeurIPS 2021

  49. arXiv:2204.01725  [pdf, other

    cs.CV

    Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

    Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro

    Abstract: Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head V… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at AAAI 2022

  50. arXiv:2204.01265  [pdf, other

    cs.CV cs.AI cs.SD eess.AS

    Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face Video

    Authors: Minsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro

    Abstract: In this paper, we introduce a novel audio-visual multi-modal bridging framework that can utilize both audio and visual information, even with uni-modal inputs. We exploit a memory network that stores source (i.e., visual) and target (i.e., audio) modal representations, where source modal representation is what we are given, and target modal representations are what we want to obtain from the memor… ▽ More

    Submitted 4 April, 2022; originally announced April 2022.

    Comments: Published at ICCV 2021