Search | arXiv e-print repository

doi 10.1145/3610921

MicroCam: Leveraging Smartphone Microscope Camera for Context-Aware Contact Surface Sensing

Authors: Yongquan Hu, Hui-Shyong Yeo, Mingyue Yuan, Haoran Fan, Don Samitha Elvitigala, Wen Hu, Aaron Quigley

Abstract: The primary focus of this research is the discreet and subtle everyday contact interactions between mobile phones and their surrounding surfaces. Such interactions are anticipated to facilitate mobile context awareness, encompassing aspects such as dispensing medication updates, intelligently switching modes (e.g., silent mode), or initiating commands (e.g., deactivating an alarm). We introduce Mi… ▽ More The primary focus of this research is the discreet and subtle everyday contact interactions between mobile phones and their surrounding surfaces. Such interactions are anticipated to facilitate mobile context awareness, encompassing aspects such as dispensing medication updates, intelligently switching modes (e.g., silent mode), or initiating commands (e.g., deactivating an alarm). We introduce MicroCam, a contact-based sensing system that employs smartphone IMU data to detect the routine state of phone placement and utilizes a built-in microscope camera to capture intricate surface details. In particular, a natural dataset is collected to acquire authentic surface textures in situ for training and testing. Moreover, we optimize the deep neural network component of the algorithm, based on continual learning, to accurately discriminate between object categories (e.g., tables) and material constituents (e.g., wood). Experimental results highlight the superior accuracy, robustness and generalization of the proposed method. Lastly, we conducted a comprehensive discussion centered on our prototype, encompassing topics such as system performance and potential applications and scenarios. △ Less

Submitted 22 July, 2024; originally announced July 2024.

Comments: 28 pages

Journal ref: Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 7.3 (2023): 1-28

arXiv:2406.07867 [pdf, other]

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Authors: Se Jin Park, Chae Won Kim, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeong Hun Yeo, Yong Man Ro

Abstract: In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corp… ▽ More In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively. △ Less

Submitted 2 August, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: Accepted to ACL 2024 (Oral)

arXiv:2404.15635 [pdf, other]

A Real-time Evaluation Framework for Pedestrian's Potential Risk at Non-Signalized Intersections Based on Predicted Post-Encroachment Time

Authors: Tengfeng Lin, Zhixiong Jin, Seongjin Choi, Hwasoo Yeo

Abstract: Addressing pedestrian safety at intersections is one of the paramount concerns in the field of transportation research, driven by the urgency of reducing traffic-related injuries and fatalities. With advances in computer vision technologies and predictive models, the pursuit of developing real-time proactive protection systems is increasingly recognized as vital to improving pedestrian safety at i… ▽ More Addressing pedestrian safety at intersections is one of the paramount concerns in the field of transportation research, driven by the urgency of reducing traffic-related injuries and fatalities. With advances in computer vision technologies and predictive models, the pursuit of developing real-time proactive protection systems is increasingly recognized as vital to improving pedestrian safety at intersections. The core of these protection systems lies in the prediction-based evaluation of pedestrian's potential risks, which plays a significant role in preventing the occurrence of accidents. The major challenges in the current prediction-based potential risk evaluation research can be summarized into three aspects: the inadequate progress in creating a real-time framework for the evaluation of pedestrian's potential risks, the absence of accurate and explainable safety indicators that can represent the potential risk, and the lack of tailor-made evaluation criteria specifically for each category of pedestrians. To address these research challenges, in this study, a framework with computer vision technologies and predictive models is developed to evaluate the potential risk of pedestrians in real time. Integral to this framework is a novel surrogate safety measure, the Predicted Post-Encroachment Time (P-PET), derived from deep learning models capable to predict the arrival time of pedestrians and vehicles at intersections. To further improve the effectiveness and reliability of pedestrian risk evaluation, we classify pedestrians into distinct categories and apply specific evaluation criteria for each group. The results demonstrate the framework's ability to effectively identify potential risks through the use of P-PET, indicating its feasibility for real-time applications and its improved performance in risk evaluation across different categories of pedestrians. △ Less

Submitted 24 April, 2024; originally announced April 2024.

arXiv:2402.15151 [pdf, other]

Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing

Authors: Jeong Hun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro

Abstract: In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM),… ▽ More In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip movements. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maximize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the proposed deduplication and Low Rank Adaptation (LoRA), VSP-LLM can be trained in a computationally efficient manner. In the translation dataset, the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements compared to the recent model trained with 433 hours of data. △ Less

Submitted 13 May, 2024; v1 submitted 23 February, 2024; originally announced February 2024.

Comments: An Erratum was added on the last page of this paper

arXiv:2401.10314 [pdf, other]

LangProp: A code optimization framework using Large Language Models applied to driving

Authors: Shu Ishida, Gianluca Corrado, George Fedoseev, Hudson Yeo, Lloyd Russell, Jamie Shotton, João F. Henriques, Anthony Hu

Abstract: We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code… ▽ More We propose LangProp, a framework for iteratively optimizing code generated by large language models (LLMs), in both supervised and reinforcement learning settings. While LLMs can generate sensible coding solutions zero-shot, they are often sub-optimal. Especially for code generation tasks, it is likely that the initial code will fail on certain edge cases. LangProp automatically evaluates the code performance on a dataset of input-output pairs, catches any exceptions, and feeds the results back to the LLM in the training loop, so that the LLM can iteratively improve the code it generates. By adopting a metric- and data-driven training paradigm for this code optimization procedure, one could easily adapt findings from traditional machine learning techniques such as imitation learning, DAgger, and reinforcement learning. We show LangProp's applicability to general domains such as Sudoku and CartPole, as well as demonstrate the first proof of concept of automated code optimization for autonomous driving in CARLA. We show that LangProp can generate interpretable and transparent policies that can be verified and improved in a metric- and data-driven way. Our code is available at https://github.com/shuishida/LangProp. △ Less

Submitted 3 May, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2401.09802 [pdf, other]

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

Authors: Minsu Kim, Jeong Hun Yeo, Se Jin Park, Hyeongseop Rha, Yong Man Ro

Abstract: This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speec… ▽ More This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we propose a novel training strategy, processing with visual speech units. Motivated by the recent success of the audio speech unit, we propose to use a visual speech unit that can be obtained by discretizing the visual speech features extracted from the self-supervised visual speech model. Through analysis, we verify that the visual speech units mainly contain viseme information while suppressing non-linguistic information. By using the visual speech units as the inputs of our system, we propose to pre-train a VSR model to predict corresponding text outputs on multilingual data constructed by merging several VSR databases. As both the inputs (i.e., visual speech units) and outputs (i.e., text) are discrete, we can greatly improve the training efficiency compared to the standard VSR training. Specifically, the input data size is reduced to 0.016% of the original video inputs. In order to complement the insufficient visual information in speech recognition, we apply curriculum learning where the inputs of the system begin with audio-visual speech units and gradually change to visual speech units. After pre-training, the model is finetuned on continuous features. We set new state-of-the-art multilingual VSR performances by achieving comparable performances to the previous language-specific VSR models, with a single trained model. △ Less

Submitted 18 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: ACMMM 2024

arXiv:2309.17080 [pdf, other]

GAIA-1: A Generative World Model for Autonomous Driving

Authors: Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, Gianluca Corrado

Abstract: Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA… ▽ More Autonomous driving promises transformative improvements to transportation, but building systems capable of safely navigating the unstructured complexity of real-world scenarios remains challenging. A critical problem lies in effectively predicting the various potential outcomes that may emerge in response to the vehicle's actions as the world evolves. To address this challenge, we introduce GAIA-1 ('Generative AI for Autonomy'), a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios while offering fine-grained control over ego-vehicle behavior and scene features. Our approach casts world modeling as an unsupervised sequence modeling problem by mapping the inputs to discrete tokens, and predicting the next token in the sequence. Emerging properties from our model include learning high-level structures and scene dynamics, contextual awareness, generalization, and understanding of geometry. The power of GAIA-1's learned representation that captures expectations of future events, combined with its ability to generate realistic samples, provides new possibilities for innovation in the field of autonomy, enabling enhanced and accelerated training of autonomous driving technology. △ Less

Submitted 29 September, 2023; originally announced September 2023.

Comments: Technical Report

arXiv:2309.08535 [pdf, other]

Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper

Authors: Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, Yong Man Ro

Abstract: This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the… ▽ More This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages, especially for low-resource languages that have a limited number of labeled data. Different from previous methods that tried to improve the VSR performance for the target language by using knowledge learned from other languages, we explore whether we can increase the amount of training data itself for the different languages without human intervention. To this end, we employ a Whisper model which can conduct both language identification and audio-based speech recognition. It serves to filter data of the desired languages and transcribe labels from the unannotated, multilingual audio-visual data pool. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels even without utilizing human annotations. Through the automated labeling process, we label large-scale unlabeled multilingual databases, VoxCeleb2 and AVSpeech, producing 1,002 hours of data for four low VSR resource languages, French, Italian, Spanish, and Portuguese. With the automatic labels, we achieve new state-of-the-art performance on mTEDx in four languages, significantly surpassing the previous methods. The automatic labels are available online: https://github.com/JeongHun0716/Visual-Speech-Recognition-for-Low-Resource-Languages △ Less

Submitted 12 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

Comments: Accepted at ICASSP 2024

arXiv:2309.08531 [pdf, other]

Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens

Authors: Minsu Kim, Jeongsoo Choi, Soumi Maiti, Jeong Hun Yeo, Shinji Watanabe, Yong Man Ro

Abstract: In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-s… ▽ More In this paper, we propose methods to build a powerful and efficient Image-to-Speech captioning (Im2Sp) model. To this end, we start with importing the rich knowledge related to image comprehension and language modeling from a large-scale pre-trained vision-language model into Im2Sp. We set the output of the proposed Im2Sp as discretized speech units, i.e., the quantized speech features of a self-supervised speech model. The speech units mainly contain linguistic information while suppressing other characteristics of speech. This allows us to incorporate the language modeling capability of the pre-trained vision-language model into the spoken language modeling of Im2Sp. With the vision-language pre-training strategy, we set new state-of-the-art Im2Sp performances on two widely used benchmark databases, COCO and Flickr8k. Then, we further improve the efficiency of the Im2Sp model. Similar to the speech unit case, we convert the original image into image units, which are derived through vector quantization of the raw image. With these image units, we can drastically reduce the required data storage for saving image data to just 0.8% when compared to the original image data in terms of bits. Demo page: https://ms-dot-k.github.io/Image-to-Speech-Captioning. △ Less

Submitted 15 September, 2023; originally announced September 2023.

arXiv:2308.09311 [pdf, other]

Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge

Authors: Minsu Kim, Jeong Hun Yeo, Jeongsoo Choi, Yong Man Ro

Abstract: This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order… ▽ More This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated. △ Less

Submitted 12 January, 2024; v1 submitted 18 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2308.07593 [pdf, other]

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model

Authors: Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, Yong Man Ro

Abstract: Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different fro… ▽ More Visual Speech Recognition (VSR) is the task of predicting spoken words from silent lip movements. VSR is regarded as a challenging task because of the insufficient information on lip movements. In this paper, we propose an Audio Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement the insufficient speech information of visual modality by using audio modality. Different from the previous methods, the proposed AKVSR 1) utilizes rich audio knowledge encoded by a large-scale pretrained audio model, 2) saves the linguistic information of audio knowledge in compact audio memory by discarding the non-linguistic information from the audio through quantization, and 3) includes Audio Bridging Module which can find the best-matched audio features from the compact audio memory, which makes our training possible without audio inputs, once after the compact audio memory is composed. We validate the effectiveness of the proposed method through extensive experiments, and achieve new state-of-the-art performances on the widely-used LRS3 dataset. △ Less

Submitted 11 January, 2024; v1 submitted 15 August, 2023; originally announced August 2023.

Comments: Accepted by IEEE Transactions on Multimedia

arXiv:2305.04542 [pdf, other]

Multi-Temporal Lip-Audio Memory for Visual Speech Recognition

Authors: Jeong Hun Yeo, Minsu Kim, Yong Man Ro

Abstract: Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Mem… ▽ More Visual Speech Recognition (VSR) is a task to predict a sentence or word from lip movements. Some works have been recently presented which use audio signals to supplement visual information. However, existing methods utilize only limited information such as phoneme-level features and soft labels of Automatic Speech Recognition (ASR) networks. In this paper, we present a Multi-Temporal Lip-Audio Memory (MTLAM) that makes the best use of audio signals to complement insufficient information of lip movements. The proposed method is mainly composed of two parts: 1) MTLAM saves multi-temporal audio features produced from short- and long-term audio signals, and the MTLAM memorizes a visual-to-audio mapping to load stored multi-temporal audio features from visual features at the inference phase. 2) We design an audio temporal model to produce multi-temporal audio features capturing the context of neighboring words. In addition, to construct effective visual-to-audio mapping, the audio temporal models can generate audio features time-aligned with visual features. Through extensive experiments, we validate the effectiveness of the MTLAM achieving state-of-the-art performances on two public VSR datasets. △ Less

Submitted 8 May, 2023; originally announced May 2023.

Comments: Presented at ICASSP 2023

arXiv:2303.15826 [pdf, other]

MS-MT: Multi-Scale Mean Teacher with Contrastive Unpaired Translation for Cross-Modality Vestibular Schwannoma and Cochlea Segmentation

Authors: Ziyuan Zhao, Kaixin Xu, Huai Zhe Yeo, Xulei Yang, Cuntai Guan

Abstract: Domain shift has been a long-standing issue for medical image segmentation. Recently, unsupervised domain adaptation (UDA) methods have achieved promising cross-modality segmentation performance by distilling knowledge from a label-rich source domain to a target domain without labels. In this work, we propose a multi-scale self-ensembling based UDA framework for automatic segmentation of two key b… ▽ More Domain shift has been a long-standing issue for medical image segmentation. Recently, unsupervised domain adaptation (UDA) methods have achieved promising cross-modality segmentation performance by distilling knowledge from a label-rich source domain to a target domain without labels. In this work, we propose a multi-scale self-ensembling based UDA framework for automatic segmentation of two key brain structures i.e., Vestibular Schwannoma (VS) and Cochlea on high-resolution T2 images. First, a segmentation-enhanced contrastive unpaired image translation module is designed for image-level domain adaptation from source T1 to target T2. Next, multi-scale deep supervision and consistency regularization are introduced to a mean teacher network for self-ensemble learning to further close the domain gap. Furthermore, self-training and intensity augmentation techniques are utilized to mitigate label scarcity and boost cross-modality segmentation performance. Our method demonstrates promising segmentation performance with a mean Dice score of 83.8% and 81.4% and an average asymmetric surface distance (ASSD) of 0.55 mm and 0.26 mm for the VS and Cochlea, respectively in the validation phase of the crossMoDA 2022 challenge. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: Accepted by BrainLes MICCAI proceedings (5th solution for MICCAI 2022 Cross-Modality Domain Adaptation (crossMoDA) Challenge)

arXiv:2210.07729 [pdf, other]

Model-Based Imitation Learning for Urban Driving

Authors: Anthony Hu, Gianluca Corrado, Nicolas Griffiths, Zak Murez, Corina Gurau, Hudson Yeo, Alex Kendall, Roberto Cipolla, Jamie Shotton

Abstract: An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expe… ▽ More An accurate model of the environment and the dynamic agents acting in it offers great potential for improving motion planning. We present MILE: a Model-based Imitation LEarning approach to jointly learn a model of the world and a policy for autonomous driving. Our method leverages 3D geometry as an inductive bias and learns a highly compact latent space directly from high-resolution videos of expert demonstrations. Our model is trained on an offline corpus of urban driving data, without any online interaction with the environment. MILE improves upon prior state-of-the-art by 31% in driving score on the CARLA simulator when deployed in a completely new town and new weather conditions. Our model can predict diverse and plausible states and actions, that can be interpretably decoded to bird's-eye view semantic segmentation. Further, we demonstrate that it can execute complex driving manoeuvres from plans entirely predicted in imagination. Our approach is the first camera-only method that models static scene, dynamic scene, and ego-behaviour in an urban driving environment. The code and model weights are available at https://github.com/wayveai/mile. △ Less

Submitted 3 November, 2022; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: NeurIPS 2022

arXiv:2204.02181 [pdf, other]

Vision Transformer Equipped with Neural Resizer on Facial Expression Recognition Task

Authors: Hyeonbin Hwang, Soyeon Kim, Wei-Jin Park, Jiho Seo, Kyungtae Ko, Hyeon Yeo

Abstract: When it comes to wild conditions, Facial Expression Recognition is often challenged with low-quality data and imbalanced, ambiguous labels. This field has much benefited from CNN based approaches; however, CNN models have structural limitation to see the facial regions in distant. As a remedy, Transformer has been introduced to vision fields with global receptive field, but requires adjusting inpu… ▽ More When it comes to wild conditions, Facial Expression Recognition is often challenged with low-quality data and imbalanced, ambiguous labels. This field has much benefited from CNN based approaches; however, CNN models have structural limitation to see the facial regions in distant. As a remedy, Transformer has been introduced to vision fields with global receptive field, but requires adjusting input spatial size to the pretrained models to enjoy their strong inductive bias at hands. We herein raise a question whether using the deterministic interpolation method is enough to feed low-resolution data to Transformer. In this work, we propose a novel training framework, Neural Resizer, to support Transformer by compensating information and downscaling in a data-driven manner trained with loss function balancing the noisiness and imbalance. Experiments show our Neural Resizer with F-PDLS loss function improves the performance with Transformer variants in general and nearly achieves the state-of-the-art performance. △ Less

Submitted 5 April, 2022; originally announced April 2022.

Comments: Accepted to IEEE ICASSP 2022

arXiv:2204.01725 [pdf, other]

Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading

Authors: Minsu Kim, Jeong Hun Yeo, Yong Man Ro

Abstract: Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head V… ▽ More Recognizing speech from silent lip movement, which is called lip reading, is a challenging task due to 1) the inherent information insufficiency of lip movement to fully represent the speech, and 2) the existence of homophenes that have similar lip movement with different pronunciations. In this paper, we try to alleviate the aforementioned two challenges in lip reading by proposing a Multi-head Visual-audio Memory (MVM). Firstly, MVM is trained with audio-visual datasets and remembers audio representations by modelling the inter-relationships of paired audio-visual representations. At the inference stage, visual input alone can extract the saved audio representation from the memory by examining the learned inter-relationships. Therefore, the lip reading model can complement the insufficient visual information with the extracted audio representations. Secondly, MVM is composed of multi-head key memories for saving visual features and one value memory for saving audio knowledge, which is designed to distinguish the homophenes. With the multi-head key memories, MVM extracts possible candidate audio features from the memory, which allows the lip reading model to consider the possibility of which pronunciations can be represented from the input lip movement. This also can be viewed as an explicit implementation of the one-to-many mapping of viseme-to-phoneme. Moreover, MVM is employed in multi-temporal levels to consider the context when retrieving the memory and distinguish the homophenes. Extensive experimental results verify the effectiveness of the proposed method in lip reading and in distinguishing the homophenes. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Published at AAAI 2022

arXiv:2201.05877 [pdf, other]

A Framework for Pedestrian Sub-classification and Arrival Time Prediction at Signalized Intersection Using Preprocessed Lidar Data

Authors: Tengfeng Lin, Zhixiong Jin, Seongjin Choi, Hwasoo Yeo

Abstract: The mortality rate for pedestrians using wheelchairs was 36% higher than the overall population pedestrian mortality rate. However, there is no data to clarify the pedestrians' categories in both fatal and nonfatal accidents, since police reports often do not keep a record of whether a victim was using a wheelchair or has a disability. Currently, real-time detection of vulnerable road users using… ▽ More The mortality rate for pedestrians using wheelchairs was 36% higher than the overall population pedestrian mortality rate. However, there is no data to clarify the pedestrians' categories in both fatal and nonfatal accidents, since police reports often do not keep a record of whether a victim was using a wheelchair or has a disability. Currently, real-time detection of vulnerable road users using advanced traffic sensors installed at the infrastructure side has a great potential to significantly improve traffic safety at the intersection. In this research, we develop a systematic framework with a combination of machine learning and deep learning models to distinguish disabled people from normal walk pedestrians and predict the time needed to reach the next side of the intersection. The proposed framework shows high performance both at vulnerable user classification and arrival time prediction accuracy. △ Less

Submitted 15 January, 2022; originally announced January 2022.

Comments: 15 pages, 11 figures, 4 tables

arXiv:2108.00439 [pdf, other]

doi 10.1016/j.trc.2022.103668

Transformer-based Map Matching Model with Limited Ground-Truth Data using Transfer-Learning Approach

Authors: Zhixiong Jin, Jiwon Kim, Hwasoo Yeo, Seongjin Choi

Abstract: In many spatial trajectory-based applications, it is necessary to map raw trajectory data points onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data-driven perspective, pr… ▽ More In many spatial trajectory-based applications, it is necessary to map raw trajectory data points onto road networks in digital maps, which is commonly referred to as a map-matching process. While most previous map-matching methods have focused on using rule-based algorithms to deal with the map-matching problems, in this paper, we consider the map-matching task from the data-driven perspective, proposing a deep learning-based map-matching model. We build a Transformer-based map-matching model with a transfer learning approach. We generate trajectory data to pre-train the Transformer model and then fine-tune the model with a limited number of ground-truth data to minimize the model development cost and reduce the real-to-virtual gap. Three metrics (Average Hamming Distance, F-score, and BLEU) at two levels (point and segment level) are used to evaluate the model performance. The results indicate that the proposed model outperforms existing models. Furthermore, we use the attention weights of the Transformer to plot the map-matching process and find how the model matches the road segments correctly. △ Less

Submitted 7 October, 2021; v1 submitted 1 August, 2021; originally announced August 2021.

Comments: 25 pages, 9 figures, 4 tables

arXiv:2107.12507 [pdf]

Analyzing vehicle pedestrian interactions combining data cube structure and predictive collision risk estimation model

Authors: Byeongjoon Noh, Hansaem Park, Hwasoo Yeo

Abstract: Traffic accidents are a threat to human lives, particularly pedestrians causing premature deaths. Therefore, it is necessary to devise systems to prevent accidents in advance and respond proactively, using potential risky situations as one of the surrogate safety measurements. This study introduces a new concept of a pedestrian safety system that combines the field and the centralized processes. T… ▽ More Traffic accidents are a threat to human lives, particularly pedestrians causing premature deaths. Therefore, it is necessary to devise systems to prevent accidents in advance and respond proactively, using potential risky situations as one of the surrogate safety measurements. This study introduces a new concept of a pedestrian safety system that combines the field and the centralized processes. The system can warn of upcoming risks immediately in the field and improve the safety of risk frequent areas by assessing the safety levels of roads without actual collisions. In particular, this study focuses on the latter by introducing a new analytical framework for a crosswalk safety assessment with behaviors of vehicle/pedestrian and environmental features. We obtain these behavioral features from actual traffic video footage in the city with complete automatic processing. The proposed framework mainly analyzes these behaviors in multidimensional perspectives by constructing a data cube structure, which combines the LSTM based predictive collision risk estimation model and the on line analytical processing operations. From the PCR estimation model, we categorize the severity of risks as four levels and apply the proposed framework to assess the crosswalk safety with behavioral features. Our analytic experiments are based on two scenarios, and the various descriptive results are harvested the movement patterns of vehicles and pedestrians by road environment and the relationships between risk levels and car speeds. Thus, the proposed framework can support decision makers by providing valuable information to improve pedestrian safety for future accidents, and it can help us better understand their behaviors near crosswalks proactively. In order to confirm the feasibility and applicability of the proposed framework, we implement and apply it to actual operating CCTVs in Osan City, Korea. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: 33 pages, 19 figures

arXiv:2107.03554 [pdf]

Automated Object Behavioral Feature Extraction for Potential Risk Analysis based on Video Sensor

Authors: Byeongjoon Noh, Dongho Ka, Wonjun Noh, Hwasoo Yeo

Abstract: Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an… ▽ More Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an automated and simpler system for effectively extracting object behavioral features from video sensors deployed on the road. We conduct basic statistical analysis on these features, and show how they can be useful for monitoring the traffic behavior on the road. We confirm the feasibility of the proposed system by applying our prototype to two unsignalized crosswalks in Osan city, South Korea. To conclude, we compare behaviors of vehicles and pedestrians in those two areas by simple statistical analysis. This study demonstrates the potential for a network of connected video sensors to provide actionable data for smart cities to improve pedestrian safety in dangerous road environments. △ Less

Submitted 25 October, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

Comments: 6 pages, 9 figures

arXiv:2105.02582 [pdf]

Vision based Pedestrian Potential Risk Analysis based on Automated Behavior Feature Extraction for Smart and Safe City

Authors: Byeongjoon Noh, Dongho Ka, David Lee, Hwasoo Yeo

Abstract: Despite recent advances in vehicle safety technologies, road traffic accidents still pose a severe threat to human lives and have become a leading cause of premature deaths. In particular, crosswalks present a major threat to pedestrians, but we lack dense behavioral data to investigate the risks they face. Therefore, we propose a comprehensive analytical model for pedestrian potential risk using… ▽ More Despite recent advances in vehicle safety technologies, road traffic accidents still pose a severe threat to human lives and have become a leading cause of premature deaths. In particular, crosswalks present a major threat to pedestrians, but we lack dense behavioral data to investigate the risks they face. Therefore, we propose a comprehensive analytical model for pedestrian potential risk using video footage gathered by road security cameras deployed at such crossings. The proposed system automatically detects vehicles and pedestrians, calculates trajectories by frames, and extracts behavioral features affecting the likelihood of potentially dangerous scenes between these objects. Finally, we design a data cube model by using the large amount of the extracted features accumulated in a data warehouse to perform multidimensional analysis for potential risk scenes with levels of abstraction, but this is beyond the scope of this paper, and will be detailed in a future study. In our experiment, we focused on extracting the various behavioral features from multiple crosswalks, and visualizing and interpreting their behaviors and relationships among them by camera location to show how they may or may not contribute to potential risk. We validated feasibility and applicability by applying it in multiple crosswalks in Osan city, Korea. △ Less

Submitted 27 May, 2021; v1 submitted 6 May, 2021; originally announced May 2021.

Comments: 26 pages, 15 figures, 5 tables

arXiv:2105.02572 [pdf]

A novel method of predictive collision risk area estimation for proactive pedestrian accident prevention system in urban surveillance infrastructure

Authors: Byeongjoon Noh, Hwasoo Yeo

Abstract: Road traffic accidents, especially vehicle pedestrian collisions in crosswalk, globally pose a severe threat to human lives and have become a leading cause of premature deaths. In order to protect such vulnerable road users from collisions, it is necessary to recognize possible conflict in advance and warn to road users, not post facto. A breakthrough for proactively preventing pedestrian collisio… ▽ More Road traffic accidents, especially vehicle pedestrian collisions in crosswalk, globally pose a severe threat to human lives and have become a leading cause of premature deaths. In order to protect such vulnerable road users from collisions, it is necessary to recognize possible conflict in advance and warn to road users, not post facto. A breakthrough for proactively preventing pedestrian collisions is to recognize pedestrian's potential risks based on vision sensors such as CCTVs. In this study, we propose a predictive collision risk area estimation system at unsignalized crosswalks. The proposed system applied trajectories of vehicles and pedestrians from video footage after preprocessing, and then predicted their trajectories by using deep LSTM networks. With use of predicted trajectories, this system can infer collision risk areas statistically, further severity of levels is divided as danger, warning, and relative safe. In order to validate the feasibility and applicability of the proposed system, we applied it and assess the severity of potential risks in two unsignalized spots in Osan city, Korea. △ Less

Submitted 6 May, 2021; originally announced May 2021.

Comments: 26 pages, 17 figures, 5 tables

arXiv:2104.14285 [pdf, other]

Hybrid tracker based optimal path tracking system for complex road environments for autonomous driving

Authors: Eunbin Seo, Seunggi Lee, Gwanjun Shin, Hoyeong Yeo, Yongseob Lim, Gyeungho Choi

Abstract: Path tracking system plays a key technology in autonomous driving. The system should be driven accurately along the lane and be careful not to cause any inconvenience to passengers. To address such tasks, this paper proposes hybrid tracker based optimal path tracking system. By applying a deep learning based lane detection algorithm and a designated fast lane fitting algorithm, this paper develope… ▽ More Path tracking system plays a key technology in autonomous driving. The system should be driven accurately along the lane and be careful not to cause any inconvenience to passengers. To address such tasks, this paper proposes hybrid tracker based optimal path tracking system. By applying a deep learning based lane detection algorithm and a designated fast lane fitting algorithm, this paper developed a lane processing algorithm that shows a match rate with actual lanes with minimal computational cost. In addition, three modified path tracking algorithms were designed using the GPS based path or the vision based path. In the driving system, a match rate for the correct ideal path does not necessarily represent driving stability. This paper proposes hybrid tracker based optimal path tracking system by applying the concept of an observer that selects the optimal tracker appropriately in complex road environments. The driving stability has been studied in complex road environments such as straight road with multiple 3-way junctions, roundabouts, intersections, and tunnels. Consequently, the proposed system experimentally showed the high performance with consistent driving comfort by maintaining the vehicle within the lanes accurately even in the presence of high complexity of road conditions. Code will be available in https://github.com/DGIST-ARTIV. △ Less

Submitted 29 April, 2021; originally announced April 2021.

Comments: Submitted to IEEE Access This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2009.10868 [pdf, other]

A Real-Time Predictive Pedestrian Collision Warning Service for Cooperative Intelligent Transportation Systems Using 3D Pose Estimation

Authors: Ue-Hwan Kim, Dongho Ka, Hwasoo Yeo, Jong-Hwan Kim

Abstract: Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization,… ▽ More Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian orientation recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to limited field-of-view, lack of generalization, and high computational complexity. To overcome these limitations, we propose a real-time predictive pedestrian collision warning service (P2CWS) for two tasks: pedestrian orientation recognition (100.53 FPS) and intention prediction (35.76 FPS). Our framework obtains satisfying generalization over multiple sites because of the proposed site-independent features. At the center of the feature extraction lies 3D pose estimation. The 3D pose analysis enables robust and accurate recognition of pedestrian orientations and prediction of intentions over multiple sites. The proposed vision framework realizes 89.3% accuracy in the behavior recognition task on the TUD dataset without any training process and 91.28% accuracy in intention prediction on our dataset achieving new state-of-the-art performance. To contribute to the corresponding research community, we make our source codes public which are available at https://github.com/Uehwan/VisionForPedestrian △ Less

Submitted 21 February, 2022; v1 submitted 22 September, 2020; originally announced September 2020.

Comments: 12 pages, 8 figures, 4 tables

arXiv:2007.14189 [pdf, other]

doi 10.1016/j.trc.2021.103091

TrajGAIL: Generating Urban Vehicle Trajectories using Generative Adversarial Imitation Learning

Authors: Seongjin Choi, Jiwon Kim, Hwasoo Yeo

Abstract: Recently, an abundant amount of urban vehicle trajectory data has been collected in road networks. Many studies have used machine learning algorithms to analyze patterns in vehicle trajectories to predict location sequences of individual travelers. Unlike the previous studies that used a discriminative modeling approach, this research suggests a generative modeling approach to learn the underlying… ▽ More Recently, an abundant amount of urban vehicle trajectory data has been collected in road networks. Many studies have used machine learning algorithms to analyze patterns in vehicle trajectories to predict location sequences of individual travelers. Unlike the previous studies that used a discriminative modeling approach, this research suggests a generative modeling approach to learn the underlying distributions of urban vehicle trajectory data. A generative model for urban vehicle trajectories can better generalize from training data by learning the underlying distribution of the training data and, thus, produce synthetic vehicle trajectories similar to real vehicle trajectories with limited observations. Synthetic trajectories can provide solutions to data sparsity or data privacy issues in using location data. This research proposesTrajGAIL, a generative adversarial imitation learning framework for the urban vehicle trajectory generation. In TrajGAIL, learning location sequences in observed trajectories is formulated as an imitation learning problem in a partially observable Markov decision process. The model is trained by the generative adversarial framework, which uses the reward function from the adversarial discriminator. The model is tested with both simulation and real-world datasets, and the results show that the proposed model obtained significant performance gains compared to existing models in sequence modeling. △ Less

Submitted 15 January, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: 25 pages, 10 figures, 2 table

Journal ref: Transportation Research Part C: Emerging Technologies Volume 128, July 2021, 103091

arXiv:2002.04406 [pdf, other]

doi 10.1109/ACCESS.2020.2999662

Traffic Data Imputation using Deep Convolutional Neural Networks

Authors: Ouafa Benkraouda, Bilal Thonnam Thodi, Hwasoo Yeo, Monica Menendez, Saif Eddin Jabari

Abstract: We propose a statistical learning-based traffic speed estimation method that uses sparse vehicle trajectory information. Using a convolutional encoder-decoder based architecture, we show that a well trained neural network can learn spatio-temporal traffic speed dynamics from time-space diagrams. We demonstrate this for a homogeneous road section using simulated vehicle trajectories and then valida… ▽ More We propose a statistical learning-based traffic speed estimation method that uses sparse vehicle trajectory information. Using a convolutional encoder-decoder based architecture, we show that a well trained neural network can learn spatio-temporal traffic speed dynamics from time-space diagrams. We demonstrate this for a homogeneous road section using simulated vehicle trajectories and then validate it using real-world data from NGSIM. Our results show that with probe vehicle penetration levels as low as 5\%, the proposed estimation method can provide a sound reconstruction of macroscopic traffic speeds and reproduce realistic shockwave patterns, implying applicability in a variety of traffic conditions. We further discuss the model's reconstruction mechanisms and confirm its ability to differentiate various traffic behaviors such as congested and free-flow traffic states, transition dynamics, and shockwave propagation. △ Less

Submitted 21 January, 2020; originally announced February 2020.

Journal ref: IEEE Access, 8, 2020, pp. 104740-104752

arXiv:1812.07151 [pdf, other]

Attention-based Recurrent Neural Network for Urban Vehicle Trajectory Prediction

Authors: Seongjin Choi, Jiwon Kim, Hwasoo Yeo

Abstract: With the increasing deployment of diverse positioning devices and location-based services, a huge amount of spatial and temporal information has been collected and accumulated as trajectory data. Among many applications, trajectory-based location prediction is gaining increasing attention because of its potential to improve the performance of many applications in multiple domains. This research fo… ▽ More With the increasing deployment of diverse positioning devices and location-based services, a huge amount of spatial and temporal information has been collected and accumulated as trajectory data. Among many applications, trajectory-based location prediction is gaining increasing attention because of its potential to improve the performance of many applications in multiple domains. This research focuses on trajectory sequence prediction methods using trajectory data obtained from the vehicles in urban traffic network. As Recurrent Neural Network(RNN) model is previously proposed, we propose an improved method of Attention-based Recurrent Neural Network model(ARNN) for urban vehicle trajectory prediction. We introduce attention mechanism into urban vehicle trajectory prediction to explain the impact of network-level traffic state information. The model is evaluated using the Bluetooth data of private vehicles collected in Brisbane, Australia with 5 metrics which are widely used in the sequence modeling. The proposed ARNN model shows significant performance improvement compared to the existing RNN models considering not only the cells to be visited but also the alignment of the cells in sequence. △ Less

Submitted 3 December, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

arXiv:1803.01166 [pdf, other]

doi 10.1145/3173574.3173758

AdaM: Adapting Multi-User Interfaces for Collaborative Environments in Real-Time

Authors: Seonwook Park, Christoph Gebhardt, Roman Rädle, Anna Feit, Hana Vrzakova, Niraj Dayama, Hui-Shyong Yeo, Clemens Klokmose, Aaron Quigley, Antti Oulasvirta, Otmar Hilliges

Abstract: Developing cross-device multi-user interfaces (UIs) is a challenging problem. There are numerous ways in which content and interactivity can be distributed. However, good solutions must consider multiple users, their roles, their preferences and access rights, as well as device capabilities. Manual and rule-based solutions are tedious to create and do not scale to larger problems nor do they adapt… ▽ More Developing cross-device multi-user interfaces (UIs) is a challenging problem. There are numerous ways in which content and interactivity can be distributed. However, good solutions must consider multiple users, their roles, their preferences and access rights, as well as device capabilities. Manual and rule-based solutions are tedious to create and do not scale to larger problems nor do they adapt to dynamic changes, such as users leaving or joining an activity. In this paper, we cast the problem of UI distribution as an assignment problem and propose to solve it using combinatorial optimization. We present a mixed integer programming formulation which allows real-time applications in dynamically changing collaborative settings. It optimizes the allocation of UI elements based on device capabilities, user roles, preferences, and access rights. We present a proof-of-concept designer-in-the-loop tool, allowing for quick solution exploration. Finally, we compare our approach to traditional paper prototyping in a lab study. △ Less

Submitted 29 March, 2018; v1 submitted 3 March, 2018; originally announced March 2018.

Comments: formatting tweaks

ACM Class: H.5.m

arXiv:1708.07174 [pdf]

Understanding Modern Intrusion Detection Systems: A Survey

Authors: Liu Hua Yeo, Xiangdong Che, Shalini Lakkaraju

Abstract: Intrusion detection systems (IDS) help detect unauthorized activities or intrusions that may compromise the confidentiality, integrity or availability of a resource. This paper presents a general overview of IDSs, the way they are classified, and the different algorithms used to detect anomalous activities. It attempts to compare the various methods of intrusion techniques. It also describes the v… ▽ More Intrusion detection systems (IDS) help detect unauthorized activities or intrusions that may compromise the confidentiality, integrity or availability of a resource. This paper presents a general overview of IDSs, the way they are classified, and the different algorithms used to detect anomalous activities. It attempts to compare the various methods of intrusion techniques. It also describes the various approaches and the importance of IDSs in information security. △ Less

Submitted 30 November, 2017; v1 submitted 23 August, 2017; originally announced August 2017.

Comments: 9 pages, 5 figures

Showing 1–29 of 29 results for author: Yeo, H