Search | arXiv e-print repository

Merging Multiple Datasets for Improved Appearance-Based Gaze Estimation

Abstract: Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of he… ▽ More Multiple datasets have been created for training and testing appearance-based gaze estimators. Intuitively, more data should lead to better performance. However, combining datasets to train a single esti-mator rarely improves gaze estimation performance. One reason may be differences in the experimental protocols used to obtain the gaze sam-ples, resulting in differences in the distributions of head poses, gaze an-gles, illumination, etc. Another reason may be the inconsistency between methods used to define gaze angles (label mismatch). We propose two innovations to improve the performance of gaze estimation by leveraging multiple datasets, a change in the estimator architecture and the intro-duction of a gaze adaptation module. Most state-of-the-art estimators merge information extracted from images of the two eyes and the entire face either in parallel or combine information from the eyes first then with the face. Our proposed Two-stage Transformer-based Gaze-feature Fusion (TTGF) method uses transformers to merge information from each eye and the face separately and then merge across the two eyes. We argue that this improves head pose invariance since changes in head pose affect left and right eye images in different ways. Our proposed Gaze Adaptation Module (GAM) method handles annotation inconsis-tency by applying a Gaze Adaption Module for each dataset to correct gaze estimates from a single shared estimator. This enables us to combine information across datasets despite differences in labeling. Our experi-ments show that these innovations improve gaze estimation performance over the SOTA both individually and collectively (by 10% - 20%). Our code is available at https://github.com/HKUST-NISL/GazeSetMerge. △ Less

Submitted 1 September, 2024; originally announced September 2024.

Comments: 14 pages

arXiv:2407.20556 [pdf, other]

Survey of Design Paradigms for Social Robots

Authors: Rita Frieske, Xiaoyu Mo, Yini Fang, Jay Nieles, Bertram E. Shi

Abstract: The demand for social robots in fields like healthcare, education, and entertainment increases due to their emotional adaptation features. These robots leverage multimodal communication, incorporating speech, facial expressions, and gestures to enhance user engagement and emotional support. The understanding of design paradigms of social robots is obstructed by the complexity of the system and the… ▽ More The demand for social robots in fields like healthcare, education, and entertainment increases due to their emotional adaptation features. These robots leverage multimodal communication, incorporating speech, facial expressions, and gestures to enhance user engagement and emotional support. The understanding of design paradigms of social robots is obstructed by the complexity of the system and the necessity to tune it to a specific task. This article provides a structured review of social robot design paradigms, categorizing them into cognitive architectures, role design models, linguistic models, communication flow, activity system models, and integrated design models. By breaking down the articles on social robot design and application based on these paradigms, we highlight the strengths and areas for improvement in current approaches. We further propose our original integrated design model that combines the most important aspects of the design of social robots. Our approach shows the importance of integrating operational, communicational, and emotional dimensions to create more adaptive and empathetic interactions between robots and humans. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2407.17772 [pdf, other]

ERIT Lightweight Multimodal Dataset for Elderly Emotion Recognition and Multimodal Fusion Evaluation

Authors: Rita Frieske, Bertram E. Shi

Abstract: ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in… ▽ More ERIT is a novel multimodal dataset designed to facilitate research in a lightweight multimodal fusion. It contains text and image data collected from videos of elderly individuals reacting to various situations, as well as seven emotion labels for each data sample. Because of the use of labeled images of elderly users reacting emotionally, it is also facilitating research on emotion recognition in an underrepresented age group in machine learning visual emotion recognition. The dataset is validated through comprehensive experiments indicating its importance in neural multimodal fusion research. △ Less

Submitted 25 July, 2024; originally announced July 2024.

arXiv:2401.01572 [pdf, other]

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models

Authors: Rita Frieske, Bertram E. Shi

Abstract: Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of halluci… ▽ More Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of hallucinations to probable natural language outputs of the model creates a danger of deception and impacts the credibility of the system. We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. We demonstrate that this method helps to distinguish between hallucinatory and non-hallucinatory models that have similar baseline word error rates. We further explore the relationship between the types of ASR errors and the types of dataset noise to determine what types of noise are most likely to create hallucinatory outputs. We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency. Finally, we discover how to induce hallucinations with a random noise injection to the utterance. △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2209.14645 [pdf, other]

Reducing Stress and Anxiety in the Metaverse: A Systematic Review of Meditation, Mindfulness and Virtual Reality

Authors: Xian Wang, Xiaoyu Mo, Mingming Fan, Lik-Hang Lee, Bertram E. Shi, Pan Hui

Abstract: Meditation, or mindfulness, is widely used to improve mental health. With the emergence of Virtual Reality technology, many studies have provided evidence that meditation with VR can bring health benefits. However, to our knowledge, there are no guidelines and comprehensive reviews in the literature on how to conduct such research in virtual reality. In order to understand the role of VR technolog… ▽ More Meditation, or mindfulness, is widely used to improve mental health. With the emergence of Virtual Reality technology, many studies have provided evidence that meditation with VR can bring health benefits. However, to our knowledge, there are no guidelines and comprehensive reviews in the literature on how to conduct such research in virtual reality. In order to understand the role of VR technology in meditation and future research opportunities, we conducted a systematic literature review in the IEEE and ACM databases. Our process yielded 19 eligible papers and we conducted a structured analysis. We understand the state-of-art of meditation type, design consideration and VR and technology through these papers and conclude research opportunities and challenges for the future. △ Less

Submitted 29 September, 2022; originally announced September 2022.

arXiv:2201.03804 [pdf, other]

CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition

Authors: Wenliang Dai, Samuel Cahyawijaya, Tiezheng Yu, Elham J. Barezi, Peng Xu, Cheuk Tung Shadow Yiu, Rita Frieske, Holy Lovenia, Genta Indra Winata, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

Abstract: With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource lan… ▽ More With the rise of deep learning and intelligent vehicle, the smart assistant has become an essential in-car component to facilitate driving and provide extra functionalities. In-car smart assistants should be able to process general as well as car-related commands and perform corresponding actions, which eases driving and improves safety. However, there is a data scarcity issue for low resource languages, hindering the development of research and applications. In this paper, we introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR), for in-car command recognition in the Cantonese language with both video and audio data. It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. Furthermore, we augment our dataset using common in-car background noises to simulate real environments, producing a dataset 10 times larger than the collected one. We provide detailed statistics of both the clean and the augmented versions of our dataset. Moreover, we implement two multimodal baselines to demonstrate the validity of CI-AVSR. Experiment results show that leveraging the visual signal improves the overall performance of the model. Although our best model can achieve a considerable quality on the clean test set, the speech recognition quality on the noisy data is still inferior and remains as an extremely challenging task for real in-car speech recognition systems. The dataset and code will be released at https://github.com/HLTCHKUST/CI-AVSR. △ Less

Submitted 14 March, 2022; v1 submitted 11 January, 2022; originally announced January 2022.

Comments: 6 pages

arXiv:2201.02419 [pdf, other]

Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset

Authors: Tiezheng Yu, Rita Frieske, Peng Xu, Samuel Cahyawijaya, Cheuk Tung Shadow Yiu, Holy Lovenia, Wenliang Dai, Elham J. Barezi, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

Abstract: Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech… ▽ More Automatic speech recognition (ASR) on low resource languages improves the access of linguistic minorities to technological advantages provided by artificial intelligence (AI). In this paper, we address the problem of data scarcity for the Hong Kong Cantonese language by creating a new Cantonese dataset. Our dataset, Multi-Domain Cantonese Corpus (MDCC), consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong. It comprises philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics. We also review all existing Cantonese datasets and analyze them according to their speech type, data source, total size and availability. We further conduct experiments with Fairseq S2T Transformer, a state-of-the-art ASR model, on the biggest existing dataset, Common Voice zh-HK, and our proposed MDCC, and the results show the effectiveness of our dataset. In addition, we create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK. △ Less

Submitted 17 January, 2022; v1 submitted 7 January, 2022; originally announced January 2022.

arXiv:2112.06223 [pdf, other]

ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation

Authors: Holy Lovenia, Samuel Cahyawijaya, Genta Indra Winata, Peng Xu, Xu Yan, Zihan Liu, Rita Frieske, Tiezheng Yu, Wenliang Dai, Elham J. Barezi, Qifeng Chen, Xiaojuan Ma, Bertram E. Shi, Pascale Fung

Abstract: Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus buil… ▽ More Code-switching is a speech phenomenon occurring when a speaker switches language during a conversation. Despite the spontaneous nature of code-switching in conversational spoken language, most existing works collect code-switching data from read speech instead of spontaneous speech. ASCEND (A Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English code-switching corpus built on spontaneous multi-turn conversational dialogue sources collected in Hong Kong. We report ASCEND's design and procedure for collecting the speech data, including annotations. ASCEND consists of 10.62 hours of clean speech, collected from 23 bilingual speakers of Chinese and English. Furthermore, we conduct baseline experiments using pre-trained wav2vec 2.0 models, achieving a best performance of 22.69\% character error rate and 27.05% mixed error rate. △ Less

Submitted 3 May, 2022; v1 submitted 12 December, 2021; originally announced December 2021.

Journal ref: Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)

arXiv:2108.04228 [pdf, other]

Iterative Distillation for Better Uncertainty Estimates in Multitask Emotion Recognition

Authors: Didan Deng, Liang Wu, Bertram E. Shi

Abstract: When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective,… ▽ More When recognizing emotions, subtle nuances in displays of emotion generate ambiguity or uncertainty in emotion perception. Emotion uncertainty has been previously interpreted as inter-rater disagreement among multiple annotators. In this paper, we consider a more common and challenging scenario: modeling emotion uncertainty when only single emotion labels are available. From a Bayesian perspective, we propose to use deep ensembles to capture uncertainty for multiple emotion descriptors, i.e., action units, discrete expression labels and continuous descriptors. We further apply iterative self-distillation. Iterative distillation over multiple generations significantly improves performance in both emotion recognition and uncertainty estimation. Our method generates single student models that provide accurate estimates of uncertainty for in-domain samples and a student ensemble that can detect out-of-domain samples. Our experiments on emotion recognition and uncertainty estimation using the Aff-wild2 dataset demonstrate that our algorithm gives more reliable uncertainty estimates than both Temperature Scaling and Monte Carol Dropout. △ Less

Submitted 17 October, 2021; v1 submitted 21 July, 2021; originally announced August 2021.

Comments: Accepted as a Workshop paper in ICCV2021 proceeding

arXiv:2103.05100 [pdf, other]

doi 10.1007/978-3-319-97628-0_7

Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Authors: Zhetuo Zhao, Jochen Triesch, Bertram E. Shi

Abstract: The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolu… ▽ More The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolution fovea can drive precise short range movements. The lower resolution periphery supports coarser long range movements. The fovea and periphery may also contain conflicting information, e.g. due to objects at different depths. While past AEC models did integrate peripheral and foveal information, they did not explicitly take into account these characteristics. We propose here a two-level hierarchical approach that does. The bottom level generates different vergence actions from foveal and peripheral regions. The top level selects one. We demonstrate that the hierarchical approach performs better than prior approaches in realistic environments, exhibiting better alignment and less oscillation. △ Less

Submitted 29 January, 2021; originally announced March 2021.

arXiv:2101.11391 [pdf, ps, other]

Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

Authors: Charles Wilmot, Bertram E. Shi, Jochen Triesch

Abstract: We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to… ▽ More We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to learn sensory representations. We also propose a new formulation of the intrinsic motivation signal that guides the learning of behavior. We demonstrate the performance of the model in simulations. △ Less

Submitted 27 January, 2021; originally announced January 2021.

arXiv:2101.05682 [pdf, other]

AVGCN: Trajectory Prediction using Graph Convolutional Networks Guided by Human Attention

Authors: Congcong Liu, Yuying Chen, Ming Liu, Bertram E. Shi

Abstract: Pedestrian trajectory prediction is a critical yet challenging task, especially for crowded scenes. We suggest that introducing an attention mechanism to infer the importance of different neighbors is critical for accurate trajectory prediction in scenes with varying crowd size. In this work, we propose a novel method, AVGCN, for trajectory prediction utilizing graph convolutional networks (GCN) b… ▽ More Pedestrian trajectory prediction is a critical yet challenging task, especially for crowded scenes. We suggest that introducing an attention mechanism to infer the importance of different neighbors is critical for accurate trajectory prediction in scenes with varying crowd size. In this work, we propose a novel method, AVGCN, for trajectory prediction utilizing graph convolutional networks (GCN) based on human attention (A denotes attention, V denotes visual field constraints). First, we train an attention network that estimates the importance of neighboring pedestrians, using gaze data collected as subjects perform a bird's eye view crowd navigation task. Then, we incorporate the learned attention weights modulated by constraints on the pedestrian's visual field into a trajectory prediction network that uses a GCN to aggregate information from neighbors efficiently. AVGCN also considers the stochastic nature of pedestrian trajectories by taking advantage of variational trajectory prediction. Our approach achieves state-of-the-art performance on several trajectory prediction benchmarks, and the lowest average prediction error over all considered benchmarks. △ Less

Submitted 14 January, 2021; originally announced January 2021.

Comments: 7 pages, 4 figures

arXiv:2009.07140 [pdf, other]

HGCN-GJS: Hierarchical Graph Convolutional Network with Groupwise Joint Sampling for Trajectory Prediction

Authors: Yuying Chen, Congcong Liu, Xiaodong Mei, Bertram E. Shi, Ming Liu

Abstract: Accurate pedestrian trajectory prediction is of great importance for downstream tasks such as autonomous driving and mobile robot navigation. Fully investigating the social interactions within the crowd is crucial for accurate pedestrian trajectory prediction. However, most existing methods do not capture group level interactions well, focusing only on pairwise interactions and neglecting group-wi… ▽ More Accurate pedestrian trajectory prediction is of great importance for downstream tasks such as autonomous driving and mobile robot navigation. Fully investigating the social interactions within the crowd is crucial for accurate pedestrian trajectory prediction. However, most existing methods do not capture group level interactions well, focusing only on pairwise interactions and neglecting group-wise interactions. In this work, we propose a hierarchical graph convolutional network, HGCN-GJS, for trajectory prediction which well leverages group level interactions within the crowd. Furthermore, we introduce a novel joint sampling scheme for modeling the joint distribution of multiple pedestrians in the future trajectories. Based on the group information, this scheme associates the trajectory of one person with the trajectory of other people in the group, but maintains the independence of the trajectories of outsiders. We demonstrate the performance of our network on several trajectory prediction datasets, achieving state-of-the-art results on all datasets considered. △ Less

Submitted 15 September, 2023; v1 submitted 15 September, 2020; originally announced September 2020.

Comments: 6 pages, 8 figures, accepted by IROS 2022

arXiv:2002.03557 [pdf, other]

Multitask Emotion Recognition with Incomplete Labels

Authors: Didan Deng, Zhaokang Chen, Bertram E. Shi

Abstract: We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experi… ▽ More We train a unified model to perform three tasks: facial action unit detection, expression classification, and valence-arousal estimation. We address two main challenges of learning the three tasks. First, most existing datasets are highly imbalanced. Second, most existing datasets do not contain labels for all three tasks. To tackle the first challenge, we apply data balancing techniques to experimental datasets. To tackle the second challenge, we propose an algorithm for the multitask model to learn from missing (incomplete) labels. This algorithm has two steps. We first train a teacher model to perform all three tasks, where each instance is trained by the ground truth label of its corresponding task. Secondly, we refer to the outputs of the teacher model as the soft labels. We use the soft labels and the ground truth to train the student model. We find that most of the student models outperform their teacher model on all the three tasks. Finally, we use model ensembling to boost performance further on the three tasks. △ Less

Submitted 10 March, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: Accepted by FG2020

arXiv:2001.09284 [pdf, other]

doi 10.1109/TPAMI.2022.3148386

Towards High Performance Low Complexity Calibration in Appearance Based Gaze Estimation

Authors: Zhaokang Chen, Bertram E. Shi

Abstract: Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the cali… ▽ More Appearance-based gaze estimation from RGB images provides relatively unconstrained gaze tracking. We have previously proposed a gaze decomposition method that decomposes the gaze angle into the sum of a subject-independent gaze estimate from the image and a subject-dependent bias. This paper extends that work with a more complete characterization of the interplay between the complexity of the calibration dataset and estimation accuracy. We analyze the effect of the number of gaze targets, the number of images used per gaze target and the number of head positions in calibration data using a new NISLGaze dataset, which is well suited for analyzing these effects as it includes more diversity in head positions and orientations for each subject than other datasets. A better understanding of these factors enables low complexity high performance calibration. Our results indicate that using only a single gaze target and single head position is sufficient to achieve high quality calibration, outperforming state-of-the-art methods by more than 6.3%. One of the surprising findings is that the same estimator yields the best performance both with and without calibration. To better understand the reasons, we provide a new theoretical analysis that specifies the conditions under which this can be expected. △ Less

Submitted 13 February, 2022; v1 submitted 25 January, 2020; originally announced January 2020.

Comments: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

arXiv:1909.10400 [pdf, other]

Robot Navigation in Crowds by Graph Convolutional Networks with Attention Learned from Human Gaze

Authors: Yuying Chen, Congcong Liu, Ming Liu, Bertram E. Shi

Abstract: Safe and efficient crowd navigation for mobile robot is a crucial yet challenging task. Previous work has shown the power of deep reinforcement learning frameworks to train efficient policies. However, their performance deteriorates when the crowd size grows. We suggest that this can be addressed by enabling the network to identify and pay attention to the humans in the crowd that are most critica… ▽ More Safe and efficient crowd navigation for mobile robot is a crucial yet challenging task. Previous work has shown the power of deep reinforcement learning frameworks to train efficient policies. However, their performance deteriorates when the crowd size grows. We suggest that this can be addressed by enabling the network to identify and pay attention to the humans in the crowd that are most critical to navigation. We propose a novel network utilizing a graph representation to learn the policy. We first train a graph convolutional network based on human gaze data that accurately predicts human attention to different agents in the crowd. Then we incorporate the learned attention into a graph-based reinforcement learning architecture. The proposed attention mechanism enables the assignment of meaningful weightings to the neighbors of the robot, and has the additional benefit of interpretability. Experiments on real-world dense pedestrian datasets with various crowd sizes demonstrate that our model outperforms state-of-art methods by 18.4% in task accomplishment and by 16.4% in time efficiency. △ Less

Submitted 23 September, 2019; originally announced September 2019.

Comments: 8 pages, 7 figures

arXiv:1905.04451 [pdf, other]

Offset Calibration for Appearance-Based Gaze Estimation via Gaze Decomposition

Authors: Zhaokang Chen, Bertram E. Shi

Abstract: Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subjec… ▽ More Appearance-based gaze estimation provides relatively unconstrained gaze tracking. However, subject-independent models achieve limited accuracy partly due to individual variations. To improve estimation, we propose a novel gaze decomposition method and a single gaze point calibration method, motivated by our finding that the inter-subject squared bias exceeds the intra-subject variance for a subject-independent estimator. We decompose the gaze angle into a subject-dependent bias term and a subject-independent term between the gaze angle and the bias. The subject-independent term is estimated by a deep convolutional network. For calibration-free tracking, we set the subject-dependent bias term to zero. For single gaze point calibration, we estimate the bias from a few images taken as the subject gazes at a point. Experiments on three datasets indicate that as a calibration-free estimator, the proposed method outperforms the state-of-the-art methods by up to $10.0\%$. The proposed calibration method is robust and reduces estimation error significantly (up to $35.6\%$), achieving state-of-the-art performance for appearance-based eye trackers with calibration. △ Less

Submitted 9 January, 2020; v1 submitted 11 May, 2019; originally announced May 2019.

Comments: Accepted by WACV2020. This is not the camera-ready version

arXiv:1904.08377 [pdf, other]

Gaze Training by Modulated Dropout Improves Imitation Learning

Authors: Yuying Chen, Congcong Liu, Lei Tai, Ming Liu, Bertram E. Shi

Abstract: Imitation learning by behavioral cloning is a prevalent method that has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drive… ▽ More Imitation learning by behavioral cloning is a prevalent method that has achieved some success in vision-based autonomous driving. The basic idea behind behavioral cloning is to have the neural network learn from observing a human expert's behavior. Typically, a convolutional neural network learns to predict the steering commands from raw driver-view images by mimicking the behaviors of human drivers. However, there are other cues, such as gaze behavior, available from human drivers that have yet to be exploited. Previous researches have shown that novice human learners can benefit from observing experts' gaze patterns. We present here that deep neural networks can also profit from this. We propose a method, gaze-modulated dropout, for integrating this gaze information into a deep driving network implicitly rather than as an additional input. Our experimental results demonstrate that gaze-modulated dropout enhances the generalization capability of the network to unseen scenes. Prediction error in steering commands is reduced by 23.5% compared to uniform dropout. Running closed loop in the simulator, the gaze-modulated dropout net increased the average distance travelled between infractions by 58.5%. Consistent with these results, the gaze-modulated dropout net shows lower model uncertainty. △ Less

Submitted 16 August, 2019; v1 submitted 17 April, 2019; originally announced April 2019.

Comments: 6 pages, 4 figures

arXiv:1903.07296 [pdf, other]

Appearance-Based Gaze Estimation Using Dilated-Convolutions

Authors: Zhaokang Chen, Bertram E. Shi

Abstract: Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appe… ▽ More Appearance-based gaze estimation has attracted more and more attention because of its wide range of applications. The use of deep convolutional neural networks has improved the accuracy significantly. In order to improve the estimation accuracy further, we focus on extracting better features from eye images. Relatively large changes in gaze angles may result in relatively small changes in eye appearance. We argue that current architectures for gaze estimation may not be able to capture such small changes, as they apply multiple pooling layers or other downsampling layers so that the spatial resolution of the high-level layers is reduced significantly. To evaluate whether the use of features extracted at high resolution can benefit gaze estimation, we adopt dilated-convolutions to extract high-level features without reducing spatial resolution. In cross-subject experiments on the Columbia Gaze dataset for eye contact detection and the MPIIGaze dataset for 3D gaze vector regression, the resulting Dilated-Nets achieve significant (up to 20.8%) gains when compared to similar networks without dilated-convolutions. Our proposed Dilated-Net achieves state-of-the-art results on both the Columbia Gaze and the MPIIGaze datasets. △ Less

Submitted 18 March, 2019; originally announced March 2019.

Comments: 16 pages, 7 figures. To appear in ACCV2018

arXiv:1812.10071 [pdf, other]

Coupled Recurrent Network (CRN)

Authors: Lin Sun, Kui Jia, Yuejia Shen, Silvio Savarese, Dit Yan Yeung, Bertram E. Shi

Abstract: Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel strea… ▽ More Many semantic video analysis tasks can benefit from multiple, heterogenous signals. For example, in addition to the original RGB input sequences, sequences of optical flow are usually used to boost the performance of human action recognition in videos. To learn from these heterogenous input sources, existing methods reply on two-stream architectural designs that contain independent, parallel streams of Recurrent Neural Networks (RNNs). However, two-stream RNNs do not fully exploit the reciprocal information contained in the multiple signals, let alone exploit it in a recurrent manner. To this end, we propose in this paper a novel recurrent architecture, termed Coupled Recurrent Network (CRN), to deal with multiple input sources. In CRN, the parallel streams of RNNs are coupled together. Key design of CRN is a Recurrent Interpretation Block (RIB) that supports learning of reciprocal feature representations from multiple signals in a recurrent manner. Different from RNNs which stack the training loss at each time step or the last time step, we propose an effective and efficient training strategy for CRN. Experiments show the efficacy of the proposed CRN. In particular, we achieve the new state of the art on the benchmark datasets of human action recognition and multi-person pose estimation. △ Less

Submitted 25 March, 2019; v1 submitted 25 December, 2018; originally announced December 2018.

arXiv:1805.00625 [pdf, other]

Multimodal Utterance-level Affect Analysis using Visual, Audio and Text Features

Authors: Didan Deng, Yuqian Zhou, Jimin Pi, Bertram E. Shi

Abstract: The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encour… ▽ More The integration of information across multiple modalities and across time is a promising way to enhance the emotion recognition performance of affective systems. Much previous work has focused on instantaneous emotion recognition. The 2018 One-Minute Gradual-Emotion Recognition (OMG-Emotion) challenge, which was held in conjunction with the IEEE World Congress on Computational Intelligence, encouraged participants to address long-term emotion recognition by integrating cues from multiple modalities, including facial expression, audio and language. Intuitively, a multi-modal inference network should be able to leverage information from each modality and their correlations to improve recognition over that achievable by a single modality network. We describe here a multi-modal neural architecture that integrates visual information over time using an LSTM, and combines it with utterance level audio and text cues to recognize human sentiment from multimodal clips. Our model outperforms the unimodal baseline, achieving the concordance correlation coefficients (CCC) of 0.400 on the arousal task, and 0.353 on the valence task. △ Less

Submitted 4 May, 2018; v1 submitted 2 May, 2018; originally announced May 2018.

Comments: 5 pages, 1 figure, subject to the 2018 IJCNN challenge on One-Minute Gradual-Emotion Recognition

arXiv:1708.09126 [pdf]

Photorealistic Facial Expression Synthesis by the Conditional Difference Adversarial Autoencoder

Authors: Yuqian Zhou, Bertram Emil Shi

Abstract: Photorealistic facial expression synthesis from single face image can be widely applied to face recognition, data augmentation for emotion recognition or entertainment. This problem is challenging, in part due to a paucity of labeled facial expression data, making it difficult for algorithms to disambiguate changes due to identity and changes due to expression. In this paper, we propose the condit… ▽ More Photorealistic facial expression synthesis from single face image can be widely applied to face recognition, data augmentation for emotion recognition or entertainment. This problem is challenging, in part due to a paucity of labeled facial expression data, making it difficult for algorithms to disambiguate changes due to identity and changes due to expression. In this paper, we propose the conditional difference adversarial autoencoder, CDAAE, for facial expression synthesis. The CDAAE takes a facial image of a previously unseen person and generates an image of that human face with a target emotion or facial action unit label. The CDAAE adds a feedforward path to an autoencoder structure connecting low level features at the encoder to features at the corresponding level at the decoder. It handles the problem of disambiguating changes due to identity and changes due to facial expression by learning to generate the difference between low-level features of images of the same person but with different facial expressions. The CDAAE structure can be used to generate novel expressions by combining and interpolating between facial expressions/action units within the training set. Our experimental results demonstrate that the CDAAE can preserve identity information when generating facial expression for unseen subjects more faithfully than previous approaches. This is especially advantageous when training with small databases. △ Less

Submitted 30 August, 2017; originally announced August 2017.

Comments: Accepted by ACII2017

arXiv:1708.03958 [pdf, other]

Lattice Long Short-Term Memory for Human Action Recognition

Authors: Lin Sun, Kui Jia, Kevin Chen, Dit Yan Yeung, Bertram E. Shi, Silvio Savarese

Abstract: Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (… ▽ More Human actions captured in video sequences are three-dimensional signals characterizing visual appearance and motion dynamics. To learn action patterns, existing methods adopt Convolutional and/or Recurrent Neural Networks (CNNs and RNNs). CNN based methods are effective in learning spatial appearances, but are limited in modeling long-term motion dynamics. RNNs, especially Long Short-Term Memory (LSTM), are able to learn temporal motion dynamics. However, naively applying RNNs to video sequences in a convolutional manner implicitly assumes that motions in videos are stationary across different spatial locations. This assumption is valid for short-term motions but invalid when the duration of the motion is long. In this work, we propose Lattice-LSTM (L2STM), which extends LSTM by learning independent hidden state transitions of memory cells for individual spatial locations. This method effectively enhances the ability to model dynamics across time and addresses the non-stationary issue of long-term motion dynamics without significantly increasing the model complexity. Additionally, we introduce a novel multi-modal training procedure for training our network. Unlike traditional two-stream architectures which use RGB and optical flow information as input, our two-stream model leverages both modalities to jointly train both input gates and both forget gates in the network rather than treating the two streams as separate entities with no information about the other. We apply this end-to-end system to benchmark datasets (UCF-101 and HMDB-51) of human action recognition. Experiments show that on both datasets, our proposed method outperforms all existing ones that are based on LSTM and/or CNNs of similar model complexities. △ Less

Submitted 13 August, 2017; originally announced August 2017.

Comments: ICCV2017

arXiv:1704.06399 [pdf]

doi 10.1080/10447318.2018.1452351

Using Variable Dwell Time to Accelerate Gaze-Based Web Browsing with Two-Step Selection

Authors: Zhaokang Chen, Bertram E. Shi

Abstract: In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back"… ▽ More In order to avoid the "Midas Touch" problem, gaze-based interfaces for selection often introduce a dwell time: a fixed amount of time the user must fixate upon an object before it is selected. Past interfaces have used a uniform dwell time across all objects. Here, we propose a gaze-based browser using a two-step selection policy with variable dwell time. In the first step, a command, e.g. "back" or "select", is chosen from a menu using a dwell time that is constant across the different commands. In the second step, if the "select" command is chosen, the user selects a hyperlink using a dwell time that varies between different hyperlinks. We assign shorter dwell times to more likely hyperlinks and longer dwell times to less likely hyperlinks. In order to infer the likelihood each hyperlink will be selected, we have developed a probabilistic model of natural gaze behavior while surfing the web. We have evaluated a number of heuristic and probabilistic methods for varying the dwell times using both simulation and experiment. Our results demonstrate that varying dwell time improves the user experience in comparison with fixed dwell time, resulting in fewer errors and increased speed. While all of the methods for varying dwell time resulted in improved performance, the probabilistic models yielded much greater gains than the simple heuristics. The best performing model reduces error rate by 50% compared to 100ms uniform dwell time while maintaining a similar response time. It reduces response time by 60% compared to 300ms uniform dwell time while maintaining a similar error rate. △ Less

Submitted 3 September, 2022; v1 submitted 21 April, 2017; originally announced April 2017.

Comments: This is an Accepted Manuscript of an article published by Taylor & Francis in the International Journal of Human-Computer Interaction on 30 March, 2018, available online: http://www.tandfonline.com/10.1080/10447318.2018.1452351 . For an eprint of the final published article, please access: https://www.tandfonline.com/eprint/T9d4cNwwRUqXPPiZYm8Z/full . Correct Figure 14

arXiv:1610.07129 [pdf, ps, other]

Developing and Assessing MATLAB Exercises for Active Concept Learning

Authors: S. H. Song, Marco Antonelli, Tony Fung, Brandon D. Armstrong, Amy Chong, Albert Lo, Bertram E. Shi

Abstract: New technologies, such as MOOCs, provide innovative methods to tackle new challenges in teaching and learning, such as globalization and changing contemporary culture and to remove the limits of conventional classrooms. However, they also bring challenges in course delivery and assessment, due to factors such as less direct student-instructor interaction. These challenges are especially severe in… ▽ More New technologies, such as MOOCs, provide innovative methods to tackle new challenges in teaching and learning, such as globalization and changing contemporary culture and to remove the limits of conventional classrooms. However, they also bring challenges in course delivery and assessment, due to factors such as less direct student-instructor interaction. These challenges are especially severe in engineering education, which relies heavily on experiential learning, such as computer simulations and laboratory exercises, to assist students in understanding concepts. As a result, effective design of experiential learning components is extremely critical for engineering MOOCs. In this paper, we will share our experience gained through developing and offering a MOOC on communication systems, with special focus on the development and assessment of MATLAB exercises for active concept learning. Our approach introduced students to concepts using learning components commonly provided by many MOOC platforms (e.g., online lectures and quizzes), and augmented the student experience with MATLAB based computer simulations and exercises to enable more concrete and detailed understanding of the material. We describe here a systematic approach to MATLAB problem design and assessment, based on our experience with the MATLAB server provided by MathWorks and integrated with the edX MOOC platform. We discuss the effectiveness of the instructional methods as evaluated through students' learning performance. We analyze the impact of the course design tools from both the instructor and the student perspective. △ Less

Submitted 23 October, 2016; originally announced October 2016.

Comments: Submitted to IEEE Transactions on Education

arXiv:1606.06443 [pdf]

An active efficient coding model of the optokinetic nystagmus

Authors: Chong Zhang, Jochen Triesch, Bertram E. Shi

Abstract: Optokinetic nystagmus (OKN) is an involuntary eye movement responsible for stabilizing retinal images in the presence of relative motion between an observer and the environment. Fully understanding the development of optokinetic nystagmus requires a neurally plausible computational model that accounts for the neural development and the behavior. To date, work in this area has been limited. We prop… ▽ More Optokinetic nystagmus (OKN) is an involuntary eye movement responsible for stabilizing retinal images in the presence of relative motion between an observer and the environment. Fully understanding the development of optokinetic nystagmus requires a neurally plausible computational model that accounts for the neural development and the behavior. To date, work in this area has been limited. We propose a neurally plausible framework for the joint development of disparity and motion tuning in the visual cortex, the optokinetic and vergence eye movements. This framework models the joint emergence of both perception and behavior, and accounts for the importance of the development of normal vergence control and binocular vision in achieving normal monocular OKN (mOKN) behaviors. Because the model includes behavior, we can simulate the same perturbations as performed in past experiments, such as artificially induced strabismus. The proposed model agrees both qualitatively and quantitatively with a number of findings from the literature on both binocular vision as well as the optokinetic reflex. Finally, our model also makes quantitative predictions about the OKN behavior using the same methods used to characterize the OKN in the experimental literature. △ Less

Submitted 11 October, 2016; v1 submitted 21 June, 2016; originally announced June 2016.

arXiv:1604.04327 [pdf]

Invariant feature extraction from event based stimuli

Authors: Thusitha N. Chandrapala, Bertram E. Shi

Abstract: We propose a novel architecture, the event-based GASSOM for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the pr… ▽ More We propose a novel architecture, the event-based GASSOM for learning and extracting invariant representations from event streams originating from neuromorphic vision sensors. The framework is inspired by feed-forward cortical models for visual processing. The model, which is based on the concepts of sparsity and temporal slowness, is able to learn feature extractors that resemble neurons in the primary visual cortex. Layers of units in the proposed model can be cascaded to learn feature extractors with different levels of complexity and selectivity. We explore the applicability of the framework on real world tasks by using the learned network for object recognition. The proposed model achieve higher classification accuracy compared to other state-of-the-art event based processing methods. Our results also demonstrate the generality and robustness of the method, as the recognizers for different data sets and different tasks all used the same set of learned feature detectors, which were trained on data collected independently of the testing data. △ Less

Submitted 21 June, 2016; v1 submitted 14 April, 2016; originally announced April 2016.

Comments: 6 pages

arXiv:1510.00562 [pdf, other]

Human Action Recognition using Factorized Spatio-Temporal Convolutional Networks

Authors: Lin Sun, Kui Jia, Dit-Yan Yeung, Bertram E. Shi

Abstract: Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexi… ▽ More Human actions in video sequences are three-dimensional (3D) spatio-temporal signals characterizing both the visual appearance and motion dynamics of the involved humans and objects. Inspired by the success of convolutional neural networks (CNN) for image classification, recent attempts have been made to learn 3D CNNs for recognizing human actions in videos. However, partly due to the high complexity of training 3D convolution kernels and the need for large quantities of training videos, only limited success has been reported. This has triggered us to investigate in this paper a new deep architecture which can handle 3D signals more effectively. Specifically, we propose factorized spatio-temporal convolutional networks (FstCN) that factorize the original 3D convolution kernel learning as a sequential process of learning 2D spatial kernels in the lower layers (called spatial convolutional layers), followed by learning 1D temporal kernels in the upper layers (called temporal convolutional layers). We introduce a novel transformation and permutation operator to make factorization in FstCN possible. Moreover, to address the issue of sequence alignment, we propose an effective training and inference strategy based on sampling multiple video clips from a given action video sequence. We have tested FstCN on two commonly used benchmark datasets (UCF-101 and HMDB-51). Without using auxiliary training videos to boost the performance, FstCN outperforms existing CNN based methods and achieves comparable performance with a recent method that benefits from using auxiliary training videos. △ Less

Submitted 2 October, 2015; originally announced October 2015.

arXiv:1402.3344 [pdf]

Intrinsically Motivated Learning of Visual Motion Perception and Smooth Pursuit

Authors: Chong Zhang, Yu Zhao, Jochen Triesch, Bertram E. Shi

Abstract: We extend the framework of efficient coding, which has been used to model the development of sensory processing in isolation, to model the development of the perception/action cycle. Our extension combines sparse coding and reinforcement learning so that sensory processing and behavior co-develop to optimize a shared intrinsic motivational signal: the fidelity of the neural encoding of the sensory… ▽ More We extend the framework of efficient coding, which has been used to model the development of sensory processing in isolation, to model the development of the perception/action cycle. Our extension combines sparse coding and reinforcement learning so that sensory processing and behavior co-develop to optimize a shared intrinsic motivational signal: the fidelity of the neural encoding of the sensory input under resource constraints. Applying this framework to a model system consisting of an active eye behaving in a time varying environment, we find that this generic principle leads to the simultaneous development of both smooth pursuit behavior and model neurons whose properties are similar to those of primary visual cortical neurons selective for different directions of visual motion. We suggest that this general principle may form the basis for a unified and integrated explanation of many perception/action loops. △ Less

Submitted 24 February, 2014; v1 submitted 13 February, 2014; originally announced February 2014.

Comments: 6 pages, 5 figures

Showing 1–29 of 29 results for author: Shi, B E