Search | arXiv e-print repository

arXiv:2305.19493 [pdf]

MERLIon CCS Challenge Evaluation Plan

Authors: Leibny Paola Garcia Perera, Y. H. Victoria Chua, Hexin Liu, Fei Ting Woon, Andy W. H. Khong, Justin Dauwels, Sanjeev Khudanpur, Suzy J. Styles

Abstract: This paper introduces the inaugural Multilingual Everyday Recordings- Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge, focused on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom. Aligning closely with Interspeech 2023 th… ▽ More This paper introduces the inaugural Multilingual Everyday Recordings- Language Identification on Code-Switched Child-Directed Speech (MERLIon CCS) Challenge, focused on developing robust language identification and language diarization systems that are reliable for non-standard, accented, spontaneous code-switched, child-directed speech collected via Zoom. Aligning closely with Interspeech 2023 theme, the main objectives of this inaugural challenge are to present a unique first-of-its-kind Zoom videocall dataset featuring English-Mandarin spontaneous code-switched child-directed speech, benchmark the current and novel language identification and language diarization systems in a code-switching scenario including extremely short utterances, and test the robustness of such systems under accented speech. The MERLIon CCS challenge features two task: language identification (Task 1) and language diarization (Task 2). Two tracks, open and closed, are available for each task, differing by the volume of data systems can be trained on. This paper describes the dataset, dataset annotation protocol, challenge tasks, open and closed tracks, evaluation metrics, and evaluation protocol. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Evaluation plan for Interspeech 2023 special session "MERLIon"

arXiv:2305.18925 [pdf, other]

Investigating model performance in language identification: beyond simple error statistics

Authors: Suzy J. Styles, Victoria Y. H. Chua, Fei Ting Woon, Hexin Liu, Leibny Paola Garcia Perera, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels

Abstract: Language development experts need tools that can automatically identify languages from fluent, conversational speech, and provide reliable estimates of usage rates at the level of an individual recording. However, language identification systems are typically evaluated on metrics such as equal error rate and balanced accuracy, applied at the level of an entire speech corpus. These overview metrics… ▽ More Language development experts need tools that can automatically identify languages from fluent, conversational speech, and provide reliable estimates of usage rates at the level of an individual recording. However, language identification systems are typically evaluated on metrics such as equal error rate and balanced accuracy, applied at the level of an entire speech corpus. These overview metrics do not provide information about model performance at the level of individual speakers, recordings, or units of speech with different linguistic characteristics. Overview statistics may therefore mask systematic errors in model performance for some subsets of the data, and consequently, have worse performance on data derived from some subsets of human speakers, creating a kind of algorithmic bias. In the current paper, we investigate how well a number of language identification systems perform on individual recordings and speech units with different linguistic properties in the MERLIon CCS Challenge. The Challenge dataset features accented English-Mandarin code-switched child-directed speech. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted to Interspeech 2023, 5 pages, 5 figures

arXiv:2305.18881 [pdf, other]

MERLIon CCS Challenge: A English-Mandarin code-switching child-directed speech corpus for language identification and diarization

Authors: Victoria Y. H. Chua, Hexin Liu, Leibny Paola Garcia Perera, Fei Ting Woon, Jinyi Wong, Xiangyu Zhang, Sanjeev Khudanpur, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles

Abstract: To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child sh… ▽ More To enhance the reliability and robustness of language identification (LID) and language diarization (LD) systems for heterogeneous populations and scenarios, there is a need for speech processing models to be trained on datasets that feature diverse language registers and speech patterns. We present the MERLIon CCS challenge, featuring a first-of-its-kind Zoom video call dataset of parent-child shared book reading, of over 30 hours with over 300 recordings, annotated by multilingual transcribers using a high-fidelity linguistic transcription protocol. The audio corpus features spontaneous and in-the-wild English-Mandarin code-switching, child-directed speech in non-standard accents with diverse language-mixing patterns recorded in a variety of home environments. This report describes the corpus, as well as LID and LD results for our baseline and several systems submitted to the MERLIon CCS challenge using the corpus. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted by Interspeech 2023, 5 pages, 2 figures, 3 tables

arXiv:2208.02405 [pdf, other]

Transformer Convolutional Neural Networks for Automated Artifact Detection in Scalp EEG

Authors: Wei Yan Peh, Yuanyuan Yao, Justin Dauwels

Abstract: It is well known that electroencephalograms (EEGs) often contain artifacts due to muscle activity, eye blinks, and various other causes. Detecting such artifacts is an essential first step toward a correct interpretation of EEGs. Although much effort has been devoted to semi-automated and automated artifact detection in EEG, the problem of artifact detection remains challenging. In this paper, we… ▽ More It is well known that electroencephalograms (EEGs) often contain artifacts due to muscle activity, eye blinks, and various other causes. Detecting such artifacts is an essential first step toward a correct interpretation of EEGs. Although much effort has been devoted to semi-automated and automated artifact detection in EEG, the problem of artifact detection remains challenging. In this paper, we propose a convolutional neural network (CNN) enhanced by transformers using belief matching (BM) loss for automated detection of five types of artifacts: chewing, electrode pop, eye movement, muscle, and shiver. Specifically, we apply these five detectors at individual EEG channels to distinguish artifacts from background EEG. Next, for each of these five types of artifacts, we combine the output of these channel-wise detectors to detect artifacts in multi-channel EEG segments. These segment-level classifiers can detect specific artifacts with a balanced accuracy (BAC) of 0.947, 0.735, 0.826, 0.857, and 0.655 for chewing, electrode pop, eye movement, muscle, and shiver artifacts, respectively. Finally, we combine the outputs of the five segment-level detectors to perform a combined binary classification (any artifact vs. background). The resulting detector achieves a sensitivity (SEN) of 60.4%, 51.8%, and 35.5%, at a specificity (SPE) of 95%, 97%, and 99%, respectively. This artifact detection module can reject artifact segments while only removing a small fraction of the background EEG, leading to a cleaner EEG for further analysis. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: This is an extension to a paper presented at the 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) Scottish Event Campus, Glasgow, UK, July 11-15, 2022

arXiv:2208.00025 [pdf, other]

Six-center Assessment of CNN-Transformer with Belief Matching Loss for Patient-independent Seizure Detection in EEG

Authors: Wei Yan Peh, Prasanth Thangavel, Yuanyuan Yao, John Thomas, Yee Leng Tan, Justin Dauwels

Abstract: Neurologists typically identify epileptic seizures from electroencephalograms (EEGs) by visual inspection. This process is often time-consuming, especially for EEG recordings that last hours or days. To expedite the process, a reliable, automated, and patient-independent seizure detector is essential. However, developing a patient-independent seizure detector is challenging as seizures exhibit div… ▽ More Neurologists typically identify epileptic seizures from electroencephalograms (EEGs) by visual inspection. This process is often time-consuming, especially for EEG recordings that last hours or days. To expedite the process, a reliable, automated, and patient-independent seizure detector is essential. However, developing a patient-independent seizure detector is challenging as seizures exhibit diverse characteristics across patients and recording devices. In this study, we propose a patient-independent seizure detector to automatically detect seizures in both scalp EEG and intracranial EEG (iEEG). First, we deploy a convolutional neural network with transformers and belief matching loss to detect seizures in single-channel EEG segments. Next, we extract regional features from the channel-level outputs to detect seizures in multi-channel EEG segments. At last, we apply postprocessing filters to the segment-level outputs to determine seizures' start and end points in multi-channel EEGs. Finally, we introduce the minimum overlap evaluation scoring as an evaluation metric that accounts for minimum overlap between the detection and seizure, improving upon existing assessment metrics. We trained the seizure detector on the Temple University Hospital Seizure (TUH-SZ) dataset and evaluated it on five independent EEG datasets. We evaluate the systems with the following metrics: sensitivity (SEN), precision (PRE), and average and median false positive rate per hour (aFPR/h and mFPR/h). Across four adult scalp EEG and iEEG datasets, we obtained SEN of 0.617-1.00, PRE of 0.534-1.00, aFPR/h of 0.425-2.002, and mFPR/h of 0-1.003. The proposed seizure detector can detect seizures in adult EEGs and takes less than 15s for a 30 minutes EEG. Hence, this system could aid clinicians in reliably identifying seizures expeditiously, allocating more time for devising proper treatment. △ Less

Submitted 22 November, 2022; v1 submitted 29 July, 2022; originally announced August 2022.

Comments: Submitting to IJNS

arXiv:2203.03218 [pdf, other]

Enhance Language Identification using Dual-mode Model with Knowledge Distillation

Authors: Hexin Liu, Leibny Paola Garcia Perera, Andy W. H. Khong, Justin Dauwels, Suzy J. Styles, Sanjeev Khudanpur

Abstract: In this paper, we propose to employ a dual-mode framework on the x-vector self-attention (XSA-LID) model with knowledge distillation (KD) to enhance its language identification (LID) performance for both long and short utterances. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip e… ▽ More In this paper, we propose to employ a dual-mode framework on the x-vector self-attention (XSA-LID) model with knowledge distillation (KD) to enhance its language identification (LID) performance for both long and short utterances. The dual-mode XSA-LID model is trained by jointly optimizing both the full and short modes with their respective inputs being the full-length speech and its short clip extracted by a specific Boolean mask, and KD is applied to further boost the performance on short utterances. In addition, we investigate the impact of clip-wise linguistic variability and lexical integrity for LID by analyzing the variation of LID performance in terms of the lengths and positions of the mimicked speech clips. We evaluated our approach on the MLS14 data from the NIST 2017 LRE. With the 3~s random-location Boolean mask, our proposed method achieved 19.23%, 21.52% and 8.37% relative improvement in average cost compared with the XSA-LID model on 3s, 10s, and 30s speech, respectively. △ Less

Submitted 7 March, 2022; originally announced March 2022.

Comments: Submitted to Odyssey 2022

arXiv:2107.05318 [pdf, other]

R3L: Connecting Deep Reinforcement Learning to Recurrent Neural Networks for Image Denoising via Residual Recovery

Authors: Rongkai Zhang, Jiang Zhu, Zhiyuan Zha, Justin Dauwels, Bihan Wen

Abstract: State-of-the-art image denoisers exploit various types of deep neural networks via deterministic training. Alternatively, very recent works utilize deep reinforcement learning for restoring images with diverse or unknown corruptions. Though deep reinforcement learning can generate effective policy networks for operator selection or architecture search in image restoration, how it is connected to t… ▽ More State-of-the-art image denoisers exploit various types of deep neural networks via deterministic training. Alternatively, very recent works utilize deep reinforcement learning for restoring images with diverse or unknown corruptions. Though deep reinforcement learning can generate effective policy networks for operator selection or architecture search in image restoration, how it is connected to the classic deterministic training in solving inverse problems remains unclear. In this work, we propose a novel image denoising scheme via Residual Recovery using Reinforcement Learning, dubbed R3L. We show that R3L is equivalent to a deep recurrent neural network that is trained using a stochastic reward, in contrast to many popular denoisers using supervised learning with deterministic losses. To benchmark the effectiveness of reinforcement learning in R3L, we train a recurrent neural network with the same architecture for residual recovery using the deterministic loss, thus to analyze how the two different training strategies affect the denoising performance. With such a unified benchmarking system, we demonstrate that the proposed R3L has better generalizability and robustness in image denoising when the estimated noise level varies, comparing to its counterparts using deterministic training, as well as various state-of-the-art image denoising algorithms. △ Less

Submitted 12 July, 2021; originally announced July 2021.

Comments: Accepted by ICIP 2021

arXiv:2009.13554 [pdf, other]

doi 10.1142/S0129065721500167

Multi-center validation study of automated classification of pathological slowing in adult scalp electroencephalograms via frequency features

Authors: Wei Yan Peh, John Thomas, Elham Bagheri, Rima Chaudhari, Sagar Karia, Rahul Rathakrishnan, Vinay Saini, Nilesh Shah, Rohit Srivastava, Yee-Leng Tan, Justin Dauwels

Abstract: Pathological slowing in the electroencephalogram (EEG) is widely investigated for the diagnosis of neurological disorders. Currently, the gold standard for slowing detection is the visual inspection of the EEG by experts, which is time-consuming and subjective. To address those issues, we propose three automated approaches to detect slowing in EEG: Threshold-based Detecting System (TDS), Shallow L… ▽ More Pathological slowing in the electroencephalogram (EEG) is widely investigated for the diagnosis of neurological disorders. Currently, the gold standard for slowing detection is the visual inspection of the EEG by experts, which is time-consuming and subjective. To address those issues, we propose three automated approaches to detect slowing in EEG: Threshold-based Detecting System (TDS), Shallow Learning-based Detecting System (SLDS), and Deep Learning-based Detecting System (DLDS). These systems are evaluated on channel-, segment- and EEG-level. The TDS, SLDS, and DLDS performs prediction via detecting slowing at individual channels, and those detections are arranged in histograms for detection of slowing at the segment- and EEG-level. We evaluate the systems through Leave-One-Subject-Out (LOSO) cross-validation (CV) and Leave-One-Institution-Out (LOIO) CV on four datasets from the US, Singapore, and India. The DLDS achieved the best overall results: LOIO CV mean balanced accuracy (BAC) of 71.9%, 75.5%, and 82.0% at channel-, segment- and EEG-level, and LOSO CV mean BAC of 73.6%, 77.2%, and 81.8% at channel-, segment-, and EEG-level. The channel- and segment-level performance is comparable to the intra-rater agreement (IRA) of an expert of 72.4% and 82%. The DLDS can process a 30-minutes EEG in 4 seconds and can be deployed to assist clinicians in interpreting EEGs. △ Less

Submitted 26 January, 2021; v1 submitted 28 September, 2020; originally announced September 2020.

Comments: 24 pages. For submission to International Journal of Neural Systems (IJNS)

arXiv:2008.13443 [pdf, other]

doi 10.1016/j.commtr.2021.100008

On the Quality Requirements of Demand Prediction for Dynamic Public Transport

Authors: Inon Peled, Kelvin Lee, Yu Jiang, Justin Dauwels, Francisco C. Pereira

Abstract: As Public Transport (PT) becomes more dynamic and demand-responsive, it increasingly depends on predictions of transport demand. But how accurate need such predictions be for effective PT operation? We address this question through an experimental case study of PT trips in Metropolitan Copenhagen, Denmark, which we conduct independently of any specific prediction models. First, we simulate errors… ▽ More As Public Transport (PT) becomes more dynamic and demand-responsive, it increasingly depends on predictions of transport demand. But how accurate need such predictions be for effective PT operation? We address this question through an experimental case study of PT trips in Metropolitan Copenhagen, Denmark, which we conduct independently of any specific prediction models. First, we simulate errors in demand prediction through unbiased noise distributions that vary considerably in shape. Using the noisy predictions, we then simulate and optimize demand-responsive PT fleets via a linear programming formulation and measure their performance. Our results suggest that the optimized performance is mainly affected by the skew of the noise distribution and the presence of infrequently large prediction errors. In particular, the optimized performance can improve under non-Gaussian vs. Gaussian noise. We also find that dynamic routing could reduce trip time by at least 23% vs. static routing. This reduction is estimated at 809,000 EUR/year in terms of Value of Travel Time Savings for the case study. △ Less

Submitted 6 November, 2021; v1 submitted 31 August, 2020; originally announced August 2020.

Comments: 26 pages, 9 tables, 6 figures

arXiv:1911.03667 [pdf, other]

Factored Latent-Dynamic Conditional Random Fields for Single and Multi-label Sequence Modeling

Authors: Satyajit Neogi, Justin Dauwels

Abstract: Conditional Random Fields (CRF) are frequently applied for labeling and segmenting sequence data. Morency et al. (2007) introduced hidden state variables in a labeled CRF structure in order to model the latent dynamics within class labels, thus improving the labeling performance. Such a model is known as Latent-Dynamic CRF (LDCRF). We present Factored LDCRF (FLDCRF), a structure that allows multip… ▽ More Conditional Random Fields (CRF) are frequently applied for labeling and segmenting sequence data. Morency et al. (2007) introduced hidden state variables in a labeled CRF structure in order to model the latent dynamics within class labels, thus improving the labeling performance. Such a model is known as Latent-Dynamic CRF (LDCRF). We present Factored LDCRF (FLDCRF), a structure that allows multiple latent dynamics of the class labels to interact with each other. Including such latent-dynamic interactions leads to improved labeling performance on single-label and multi-label sequence modeling tasks. We apply our FLDCRF models on two single-label (one nested cross-validation) and one multi-label sequence tagging (nested cross-validation) experiments across two different datasets - UCI gesture phase data and UCI opportunity data. FLDCRF outperforms all state-of-the-art sequence models, i.e., CRF, LDCRF, LSTM, LSTM-CRF, Factorial CRF, Coupled CRF and a multi-label LSTM model in all our experiments. In addition, LSTM based models display inconsistent performance across validation and test data, and pose diffculty to select models on validation data during our experiments. FLDCRF offers easier model selection, consistency across validation and test performance and lucid model intuition. FLDCRF is also much faster to train compared to LSTM, even without a GPU. FLDCRF outshines the best LSTM model by ~4% on a single-label task on UCI gesture phase data and outperforms LSTM performance by ~2% on average across nested cross-validation test sets on the multi-label sequence tagging experiment on UCI opportunity data. The idea of FLDCRF can be extended to joint (multi-agent interactions) and heterogeneous (discrete and continuous) state space models. △ Less

Submitted 12 November, 2019; v1 submitted 9 November, 2019; originally announced November 2019.

Comments: To be submitted to Journal of Machine Learning Research (JMLR)

arXiv:1907.11881 [pdf, other]

doi 10.1109/TITS.2020.2995166

Context Model for Pedestrian Intention Prediction using Factored Latent-Dynamic Conditional Random Fields

Authors: Satyajit Neogi, Michael Hoy, Kang Dang, Hang Yu, Justin Dauwels

Abstract: Smooth handling of pedestrian interactions is a key requirement for Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS). Such systems call for early and accurate prediction of a pedestrian's crossing/not-crossing behaviour in front of the vehicle. Existing approaches to pedestrian behaviour prediction make use of pedestrian motion, his/her location in a scene and static context… ▽ More Smooth handling of pedestrian interactions is a key requirement for Autonomous Vehicles (AV) and Advanced Driver Assistance Systems (ADAS). Such systems call for early and accurate prediction of a pedestrian's crossing/not-crossing behaviour in front of the vehicle. Existing approaches to pedestrian behaviour prediction make use of pedestrian motion, his/her location in a scene and static context variables such as traffic lights, zebra crossings etc. We stress on the necessity of early prediction for smooth operation of such systems. We introduce the influence of vehicle interactions on pedestrian intention for this purpose. In this paper, we show a discernible advance in prediction time aided by the inclusion of such vehicle interaction context. We apply our methods to two different datasets, one in-house collected - NTU dataset and another public real-life benchmark - JAAD dataset. We also propose a generic graphical model Factored Latent-Dynamic Conditional Random Fields (FLDCRF) for single and multi-label sequence prediction as well as joint interaction modeling tasks. FLDCRF outperforms Long Short-Term Memory (LSTM) networks across the datasets ($\sim$100 sequences per dataset) over identical time-series features. While the existing best system predicts pedestrian stopping behaviour with 70\% accuracy 0.38 seconds before the actual events, our system achieves such accuracy at least 0.9 seconds on an average before the actual events across datasets. △ Less

Submitted 15 September, 2020; v1 submitted 27 July, 2019; originally announced July 2019.

Comments: Accepted by IEEE Transactions on Intelligent Transportation Systems

arXiv:1907.05274 [pdf, other]

Affine Disentangled GAN for Interpretable and Robust AV Perception

Authors: Letao Liu, Martin Saerbeck, Justin Dauwels

Abstract: Autonomous vehicles (AV) have progressed rapidly with the advancements in computer vision algorithms. The deep convolutional neural network as the main contributor to this advancement has boosted the classification accuracy dramatically. However, the discovery of adversarial examples reveals the generalization gap between dataset and the real world. Furthermore, affine transformations may also con… ▽ More Autonomous vehicles (AV) have progressed rapidly with the advancements in computer vision algorithms. The deep convolutional neural network as the main contributor to this advancement has boosted the classification accuracy dramatically. However, the discovery of adversarial examples reveals the generalization gap between dataset and the real world. Furthermore, affine transformations may also confuse computer vision based object detectors. The degradation of the perception system is undesirable for safety critical systems such as autonomous vehicles. In this paper, a deep learning system is proposed: Affine Disentangled GAN (ADIS-GAN), which is robust against affine transformations and adversarial attacks. It is demonstrated that conventional data augmentation for affine transformation and adversarial attacks are orthogonal, while ADIS-GAN can handle both attacks at the same time. Useful information such as image rotation angle and scaling factor are also generated in ADIS-GAN. On MNIST dataset, ADIS-GAN can achieve over 98 percent classification accuracy within 30 degrees rotation, and over 90 percent classification accuracy against FGSM and PGD adversarial attack. △ Less

Submitted 6 July, 2019; originally announced July 2019.

Showing 1–12 of 12 results for author: Dauwels, J