-
Robust Self Supervised Speech Embeddings for Child-Adult Classification in Interactions involving Children with Autism
Authors:
Rimita Lahiri,
Tiantian Feng,
Rajat Hebbar,
Catherine Lord,
So Hyun Kim,
Shrikanth Narayanan
Abstract:
We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child…
▽ More
We address the problem of detecting who spoke when in child-inclusive spoken interactions i.e., automatic child-adult speaker classification. Interactions involving children are richly heterogeneous due to developmental differences. The presence of neurodiversity e.g., due to Autism, contributes additional variability. We investigate the impact of additional pre-training with more unlabelled child speech on the child-adult classification performance. We pre-train our model with child-inclusive interactions, following two recent self-supervision algorithms, Wav2vec 2.0 and WavLM, with a contrastive loss objective. We report 9 - 13% relative improvement over the state-of-the-art baseline with regards to classification F1 scores on two clinical interaction datasets involving children with Autism. We also analyze the impact of pre-training under different conditions by evaluating our model on interactions involving different subgroups of children based on various demographic factors.
△ Less
Submitted 31 July, 2023;
originally announced July 2023.
-
MESAHA-Net: Multi-Encoders based Self-Adaptive Hard Attention Network with Maximum Intensity Projections for Lung Nodule Segmentation in CT Scan
Authors:
Muhammad Usman,
Azka Rehman,
Abdullah Shahid,
Siddique Latif,
Shi Sub Byon,
Sung Hyun Kim,
Tariq Mahmood Khan,
Yeong Gil Shin
Abstract:
Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for developing robust nodule segmenta…
▽ More
Accurate lung nodule segmentation is crucial for early-stage lung cancer diagnosis, as it can substantially enhance patient survival rates. Computed tomography (CT) images are widely employed for early diagnosis in lung nodule analysis. However, the heterogeneity of lung nodules, size diversity, and the complexity of the surrounding environment pose challenges for developing robust nodule segmentation methods. In this study, we propose an efficient end-to-end framework, the multi-encoder-based self-adaptive hard attention network (MESAHA-Net), for precise lung nodule segmentation in CT scans. MESAHA-Net comprises three encoding paths, an attention block, and a decoder block, facilitating the integration of three types of inputs: CT slice patches, forward and backward maximum intensity projection (MIP) images, and region of interest (ROI) masks encompassing the nodule. By employing a novel adaptive hard attention mechanism, MESAHA-Net iteratively performs slice-by-slice 2D segmentation of lung nodules, focusing on the nodule region in each slice to generate 3D volumetric segmentation of lung nodules. The proposed framework has been comprehensively evaluated on the LIDC-IDRI dataset, the largest publicly available dataset for lung nodule segmentation. The results demonstrate that our approach is highly robust for various lung nodule types, outperforming previous state-of-the-art techniques in terms of segmentation accuracy and computational complexity, rendering it suitable for real-time clinical implementation.
△ Less
Submitted 4 April, 2023;
originally announced April 2023.
-
Hierarchical control and learning of a foraging CyberOctopus
Authors:
Chia-Hsien Shih,
Noel Naughton,
Udit Halder,
Heng-Sheng Chang,
Seung Hyun Kim,
Rhanor Gillette,
Prashant G. Mehta,
Mattia Gazzola
Abstract:
Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition resul…
▽ More
Inspired by the unique neurophysiology of the octopus, we propose a hierarchical framework that simplifies the coordination of multiple soft arms by decomposing control into high-level decision making, low-level motor activation, and local reflexive behaviors via sensory feedback. When evaluated in the illustrative problem of a model octopus foraging for food, this hierarchical decomposition results in significant improvements relative to end-to-end methods. Performance is achieved through a mixed-modes approach, whereby qualitatively different tasks are addressed via complementary control schemes. Here, model-free reinforcement learning is employed for high-level decision-making, while model-based energy shaping takes care of arm-level motor execution. To render the pairing computationally tenable, a novel neural-network energy shaping (NN-ES) controller is developed, achieving accurate motions with time-to-solutions 200 times faster than previous attempts. Our hierarchical framework is then successfully deployed in increasingly challenging foraging scenarios, including an arena littered with obstacles in 3D space, demonstrating the viability of our approach.
△ Less
Submitted 11 February, 2023;
originally announced February 2023.
-
Significantly Improving Zero-Shot X-ray Pathology Classification via Fine-tuning Pre-trained Image-Text Encoders
Authors:
Jongseong Jang,
Daeun Kyung,
Seung Hwan Kim,
Honglak Lee,
Kyunghoon Bae,
Edward Choi
Abstract:
Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large…
▽ More
Deep neural networks have been successfully adopted to diverse domains including pathology classification based on medical images. However, large-scale and high-quality data to train powerful neural networks are rare in the medical domain as the labeling must be done by qualified experts. Researchers recently tackled this problem with some success by taking advantage of models pre-trained on large-scale general domain data. Specifically, researchers took contrastive image-text encoders (e.g., CLIP) and fine-tuned it with chest X-ray images and paired reports to perform zero-shot pathology classification, thus completely removing the need for pathology-annotated images to train a classification model. Existing studies, however, fine-tuned the pre-trained model with the same contrastive learning objective, and failed to exploit the multi-labeled nature of medical image-report pairs. In this paper, we propose a new fine-tuning strategy based on sentence sampling and positive pair loss relaxation for improving the downstream zero-shot pathology classification performance, which can be applied to any pre-trained contrastive image-text encoders. Our method consistently showed dramatically improved zero-shot pathology classification performance on four different chest X-ray datasets and 3 different pre-trained models (5.77% average AUROC increase). In particular, fine-tuning CLIP with our method showed much comparable or marginally outperformed to board-certified radiologists (0.619 vs 0.625 in F1 score and 0.530 vs 0.544 in MCC) in zero-shot classification of five prominent diseases from the CheXpert dataset.
△ Less
Submitted 16 March, 2023; v1 submitted 14 December, 2022;
originally announced December 2022.
-
A Context-Aware Computational Approach for Measuring Vocal Entrainment in Dyadic Conversations
Authors:
Rimita Lahiri,
Md Nasir,
Catherine Lord,
So Hyun Kim,
Shrikanth Narayanan
Abstract:
Vocal entrainment is a social adaptation mechanism in human interaction, knowledge of which can offer useful insights to an individual's cognitive-behavioral characteristics. We propose a context-aware approach for measuring vocal entrainment in dyadic conversations. We use conformers(a combination of convolutional network and transformer) for capturing both short-term and long-term conversational…
▽ More
Vocal entrainment is a social adaptation mechanism in human interaction, knowledge of which can offer useful insights to an individual's cognitive-behavioral characteristics. We propose a context-aware approach for measuring vocal entrainment in dyadic conversations. We use conformers(a combination of convolutional network and transformer) for capturing both short-term and long-term conversational context to model entrainment patterns in interactions across different domains. Specifically we use cross-subject attention layers to learn intra- as well as inter-personal signals from dyadic conversations. We first validate the proposed method based on classification experiments to distinguish between real(consistent) and fake(inconsistent/shuffled) conversations. Experimental results on interactions involving individuals with Autism Spectrum Disorder also show evidence of a statistically-significant association between the introduced entrainment measure and clinical scores relevant to symptoms, including across gender and age groups.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
MEDS-Net: Self-Distilled Multi-Encoders Network with Bi-Direction Maximum Intensity projections for Lung Nodule Detection
Authors:
Muhammad Usman,
Azka Rehman,
Abdullah Shahid,
Siddique Latif,
Shi Sub Byon,
Byoung Dai Lee,
Sung Hyun Kim,
Byung il Lee,
Yeong Gil Shin
Abstract:
In this study, we propose a lung nodule detection scheme which fully incorporates the clinic workflow of radiologists. Particularly, we exploit Bi-Directional Maximum intensity projection (MIP) images of various thicknesses (i.e., 3, 5 and 10mm) along with a 3D patch of CT scan, consisting of 10 adjacent slices to feed into self-distillation-based Multi-Encoders Network (MEDS-Net). The proposed ar…
▽ More
In this study, we propose a lung nodule detection scheme which fully incorporates the clinic workflow of radiologists. Particularly, we exploit Bi-Directional Maximum intensity projection (MIP) images of various thicknesses (i.e., 3, 5 and 10mm) along with a 3D patch of CT scan, consisting of 10 adjacent slices to feed into self-distillation-based Multi-Encoders Network (MEDS-Net). The proposed architecture first condenses 3D patch input to three channels by using a dense block which consists of dense units which effectively examine the nodule presence from 2D axial slices. This condensed information, along with the forward and backward MIP images, is fed to three different encoders to learn the most meaningful representation, which is forwarded into the decoded block at various levels. At the decoder block, we employ a self-distillation mechanism by connecting the distillation block, which contains five lung nodule detectors. It helps to expedite the convergence and improves the learning ability of the proposed architecture. Finally, the proposed scheme reduces the false positives by complementing the main detector with auxiliary detectors. The proposed scheme has been rigorously evaluated on 888 scans of LUNA16 dataset and obtained a CPM score of 93.6\%. The results demonstrate that incorporating of bi-direction MIP images enables MEDS-Net to effectively distinguish nodules from surroundings which help to achieve the sensitivity of 91.5% and 92.8% with false positives rate of 0.25 and 0.5 per scan, respectively.
△ Less
Submitted 26 December, 2022; v1 submitted 30 October, 2022;
originally announced November 2022.
-
Dual-Stage Deeply Supervised Attention-based Convolutional Neural Networks for Mandibular Canal Segmentation in CBCT Scans
Authors:
Azka Rehman,
Muhammad Usman,
Rabeea Jawaid,
Amal Muhammad Saleem,
Shi Sub Byon,
Sung Hyun Kim,
Byoung Dai Lee,
Byung il Lee,
Yeong Gil Shin
Abstract:
Accurate segmentation of mandibular canals in lower jaws is important in dental implantology. Medical experts determine the implant position and dimensions manually from 3D CT images to avoid damaging the mandibular nerve inside the canal. In this paper, we propose a novel dual-stage deep learning-based scheme for the automatic segmentation of the mandibular canal. Particularly, we first enhance t…
▽ More
Accurate segmentation of mandibular canals in lower jaws is important in dental implantology. Medical experts determine the implant position and dimensions manually from 3D CT images to avoid damaging the mandibular nerve inside the canal. In this paper, we propose a novel dual-stage deep learning-based scheme for the automatic segmentation of the mandibular canal. Particularly, we first enhance the CBCT scans by employing the novel histogram-based dynamic windowing scheme, which improves the visibility of mandibular canals. After enhancement, we design 3D deeply supervised attention U-Net architecture for localizing the volumes of interest (VOIs), which contain the mandibular canals (i.e., left and right canals). Finally, we employed the multi-scale input residual U-Net architecture (MS-R-UNet) to segment the mandibular canals using VOIs accurately. The proposed method has been rigorously evaluated on 500 scans. The results demonstrate that our technique outperforms the current state-of-the-art segmentation performance and robustness methods.
△ Less
Submitted 2 November, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Park-and-Ride Facility Location Selection under Nested Logit Demand Function
Authors:
Sang Hyun Kim,
Sangho Shim
Abstract:
Park-and-ride facilities are car parks where users can transfer to public transportation. Commuters can use P&R facilities or choose to travel by car to their destinations, and individual choice behavior is assumed to follow a logit model. The P&R facility location problem identifies locations for a fixed number of P&R facilities from among potential locations such that the number of users of the…
▽ More
Park-and-ride facilities are car parks where users can transfer to public transportation. Commuters can use P&R facilities or choose to travel by car to their destinations, and individual choice behavior is assumed to follow a logit model. The P&R facility location problem identifies locations for a fixed number of P&R facilities from among potential locations such that the number of users of the P&R facilities is maximized. This problem has previously been formalized under a multinomial logit demand function. However, as it imposes the strong condition of the independence of irrelevant alternatives, the MNL model is unable to represent the real-world P&R facility location problem exactly. Respecting the nested structure of individual choice behavior, we generalize the MNL model to a nested logit model and develop two computational methods -- neighborhood search and randomized rounding -- to solve large-scale P&R facility location problems under the NL demand function. The neighborhood search method first finds one feasible solution and then improves the feasible solution to the next one along an edge of the polyhedron whose vertices are the feasible solutions. The neighborhood search method is randomized to create an adaptive randomized rounding procedure. Computational experiments verified that the computational methods were able to solve the nonlinear optimization problem under the NL demand function on 1,000 medium-scale instances to exact optimality rapidly. Specifically, the methods were verified to solve the MNL model to exact optimality 10,000 times faster than the mixed integer linear programming formulation described in the literature. We performed additional computational experiments to assess the performance of our computational methods on a variety of large-scale instances. Our computational analysis also elucidates the difference between the MNL and NL models.
△ Less
Submitted 21 December, 2022; v1 submitted 17 November, 2021;
originally announced November 2021.
-
Meta-learning with Latent Space Clustering in Generative Adversarial Network for Speaker Diarization
Authors:
Monisankha Pal,
Manoj Kumar,
Raghuveer Peri,
Tae Jin Park,
So Hyun Kim,
Catherine Lord,
Somer Bishop,
Shrikanth Narayanan
Abstract:
The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN n…
▽ More
The performance of most speaker diarization systems with x-vector embeddings is both vulnerable to noisy environments and lacks domain robustness. Earlier work on speaker diarization using generative adversarial network (GAN) with an encoder network (ClusterGAN) to project input x-vectors into a latent space has shown promising performance on meeting data. In this paper, we extend the ClusterGAN network to improve diarization robustness and enable rapid generalization across various challenging domains. To this end, we fetch the pre-trained encoder from the ClusterGAN and fine-tune it by using prototypical loss (meta-ClusterGAN or MCGAN) under the meta-learning paradigm. Experiments are conducted on CALLHOME telephonic conversations, AMI meeting data, DIHARD II (dev set) which includes challenging multi-domain corpus, and two child-clinician interaction corpora (ADOS, BOSCC) related to the autism spectrum disorder domain. Extensive analyses of the experimental data are done to investigate the effectiveness of the proposed ClusterGAN and MCGAN embeddings over x-vectors. The results show that the proposed embeddings with normalized maximum eigengap spectral clustering (NME-SC) back-end consistently outperform Kaldi state-of-the-art z-vector diarization system. Finally, we employ embedding fusion with x-vectors to provide further improvement in diarization performance. We achieve a relative diarization error rate (DER) improvement of 6.67% to 53.93% on the aforementioned datasets using the proposed fused embeddings over x-vectors. Besides, the MCGAN embeddings provide better performance in the number of speakers estimation and short speech segment diarization as compared to x-vectors and ClusterGAN in telephonic data.
△ Less
Submitted 19 July, 2020;
originally announced July 2020.
-
Volumetric Lung Nodule Segmentation using Adaptive ROI with Multi-View Residual Learning
Authors:
Muhammad Usman,
Byoung-Dai Lee,
Shi Sub Byon,
Sung Hyun Kim,
Byung-ilLee
Abstract:
Accurate quantification of pulmonary nodules can greatly assist the early diagnosis of lung cancer, which can enhance patient survival possibilities. A number of nodule segmentation techniques have been proposed, however, all of the existing techniques rely on radiologist 3-D volume of interest (VOI) input or use the constant region of interest (ROI) and only investigate the presence of nodule vox…
▽ More
Accurate quantification of pulmonary nodules can greatly assist the early diagnosis of lung cancer, which can enhance patient survival possibilities. A number of nodule segmentation techniques have been proposed, however, all of the existing techniques rely on radiologist 3-D volume of interest (VOI) input or use the constant region of interest (ROI) and only investigate the presence of nodule voxels within the given VOI. Such approaches restrain the solutions to investigate the nodule presence outside the given VOI and also include the redundant structures into VOI, which may lead to inaccurate nodule segmentation. In this work, a novel semi-automated approach for 3-D segmentation of nodule in volumetric computerized tomography (CT) lung scans has been proposed. The proposed technique can be segregated into two stages, at the first stage, it takes a 2-D ROI containing the nodule as input and it performs patch-wise investigation along the axial axis with a novel adaptive ROI strategy. The adaptive ROI algorithm enables the solution to dynamically select the ROI for the surrounding slices to investigate the presence of nodule using deep residual U-Net architecture. The first stage provides the initial estimation of nodule which is further utilized to extract the VOI. At the second stage, the extracted VOI is further investigated along the coronal and sagittal axis with two different networks and finally, all the estimated masks are fed into the consensus module to produce the final volumetric segmentation of nodule. The proposed approach has been rigorously evaluated on the LIDC dataset, which is the largest publicly available dataset. The result suggests that the approach is significantly robust and accurate as compared to the previous state of the art techniques.
△ Less
Submitted 3 February, 2020; v1 submitted 31 December, 2019;
originally announced December 2019.
-
Meta-learning for robust child-adult classification from speech
Authors:
Nithin Rao Koluguri,
Manoj Kumar,
So Hyun Kim,
Catherine Lord,
Shrikanth Narayanan
Abstract:
Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. T…
▽ More
Computational modeling of naturalistic conversations in clinical applications has seen growing interest in the past decade. An important use-case involves child-adult interactions within the autism diagnosis and intervention domain. In this paper, we address a specific sub-problem of speaker diarization, namely child-adult speaker classification in such dyadic conversations with specified roles. Training a speaker classification system robust to speaker and channel conditions is challenging due to inherent variability in the speech within children and the adult interlocutors. In this work, we propose the use of meta-learning, in particular, prototypical networks which optimize a metric space across multiple tasks. By modeling every child-adult pair in the training set as a separate task during meta-training, we learn a representation with improved generalizability compared to conventional supervised learning. We demonstrate improvements over state-of-the-art speaker embeddings (x-vectors) under two evaluation settings: weakly supervised classification (up to 14.53% relative improvement in F1-scores) and clustering (up to relative 9.66% improvement in cluster purity). Our results show that protonets can potentially extract robust speaker embeddings for child-adult classification from speech.
△ Less
Submitted 28 October, 2019; v1 submitted 24 October, 2019;
originally announced October 2019.
-
Speaker diarization using latent space clustering in generative adversarial network
Authors:
Monisankha Pal,
Manoj Kumar,
Raghuveer Peri,
Tae Jin Park,
So Hyun Kim,
Catherine Lord,
Somer Bishop,
Shrikanth Narayanan
Abstract:
In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a co…
▽ More
In this work, we propose deep latent space clustering for speaker diarization using generative adversarial network (GAN) backprojection with the help of an encoder network. The proposed diarization system is trained jointly with GAN loss, latent variable recovery loss, and a clustering-specific loss. It uses x-vector speaker embeddings at the input, while the latent variables are sampled from a combination of continuous random variables and discrete one-hot encoded variables using the original speaker labels. We benchmark our proposed system on the AMI meeting corpus, and two child-clinician interaction corpora (ADOS and BOSCC) from the autism diagnosis domain. ADOS and BOSCC contain diagnostic and treatment outcome sessions respectively obtained in clinical settings for verbal children and adolescents with autism. Experimental results show that our proposed system significantly outperform the state-of-the-art x-vector based diarization system on these databases. Further, we perform embedding fusion with x-vectors to achieve a relative DER improvement of 31%, 36% and 49% on AMI eval, ADOS and BOSCC corpora respectively, when compared to the x-vector baseline using oracle speech segmentation.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
Robust Translational Force Control of Multi-Rotor UAV for Precise Acceleration Tracking
Authors:
Seung Jae Lee,
Seung Hyun Kim,
H. Jin Kim
Abstract:
In this paper, we introduce a translational force control method with disturbance observer (DOB)-based force disturbance cancellation for precise three-dimensional acceleration control of a multi-rotor UAV. The acceleration control of the multi-rotor requires conversion of the desired acceleration signal to the desired roll, pitch, and total thrust. But because the attitude dynamics and the thrust…
▽ More
In this paper, we introduce a translational force control method with disturbance observer (DOB)-based force disturbance cancellation for precise three-dimensional acceleration control of a multi-rotor UAV. The acceleration control of the multi-rotor requires conversion of the desired acceleration signal to the desired roll, pitch, and total thrust. But because the attitude dynamics and the thrust dynamics are different, simple kinematic signal conversion without consideration of those difference can cause serious performance degradation in acceleration tracking. Unlike most existing translational force control techniques that are based on such simple inversion, our new method allows controlling the acceleration of the multi-rotor more precisely by considering the dynamics of the multi-rotor during the kinematic inversion. By combining the DOB with the translational force system that includes the improved conversion technique, we achieve robustness with respect to the external force disturbances that hinders the accurate acceleration control. mu-analysis is performed to ensure the robust stability of the overall closed-loop system, considering the combined effect of various possible model uncertainties. Both simulation and experiment are conducted to validate the proposed technique, which confirms the satisfactory performance to track the desired acceleration of the multi-rotor.
△ Less
Submitted 14 August, 2019;
originally announced August 2019.
-
Airport Gate Scheduling for Passengers, Aircraft, and Operation
Authors:
Sang Hyun Kim,
Eric Feron,
John-Paul Clarke,
Aude Marzuoli,
Daniel Delahaye
Abstract:
Passengers' experience is becoming a key metric to evaluate the air transportation system's performance. Efficient and robust tools to handle airport operations are needed along with a better understanding of passengers' interests and concerns. Among various airport operations, this paper studies airport gate scheduling for improved passengers' experience. Three objectives accounting for passenger…
▽ More
Passengers' experience is becoming a key metric to evaluate the air transportation system's performance. Efficient and robust tools to handle airport operations are needed along with a better understanding of passengers' interests and concerns. Among various airport operations, this paper studies airport gate scheduling for improved passengers' experience. Three objectives accounting for passengers, aircraft, and operation are presented. Trade-offs between these objectives are analyzed, and a balancing objective function is proposed. The results show that the balanced objective can improve the efficiency of traffic flow in passenger terminals and on ramps, as well as the robustness of gate operations.
△ Less
Submitted 15 January, 2013;
originally announced January 2013.