-
WARM-3D: A Weakly-Supervised Sim2Real Domain Adaptation Framework for Roadside Monocular 3D Object Detection
Authors:
Xingcheng Zhou,
Deyu Fu,
Walter Zimmer,
Mingyu Liu,
Venkatnarayanan Lakshminarasimhan,
Leah Strand,
Alois C. Knoll
Abstract:
Existing roadside perception systems are limited by the absence of publicly available, large-scale, high-quality 3D datasets. Exploring the use of cost-effective, extensive synthetic datasets offers a viable solution to tackle this challenge and enhance the performance of roadside monocular 3D detection. In this study, we introduce the TUMTraf Synthetic Dataset, offering a diverse and substantial…
▽ More
Existing roadside perception systems are limited by the absence of publicly available, large-scale, high-quality 3D datasets. Exploring the use of cost-effective, extensive synthetic datasets offers a viable solution to tackle this challenge and enhance the performance of roadside monocular 3D detection. In this study, we introduce the TUMTraf Synthetic Dataset, offering a diverse and substantial collection of high-quality 3D data to augment scarce real-world datasets. Besides, we present WARM-3D, a concise yet effective framework to aid the Sim2Real domain transfer for roadside monocular 3D detection. Our method leverages cheap synthetic datasets and 2D labels from an off-the-shelf 2D detector for weak supervision. We show that WARM-3D significantly enhances performance, achieving a +12.40% increase in mAP 3D over the baseline with only pseudo-2D supervision. With 2D GT as weak labels, WARM-3D even reaches performance close to the Oracle baseline. Moreover, WARM-3D improves the ability of 3D detectors to unseen sample recognition across various real-world environments, highlighting its potential for practical applications.
△ Less
Submitted 30 July, 2024;
originally announced July 2024.
-
TUMTraf Event: Calibration and Fusion Resulting in a Dataset for Roadside Event-Based and RGB Cameras
Authors:
Christian Creß,
Walter Zimmer,
Nils Purschke,
Bach Ngoc Doan,
Sven Kirchner,
Venkatnarayanan Lakshminarasimhan,
Leah Strand,
Alois C. Knoll
Abstract:
Event-based cameras are predestined for Intelligent Transportation Systems (ITS). They provide very high temporal resolution and dynamic range, which can eliminate motion blur and improve detection performance at night. However, event-based images lack color and texture compared to images from a conventional RGB camera. Considering that, data fusion between event-based and conventional cameras can…
▽ More
Event-based cameras are predestined for Intelligent Transportation Systems (ITS). They provide very high temporal resolution and dynamic range, which can eliminate motion blur and improve detection performance at night. However, event-based images lack color and texture compared to images from a conventional RGB camera. Considering that, data fusion between event-based and conventional cameras can combine the strengths of both modalities. For this purpose, extrinsic calibration is necessary. To the best of our knowledge, no targetless calibration between event-based and RGB cameras can handle multiple moving objects, nor does data fusion optimized for the domain of roadside ITS exist. Furthermore, synchronized event-based and RGB camera datasets considering roadside perspective are not yet published. To fill these research gaps, based on our previous work, we extended our targetless calibration approach with clustering methods to handle multiple moving objects. Furthermore, we developed an early fusion, simple late fusion, and a novel spatiotemporal late fusion method. Lastly, we published the TUMTraf Event Dataset, which contains more than 4,111 synchronized event-based and RGB images with 50,496 labeled 2D boxes. During our extensive experiments, we verified the effectiveness of our calibration method with multiple moving objects. Furthermore, compared to a single RGB camera, we increased the detection performance of up to +9 % mAP in the day and up to +13 % mAP during the challenging night with our presented event-based sensor fusion methods. The TUMTraf Event Dataset is available at https://innovation-mobility.com/tumtraf-dataset.
△ Less
Submitted 9 March, 2024; v1 submitted 16 January, 2024;
originally announced January 2024.
-
A9-Dataset: Multi-Sensor Infrastructure-Based Dataset for Mobility Research
Authors:
Christian Creß,
Walter Zimmer,
Leah Strand,
Venkatnarayanan Lakshminarasimhan,
Maximilian Fortkord,
Siyi Dai,
Alois Knoll
Abstract:
Data-intensive machine learning based techniques increasingly play a prominent role in the development of future mobility solutions - from driver assistance and automation functions in vehicles, to real-time traffic management systems realized through dedicated infrastructure. The availability of high quality real-world data is often an important prerequisite for the development and reliable deplo…
▽ More
Data-intensive machine learning based techniques increasingly play a prominent role in the development of future mobility solutions - from driver assistance and automation functions in vehicles, to real-time traffic management systems realized through dedicated infrastructure. The availability of high quality real-world data is often an important prerequisite for the development and reliable deployment of such systems in large scale. Towards this endeavour, we present the A9-Dataset based on roadside sensor infrastructure from the 3 km long Providentia++ test field near Munich in Germany. The dataset includes anonymized and precision-timestamped multi-modal sensor and object data in high resolution, covering a variety of traffic situations. As part of the first set of data, which we describe in this paper, we provide camera and LiDAR frames from two overhead gantry bridges on the A9 autobahn with the corresponding objects labeled with 3D bounding boxes. The first set includes in total more than 1000 sensor frames and 14000 traffic objects. The dataset is available for download at https://a9-dataset.com.
△ Less
Submitted 13 May, 2022; v1 submitted 13 April, 2022;
originally announced April 2022.
-
Improving Voice Trigger Detection with Metric Learning
Authors:
Prateeth Nayak,
Takuya Higuchi,
Anmol Gupta,
Shivesh Ranjan,
Stephen Shum,
Siddharth Sigtia,
Erik Marchi,
Varun Lakshminarasimhan,
Minsik Cho,
Saurabh Adya,
Chandra Dhir,
Ahmed Tewfik
Abstract:
Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented…
▽ More
Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice trigger detection task. However, such a speaker independent voice trigger detector typically suffers from performance degradation on speech from underrepresented groups, such as accented speakers. In this work, we propose a novel voice trigger detector that can use a small number of utterances from a target speaker to improve detection accuracy. Our proposed model employs an encoder-decoder architecture. While the encoder performs speaker independent voice trigger detection, similar to the conventional detector, the decoder predicts a personalized embedding for each utterance. A personalized voice trigger score is then obtained as a similarity score between the embeddings of enrollment utterances and a test utterance. The personalized embedding allows adapting to target speaker's speech when computing the voice trigger score, hence improving voice trigger detection accuracy. Experimental results show that the proposed approach achieves a 38% relative reduction in a false rejection rate (FRR) compared to a baseline speaker independent voice trigger model.
△ Less
Submitted 13 September, 2022; v1 submitted 5 April, 2022;
originally announced April 2022.
-
Federated Learning Framework Coping with Hierarchical Heterogeneity in Cooperative ITS
Authors:
Rui Song,
Liguo Zhou,
Venkatnarayanan Lakshminarasimhan,
Andreas Festag,
Alois Knoll
Abstract:
Deep learning is a key approach for the environment perception function of Cooperative Intelligent Transportation Systems (C-ITS) with autonomous vehicles and smart traffic infrastructure. In today's C-ITS, smart traffic participants are capable of timely generating and transmitting a large amount of data. However, these data can not be used for model training directly due to privacy constraints.…
▽ More
Deep learning is a key approach for the environment perception function of Cooperative Intelligent Transportation Systems (C-ITS) with autonomous vehicles and smart traffic infrastructure. In today's C-ITS, smart traffic participants are capable of timely generating and transmitting a large amount of data. However, these data can not be used for model training directly due to privacy constraints. In this paper, we introduce a federated learning framework coping with Hierarchical Heterogeneity (H2-Fed), which can notably enhance the conventional pre-trained deep learning model. The framework exploits data from connected public traffic agents in vehicular networks without affecting user data privacy. By coordinating existing traffic infrastructure, including roadside units and road traffic clouds, the model parameters are efficiently disseminated by vehicular communications and hierarchically aggregated. Considering the individual heterogeneity of data distribution, computational and communication capabilities across traffic agents and roadside units, we employ a novel method that addresses the heterogeneity of different aggregation layers of the framework architecture, i.e., aggregation in layers of roadside units and cloud. The experiment results indicate that our method can well balance the learning accuracy and stability according to the knowledge of heterogeneity in current communication networks. Comparing to other baseline approaches, the evaluation on federated datasets shows that our framework is more general and capable especially in application scenarios with low communication quality. Even when 90% of the agents are timely disconnected, the pre-trained deep learning model can still be forced to converge stably, and its accuracy can be enhanced from 68% to over 90% after convergence.
△ Less
Submitted 28 July, 2022; v1 submitted 1 April, 2022;
originally announced April 2022.
-
Whispered and Lombard Neural Speech Synthesis
Authors:
Qiong Hu,
Tobias Bleisch,
Petko Petkov,
Tuomo Raitio,
Erik Marchi,
Varun Lakshminarasimhan
Abstract:
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1)…
▽ More
It is desirable for a text-to-speech system to take into account the environment where synthetic speech is presented, and provide appropriate context-dependent output to the user. In this paper, we present and compare various approaches for generating different speaking styles, namely, normal, Lombard, and whisper speech, using only limited data. The following systems are proposed and assessed: 1) Pre-training and fine-tuning a model for each style. 2) Lombard and whisper speech conversion through a signal processing based approach. 3) Multi-style generation using a single model based on a speaker verification model. Our mean opinion score and AB preference listening tests show that 1) we can generate high quality speech through the pre-training/fine-tuning approach for all speaking styles. 2) Although our speaker verification (SV) model is not explicitly trained to discriminate different speaking styles, and no Lombard and whisper voice is used for pre-training this system, the SV model can be used as a style encoder for generating different style embeddings as input for the Tacotron system. We also show that the resulting synthetic Lombard speech has a significant positive impact on intelligibility gain.
△ Less
Submitted 13 January, 2021;
originally announced January 2021.
-
Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised Neural Machine Translation
Authors:
Sreyashi Nag,
Mihir Kale,
Varun Lakshminarasimhan,
Swapnil Singhavi
Abstract:
We explore ways of incorporating bilingual dictionaries to enable semi-supervised neural machine translation. Conventional back-translation methods have shown success in leveraging target side monolingual data. However, since the quality of back-translation models is tied to the size of the available parallel corpora, this could adversely impact the synthetically generated sentences in a low resou…
▽ More
We explore ways of incorporating bilingual dictionaries to enable semi-supervised neural machine translation. Conventional back-translation methods have shown success in leveraging target side monolingual data. However, since the quality of back-translation models is tied to the size of the available parallel corpora, this could adversely impact the synthetically generated sentences in a low resource setting. We propose a simple data augmentation technique to address both this shortcoming. We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences. This automatically expands the vocabulary of the model while maintaining high quality content. Our method shows an appreciable improvement in performance over strong baselines.
△ Less
Submitted 4 April, 2020;
originally announced April 2020.
-
Providentia -- A Large-Scale Sensor System for the Assistance of Autonomous Vehicles and Its Evaluation
Authors:
Annkathrin Krämmer,
Christoph Schöller,
Dhiraj Gulati,
Venkatnarayanan Lakshminarasimhan,
Franz Kurz,
Dominik Rosenbaum,
Claus Lenz,
Alois Knoll
Abstract:
The environmental perception of an autonomous vehicle is limited by its physical sensor ranges and algorithmic performance, as well as by occlusions that degrade its understanding of an ongoing traffic situation. This not only poses a significant threat to safety and limits driving speeds, but it can also lead to inconvenient maneuvers. Intelligent Infrastructure Systems can help to alleviate thes…
▽ More
The environmental perception of an autonomous vehicle is limited by its physical sensor ranges and algorithmic performance, as well as by occlusions that degrade its understanding of an ongoing traffic situation. This not only poses a significant threat to safety and limits driving speeds, but it can also lead to inconvenient maneuvers. Intelligent Infrastructure Systems can help to alleviate these problems. An Intelligent Infrastructure System can fill in the gaps in a vehicle's perception and extend its field of view by providing additional detailed information about its surroundings, in the form of a digital model of the current traffic situation, i.e. a digital twin. However, detailed descriptions of such systems and working prototypes demonstrating their feasibility are scarce. In this paper, we propose a hardware and software architecture that enables such a reliable Intelligent Infrastructure System to be built. We have implemented this system in the real world and demonstrate its ability to create an accurate digital twin of an extended highway stretch, thus enhancing an autonomous vehicle's perception beyond the limits of its on-board sensors. Furthermore, we evaluate the accuracy and reliability of the digital twin by using aerial images and earth observation methods for generating ground truth data.
△ Less
Submitted 8 December, 2021; v1 submitted 16 June, 2019;
originally announced June 2019.
-
Efficient Low-rank Multimodal Fusion with Modality-Specific Factors
Authors:
Zhun Liu,
Ying Shen,
Varun Bharadhwaj Lakshminarasimhan,
Paul Pu Liang,
Amir Zadeh,
Louis-Philippe Morency
Abstract:
Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, the…
▽ More
Multimodal research is an emerging field of artificial intelligence, and one of the main research problems in this field is multimodal fusion. The fusion of multimodal data is the process of integrating multiple unimodal representations into one compact multimodal representation. Previous research in this field has exploited the expressiveness of tensors for multimodal representation. However, these methods often suffer from exponential increase in dimensions and in computational complexity introduced by transformation of input into tensor. In this paper, we propose the Low-rank Multimodal Fusion method, which performs multimodal fusion using low-rank tensors to improve efficiency. We evaluate our model on three different tasks: multimodal sentiment analysis, speaker trait analysis, and emotion recognition. Our model achieves competitive results on all these tasks while drastically reducing computational complexity. Additional experiments also show that our model can perform robustly for a wide range of low-rank settings, and is indeed much more efficient in both training and inference compared to other methods that utilize tensor representations.
△ Less
Submitted 31 May, 2018;
originally announced June 2018.