Zum Hauptinhalt springen

Showing 1–29 of 29 results for author: Dinkel, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.13275  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

    Authors: Jizhong Liu, Gang Li, Junbo Zhang, Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Yujun Wang, Bin Wang

    Abstract: Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED)… ▽ More

    Submitted 25 June, 2024; v1 submitted 19 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  2. arXiv:2406.07012  [pdf, other

    cs.SD cs.CL eess.AS

    Bridging Language Gaps in Audio-Text Retrieval

    Authors: Zhiyong Yan, Heinrich Dinkel, Yongqing Wang, Jizhong Liu, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Audio-text retrieval is a challenging task, requiring the search for an audio clip or a text caption within a database. The predominant focus of existing research on English descriptions poses a limitation on the applicability of such models, given the abundance of non-English content in real-world data. To address these linguistic disparities, we propose a language enhancement (LE), using a multi… ▽ More

    Submitted 16 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: interspeech2024

  3. arXiv:2406.06992  [pdf, other

    cs.SD eess.AS

    Scaling up masked audio encoder learning for general audio classification

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and… ▽ More

    Submitted 13 June, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024

  4. arXiv:2312.02396  [pdf, other

    cs.RO cs.CV cs.LG

    Unsupervised Change Detection for Space Habitats Using 3D Point Clouds

    Authors: Jamie Santos, Holly Dinkel, Julia Di, Paulo V. K. Borges, Marina Moreira, Oleg Alexandrov, Brian Coltin, Trey Smith

    Abstract: This work presents an algorithm for scene change detection from point clouds to enable autonomous robotic caretaking in future space habitats. Autonomous robotic systems will help maintain future deep-space habitats, such as the Gateway space station, which will be uncrewed for extended periods. Existing scene analysis software used on the International Space Station (ISS) relies on manually-label… ▽ More

    Submitted 5 August, 2024; v1 submitted 4 December, 2023; originally announced December 2023.

    Comments: 15 pages, 7 figures, Manuscript was presented at the AIAA SciTech Forum in Orlando, FL, USA, 8 - 12 January 2024. Video presentation: [https://www.youtube.com/watch?v=7WHp0dQYG4Y]. Code: [https://github.com/nasa/isaac/tree/master/anomaly/gmm-change-detection]

    Report number: AIAA 2024-1960

    Journal ref: AIAA SCITECH 2024 Forum

  5. Multi-Agent 3D Map Reconstruction and Change Detection in Microgravity with Free-Flying Robots

    Authors: Holly Dinkel, Julia Di, Jamie Santos, Keenan Albee, Paulo Borges, Marina Moreira, Oleg Alexandrov, Brian Coltin, Trey Smith

    Abstract: Assistive free-flyer robots autonomously caring for future crewed outposts -- such as NASA's Astrobee robots on the International Space Station (ISS) -- must be able to detect day-to-day interior changes to track inventory, detect and diagnose faults, and monitor the outpost status. This work presents a framework for multi-agent cooperative mapping and change detection to enable robotic maintenanc… ▽ More

    Submitted 6 August, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

    Comments: 11 pages, 8 figures, Manuscript presented at the 74th International Astronautical Congress, IAC 2023, Baku, Azerbaijan, 2 - 6 October 2023. Video presentation: [https://www.youtube.com/watch?v=VfjV-zwFEtU]. Code: [https://github.com/hollydinkel/astrobeecd]

    Journal ref: Acta Astronautica 223 (2024) 98-107

  6. arXiv:2310.13245  [pdf, other

    cs.RO

    Simultaneous Shape Tracking of Multiple Deformable Linear Objects with Global-Local Topology Preservation

    Authors: Jingyi Xiang, Holly Dinkel

    Abstract: This work presents an algorithm for tracking the shape of multiple entangling Deformable Linear Objects (DLOs) from a sequence of RGB-D images. This algorithm runs in real-time and improves on previous single-DLO tracking approaches by enabling tracking of multiple objects. This is achieved using Global-Local Topology Preservation (GLTP). This work uses the geodesic distance in GLTP to define the… ▽ More

    Submitted 23 October, 2023; v1 submitted 19 October, 2023; originally announced October 2023.

    Comments: 3 pages, 3 figures, presented at the 3rd Workshop on Representing and Manipulating Deformable Objects at the IEEE International Conference on Robotics and Automation. Video presentation [https://youtu.be/hfiqwMxitqA]. 3rd Workshop on Representing and Manipulating Deformable Objects [https://deformable-workshop.github.io/icra2023/]

  7. arXiv:2310.08233  [pdf, other

    cs.RO cs.AI

    The Impact of Time Step Frequency on the Realism of Robotic Manipulation Simulation for Objects of Different Scales

    Authors: Minh Q. Ta, Holly Dinkel, Hameed Abdul-Rashid, Yangfei Dai, Jessica Myers, Tan Chen, Junyi Geng, Timothy Bretl

    Abstract: This work evaluates the impact of time step frequency and component scale on robotic manipulation simulation accuracy. Increasing the time step frequency for small-scale objects is shown to improve simulation accuracy. This simulation, demonstrating pre-assembly part picking for two object geometries, serves as a starting point for discussing how to improve Sim2Real transfer in robotic assembly pr… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: 3 pages, 3 figures, Best Poster Finalist at the 2023 Robotics and AI in Future Factory Workshop at the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Video presentation [https://www.youtube.com/watch?v=JOXrBpMmI0A]. Robotics and AI in Future Factory workshop [https://sites.google.com/view/robot-ai-future-factory/]

  8. arXiv:2308.11957  [pdf, other

    cs.SD eess.AS

    CED: Consistent ensemble distillation for audio tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Augmentation and knowledge distillation (KD) are well-established techniques employed in audio classification tasks, aimed at enhancing performance and reducing model sizes on the widely recognized Audioset (AS) benchmark. Although both techniques are effective individually, their combined use, called consistent teaching, hasn't been explored before. This paper proposes CED, a simple training fram… ▽ More

    Submitted 7 September, 2023; v1 submitted 23 August, 2023; originally announced August 2023.

  9. arXiv:2306.16241  [pdf, other

    cs.SD eess.AS

    Focus on the Sound around You: Monaural Target Speaker Extraction via Distance and Speaker Information

    Authors: Jiuxin Lin, Peng Wang, Heinrich Dinkel, Jun Chen, Zhiyong Wu, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: Previously, Target Speaker Extraction (TSE) has yielded outstanding performance in certain application scenarios for speech enhancement and source separation. However, obtaining auxiliary speaker-related information is still challenging in noisy environments with significant reverberation. inspired by the recently proposed distance-based sound separation, we propose the near sound (NS) extractor,… ▽ More

    Submitted 7 October, 2023; v1 submitted 28 June, 2023; originally announced June 2023.

    Comments: Proc. INTERSPEECH 2023, 2488-2492, doi: 10.21437/Interspeech.2023-218

  10. arXiv:2306.14170  [pdf, other

    cs.MM cs.SD eess.AS

    AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

    Authors: Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

    Abstract: Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of… ▽ More

    Submitted 25 June, 2023; originally announced June 2023.

    Comments: Accepted by ICASSP2023

  11. arXiv:2305.18794  [pdf, other

    cs.SD eess.AS

    Understanding temporally weakly supervised training: A case study for keyword spotting

    Authors: Heinrich Dinkel, Weiji Zhuang, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: The currently most prominent algorithm to train keyword spotting (KWS) models with deep neural networks (DNNs) requires strong supervision i.e., precise knowledge of the spoken keyword location in time. Thus, most KWS approaches treat the presence of redundant data, such as noise, within their training set as an obstacle. A common training paradigm to deal with data redundancies is to use temporal… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

  12. arXiv:2305.17834  [pdf, other

    cs.SD eess.AS

    Streaming Audio Transformers for Online Audio Tagging

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang

    Abstract: Transformers have emerged as a prominent model framework for audio tagging (AT), boasting state-of-the-art (SOTA) performance on the widely-used Audioset dataset. However, their impressive performance often comes at the cost of high memory usage, slow inference speed, and considerable model delay, rendering them impractical for real-world AT applications. In this study, we introduce streaming audi… ▽ More

    Submitted 10 June, 2024; v1 submitted 28 May, 2023; originally announced May 2023.

    Comments: Interspeech2024

  13. arXiv:2303.01812  [pdf, other

    cs.SD eess.AS

    Unified Keyword Spotting and Audio Tagging on Mobile Devices with Transformers

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Keyword spotting (KWS) is a core human-machine-interaction front-end task for most modern intelligent assistants. Recently, a unified (UniKW-AT) framework has been proposed that adds additional capabilities in the form of audio tagging (AT) to a KWS model. However, previous work did not consider the real-world deployment of a UniKW-AT model, where factors such as model size and inference speed are… ▽ More

    Submitted 3 March, 2023; originally announced March 2023.

    Comments: ICASSP 2023

  14. An empirical study of weakly supervised audio tagging embeddings for general audio representations

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning meth… ▽ More

    Submitted 29 September, 2022; originally announced September 2022.

    Comments: Odyssey 2022

  15. UniKW-AT: Unified Keyword Spotting and Audio Tagging

    Authors: Heinrich Dinkel, Yongqing Wang, Zhiyong Yan, Junbo Zhang, Yujun Wang

    Abstract: Within the audio research community and the industry, keyword spotting (KWS) and audio tagging (AT) are seen as two distinct tasks and research fields. However, from a technical point of view, both of these tasks are identical: they predict a label (keyword in KWS, sound event in AT) for some fixed-sized input audio segment. This work proposes UniKW-AT: An initial approach for jointly training bot… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: Accepted in Interspeech2022

  16. arXiv:2205.14340  [pdf, other

    cs.RO eess.SY

    Insights from an Industrial Collaborative Assembly Project: Lessons in Research and Collaboration

    Authors: Tan Chen, Zhe Huang, James Motes, Junyi Geng, Quang Minh Ta, Holly Dinkel, Hameed Abdul-Rashid, Jessica Myers, Ye-Ji Mun, Wei-che Lin, Yuan-yung Huang, Sizhe Liu, Marco Morales, Nancy M. Amato, Katherine Driggs-Campbell, Timothy Bretl

    Abstract: Significant progress in robotics reveals new opportunities to advance manufacturing. Next-generation industrial automation will require both integration of distinct robotic technologies and their application to challenging industrial environments. This paper presents lessons from a collaborative assembly project between three academic research groups and an industry partner. The goal of the projec… ▽ More

    Submitted 28 May, 2022; originally announced May 2022.

    Comments: Spotlight presentation at ICRA 2022 Workshop on Collaborative Robots and the Work of the Future (ICRA 2022 CoR-WotF); see the spotlight presentation at https://sites.google.com/view/icra22ws-cor-wotf/accepted-papers?authuser=0

  17. arXiv:2204.13430  [pdf, other

    cs.SD eess.AS

    Pseudo strong labels for large scale weakly supervised audio tagging

    Authors: Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

    Abstract: Large-scale audio tagging datasets inevitably contain imperfect labels, such as clip-wise annotated (temporally weak) tags with no exact on- and offsets, due to a high manual labeling cost. This work proposes pseudo strong labels (PSL), a simple label augmentation framework that enhances the supervision quality for large-scale weakly supervised audio tagging. A machine annotator is first trained o… ▽ More

    Submitted 28 April, 2022; originally announced April 2022.

    Comments: Accepted by ICASSP 2022

  18. Voice activity detection in the wild: A data-driven approach using teacher-student training

    Authors: Heinrich Dinkel, Shuai Wang, Xuenan Xu, Mengyue Wu, Kai Yu

    Abstract: Voice activity detection is an essential pre-processing component for speech-related tasks such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain frame-level labels from an ASR pipeline by using, e.g., a Hidden Markov model. These ASR models are commonly trained on clean and fully transcribed data, limiting VAD systems to be trained on clean or synthetically noised d… ▽ More

    Submitted 9 May, 2021; originally announced May 2021.

    Journal ref: IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1542-1555, 2021

  19. arXiv:2102.11474  [pdf, other

    cs.SD eess.AS

    Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audioca… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

  20. arXiv:2102.11457  [pdf, other

    cs.SD eess.AS

    Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Zeyu Xie, Kai Yu

    Abstract: Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic scenery. Currently, the mainstream paradigm for AAC is the end-to-end encoder-decoder architecture, expecting the encoder to learn all levels of concepts embedd… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

  21. Towards duration robust weakly supervised sound event detection

    Authors: Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Sound event detection (SED) is the task of tagging the absence or presence of audio events and their corresponding interval within a given audio clip. While SED can be done using supervised machine learning, where training data is fully labeled with access to per event timestamps and duration, our work focuses on weakly-supervised sound event detection (WSSED), where prior knowledge about an event… ▽ More

    Submitted 4 February, 2021; v1 submitted 19 January, 2021; originally announced January 2021.

  22. End-to-end spoofing detection with raw waveform CLDNNs

    Authors: Heinrich Dinkel, Nanxin Chen, Yanmin Qian, Kai Yu

    Abstract: Albeit recent progress in speaker verification generates powerful models, malicious attacks in the form of spoofed speech, are generally not coped with. Recent results in ASVSpoof2015 and BTAS2016 challenges indicate that spoof-aware features are a possible solution to this problem. Most successful methods in both challenges focus on spoof-aware features, rather than focusing on a powerful classif… ▽ More

    Submitted 26 July, 2020; originally announced July 2020.

    Comments: 5 pages

    Journal ref: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  23. arXiv:2007.06355  [pdf, other

    cs.CV

    Multiple Sound Sources Localization from Coarse to Fine

    Authors: Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin

    Abstract: How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine man… ▽ More

    Submitted 14 July, 2020; v1 submitted 13 July, 2020; originally announced July 2020.

    Comments: to appear in ECCV 2020

  24. arXiv:2003.12222  [pdf, other

    cs.SD eess.AS

    Voice activity detection in the wild via weakly supervised sound event detection

    Authors: Heinrich Dinkel, Yefei Chen, Mengyue Wu, Kai Yu

    Abstract: Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-… ▽ More

    Submitted 16 August, 2020; v1 submitted 26 March, 2020; originally announced March 2020.

    Comments: Accepted in Interspeech 2020

  25. arXiv:1910.13028  [pdf, other

    cs.HC cs.SD eess.AS

    DEPA: Self-Supervised Audio Embedding for Depression Detection

    Authors: Pingyue Zhang, Mengyue Wu, Heinrich Dinkel, Kai Yu

    Abstract: Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently, self-supervised learning has seen success in pretraining text embeddings and has been applied broadly on related tasks with sparse data, while pretrained audio embeddings based on self-supervised learning are rarely investigate… ▽ More

    Submitted 28 October, 2021; v1 submitted 28 October, 2019; originally announced October 2019.

    Journal ref: In Proceedings of the 29th ACM International Conference on Multimedia. Association for Computing Machinery, New York, NY, USA, 2021

  26. arXiv:1905.13448  [pdf, other

    cs.SD cs.CL eess.AS

    Audio Caption in a Car Setting with a Sentence-Level Loss

    Authors: Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Captioning has attracted much attention in image and video understanding while a small amount of work examines audio captioning. This paper contributes a Mandarin-annotated dataset for audio captioning within a car scene. A sentence-level loss is proposed to be used in tandem with a GRU encoder-decoder model to generate captions with higher semantic similarity to human annotations. We evaluate the… ▽ More

    Submitted 23 October, 2020; v1 submitted 31 May, 2019; originally announced May 2019.

  27. arXiv:1904.05154  [pdf, other

    cs.LG cs.CL cs.HC

    Text-based depression detection on sparse data

    Authors: Heinrich Dinkel, Mengyue Wu, Kai Yu

    Abstract: Previous text-based depression detection is commonly based on large user-generated data. Sparse scenarios like clinical conversations are less investigated. This work proposes a text-based multi-task BGRU network with pretrained word embeddings to model patients' responses during clinical interviews. Our main approach uses a novel multi-task loss function, aiming at modeling both depression severi… ▽ More

    Submitted 8 July, 2020; v1 submitted 7 April, 2019; originally announced April 2019.

  28. Duration robust weakly supervised sound event detection

    Authors: Heinrich Dinkel, Kai Yu

    Abstract: Task 4 of the DCASE2018 challenge demonstrated that substantially more research is needed for a real-world application of sound event detection. Analyzing the challenge results it can be seen that most successful models are biased towards predicting long (e.g., over 5s) clips. This work aims to investigate the performance impact of fixed-sized window median filter post-processing and advocate the… ▽ More

    Submitted 26 January, 2020; v1 submitted 8 April, 2019; originally announced April 2019.

    Comments: Accepted by ICASSP2020

  29. arXiv:1902.09254  [pdf, other

    cs.SD cs.CL eess.AS

    Audio Caption: Listen and Tell

    Authors: Mengyue Wu, Heinrich Dinkel, Kai Yu

    Abstract: Increasing amount of research has shed light on machine perception of audio events, most of which concerns detection and classification tasks. However, human-like perception of audio scenes involves not only detecting and classifying audio sounds, but also summarizing the relationship between different audio events. Comparable research such as image caption has been conducted, yet the audio field… ▽ More

    Submitted 30 May, 2019; v1 submitted 25 February, 2019; originally announced February 2019.

    Comments: accepted by ICASSP2019