Zum Hauptinhalt springen

Showing 1–5 of 5 results for author: Schmid, C

Searching in archive eess. Search in all archives.
.
  1. arXiv:2303.16501  [pdf, other

    cs.CV cs.SD eess.AS

    AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audio-only mode… ▽ More

    Submitted 29 March, 2023; originally announced March 2023.

    Comments: CVPR 2023

  2. arXiv:2211.09966  [pdf, ps, other

    cs.CV cs.MM cs.SD eess.AS eess.IV

    AVATAR submission to the Ego4D AV Transcription Challenge

    Authors: Paul Hongsuck Seo, Arsha Nagrani, Cordelia Schmid

    Abstract: In this report, we describe our submission to the Ego4D AudioVisual (AV) Speech Transcription Challenge 2022. Our pipeline is based on AVATAR, a state of the art encoder-decoder model for AV-ASR that performs early fusion of spectrograms and RGB images. We describe the datasets, experimental settings and ablations. Our final method achieves a WER of 68.40 on the challenge test set, outperforming t… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  3. arXiv:2206.07684  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVATAR: Unconstrained Audiovisual Speech Recognition

    Authors: Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

    Abstract: Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth. Unlike works that simply focus on the lip motion, we investigate the contribution of entire visual frames (visual actions, objects, background etc.). This is particularly useful for unconstrained videos, where the speaker is not necessarily visible… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

  4. arXiv:2204.00679  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    Learning Audio-Video Modalities from Image Captions

    Authors: Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

    Abstract: A major challenge in text-video and text-audio retrieval is the lack of large-scale training data. This is unlike image-captioning, where datasets are in the order of millions of samples. To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort. Using this pipeline, we create a new l… ▽ More

    Submitted 1 April, 2022; originally announced April 2022.

  5. arXiv:1901.01342  [pdf, other

    cs.CV cs.MM cs.SD eess.AS

    AVA-ActiveSpeaker: An Audio-Visual Dataset for Active Speaker Detection

    Authors: Joseph Roth, Sourish Chaudhuri, Ondrej Klejch, Radhika Marvin, Andrew Gallagher, Liat Kaver, Sharadh Ramaswamy, Arkadiusz Stopczynski, Cordelia Schmid, Zhonghua Xi, Caroline Pantofaru

    Abstract: Active speaker detection is an important component in video analysis algorithms for applications such as speaker diarization, video re-targeting for meetings, speech enhancement, and human-robot interaction. The absence of a large, carefully labeled audio-visual dataset for this task has constrained algorithm evaluations with respect to data diversity, environments, and accuracy. This has made com… ▽ More

    Submitted 24 May, 2019; v1 submitted 4 January, 2019; originally announced January 2019.