Zum Hauptinhalt springen

Showing 1–50 of 81 results for author: Kuehne, H

.
  1. arXiv:2407.20034  [pdf, other

    cs.CV

    MaskInversion: Localized Embeddings via Optimization of Explainability Maps

    Authors: Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne

    Abstract: Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Project page: https://walidbousselham.com/MaskInversion

  2. arXiv:2407.04082  [pdf, other

    eess.AS

    DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

    Authors: Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain: First, in 10-second short audio tagging tasks, Audio SSMs still underperform compared to Transformer-based models such as Audio Spectrogram Transfor… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

  3. arXiv:2406.10082  [pdf, other

    eess.AS cs.CV cs.SD

    Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

    Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

    Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More

    Submitted 14 June, 2024; originally announced June 2024.

    Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

  4. arXiv:2404.03214  [pdf, other

    cs.CV

    LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

    Authors: Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne

    Abstract: Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of Vi… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

    Comments: Code available at https://github.com/WalBouss/LeGrad

  5. arXiv:2403.11755  [pdf, other

    cs.CV cs.AI cs.LG

    Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

    Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

    Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing th… ▽ More

    Submitted 7 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

    Comments: ECCV Camera Ready. Code & Data: https://jmiemirza.github.io/Meta-Prompting/

  6. arXiv:2402.08324  [pdf, other

    cs.LG cs.AI

    Uncertainty Quantification via Stable Distribution Propagation

    Authors: Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, Mikhail Yurochkin

    Abstract: We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the… ▽ More

    Submitted 13 February, 2024; originally announced February 2024.

    Comments: Published at ICLR 2024, Code @ https://github.com/Felix-Petersen/distprop

  7. arXiv:2312.15289  [pdf, other

    cs.CV cs.LG eess.IV

    Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

    Authors: Lokesh Veeramacheneni, Moritz Wolter, Hildegard Kuehne, Juergen Gall

    Abstract: Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectru… ▽ More

    Submitted 10 June, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

  8. arXiv:2312.00878  [pdf, other

    cs.CV cs.AI

    Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

    Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne

    Abstract: Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (… ▽ More

    Submitted 14 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

    Comments: Code available at https://github.com/WalBouss/GEM

  9. arXiv:2311.06231  [pdf, other

    cs.CV

    Learning Human Action Recognition Representations Without Real Humans

    Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris

    Abstract: Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to a… ▽ More

    Submitted 10 November, 2023; originally announced November 2023.

    Comments: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Track

  10. arXiv:2310.04900  [pdf, other

    cs.CV

    HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

    Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

    Abstract: Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal… ▽ More

    Submitted 7 October, 2023; originally announced October 2023.

    Comments: https://github.com/ninatu/howtocaption

  11. arXiv:2309.08928  [pdf, other

    cs.CV

    In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

    Authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setti… ▽ More

    Submitted 16 September, 2023; originally announced September 2023.

    Comments: Published at ICCV 2023, code: https://github.com/ninatu/in_style

  12. arXiv:2308.13077  [pdf, other

    cs.CV

    Preserving Modality Structure Improves Multi-Modal Learning

    Authors: Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah

    Abstract: Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic struct… ▽ More

    Submitted 24 August, 2023; originally announced August 2023.

    Comments: Accepted at ICCV 2023

  13. arXiv:2306.15521  [pdf, other

    cs.CV

    What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation

    Authors: Benedikt Blumenstiel, Johannes Jakubik, Hilde Kühne, Michael Vössing

    Abstract: While semantic segmentation has seen tremendous improvements in the past, there are still significant labeling efforts necessary and the problem of limited generalization to classes that have not been present during training. To address this problem, zero-shot semantic segmentation makes use of large self-supervised vision-language models, allowing zero-shot transfer to unseen classes. In this wor… ▽ More

    Submitted 16 December, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

    Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

  14. arXiv:2305.12606  [pdf, other

    cs.CL cs.SD eess.AS

    Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

    Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

    Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More

    Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

    Comments: Accepted at Interspeech 2023

  15. arXiv:2305.00604  [pdf, other

    cs.LG cs.CV math.OC stat.ML

    ISAAC Newton: Input-based Approximate Curvature for Newton's Method

    Authors: Felix Petersen, Tobias Sutter, Christian Borgelt, Dongsung Huh, Hilde Kuehne, Yuekai Sun, Oliver Deussen

    Abstract: We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational over… ▽ More

    Submitted 30 April, 2023; originally announced May 2023.

    Comments: Published at ICLR 2023, Code @ https://github.com/Felix-Petersen/isaac, Video @ https://youtu.be/7RKRX-MdwqM

  16. arXiv:2304.13116  [pdf, other

    cond-mat.str-el cond-mat.mtrl-sci

    Spin-liquid-like state in a square lattice antiferromagnet

    Authors: B. Sana, M. Barik, S. Lee, U. Jena, M. Baenitz, J. Sichelschmidt, S. Luther, H. Kuehne, K. Sethupathi, M. S. Ramachandra Rao, K. Y. Choi, P. Khuntia

    Abstract: Collective behavior of spins, frustration-induced strong quantum fluctuations and subtle interplay between competing degrees of freedom in quantum materials can lead to correlated quantum states with fractional excitations that are essential ingredients for establishing paradigmatic models and have immense potential for quantum technologies. Quenched randomness is a new paradigm in elucidating the… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

  17. arXiv:2304.08682  [pdf, other

    cs.CV

    Learning Situation Hyper-Graphs for Video Question Answering

    Authors: Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

    Abstract: Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a… ▽ More

    Submitted 6 May, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

  18. arXiv:2304.05088  [pdf, other

    cs.CV cs.HC

    WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition

    Authors: Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller

    Abstract: Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activiti… ▽ More

    Submitted 21 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

    Comments: 15 pages, 3 figures, 2 tables

  19. arXiv:2303.16990  [pdf, other

    cs.CV

    What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

    Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

    Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More

    Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

    Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

  20. arXiv:2303.13664  [pdf, other

    cs.CV cs.LG

    Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data

    Authors: Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, Christian Rupprecht

    Abstract: Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contr… ▽ More

    Submitted 23 March, 2023; originally announced March 2023.

    Comments: ICLR 2023

  21. arXiv:2303.08914  [pdf, other

    cs.CV

    MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

    Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

    Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best ze… ▽ More

    Submitted 22 July, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted at ICCV 2023

  22. arXiv:2303.05166  [pdf, other

    cs.CV

    TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering

    Authors: Wei Lin, Anna Kukleva, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding… ▽ More

    Submitted 9 March, 2023; originally announced March 2023.

    Comments: Computer Vision Winter Workshop 2023

  23. arXiv:2301.02009  [pdf, other

    cs.CV

    Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

    Authors: Nina Shvetsova, Felix Petersen, Anna Kukleva, Bernt Schiele, Hilde Kuehne

    Abstract: Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints… ▽ More

    Submitted 18 August, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

    Comments: Published at ICCV 2023, Code @ https://github.com/ninatu/learning_by_sorting

  24. arXiv:2211.15393  [pdf, other

    cs.CV

    Video Test-Time Adaptation for Action Recognition

    Authors: Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal mode… ▽ More

    Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

    Comments: Accepted at CVPR 2023

  25. arXiv:2210.08277  [pdf, other

    cs.LG

    Deep Differentiable Logic Gate Networks

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Recently, research has increasingly focused on developing efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as "AND" and "XOR", which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differen… ▽ More

    Submitted 15 October, 2022; originally announced October 2022.

    Comments: Published at NeurIPS 2022

  26. arXiv:2210.07839  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Contrastive Audio-Visual Masked Autoencoder

    Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

    Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More

    Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

    Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

  27. arXiv:2210.03625  [pdf, other

    cs.CL cs.CV cs.MM

    C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

    Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

    Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More

    Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

  28. Field-tunable Berezinskii-Kosterlitz-Thouless correlations in a Heisenberg magnet

    Authors: D. Opherden, M. S. J. Tepaske, F. Bärtl, M. Weber, M. M. Turnbull, T. Lancaster, S. J. Blundell, M. Baenitz, J. Wosnitza, C. P. Landee, R. Moessner, D. J. Luitz, H. Kühne

    Abstract: We report the manifestation of field-induced Berezinskii-Kosterlitz-Thouless (BKT) correlations in the weakly coupled spin-1/2 Heisenberg layers of the molecular-based bulk material [Cu(pz)$_2$(2-HOpy)$_2$](PF$_6$)$_2$. Due to the moderate intralayer exchange coupling of $J/k_\mathrm{B} = 6.8$ K, the application of laboratory magnetic fields induces a substantial $XY$ anisotropy of the spin correl… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: 10 pages, 7 figures

  29. arXiv:2209.06103  [pdf, other

    cs.CV cs.AI cs.CL

    VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

    Authors: Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne

    Abstract: Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to wh… ▽ More

    Submitted 12 September, 2022; originally announced September 2022.

  30. arXiv:2208.01956  [pdf, other

    cs.CV

    Augmentation Learning for Semi-Supervised Classification

    Authors: Tim Frommknecht, Pedro Alves Zipf, Quanfu Fan, Nina Shvetsova, Hilde Kuehne

    Abstract: Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other doma… ▽ More

    Submitted 3 August, 2022; originally announced August 2022.

    Comments: Accepted to GCPR 2022, 13 pages with 4 figures

  31. Structural 130-K Phase Transition and Emergence of a Two-Ion Kondo State in HT-Ce$_2$Rh$_2$Ga Explored by $^{69,71}$Ga Nuclear Quadrupole Resonance

    Authors: Sh. Yamamoto, T. Fujii, S. Luther, H. Yasuoka, H. Sakai, F. Bärtl, K. M. Ranjith, H. Rosner, J. Wosnitza, A. M. Strydom, H. Kühne, M. Baenitz

    Abstract: We have studied the microscopic magnetic properties, the nature of the 130-K phase transition, and the ground state in the recently synthesized compound Ce$_2$Rh$_2$Ga by use of $^{69,71}$Ga nuclear quadrupole resonance (NQR). The NQR spectra clearly show an unusual phase transition at $T_t$ $\sim$ 130 K yielding a splitting of the high-temperature single NQR line into two clearly resolved NQR lin… ▽ More

    Submitted 11 July, 2022; originally announced July 2022.

    Comments: 12 pages, 7 figures

    Journal ref: Phys. Rev. B 106, 115125 (2022)

  32. arXiv:2207.02334  [pdf, other

    cs.CV

    Weakly Supervised Grounding for VQA in Vision-Language Transformers

    Authors: Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah

    Abstract: Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this li… ▽ More

    Submitted 5 July, 2022; originally announced July 2022.

    Comments: To appear at ECCV 2022

  33. arXiv:2206.07290  [pdf, other

    cs.LG cs.CV

    Differentiable Top-k Classification Learning

    Authors: Felix Petersen, Hilde Kuehne, Christian Borgelt, Oliver Deussen

    Abstract: The top-k classification accuracy is one of the core metrics in machine learning. Here, k is conventionally a positive integer, such as 1 or 5, leading to top-1 or top-5 training objectives. In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a different… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: Published at ICML 2022, Code @ https://github.com/Felix-Petersen/difftopk

  34. arXiv:2203.16244  [pdf, other

    cs.CV

    CycDA: Unsupervised Cycle Domain Adaptation from Image to Video

    Authors: Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, Horst Bischof

    Abstract: Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and vid… ▽ More

    Submitted 22 March, 2023; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted at ECCV2022. Supplementary included

  35. arXiv:2203.09630  [pdf, other

    cs.LG cs.AI cs.IR stat.ML

    Monotonic Differentiable Sorting Networks

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Differentiable sorting algorithms allow training with sorting and ranking supervision, where only the ordering or ranking of samples is known. Various methods have been proposed to address this challenge, ranging from optimal transport-based differentiable Sinkhorn sorting algorithms to making classic sorting networks differentiable. One problem of current differentiable sorting methods is that th… ▽ More

    Submitted 17 March, 2022; originally announced March 2022.

    Comments: Published at ICLR 2022, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=Rl-sFaE1z4M

  36. Orbital-Induced Crossover of the Fulde-Ferrell-Larkin-Ovchinnikov Phase into Abrikosov-like States

    Authors: Tommy Kotte, Hannes Kühne, John A Schlueter, Gertrud Zwicknagl, J. Wosnitza

    Abstract: The Fulde-Ferrell-Larkin-Ovchinnikov (FFLO) state can emerge in superconductors for which the orbital critical field exceeds the Pauli limit. Here, we present angular-resolved specific-heat data of the quasi-two-dimensional organic superconductor $κ$-(ET)$_2$Cu(NCS)$_2$, with a focus on high fields in the regime of the FFLO transition. For an increasing out-of-plane tilt of the applied magnetic fi… ▽ More

    Submitted 6 May, 2022; v1 submitted 6 January, 2022; originally announced January 2022.

    Comments: Main text: 6 Pages, 4 Figures Supplement: 11 Pages, 6 Figures

    Journal ref: Phys. Rev. B 106, L060503 (2022)

  37. arXiv:2112.04446  [pdf, other

    cs.CV cs.CL cs.SD eess.AS

    Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

    Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

    Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More

    Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

    Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

  38. arXiv:2112.02300  [pdf, other

    cs.CV

    Unsupervised Domain Generalization by Learning a Bridge Across Domains

    Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

    Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More

    Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

  39. arXiv:2112.00775  [pdf, other

    cs.CV

    Routing with Self-Attention for Multimodal Capsule Networks

    Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

    Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More

    Submitted 1 December, 2021; originally announced December 2021.

  40. arXiv:2111.04823  [pdf, other

    cs.CL cs.CV cs.MM cs.SD eess.AS eess.IV

    Cascaded Multilingual Audio-Visual Learning from Videos

    Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

    Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

  41. arXiv:2110.10784  [pdf, other

    cs.CV cs.LG

    Style Agnostic 3D Reconstruction via Adversarial Style Transfer

    Authors: Felix Petersen, Bastian Goldluecke, Oliver Deussen, Hilde Kuehne

    Abstract: Reconstructing the 3D geometry of an object from an image is a major challenge in computer vision. Recently introduced differentiable renderers can be leveraged to learn the 3D geometry of objects from 2D images, but those approaches require additional supervision to enable the renderer to produce an output that can be compared to the input image. This can be scene information or constraints such… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: To be published at WACV 2022, Code @ https://github.com/Felix-Petersen/style-agnostic-3d-reconstruction

  42. arXiv:2110.05651  [pdf, other

    cs.LG stat.ML

    Learning with Algorithmic Supervision via Continuous Relaxations

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: The integration of algorithmic components into neural architectures has gained increased attention recently, as it allows training neural networks with new forms of supervision such as ordering constraints or silhouettes instead of using ground truth labels. Many approaches in the field focus on the continuous relaxation of a specific task and show promising results in this context. But the focus… ▽ More

    Submitted 25 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

    Comments: Published at NeurIPS 2021, Code @ https://github.com/Felix-Petersen/algovision, Video @ https://www.youtube.com/watch?v=01ENzpkjOCE

  43. The planar triangular $S=3/2$ magnet AgCrSe$_2$: magnetic frustration, short range correlations, and field tuned anisotropic cycloidal magnetic order

    Authors: M. Baenitz, M. M. Piva, S. Luther, J. Sichelschmidt, K. M. Ranjith, H. Dawczak-Dȩbicki, M. O. Ajeesh, S. -J. Kim, G. Siemann, C. Bigi, P. Manuel, D. Khalyavin, D. A. Sokolov, P. Mokhtari, H. Zhang, H. Yasuoka, P. D. C. King, G. Vinai, V. Polewczyk, P. Torelli, J. Wosnitza, U. Burkhardt, B. Schmidt, H. Rosner, S. Wirth , et al. (3 additional authors not shown)

    Abstract: Our studies evidence an anisotropic magnetic order below $T_N = 32$~K. Susceptibility data in small fields of about 1~T reveal an antiferromagnetic (AFM) order for $H \perp c$, whereas for $H \parallel c$ the data are reminiscent of a field-induced ferromagnetic (FM) structure. At low temperatures and for $H \perp c$, the field-dependent magnetization and AC susceptibility data evidence a metamagn… ▽ More

    Submitted 6 September, 2021; originally announced September 2021.

  44. arXiv:2108.08165  [pdf, other

    cs.CV

    Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

    Authors: Anna Kukleva, Hilde Kuehne, Bernt Schiele

    Abstract: Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes. In this work we propose a three-stage framework that allows to explicitly and effectively address these challenges. While the first phase lea… ▽ More

    Submitted 18 August, 2021; originally announced August 2021.

    Comments: ICCV 2021

  45. arXiv:2105.04836  [pdf, other

    cs.CV

    Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

    Authors: Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah

    Abstract: The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. However, pre-trained object detectors require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this… ▽ More

    Submitted 11 May, 2021; originally announced May 2021.

    Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  46. arXiv:2105.04019  [pdf, other

    cs.LG cs.IR

    Differentiable Sorting Networks for Scalable Sorting and Ranking Supervision

    Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

    Abstract: Sorting and ranking supervision is a method for training neural networks end-to-end based on ordering constraints. That is, the ground truth order of sets of samples is known, while their absolute values remain unsupervised. For that, we propose differentiable sorting networks by relaxing their pairwise conditional swap operations. To address the problems of vanishing gradients and extensive blurr… ▽ More

    Submitted 14 July, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

    Comments: Published at ICML 2021, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=38dvqdYEs1o

    Journal ref: PMLR 139:8546-8555, 2021

  47. arXiv:2105.00067  [pdf, other

    cs.CV

    Unsupervised Discriminative Embedding for Sub-Action Learning in Complex Activities

    Authors: Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah

    Abstract: Action recognition and detection in the context of long untrimmed video sequences has seen an increased attention from the research community. However, annotation of complex activities is usually time consuming and challenging in practice. Therefore, recent works started to tackle the problem of unsupervised learning of sub-actions in complex activities. This paper proposes a novel approach for un… ▽ More

    Submitted 30 April, 2021; originally announced May 2021.

  48. arXiv:2104.12671  [pdf, other

    cs.CV

    Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

    Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

    Abstract: Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalitie… ▽ More

    Submitted 3 September, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

    Comments: To be presented at ICCV 2021

    Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8012-8021

  49. arXiv:2104.09829  [pdf, other

    cs.CV

    Detector-Free Weakly Supervised Grounding by Separation

    Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

    Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de… ▽ More

    Submitted 20 April, 2021; originally announced April 2021.

  50. arXiv:2101.01915  [pdf, ps, other

    cond-mat.str-el

    Emergence of Frustrated Short-Range Order above Long-Range Order in the $S=1/2$ Kagome Antiferromagnet CaCu$_3$(OD)$_6$Cl$_2\cdot0.6$D$_2$O

    Authors: Yoshihiko Ihara, Kazuki Matsui, Yoshimitsu Kohama, Sven Luther, Daryna Opherden, Jochen Wosnitza, Hannes Kühne, Hiroyuki K. Yoshida

    Abstract: We report on the low-energy dynamics in the kagome antiferromagnet CaCu$_3$(OD)$_6$Cl$_2\cdot0.6$D$_2$O (Ca-kapellasite) as studied by use of $^2$D-NMR measurements. Previous $^{35}$Cl-NMR measurements revealed that the nuclear spin-lattice relaxation rate ($1/T_1$) shows two peaks at temperatures, $T^{\ast} = 7.2$ K and $T_s \simeq 25$ K. While the low-temperature peak at $T^{\ast}$ is ascribed t… ▽ More

    Submitted 6 January, 2021; originally announced January 2021.

    Comments: 5 pages, 4 figures