Search | arXiv e-print repository

arXiv:2407.20034 [pdf, other]

MaskInversion: Localized Embeddings via Optimization of Explainability Maps

Authors: Walid Bousselham, Sofian Chaybouti, Christian Rupprecht, Vittorio Ferrari, Hilde Kuehne

Abstract: Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for… ▽ More Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. % To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the foundation model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Project page: https://walidbousselham.com/MaskInversion

arXiv:2407.04082 [pdf, other]

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners

Authors: Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

Abstract: State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain: First, in 10-second short audio tagging tasks, Audio SSMs still underperform compared to Transformer-based models such as Audio Spectrogram Transfor… ▽ More State-space models (SSMs) have emerged as an alternative to Transformers for audio modeling due to their high computational efficiency with long inputs. While recent efforts on Audio SSMs have reported encouraging results, two main limitations remain: First, in 10-second short audio tagging tasks, Audio SSMs still underperform compared to Transformer-based models such as Audio Spectrogram Transformer (AST). Second, although Audio SSMs theoretically support long audio inputs, their actual performance with long audio has not been thoroughly evaluated. To address these limitations, in this paper, 1) We applied knowledge distillation in audio space model training, resulting in a model called Knowledge Distilled Audio SSM (DASS). To the best of our knowledge, it is the first SSM that outperforms the Transformers on AudioSet and achieves an mAP of 47.6; and 2) We designed a new test called Audio Needle In A Haystack (Audio NIAH). We find that DASS, trained with only 10-second audio clips, can retrieve sound events in audio recordings up to 2.5 hours long, while the AST model fails when the input is just 50 seconds, demonstrating SSMs are indeed more duration scalable. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2406.10082 [pdf, other]

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation

Authors: Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinsky, Hilde Kuehne, Rogerio Feris, James Glass

Abstract: Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data differe… ▽ More Audio-Visual Speech Recognition (AVSR) uses lip-based video to improve performance in noise. Since videos are harder to obtain than audio, the video training data of AVSR models is usually limited to a few thousand hours. In contrast, speech models such as Whisper are trained with hundreds of thousands of hours of data, and thus learn a better speech-to-text decoder. The huge training data difference motivates us to adapt Whisper to handle video inputs. Inspired by Flamingo which injects visual features into language models, we propose Whisper-Flamingo which integrates visual features into the Whisper speech recognition and translation model with gated cross attention. Our audio-visual Whisper-Flamingo outperforms audio-only Whisper on English speech recognition and En-X translation for 6 languages in noisy conditions. Moreover, Whisper-Flamingo is a versatile model and conducts all of these tasks using one set of parameters, while prior methods are trained separately on each language. △ Less

Submitted 14 June, 2024; originally announced June 2024.

Comments: Interspeech 2024. Code https://github.com/roudimit/whisper-flamingo

arXiv:2404.03214 [pdf, other]

LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

Authors: Walid Bousselham, Angie Boggust, Sofian Chaybouti, Hendrik Strobelt, Hilde Kuehne

Abstract: Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of Vi… ▽ More Vision Transformers (ViTs), with their ability to model long-range dependencies through self-attention mechanisms, have become a standard architecture in computer vision. However, the interpretability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement tool for enhancing the transparency of ViTs. We evaluate LeGrad in challenging segmentation, perturbation, and open-vocabulary settings, showcasing its versatility compared to other SotA explainability methods demonstrating its superior spatial fidelity and robustness to perturbations. A demo and the code is available at https://github.com/WalBouss/LeGrad. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: Code available at https://github.com/WalBouss/LeGrad

arXiv:2403.11755 [pdf, other]

Meta-Prompting for Automating Zero-shot Visual Recognition with LLMs

Authors: M. Jehanzeb Mirza, Leonid Karlinsky, Wei Lin, Sivan Doveh, Jakub Micorek, Mateusz Kozinski, Hilde Kuehne, Horst Possegger

Abstract: Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing th… ▽ More Prompt ensembling of Large Language Model (LLM) generated category-specific prompts has emerged as an effective method to enhance zero-shot recognition ability of Vision-Language Models (VLMs). To obtain these category-specific prompts, the present methods rely on hand-crafting the prompts to the LLMs for generating VLM prompts for the downstream tasks. However, this requires manually composing these task-specific prompts and still, they might not cover the diverse set of visual concepts and task-specific styles associated with the categories of interest. To effectively take humans out of the loop and completely automate the prompt generation process for zero-shot recognition, we propose Meta-Prompting for Visual Recognition (MPVR). Taking as input only minimal information about the target task, in the form of its short natural language description, and a list of associated class labels, MPVR automatically produces a diverse set of category-specific prompts resulting in a strong zero-shot classifier. MPVR generalizes effectively across various popular zero-shot image recognition benchmarks belonging to widely different domains when tested with multiple LLMs and VLMs. For example, MPVR obtains a zero-shot recognition improvement over CLIP by up to 19.8% and 18.2% (5.0% and 4.5% on average over 20 datasets) leveraging GPT and Mixtral LLMs, respectively △ Less

Submitted 7 August, 2024; v1 submitted 18 March, 2024; originally announced March 2024.

Comments: ECCV Camera Ready. Code & Data: https://jmiemirza.github.io/Meta-Prompting/

arXiv:2402.08324 [pdf, other]

Uncertainty Quantification via Stable Distribution Propagation

Authors: Felix Petersen, Aashwin Mishra, Hilde Kuehne, Christian Borgelt, Oliver Deussen, Mikhail Yurochkin

Abstract: We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the… ▽ More We propose a new approach for propagating stable probability distributions through neural networks. Our method is based on local linearization, which we show to be an optimal approximation in terms of total variation distance for the ReLU non-linearity. This allows propagating Gaussian and Cauchy input uncertainties through neural networks to quantify their output uncertainties. To demonstrate the utility of propagating distributions, we apply the proposed method to predicting calibrated confidence intervals and selective prediction on out-of-distribution data. The results demonstrate a broad applicability of propagating distributions and show the advantages of our method over other approaches such as moment matching. △ Less

Submitted 13 February, 2024; originally announced February 2024.

Comments: Published at ICLR 2024, Code @ https://github.com/Felix-Petersen/distprop

arXiv:2312.15289 [pdf, other]

Fréchet Wavelet Distance: A Domain-Agnostic Metric for Image Generation

Authors: Lokesh Veeramacheneni, Moritz Wolter, Hildegard Kuehne, Juergen Gall

Abstract: Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectru… ▽ More Modern metrics for generative learning like Fréchet Inception Distance (FID) demonstrate impressive performance. However, they suffer from various shortcomings, like a bias towards specific generators and datasets. To address this problem, we propose the Fréchet Wavelet Distance (FWD) as a domain-agnostic metric based on Wavelet Packet Transform ($W_p$). FWD provides a sight across a broad spectrum of frequencies in images with a high resolution, along with preserving both spatial and textural aspects. Specifically, we use Wp to project generated and dataset images to packet coefficient space. Further, we compute Fréchet distance with the resultant coefficients to evaluate the quality of a generator. This metric is general-purpose and dataset-domain agnostic, as it does not rely on any pre-trained network while being more interpretable because of frequency band transparency. We conclude with an extensive evaluation of a wide variety of generators across various datasets that the proposed FWD is able to generalize and improve robustness to domain shift and various corruptions compared to other metrics. △ Less

Submitted 10 June, 2024; v1 submitted 23 December, 2023; originally announced December 2023.

arXiv:2312.00878 [pdf, other]

Grounding Everything: Emerging Localization Properties in Vision-Language Transformers

Authors: Walid Bousselham, Felix Petersen, Vittorio Ferrari, Hilde Kuehne

Abstract: Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (… ▽ More Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval, classification, or captioning. But so far, those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result, they need to be fine-tuned for this task. In this paper, we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities, we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering, thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation, we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. It shows that GEM not only outperforms other training-free open-vocabulary localization methods, but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark. △ Less

Submitted 14 December, 2023; v1 submitted 1 December, 2023; originally announced December 2023.

Comments: Code available at https://github.com/WalBouss/GEM

arXiv:2311.06231 [pdf, other]

Learning Human Action Recognition Representations Without Real Humans

Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris

Abstract: Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to a… ▽ More Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to alleviate these problems by blurring faces, downsampling videos, or training on synthetic data. On the other hand, analysis on the transferability of privacy-preserving pre-trained models to downstream tasks has been limited. In this work, we study this problem by first asking the question: can we pre-train models for human action recognition with data that does not include real humans? To this end, we present, for the first time, a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Furthermore, we propose a novel pre-training strategy, called Privacy-Preserving MAE-Align, to effectively combine synthetic data and human-removed real data. Our approach outperforms previous baselines by up to 5% and closes the performance gap between human and no-human action recognition representations on downstream tasks, for both linear probing and fine-tuning. Our benchmark, code, and models are available at https://github.com/howardzh01/PPMA . △ Less

Submitted 10 November, 2023; originally announced November 2023.

Comments: 19 pages, 7 figures, 2023 NeurIPS Datasets and Benchmarks Track

arXiv:2310.04900 [pdf, other]

HowToCaption: Prompting LLMs to Transform Video Annotations at Scale

Authors: Nina Shvetsova, Anna Kukleva, Xudong Hong, Christian Rupprecht, Bernt Schiele, Hilde Kuehne

Abstract: Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal… ▽ More Instructional videos are an excellent source for learning multimodal representations by leveraging video-subtitle pairs extracted with automatic speech recognition systems (ASR) from the audio signal in the videos. However, in contrast to human-annotated captions, both speech and subtitles naturally differ from the visual content of the videos and thus provide only noisy supervision for multimodal learning. As a result, large-scale annotation-free web video training data remains sub-optimal for training text-video models. In this work, we propose to leverage the capability of large language models (LLMs) to obtain fine-grained video descriptions aligned with videos. Specifically, we prompt an LLM to create plausible video descriptions based on ASR narrations of the video for a large-scale instructional video dataset. To this end, we introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture context beyond a single sentence. To align the captions to the video temporally, we prompt the LLM to generate timestamps for each produced caption based on the subtitles. In this way, we obtain human-style video captions at scale without human supervision. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption. Our evaluation shows that the resulting captions not only significantly improve the performance over many different benchmark datasets for text-video retrieval but also lead to a disentangling of textual narration from the audio, boosting performance in text-video-audio tasks. △ Less

Submitted 7 October, 2023; originally announced October 2023.

Comments: https://github.com/ninatu/howtocaption

arXiv:2309.08928 [pdf, other]

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval

Authors: Nina Shvetsova, Anna Kukleva, Bernt Schiele, Hilde Kuehne

Abstract: Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setti… ▽ More Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, when transferring them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that during training utilizes only text queries together with uncurated web videos without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Comments: Published at ICCV 2023, code: https://github.com/ninatu/in_style

arXiv:2308.13077 [pdf, other]

Preserving Modality Structure Improves Multi-Modal Learning

Authors: Swetha Sirnam, Mamshad Nayeem Rizve, Nina Shvetsova, Hilde Kuehne, Mubarak Shah

Abstract: Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic struct… ▽ More Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_Knopp △ Less

Submitted 24 August, 2023; originally announced August 2023.

Comments: Accepted at ICCV 2023

arXiv:2306.15521 [pdf, other]

What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation

Authors: Benedikt Blumenstiel, Johannes Jakubik, Hilde Kühne, Michael Vössing

Abstract: While semantic segmentation has seen tremendous improvements in the past, there are still significant labeling efforts necessary and the problem of limited generalization to classes that have not been present during training. To address this problem, zero-shot semantic segmentation makes use of large self-supervised vision-language models, allowing zero-shot transfer to unseen classes. In this wor… ▽ More While semantic segmentation has seen tremendous improvements in the past, there are still significant labeling efforts necessary and the problem of limited generalization to classes that have not been present during training. To address this problem, zero-shot semantic segmentation makes use of large self-supervised vision-language models, allowing zero-shot transfer to unseen classes. In this work, we build a benchmark for Multi-domain Evaluation of Semantic Segmentation (MESS), which allows a holistic analysis of performance across a wide range of domain-specific datasets such as medicine, engineering, earth monitoring, biology, and agriculture. To do this, we reviewed 120 datasets, developed a taxonomy, and classified the datasets according to the developed taxonomy. We select a representative subset consisting of 22 datasets and propose it as the MESS benchmark. We evaluate eight recently published models on the proposed MESS benchmark and analyze characteristics for the performance of zero-shot transfer models. The toolkit is available at https://github.com/blumenstiel/MESS. △ Less

Submitted 16 December, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) Track on Datasets and Benchmarks

arXiv:2305.12606 [pdf, other]

Comparison of Multilingual Self-Supervised and Weakly-Supervised Speech Pre-Training for Adaptation to Unseen Languages

Authors: Andrew Rouditchenko, Sameer Khurana, Samuel Thomas, Rogerio Feris, Leonid Karlinsky, Hilde Kuehne, David Harwath, Brian Kingsbury, James Glass

Abstract: Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both mo… ▽ More Recent models such as XLS-R and Whisper have made multilingual speech technologies more accessible by pre-training on audio from around 100 spoken languages each. However, there are thousands of spoken languages worldwide, and adapting to new languages is an important problem. In this work, we aim to understand which model adapts better to languages unseen during pre-training. We fine-tune both models on 13 unseen languages and 18 seen languages. Our results show that the number of hours seen per language and language family during pre-training is predictive of how the models compare, despite the significant differences in the pre-training methods. △ Less

Submitted 30 May, 2023; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: Accepted at Interspeech 2023

arXiv:2305.00604 [pdf, other]

ISAAC Newton: Input-based Approximate Curvature for Newton's Method

Authors: Felix Petersen, Tobias Sutter, Christian Borgelt, Dongsung Huh, Hilde Kuehne, Yuekai Sun, Oliver Deussen

Abstract: We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational over… ▽ More We present ISAAC (Input-baSed ApproximAte Curvature), a novel method that conditions the gradient using selected second-order information and has an asymptotically vanishing computational overhead, assuming a batch size smaller than the number of neurons. We show that it is possible to compute a good conditioner based on only the input to a respective layer without a substantial computational overhead. The proposed method allows effective training even in small-batch stochastic regimes, which makes it competitive to first-order as well as second-order methods. △ Less

Submitted 30 April, 2023; originally announced May 2023.

Comments: Published at ICLR 2023, Code @ https://github.com/Felix-Petersen/isaac, Video @ https://youtu.be/7RKRX-MdwqM

arXiv:2304.13116 [pdf, other]

Spin-liquid-like state in a square lattice antiferromagnet

Authors: B. Sana, M. Barik, S. Lee, U. Jena, M. Baenitz, J. Sichelschmidt, S. Luther, H. Kuehne, K. Sethupathi, M. S. Ramachandra Rao, K. Y. Choi, P. Khuntia

Abstract: Collective behavior of spins, frustration-induced strong quantum fluctuations and subtle interplay between competing degrees of freedom in quantum materials can lead to correlated quantum states with fractional excitations that are essential ingredients for establishing paradigmatic models and have immense potential for quantum technologies. Quenched randomness is a new paradigm in elucidating the… ▽ More Collective behavior of spins, frustration-induced strong quantum fluctuations and subtle interplay between competing degrees of freedom in quantum materials can lead to correlated quantum states with fractional excitations that are essential ingredients for establishing paradigmatic models and have immense potential for quantum technologies. Quenched randomness is a new paradigm in elucidating the emergence of spin-liquidlike states in geometrically frustrated magnets. Herein, we report magnetization, specific heat, electron spin resonance, and muon spin resonance studies on a 3d-electron-based square lattice antiferromagnet Sr3CuTa2O9. In this material, S = 1/2 Cu2+ nearest-neighbor ions constitute a two-dimensional square lattice. The negative value of Curie-Weiss temperature, obtained from the Curie-Weiss fit of high-temperature magnetic susceptibility data indicates the presence of antiferromagnetic interaction between Cu2+ moments. Specific heat data show the absence of long-range magnetic ordering down to 64 mK despite a reasonably strong exchange interaction between Cu2+ spins as reflected from a Curie-Weiss temperature of -27 K. The power-law behavior and the data collapse of specific heat and magnetization data evince the emergence of a random-singlet state in Sr3CuTa2O9. The power-law-like spin auto-correlation function and the data collapse of muon polarization asymmetry with longitudinal field dependence of t/(μ0H)γ further support credence to the presence of a randomness-induced liquid-like state. Our results suggest that randomness induced by disorder is a viable route to realize quantum spin liquid-like state in this square lattice antiferromagnet. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.08682 [pdf, other]

Learning Situation Hyper-Graphs for Video Question Answering

Authors: Aisha Urooj Khan, Hilde Kuehne, Bo Wu, Kim Chheu, Walid Bousselham, Chuang Gan, Niels Lobo, Mubarak Shah

Abstract: Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a… ▽ More Answering questions about complex situations in videos requires not only capturing the presence of actors, objects, and their relations but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip. and to use cross-attention between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed method is trained in an end-to-end manner and optimized by a VQA loss with the cross-entropy function and a Hungarian matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question-answering tasks. △ Less

Submitted 6 May, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

arXiv:2304.05088 [pdf, other]

WEAR: An Outdoor Sports Dataset for Wearable and Egocentric Activity Recognition

Authors: Marius Bock, Hilde Kuehne, Kristof Van Laerhoven, Michael Moeller

Abstract: Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activiti… ▽ More Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both egocentric video and inertial-based sensor data remain scarce. In this paper, we introduce WEAR, an outdoor sports dataset for both vision- and inertial-based human activity recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. Unlike previous egocentric datasets, WEAR provides a challenging prediction scenario marked by purposely introduced activity variations as well as an overall small information overlap across modalities. Benchmark results obtained using each modality separately show that each modality interestingly offers complementary strengths and weaknesses in their prediction performance. Further, in light of the recent success of temporal action localization models following the architecture design of the ActionFormer, we demonstrate their versatility by applying them in a plain fashion using vision, inertial and combined (vision + inertial) features as input. Results demonstrate both the applicability of vision-based temporal action localization models for inertial data and fusing both modalities by means of simple concatenation, with the combined approach (vision + inertial features) being able to produce the highest mean average precision and close-to-best F1-score. The dataset and code to reproduce experiments is publicly available via: https://mariusbock.github.io/wear/ △ Less

Submitted 21 November, 2023; v1 submitted 11 April, 2023; originally announced April 2023.

Comments: 15 pages, 3 figures, 2 tables

arXiv:2303.16990 [pdf, other]

What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions

Authors: Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Daniel Kondermann, Samuel Thomas, Shih-Fu Chang, Rogerio Feris, James Glass, Hilde Kuehne

Abstract: Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video an… ▽ More Spatio-temporal grounding describes the task of localizing events in space and time, e.g., in video data, based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective, proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only, without human annotation. To this end, we combine local representation learning, which focuses on leveraging fine-grained spatial information, with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting, a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long, untrimmed, multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings, including spatial, temporal, and untrimmed multi-action spatio-temporal grounding. △ Less

Submitted 28 May, 2024; v1 submitted 29 March, 2023; originally announced March 2023.

Comments: To be presented at CVPR 2024. Project page: https://brian7685.github.io/STG/

arXiv:2303.13664 [pdf, other]

Temperature Schedules for Self-Supervised Contrastive Methods on Long-Tail Data

Authors: Anna Kukleva, Moritz Böhle, Bernt Schiele, Hilde Kuehne, Christian Rupprecht

Abstract: Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contr… ▽ More Most approaches for self-supervised learning (SSL) are optimised on curated balanced datasets, e.g. ImageNet, despite the fact that natural data usually exhibits long-tail distributions. In this paper, we analyse the behaviour of one of the most popular variants of SSL, i.e. contrastive methods, on long-tail data. In particular, we investigate the role of the temperature parameter $τ$ in the contrastive loss, by analysing the loss through the lens of average distance maximisation, and find that a large $τ$ emphasises group-wise discrimination, whereas a small $τ$ leads to a higher degree of instance discrimination. While $τ$ has thus far been treated exclusively as a constant hyperparameter, in this work, we propose to employ a dynamic $τ$ and show that a simple cosine schedule can yield significant improvements in the learnt representations. Such a schedule results in a constant `task switching' between an emphasis on instance discrimination and group-wise discrimination and thereby ensures that the model learns both group-wise features, as well as instance-specific details. Since frequent classes benefit from the former, while infrequent classes require the latter, we find this method to consistently improve separation between the classes in long-tail data without any additional computational cost. △ Less

Submitted 23 March, 2023; originally announced March 2023.

Comments: ICLR 2023

arXiv:2303.08914 [pdf, other]

MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge

Authors: Wei Lin, Leonid Karlinsky, Nina Shvetsova, Horst Possegger, Mateusz Kozinski, Rameswar Panda, Rogerio Feris, Hilde Kuehne, Horst Bischof

Abstract: Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best ze… ▽ More Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}. △ Less

Submitted 22 July, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

Comments: Accepted at ICCV 2023

arXiv:2303.05166 [pdf, other]

TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering

Authors: Wei Lin, Anna Kukleva, Horst Possegger, Hilde Kuehne, Horst Bischof

Abstract: Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding… ▽ More Temporal action segmentation in untrimmed videos has gained increased attention recently. However, annotating action classes and frame-wise boundaries is extremely time consuming and cost intensive, especially on large-scale datasets. To address this issue, we propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning, to preserve the spatial layout and sequential nature of the video features. A two-step clustering pipeline on these embedded feature representations then allows us to enforce temporal consistency within, as well as across videos. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes. Our evaluation on three challenging datasets shows the impact of each component and, furthermore, demonstrates our state-of-the-art unsupervised action segmentation results. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: Computer Vision Winter Workshop 2023

arXiv:2301.02009 [pdf, other]

Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

Authors: Nina Shvetsova, Felix Petersen, Anna Kukleva, Bernt Schiele, Hilde Kuehne

Abstract: Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints… ▽ More Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo), that leverages the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly. To this end, the GroCo loss is based on differentiable sorting networks, which enable training with sorting supervision by matching a differentiable permutation matrix, which is produced by sorting a given set of scores, to a respective ground truth permutation matrix. Applying this idea to groupwise pre-ordered inputs of multiple positive and negative pairs allows introducing the GroCo loss with implicit emphasis on strong positives and negatives, leading to better optimization of the local neighborhood. We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance. △ Less

Submitted 18 August, 2023; v1 submitted 5 January, 2023; originally announced January 2023.

Comments: Published at ICCV 2023, Code @ https://github.com/ninatu/learning_by_sorting

arXiv:2211.15393 [pdf, other]

Video Test-Time Adaptation for Action Recognition

Authors: Wei Lin, Muhammad Jehanzeb Mirza, Mateusz Kozinski, Horst Possegger, Hilde Kuehne, Horst Bischof

Abstract: Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal mode… ▽ More Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. Code will be available at \url{https://github.com/wlin-at/ViTTA}. △ Less

Submitted 20 March, 2023; v1 submitted 24 November, 2022; originally announced November 2022.

Comments: Accepted at CVPR 2023

arXiv:2210.08277 [pdf, other]

Deep Differentiable Logic Gate Networks

Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

Abstract: Recently, research has increasingly focused on developing efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as "AND" and "XOR", which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differen… ▽ More Recently, research has increasingly focused on developing efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as "AND" and "XOR", which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differentiable and therefore do not allow training with gradient descent. Thus, to allow for effective training, we propose differentiable logic gate networks, an architecture that combines real-valued logics and a continuously parameterized relaxation of the network. The resulting discretized logic gate networks achieve fast inference speeds, e.g., beyond a million images of MNIST per second on a single CPU core. △ Less

Submitted 15 October, 2022; originally announced October 2022.

Comments: Published at NeurIPS 2022

arXiv:2210.07839 [pdf, other]

Contrastive Audio-Visual Masked Autoencoder

Authors: Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, James Glass

Abstract: In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments… ▽ More In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive learning and masked data modeling, two major self-supervised learning frameworks, to learn a joint and coordinated audio-visual representation. Our experiments show that the contrastive audio-visual correspondence learning objective not only enables the model to perform audio-visual retrieval tasks, but also helps the model learn a better joint representation. As a result, our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound, and is comparable with the previous best supervised pretrained model on AudioSet in the audio-visual event classification task. Code and pretrained models are at https://github.com/yuangongnd/cav-mae. △ Less

Submitted 11 April, 2023; v1 submitted 2 October, 2022; originally announced October 2022.

Comments: Accepted at ICLR 2023 as a notable top 25% paper. Code and pretrained models are at https://github.com/yuangongnd/cav-mae

arXiv:2210.03625 [pdf, other]

C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval

Authors: Andrew Rouditchenko, Yung-Sung Chuang, Nina Shvetsova, Samuel Thomas, Rogerio Feris, Brian Kingsbury, Leonid Karlinsky, David Harwath, Hilde Kuehne, James Glass

Abstract: Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in differen… ▽ More Multilingual text-video retrieval methods have improved significantly in recent years, but the performance for other languages lags behind English. We propose a Cross-Lingual Cross-Modal Knowledge Distillation method to improve multilingual text-video retrieval. Inspired by the fact that English text-video retrieval outperforms other languages, we train a student model using input text in different languages to match the cross-modal predictions from teacher models using input text in English. We propose a cross entropy based objective which forces the distribution over the student's text-video similarity scores to be similar to those of the teacher models. We introduce a new multilingual video dataset, Multi-YouCook2, by translating the English captions in the YouCook2 video dataset to 8 other languages. Our method improves multilingual text-video retrieval performance on Multi-YouCook2 and several other datasets such as Multi-MSRVTT and VATEX. We also conducted an analysis on the effectiveness of different multilingual text models as teachers. The code, models, and dataset are available at https://github.com/roudimit/c2kd. △ Less

Submitted 9 May, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

Comments: Accepted at ICASSP 2023. The code, models, and dataset are available at https://github.com/roudimit/c2kd

arXiv:2209.11085 [pdf, other]

doi 10.1103/PhysRevLett.130.086704

Field-tunable Berezinskii-Kosterlitz-Thouless correlations in a Heisenberg magnet

Authors: D. Opherden, M. S. J. Tepaske, F. Bärtl, M. Weber, M. M. Turnbull, T. Lancaster, S. J. Blundell, M. Baenitz, J. Wosnitza, C. P. Landee, R. Moessner, D. J. Luitz, H. Kühne

Abstract: We report the manifestation of field-induced Berezinskii-Kosterlitz-Thouless (BKT) correlations in the weakly coupled spin-1/2 Heisenberg layers of the molecular-based bulk material [Cu(pz)$_2$(2-HOpy)$_2$](PF$_6$)$_2$. Due to the moderate intralayer exchange coupling of $J/k_\mathrm{B} = 6.8$ K, the application of laboratory magnetic fields induces a substantial $XY$ anisotropy of the spin correl… ▽ More We report the manifestation of field-induced Berezinskii-Kosterlitz-Thouless (BKT) correlations in the weakly coupled spin-1/2 Heisenberg layers of the molecular-based bulk material [Cu(pz)$_2$(2-HOpy)$_2$](PF$_6$)$_2$. Due to the moderate intralayer exchange coupling of $J/k_\mathrm{B} = 6.8$ K, the application of laboratory magnetic fields induces a substantial $XY$ anisotropy of the spin correlations. Crucially, this provides a significant BKT regime, as the tiny interlayer exchange $J^\prime / k_\mathrm{B} \approx 1$ mK only induces 3D correlations upon close approach to the BKT transition with its exponential growth in the spin-correlation length. We employ nuclear magnetic resonance and $μ^{+}$SR measurements to probe the spin correlations that determine the critical temperatures of the BKT transition as well as that of the onset of long-range order. Further, we perform stochastic series expansion quantum Monte Carlo simulations based on the experimentally determined model parameters. Finite-size scaling of the in-plane spin stiffness yields excellent agreement of critical temperatures between theory and experiment, providing clear evidence that the nonmonotonic magnetic phase diagram of [Cu(pz)$_2$(2-HOpy)$_2$](PF$_6$)$_2$ is determined by the field-tuned $XY$ anisotropy and the concomitant BKT physics. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: 10 pages, 7 figures

arXiv:2209.06103 [pdf, other]

VL-Taboo: An Analysis of Attribute-based Zero-shot Capabilities of Vision-Language Models

Authors: Felix Vogel, Nina Shvetsova, Leonid Karlinsky, Hilde Kuehne

Abstract: Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to wh… ▽ More Vision-language models trained on large, randomly collected data had significant impact in many areas since they appeared. But as they show great performance in various fields, such as image-text-retrieval, their inner workings are still not fully understood. The current work analyses the true zero-shot capabilities of those models. We start from the analysis of the training corpus assessing to what extent (and which of) the test classes are really zero-shot and how this correlates with individual classes performance. We follow up with the analysis of the attribute-based zero-shot learning capabilities of these models, evaluating how well this classical zero-shot notion emerges from large-scale webly supervision. We leverage the recently released LAION400M data corpus as well as the publicly available pretrained models of CLIP, OpenCLIP, and FLAVA, evaluating the attribute-based zero-shot capabilities on CUB and AWA2 benchmarks. Our analysis shows that: (i) most of the classes in popular zero-shot benchmarks are observed (a lot) during pre-training; (ii) zero-shot performance mainly comes out of models' capability of recognizing class labels, whenever they are present in the text, and a significantly lower performing capability of attribute-based zeroshot learning is only observed when class labels are not used; (iii) the number of the attributes used can have a significant effect on performance, and can easily cause a significant performance decrease. △ Less

Submitted 12 September, 2022; originally announced September 2022.

arXiv:2208.01956 [pdf, other]

Augmentation Learning for Semi-Supervised Classification

Authors: Tim Frommknecht, Pedro Alves Zipf, Quanfu Fan, Nina Shvetsova, Hilde Kuehne

Abstract: Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other doma… ▽ More Recently, a number of new Semi-Supervised Learning methods have emerged. As the accuracy for ImageNet and similar datasets increased over time, the performance on tasks beyond the classification of natural images is yet to be explored. Most Semi-Supervised Learning methods rely on a carefully manually designed data augmentation pipeline that is not transferable for learning on images of other domains. In this work, we propose a Semi-Supervised Learning method that automatically selects the most effective data augmentation policy for a particular dataset. We build upon the Fixmatch method and extend it with meta-learning of augmentations. The augmentation is learned in additional training before the classification training and makes use of bi-level optimization, to optimize the augmentation policy and maximize accuracy. We evaluate our approach on two domain-specific datasets, containing satellite images and hand-drawn sketches, and obtain state-of-the-art results. We further investigate in an ablation the different parameters relevant for learning augmentation policies and show how policy learning can be used to adapt augmentations to datasets beyond ImageNet. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: Accepted to GCPR 2022, 13 pages with 4 figures

arXiv:2207.05148 [pdf, ps, other]

doi 10.1103/PhysRevB.106.115125

Structural 130-K Phase Transition and Emergence of a Two-Ion Kondo State in HT-Ce$_2$Rh$_2$Ga Explored by $^{69,71}$Ga Nuclear Quadrupole Resonance

Authors: Sh. Yamamoto, T. Fujii, S. Luther, H. Yasuoka, H. Sakai, F. Bärtl, K. M. Ranjith, H. Rosner, J. Wosnitza, A. M. Strydom, H. Kühne, M. Baenitz

Abstract: We have studied the microscopic magnetic properties, the nature of the 130-K phase transition, and the ground state in the recently synthesized compound Ce$_2$Rh$_2$Ga by use of $^{69,71}$Ga nuclear quadrupole resonance (NQR). The NQR spectra clearly show an unusual phase transition at $T_t$ $\sim$ 130 K yielding a splitting of the high-temperature single NQR line into two clearly resolved NQR lin… ▽ More We have studied the microscopic magnetic properties, the nature of the 130-K phase transition, and the ground state in the recently synthesized compound Ce$_2$Rh$_2$Ga by use of $^{69,71}$Ga nuclear quadrupole resonance (NQR). The NQR spectra clearly show an unusual phase transition at $T_t$ $\sim$ 130 K yielding a splitting of the high-temperature single NQR line into two clearly resolved NQR lines, providing evidence for two crystallographically inequivalent Ga sites. The NQR frequencies are in good agreement with fully-relativistic calculations of the band structure. Our NQR results indicate the absence of magnetic or charge order down to 0.3 K. The temperature dependence of the spin-lattice relaxation rate, 1/$T_1$, shows three distinct regimes, with onset temperatures at $T_t$ and 2 K. The temperature-independent 1/$T_1$, observed between $T_t$ and 2 K, crosses over to a Korringa process, 1/$T_1$ $\propto$ $T$, below $\sim$ 2 K, which evidences a rare two-ion Kondo scenario: the system goes into a dense Kondo coherent state at 2.0 and 0.8 K for the two different Ga sites. △ Less

Submitted 11 July, 2022; originally announced July 2022.

Comments: 12 pages, 7 figures

Journal ref: Phys. Rev. B 106, 115125 (2022)

arXiv:2207.02334 [pdf, other]

Weakly Supervised Grounding for VQA in Vision-Language Transformers

Authors: Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah

Abstract: Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this li… ▽ More Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field. △ Less

Submitted 5 July, 2022; originally announced July 2022.

Comments: To appear at ECCV 2022

arXiv:2206.07290 [pdf, other]

Differentiable Top-k Classification Learning

Authors: Felix Petersen, Hilde Kuehne, Christian Borgelt, Oliver Deussen

Abstract: The top-k classification accuracy is one of the core metrics in machine learning. Here, k is conventionally a positive integer, such as 1 or 5, leading to top-1 or top-5 training objectives. In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a different… ▽ More The top-k classification accuracy is one of the core metrics in machine learning. Here, k is conventionally a positive integer, such as 1 or 5, leading to top-1 or top-5 training objectives. In this work, we relax this assumption and optimize the model for multiple k simultaneously instead of using a single k. Leveraging recent advances in differentiable sorting and ranking, we propose a differentiable top-k cross-entropy classification loss. This allows training the network while not only considering the top-1 prediction, but also, e.g., the top-2 and top-5 predictions. We evaluate the proposed loss function for fine-tuning on state-of-the-art architectures, as well as for training from scratch. We find that relaxing k does not only produce better top-5 accuracies, but also leads to top-1 accuracy improvements. When fine-tuning publicly available ImageNet models, we achieve a new state-of-the-art for these models. △ Less

Submitted 15 June, 2022; originally announced June 2022.

Comments: Published at ICML 2022, Code @ https://github.com/Felix-Petersen/difftopk

arXiv:2203.16244 [pdf, other]

CycDA: Unsupervised Cycle Domain Adaptation from Image to Video

Authors: Wei Lin, Anna Kukleva, Kunyang Sun, Horst Possegger, Hilde Kuehne, Horst Bischof

Abstract: Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and vid… ▽ More Although action recognition has achieved impressive results over recent years, both collection and annotation of video training data are still time-consuming and cost intensive. Therefore, image-to-video adaptation has been proposed to exploit labeling-free web image source for adapting on unlabeled target videos. This poses two major challenges: (1) spatial domain shift between web images and video frames; (2) modality gap between image and video data. To address these challenges, we propose Cycle Domain Adaptation (CycDA), a cycle-based approach for unsupervised image-to-video domain adaptation by leveraging the joint spatial information in images and videos on the one hand and, on the other hand, training an independent spatio-temporal model to bridge the modality gap. We alternate between the spatial and spatio-temporal learning with knowledge transfer between the two in each cycle. We evaluate our approach on benchmark datasets for image-to-video as well as for mixed-source domain adaptation achieving state-of-the-art results and demonstrating the benefits of our cyclic adaptation. Code is available at \url{https://github.com/wlin-at/CycDA}. △ Less

Submitted 22 March, 2023; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: Accepted at ECCV2022. Supplementary included

arXiv:2203.09630 [pdf, other]

Monotonic Differentiable Sorting Networks

Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

Abstract: Differentiable sorting algorithms allow training with sorting and ranking supervision, where only the ordering or ranking of samples is known. Various methods have been proposed to address this challenge, ranging from optimal transport-based differentiable Sinkhorn sorting algorithms to making classic sorting networks differentiable. One problem of current differentiable sorting methods is that th… ▽ More Differentiable sorting algorithms allow training with sorting and ranking supervision, where only the ordering or ranking of samples is known. Various methods have been proposed to address this challenge, ranging from optimal transport-based differentiable Sinkhorn sorting algorithms to making classic sorting networks differentiable. One problem of current differentiable sorting methods is that they are non-monotonic. To address this issue, we propose a novel relaxation of conditional swap operations that guarantees monotonicity in differentiable sorting networks. We introduce a family of sigmoid functions and prove that they produce differentiable sorting networks that are monotonic. Monotonicity ensures that the gradients always have the correct sign, which is an advantage in gradient-based optimization. We demonstrate that monotonic differentiable sorting networks improve upon previous differentiable sorting methods. △ Less

Submitted 17 March, 2022; originally announced March 2022.

Comments: Published at ICLR 2022, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=Rl-sFaE1z4M

arXiv:2201.02138 [pdf, other]

doi 10.1103/PhysRevB.106.L060503

Orbital-Induced Crossover of the Fulde-Ferrell-Larkin-Ovchinnikov Phase into Abrikosov-like States

Authors: Tommy Kotte, Hannes Kühne, John A Schlueter, Gertrud Zwicknagl, J. Wosnitza

Abstract: The Fulde-Ferrell-Larkin-Ovchinnikov (FFLO) state can emerge in superconductors for which the orbital critical field exceeds the Pauli limit. Here, we present angular-resolved specific-heat data of the quasi-two-dimensional organic superconductor $κ$-(ET)$_2$Cu(NCS)$_2$, with a focus on high fields in the regime of the FFLO transition. For an increasing out-of-plane tilt of the applied magnetic fi… ▽ More The Fulde-Ferrell-Larkin-Ovchinnikov (FFLO) state can emerge in superconductors for which the orbital critical field exceeds the Pauli limit. Here, we present angular-resolved specific-heat data of the quasi-two-dimensional organic superconductor $κ$-(ET)$_2$Cu(NCS)$_2$, with a focus on high fields in the regime of the FFLO transition. For an increasing out-of-plane tilt of the applied magnetic field, which leads to an increase of orbital contributions, we found that the nature of the superconducting transition changes from second to first order and that a further transition appears within the high-field superconducting phase. However, the superconducting state above the Pauli limit is stable for field tilt of several degrees. Since any finite perpendicular component of the magnetic field necessarily leads to quantization of the orbital motion, the resulting vortex lattice states compete with the modulated order parameter of the FFLO state leading to complex high-field superconducting phases. By solving the linearized self-consistency equation within weak-coupling BCS theory, we show that our results are clear experimental evidence of an orbital-induced transformation of the FFLO order-parameter into Abrikosov-like states of higher Landau levels. △ Less

Submitted 6 May, 2022; v1 submitted 6 January, 2022; originally announced January 2022.

Comments: Main text: 6 Pages, 4 Figures Supplement: 11 Pages, 6 Figures

Journal ref: Phys. Rev. B 106, L060503 (2022)

arXiv:2112.04446 [pdf, other]

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Authors: Nina Shvetsova, Brian Chen, Andrew Rouditchenko, Samuel Thomas, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Hilde Kuehne

Abstract: Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text,… ▽ More Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we present a multi-modal, modality agnostic fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation to obtain an embedding that aggregates multi-modal temporal information. We propose to train the system with a combinatorial loss on everything at once, single modalities as well as pairs of modalities, explicitly leaving out any add-ons such as position or modality encoding. At test time, the resulting model can process and fuse any number of input modalities. Moreover, the implicit properties of the transformer allow to process inputs of different lengths. To evaluate the proposed approach, we train the model on the large scale HowTo100M dataset and evaluate the resulting embedding space on four challenging benchmark datasets obtaining state-of-the-art results in zero-shot video retrieval and zero-shot video action localization. △ Less

Submitted 18 August, 2022; v1 submitted 8 December, 2021; originally announced December 2021.

Comments: CVPR2022. The final published version of the proceedings will be available on IEEE Xplore

arXiv:2112.02300 [pdf, other]

Unsupervised Domain Generalization by Learning a Bridge Across Domains

Authors: Sivan Harary, Eli Schwartz, Assaf Arbelle, Peter Staar, Shady Abu-Hussein, Elad Amrani, Roei Herzig, Amit Alfassy, Raja Giryes, Hilde Kuehne, Dina Katabi, Kate Saenko, Rogerio Feris, Leonid Karlinsky

Abstract: The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generaliz… ▽ More The ability to generalize learned representations across significantly different visual domains, such as between real photos, clipart, paintings, and sketches, is a fundamental capacity of the human visual system. In this paper, different from most cross-domain works that utilize some (or full) source domain supervision, we approach a relatively new and very practical Unsupervised Domain Generalization (UDG) setup of having no training supervision in neither source nor target domains. Our approach is based on self-supervised learning of a Bridge Across Domains (BrAD) - an auxiliary bridge domain accompanied by a set of semantics preserving visual (image-to-image) mappings to BrAD from each of the training domains. The BrAD and mappings to it are learned jointly (end-to-end) with a contrastive self-supervised representation model that semantically aligns each of the domains to its BrAD-projection, and hence implicitly drives all the domains (seen or unseen) to semantically align to each other. In this work, we show how using an edge-regularized BrAD our approach achieves significant gains across multiple benchmarks and a range of tasks, including UDG, Few-shot UDA, and unsupervised generalization across multi-domain datasets (including generalization to unseen domains and classes). △ Less

Submitted 17 May, 2022; v1 submitted 4 December, 2021; originally announced December 2021.

arXiv:2112.00775 [pdf, other]

Routing with Self-Attention for Multimodal Capsule Networks

Authors: Kevin Duarte, Brian Chen, Nina Shvetsova, Andrew Rouditchenko, Samuel Thomas, Alexander Liu, David Harwath, James Glass, Hilde Kuehne, Mubarak Shah

Abstract: The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of cap… ▽ More The task of multimodal learning has seen a growing interest recently as it allows for training neural architectures based on different modalities such as vision, text, and audio. One challenge in training such models is that they need to jointly learn semantic concepts and their relationships across different input representations. Capsule networks have been shown to perform well in context of capturing the relation between low-level input features and higher-level concepts. However, capsules have so far mainly been used only in small-scale fully supervised settings due to the resource demand of conventional routing algorithms. We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework on large amounts of video data. To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules which are then used to generate a final joint multimodal feature representation. This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods while still being computationally efficient. We evaluate the proposed architecture by pretraining it on a large-scale multimodal video dataset and applying it on four datasets in two challenging downstream tasks. Results show that the proposed multimodal capsule network is not only able to improve results compared to other routing techniques, but also achieves competitive performance on the task of multimodal learning. △ Less

Submitted 1 December, 2021; originally announced December 2021.

arXiv:2111.04823 [pdf, other]

Cascaded Multilingual Audio-Visual Learning from Videos

Authors: Andrew Rouditchenko, Angie Boggust, David Harwath, Samuel Thomas, Hilde Kuehne, Brian Chen, Rameswar Panda, Rogerio Feris, Brian Kingsbury, Michael Picheny, James Glass

Abstract: In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that levera… ▽ More In this paper, we explore self-supervised audio-visual models that learn from instructional videos. Prior work has shown that these models can relate spoken words and sounds to visual content after training on a large-scale dataset of videos, but they were only trained and evaluated on videos in English. To learn multilingual audio-visual representations, we propose a cascaded approach that leverages a model trained on English videos and applies it to audio-visual data in other languages, such as Japanese videos. With our cascaded approach, we show an improvement in retrieval performance of nearly 10x compared to training on the Japanese videos solely. We also apply the model trained on English videos to Japanese and Hindi spoken captions of images, achieving state-of-the-art performance. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: Presented at Interspeech 2021. This version contains updated results using the YouCook-Japanese dataset

arXiv:2110.10784 [pdf, other]

Style Agnostic 3D Reconstruction via Adversarial Style Transfer

Authors: Felix Petersen, Bastian Goldluecke, Oliver Deussen, Hilde Kuehne

Abstract: Reconstructing the 3D geometry of an object from an image is a major challenge in computer vision. Recently introduced differentiable renderers can be leveraged to learn the 3D geometry of objects from 2D images, but those approaches require additional supervision to enable the renderer to produce an output that can be compared to the input image. This can be scene information or constraints such… ▽ More Reconstructing the 3D geometry of an object from an image is a major challenge in computer vision. Recently introduced differentiable renderers can be leveraged to learn the 3D geometry of objects from 2D images, but those approaches require additional supervision to enable the renderer to produce an output that can be compared to the input image. This can be scene information or constraints such as object silhouettes, uniform backgrounds, material, texture, and lighting. In this paper, we propose an approach that enables a differentiable rendering-based learning of 3D objects from images with backgrounds without the need for silhouette supervision. Instead of trying to render an image close to the input, we propose an adversarial style-transfer and domain adaptation pipeline that allows to translate the input image domain to the rendered image domain. This allows us to directly compare between a translated image and the differentiable rendering of a 3D object reconstruction in order to train the 3D object reconstruction network. We show that the approach learns 3D geometry from images with backgrounds and provides a better performance than constrained methods for single-view 3D object reconstruction on this task. △ Less

Submitted 20 October, 2021; originally announced October 2021.

Comments: To be published at WACV 2022, Code @ https://github.com/Felix-Petersen/style-agnostic-3d-reconstruction

arXiv:2110.05651 [pdf, other]

Learning with Algorithmic Supervision via Continuous Relaxations

Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

Abstract: The integration of algorithmic components into neural architectures has gained increased attention recently, as it allows training neural networks with new forms of supervision such as ordering constraints or silhouettes instead of using ground truth labels. Many approaches in the field focus on the continuous relaxation of a specific task and show promising results in this context. But the focus… ▽ More The integration of algorithmic components into neural architectures has gained increased attention recently, as it allows training neural networks with new forms of supervision such as ordering constraints or silhouettes instead of using ground truth labels. Many approaches in the field focus on the continuous relaxation of a specific task and show promising results in this context. But the focus on single tasks also limits the applicability of the proposed concepts to a narrow range of applications. In this work, we build on those ideas to propose an approach that allows to integrate algorithms into end-to-end trainable neural network architectures based on a general approximation of discrete conditions. To this end, we relax these conditions in control structures such as conditional statements, loops, and indexing, so that resulting algorithms are smoothly differentiable. To obtain meaningful gradients, each relevant variable is perturbed via logistic distributions and the expectation value under this perturbation is approximated. We evaluate the proposed continuous relaxation model on four challenging tasks and show that it can keep up with relaxations specifically designed for each individual task. △ Less

Submitted 25 October, 2021; v1 submitted 11 October, 2021; originally announced October 2021.

Comments: Published at NeurIPS 2021, Code @ https://github.com/Felix-Petersen/algovision, Video @ https://www.youtube.com/watch?v=01ENzpkjOCE

arXiv:2109.02582 [pdf, other]

doi 10.1103/PhysRevB.104.134410

The planar triangular $S=3/2$ magnet AgCrSe$_2$: magnetic frustration, short range correlations, and field tuned anisotropic cycloidal magnetic order

Authors: M. Baenitz, M. M. Piva, S. Luther, J. Sichelschmidt, K. M. Ranjith, H. Dawczak-Dȩbicki, M. O. Ajeesh, S. -J. Kim, G. Siemann, C. Bigi, P. Manuel, D. Khalyavin, D. A. Sokolov, P. Mokhtari, H. Zhang, H. Yasuoka, P. D. C. King, G. Vinai, V. Polewczyk, P. Torelli, J. Wosnitza, U. Burkhardt, B. Schmidt, H. Rosner, S. Wirth , et al. (3 additional authors not shown)

Abstract: Our studies evidence an anisotropic magnetic order below $T_N = 32$~K. Susceptibility data in small fields of about 1~T reveal an antiferromagnetic (AFM) order for $H \perp c$, whereas for $H \parallel c$ the data are reminiscent of a field-induced ferromagnetic (FM) structure. At low temperatures and for $H \perp c$, the field-dependent magnetization and AC susceptibility data evidence a metamagn… ▽ More Our studies evidence an anisotropic magnetic order below $T_N = 32$~K. Susceptibility data in small fields of about 1~T reveal an antiferromagnetic (AFM) order for $H \perp c$, whereas for $H \parallel c$ the data are reminiscent of a field-induced ferromagnetic (FM) structure. At low temperatures and for $H \perp c$, the field-dependent magnetization and AC susceptibility data evidence a metamagnetic transition at $H^+ = 5$~T, which is absent for $H \parallel c$. We assign this to a transition from a planar cycloidal spin structure at low fields to a planar fan-like arrangement above $H^+$. A fully FM polarized state is obtained above the saturation field of $H_{\perp S} = 23.7$~T at 2~K with a magnetization of $M_s = 2.8$~$μ_{\rm B}{\rm /Cr}$. For $H \parallel c$, $M(H)$ monotonously increases and saturates at the same $M_s$ value at $H_{\parallel S} = 25.1$~T at 4.2~K. Above $T_N $, the magnetic susceptibility and specific heat indicate signatures of two dimensional (2D) frustration related to the presence of planar ferromagnetic and antiferromagnetic exchange interactions. We found a pronounced nearly isotropic maximum in both properties at about $T^* = 45$~K, which is a clear fingerprint of short-range correlations and emergent spin fluctuations. Calculations based on a planar 2D Heisenberg model support our experimental findings and suggest a predominant FM exchange among nearest and AFM exchange among third-nearest neighbors. Only a minor contribution might be assigned to the antisymmetric Dzyaloshinskii-Moriya interaction possible related to the non-centrosymmetric polar space group $R3m$. Due to these competing interactions, the magnetism in AgCrSe$_{2}$, in contrast to the oxygen based delafossites, can be tuned by relatively small, experimentally accessible, magnetic fields, allowing us to establish the complete anisotropic magnetic $H-T$ phase diagram in detail. △ Less

Submitted 6 September, 2021; originally announced September 2021.

arXiv:2108.08165 [pdf, other]

Generalized and Incremental Few-Shot Learning by Explicit Learning and Calibration without Forgetting

Authors: Anna Kukleva, Hilde Kuehne, Bernt Schiele

Abstract: Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes. In this work we propose a three-stage framework that allows to explicitly and effectively address these challenges. While the first phase lea… ▽ More Both generalized and incremental few-shot learning have to deal with three major challenges: learning novel classes from only few samples per class, preventing catastrophic forgetting of base classes, and classifier calibration across novel and base classes. In this work we propose a three-stage framework that allows to explicitly and effectively address these challenges. While the first phase learns base classes with many samples, the second phase learns a calibrated classifier for novel classes from few samples while also preventing catastrophic forgetting. In the final phase, calibration is achieved across all classes. We evaluate the proposed framework on four challenging benchmark datasets for image and video few-shot classification and obtain state-of-the-art results for both generalized and incremental few shot learning. △ Less

Submitted 18 August, 2021; originally announced August 2021.

Comments: ICCV 2021

arXiv:2105.04836 [pdf, other]

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

Authors: Aisha Urooj Khan, Hilde Kuehne, Kevin Duarte, Chuang Gan, Niels Lobo, Mubarak Shah

Abstract: The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. However, pre-trained object detectors require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this… ▽ More The problem of grounding VQA tasks has seen an increased attention in the research community recently, with most attempts usually focusing on solving this task by using pretrained object detectors. However, pre-trained object detectors require bounding box annotations for detecting relevant objects in the vocabulary, which may not always be feasible for real-life large-scale applications. In this paper, we focus on a more relaxed setting: the grounding of relevant visual entities in a weakly supervised manner by training on the VQA task alone. To address this problem, we propose a visual capsule module with a query-based selection mechanism of capsule features, that allows the model to focus on relevant regions based on the textual cues about visual information in the question. We show that integrating the proposed capsule module in existing VQA systems significantly improves their performance on the weakly supervised grounding task. Overall, we demonstrate the effectiveness of our approach on two state-of-the-art VQA systems, stacked NMN and MAC, on the CLEVR-Answers benchmark, our new evaluation set based on CLEVR scenes with ground truth bounding boxes for objects that are relevant for the correct answer, as well as on GQA, a real world VQA dataset with compositional questions. We show that the systems with the proposed capsule module consistently outperform the respective baseline systems in terms of answer grounding, while achieving comparable performance on VQA task. △ Less

Submitted 11 May, 2021; originally announced May 2021.

Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021

arXiv:2105.04019 [pdf, other]

Differentiable Sorting Networks for Scalable Sorting and Ranking Supervision

Authors: Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

Abstract: Sorting and ranking supervision is a method for training neural networks end-to-end based on ordering constraints. That is, the ground truth order of sets of samples is known, while their absolute values remain unsupervised. For that, we propose differentiable sorting networks by relaxing their pairwise conditional swap operations. To address the problems of vanishing gradients and extensive blurr… ▽ More Sorting and ranking supervision is a method for training neural networks end-to-end based on ordering constraints. That is, the ground truth order of sets of samples is known, while their absolute values remain unsupervised. For that, we propose differentiable sorting networks by relaxing their pairwise conditional swap operations. To address the problems of vanishing gradients and extensive blurring that arise with larger numbers of layers, we propose mapping activations to regions with moderate gradients. We consider odd-even as well as bitonic sorting networks, which outperform existing relaxations of the sorting operation. We show that bitonic sorting networks can achieve stable training on large input sets of up to 1024 elements. △ Less

Submitted 14 July, 2021; v1 submitted 9 May, 2021; originally announced May 2021.

Comments: Published at ICML 2021, Code @ https://github.com/Felix-Petersen/diffsort, Video @ https://www.youtube.com/watch?v=38dvqdYEs1o

Journal ref: PMLR 139:8546-8555, 2021

arXiv:2105.00067 [pdf, other]

Unsupervised Discriminative Embedding for Sub-Action Learning in Complex Activities

Authors: Sirnam Swetha, Hilde Kuehne, Yogesh S Rawat, Mubarak Shah

Abstract: Action recognition and detection in the context of long untrimmed video sequences has seen an increased attention from the research community. However, annotation of complex activities is usually time consuming and challenging in practice. Therefore, recent works started to tackle the problem of unsupervised learning of sub-actions in complex activities. This paper proposes a novel approach for un… ▽ More Action recognition and detection in the context of long untrimmed video sequences has seen an increased attention from the research community. However, annotation of complex activities is usually time consuming and challenging in practice. Therefore, recent works started to tackle the problem of unsupervised learning of sub-actions in complex activities. This paper proposes a novel approach for unsupervised sub-action learning in complex activities. The proposed method maps both visual and temporal representations to a latent space where the sub-actions are learnt discriminatively in an end-to-end fashion. To this end, we propose to learn sub-actions as latent concepts and a novel discriminative latent concept learning (DLCL) module aids in learning sub-actions. The proposed DLCL module lends on the idea of latent concepts to learn compact representations in the latent embedding space in an unsupervised way. The result is a set of latent vectors that can be interpreted as cluster centers in the embedding space. The latent space itself is formed by a joint visual and temporal embedding capturing the visual similarity and temporal ordering of the data. Our joint learning with discriminative latent concept module is novel which eliminates the need for explicit clustering. We validate our approach on three benchmark datasets and show that the proposed combination of visual-temporal embedding and discriminative latent concepts allow to learn robust action representations in an unsupervised setting. △ Less

Submitted 30 April, 2021; originally announced May 2021.

arXiv:2104.12671 [pdf, other]

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Authors: Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang

Abstract: Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalitie… ▽ More Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets. △ Less

Submitted 3 September, 2021; v1 submitted 26 April, 2021; originally announced April 2021.

Comments: To be presented at ICCV 2021

Journal ref: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8012-8021

arXiv:2104.09829 [pdf, other]

Detector-Free Weakly Supervised Grounding by Separation

Authors: Assaf Arbelle, Sivan Doveh, Amit Alfassy, Joseph Shtok, Guy Lev, Eli Schwartz, Hilde Kuehne, Hila Barak Levi, Prasanna Sattigeri, Rameswar Panda, Chun-Fu Chen, Alex Bronstein, Kate Saenko, Shimon Ullman, Raja Giryes, Rogerio Feris, Leonid Karlinsky

Abstract: Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object de… ▽ More Nowadays, there is an abundance of data involving images and surrounding free-form text weakly corresponding to those images. Weakly Supervised phrase-Grounding (WSG) deals with the task of using this data to learn to localize (or to ground) arbitrary text phrases in images without any additional annotations. However, most recent SotA methods for WSG assume the existence of a pre-trained object detector, relying on it to produce the ROIs for localization. In this work, we focus on the task of Detector-Free WSG (DF-WSG) to solve WSG without relying on a pre-trained detector. We directly learn everything from the images and associated free-form text pairs, thus potentially gaining an advantage on the categories unsupported by the detector. The key idea behind our proposed Grounding by Separation (GbS) method is synthesizing `text to image-regions' associations by random alpha-blending of arbitrary image pairs and using the corresponding texts of the pair as conditions to recover the alpha map from the blended image via a segmentation network. At test time, this allows using the query phrase as a condition for a non-blended query image, thus interpreting the test image as a composition of a region corresponding to the phrase and the complement region. Using this approach we demonstrate a significant accuracy improvement, of up to $8.5\%$ over previous DF-WSG SotA, for a range of benchmarks including Flickr30K, Visual Genome, and ReferIt, as well as a significant complementary improvement (above $7\%$) over the detector-based approaches for WSG. △ Less

Submitted 20 April, 2021; originally announced April 2021.

arXiv:2101.01915 [pdf, ps, other]

doi 10.7566/JPSJ.90.023703

Emergence of Frustrated Short-Range Order above Long-Range Order in the $S=1/2$ Kagome Antiferromagnet CaCu$_3$(OD)$_6$Cl$_2\cdot0.6$D$_2$O

Authors: Yoshihiko Ihara, Kazuki Matsui, Yoshimitsu Kohama, Sven Luther, Daryna Opherden, Jochen Wosnitza, Hannes Kühne, Hiroyuki K. Yoshida

Abstract: We report on the low-energy dynamics in the kagome antiferromagnet CaCu$_3$(OD)$_6$Cl$_2\cdot0.6$D$_2$O (Ca-kapellasite) as studied by use of $^2$D-NMR measurements. Previous $^{35}$Cl-NMR measurements revealed that the nuclear spin-lattice relaxation rate ($1/T_1$) shows two peaks at temperatures, $T^{\ast} = 7.2$ K and $T_s \simeq 25$ K. While the low-temperature peak at $T^{\ast}$ is ascribed t… ▽ More We report on the low-energy dynamics in the kagome antiferromagnet CaCu$_3$(OD)$_6$Cl$_2\cdot0.6$D$_2$O (Ca-kapellasite) as studied by use of $^2$D-NMR measurements. Previous $^{35}$Cl-NMR measurements revealed that the nuclear spin-lattice relaxation rate ($1/T_1$) shows two peaks at temperatures, $T^{\ast} = 7.2$ K and $T_s \simeq 25$ K. While the low-temperature peak at $T^{\ast}$ is ascribed to the critical fluctuations near the long-range magnetic ordering, the origin of the high-temperature peak has not been fully understood. From the $1/T_1$ measurements on the D sites at the OD groups (D$_{\rm OD}$), we find no peak at $T_s$, evidencing that the high-temperature peak is not related to the molecular dynamics of the OD groups. We discuss the possibility of a frustration-induced short-range ordered state below $T_s$ before the long-range order is stabilized by the Dzyaloshinskii-Moriya interaction. We also observed static internal fields at the D$_{\rm OD}$ site in the long-range ordered state below $T^{\ast}$, and confirm the previously proposed negative-chirality $q=0$ magnetic structure. △ Less

Submitted 6 January, 2021; originally announced January 2021.

Comments: 5 pages, 4 figures

Showing 1–50 of 81 results for author: Kuehne, H