Search | arXiv e-print repository

What's left can't be right -- The remaining positional incompetence of contrastive vision-language models

Authors: Nils Hoehing, Ellen Rushe, Anthony Ventresque

Abstract: Contrastive vision-language models like CLIP have been found to lack spatial understanding capabilities. In this paper we discuss the possible causes of this phenomenon by analysing both datasets and embedding space. By focusing on simple left-right positional relations, we show that this behaviour is entirely predictable, even with large-scale datasets, demonstrate that these relations can be tau… ▽ More Contrastive vision-language models like CLIP have been found to lack spatial understanding capabilities. In this paper we discuss the possible causes of this phenomenon by analysing both datasets and embedding space. By focusing on simple left-right positional relations, we show that this behaviour is entirely predictable, even with large-scale datasets, demonstrate that these relations can be taught using synthetic data and show that this approach can generalise well to natural images - improving the performance on left-right relations on Visual Genome Relations. △ Less

Submitted 19 November, 2023; originally announced November 2023.

arXiv:2306.17558 [pdf, other]

Towards the extraction of robust sign embeddings for low resource sign language recognition

Authors: Mathieu De Coster, Ellen Rushe, Ruth Holmes, Anthony Ventresque, Joni Dambre

Abstract: Isolated Sign Language Recognition (SLR) has mostly been applied on datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer independent models. To tackle this difficult problem, we require a robust feature extractor to process… ▽ More Isolated Sign Language Recognition (SLR) has mostly been applied on datasets containing signs executed slowly and clearly by a limited group of signers. In real-world scenarios, however, we are met with challenging visual conditions, coarticulated signing, small datasets, and the need for signer independent models. To tackle this difficult problem, we require a robust feature extractor to process the sign language videos. One could expect human pose estimators to be ideal candidates. However, due to a domain mismatch with their training sets and challenging poses in sign language, they lack robustness on sign language data and image-based models often still outperform keypoint-based models. Furthermore, whereas the common practice of transfer learning with image-based models yields even higher accuracy, keypoint-based models are typically trained from scratch on every SLR dataset. These factors limit their usefulness for SLR. From the existing literature, it is also not clear which, if any, pose estimator performs best for SLR. We compare the three most popular pose estimators for SLR: OpenPose, MMPose and MediaPipe. We show that through keypoint normalization, missing keypoint imputation, and learning a pose embedding, we can obtain significantly better results and enable transfer learning. We show that keypoint-based embeddings contain cross-lingual features: they can transfer between sign languages and achieve competitive performance even when fine-tuning only the classifier layer of an SLR model on a target sign language. We furthermore achieve better performance using fine-tuned transferred embeddings than models trained only on the target sign language. The embeddings can also be learned in a multilingual fashion. The application of these embeddings could prove particularly useful for low resource sign languages in the future. △ Less

Submitted 16 August, 2023; v1 submitted 30 June, 2023; originally announced June 2023.

arXiv:2006.01168 [pdf, other]

Deep Context-Aware Novelty Detection

Authors: Ellen Rushe, Brian Mac Namee

Abstract: A common assumption of novelty detection is that the distribution of both "normal" and "novel" data are static. This, however, is often not the case - for example scenarios where data evolves over time or scenarios in which the definition of normal and novel depends on contextual information, both leading to changes in these distributions. This can lead to significant difficulties when attempting… ▽ More A common assumption of novelty detection is that the distribution of both "normal" and "novel" data are static. This, however, is often not the case - for example scenarios where data evolves over time or scenarios in which the definition of normal and novel depends on contextual information, both leading to changes in these distributions. This can lead to significant difficulties when attempting to train a model on datasets where the distribution of normal data in one scenario is similar to that of novel data in another scenario. In this paper we propose a context-aware approach to novelty detection for deep autoencoders to address these difficulties. We create a semi-supervised network architecture that utilises auxiliary labels to reveal contextual information and allow the model to adapt to a variety of contexts in which the definitions of normal and novel change. We evaluate our approach on both image data and real world audio data displaying these characteristics and show that the performance of individually trained models can be achieved in a single model. △ Less

Submitted 6 December, 2020; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:1908.00148 [pdf, other]

Personalized, Health-Aware Recipe Recommendation: An Ensemble Topic Modeling Based Approach

Authors: Mansura A. Khan, Ellen Rushe, Barry Smyth, David Coyle

Abstract: Food choices are personal and complex and have a significant impact on our long-term health and quality of life. By helping users to make informed and satisfying decisions, Recommender Systems (RS) have the potential to support users in making healthier food choices. Intelligent users-modeling is a key challenge in achieving this potential. This paper investigates Ensemble Topic Modelling (EnsTM)… ▽ More Food choices are personal and complex and have a significant impact on our long-term health and quality of life. By helping users to make informed and satisfying decisions, Recommender Systems (RS) have the potential to support users in making healthier food choices. Intelligent users-modeling is a key challenge in achieving this potential. This paper investigates Ensemble Topic Modelling (EnsTM) based Feature Identification techniques for efficient user-modeling and recipe recommendation. It builds on findings in EnsTM to propose a reduced data representation format and a smart user-modeling strategy that makes capturing user-preference fast, efficient and interactive. This approach enables personalization, even in a cold-start scenario. This paper proposes two different EnsTM based and one Hybrid EnsTM based recommenders. We compared all three EnsTM based variations through a user study with 48 participants, using a large-scale,real-world corpus of 230,876 recipes, and compare against a conventional Content Based (CB) approach. EnsTM based recommenders performed significantly better than the CB approach. Besides acknowledging multi-domain contents such as taste, demographics and cost, our proposed approach also considers user's nutritional preference and assists them finding recipes under diverse nutritional categories. Furthermore, it provides excellent coverage and enables implicit understanding of user's food practices. Subsequent analysis also exposed correlation between certain features and a healthier lifestyle. △ Less

Submitted 31 July, 2019; originally announced August 2019.

Comments: This is a pre-print version of the accepted full-paper in HealthRecsys2019 workshop (https://healthrecsys.github.io/2019/). The final version of the article would be published in the workshop preceding

ACM Class: I.2

Showing 1–4 of 4 results for author: Rushe, E