Search | arXiv e-print repository

Human-inspired Explanations for Vision Transformers and Convolutional Neural Networks

Authors: Mahadev Prasad Panda, Matteo Tiezzi, Martina Vilas, Gemma Roig, Bjoern M. Eskofier, Dario Zanca

Abstract: We introduce Foveation-based Explanations (FovEx), a novel human-inspired visual explainability (XAI) method for Deep Neural Networks. Our method achieves state-of-the-art performance on both transformer (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility. Furthermore, we show the alignment between the explanation map produced by FovEx and human… ▽ More We introduce Foveation-based Explanations (FovEx), a novel human-inspired visual explainability (XAI) method for Deep Neural Networks. Our method achieves state-of-the-art performance on both transformer (on 4 out of 5 metrics) and convolutional models (on 3 out of 5 metrics), demonstrating its versatility. Furthermore, we show the alignment between the explanation map produced by FovEx and human gaze patterns (+14\% in NSS compared to RISE, +203\% in NSS compared to gradCAM), enhancing our confidence in FovEx's ability to close the interpretation gap between humans and machines. △ Less

Submitted 20 August, 2024; v1 submitted 4 August, 2024; originally announced August 2024.

Comments: Accepted at the Human-inspired Computer Vision (HCV) ECCV 2024 Workshop as an extended abstract. A long version of the work can be found at arXiv:2408.02123v1

arXiv:2407.20024 [pdf, other]

Fairness Through Controlled (Un)Awareness in Node Embeddings

Authors: Dennis Vetter, Jasper Forth, Gemma Roig, Holger Dell

Abstract: Graph representation learning is central for the application of machine learning (ML) models to complex graphs, such as social networks. Ensuring `fair' representations is essential, due to the societal implications and the use of sensitive personal data. In this paper, we demonstrate how the parametrization of the \emph{CrossWalk} algorithm influences the ability to infer a sensitive attributes f… ▽ More Graph representation learning is central for the application of machine learning (ML) models to complex graphs, such as social networks. Ensuring `fair' representations is essential, due to the societal implications and the use of sensitive personal data. In this paper, we demonstrate how the parametrization of the \emph{CrossWalk} algorithm influences the ability to infer a sensitive attributes from node embeddings. By fine-tuning hyperparameters, we show that it is possible to either significantly enhance or obscure the detectability of these attributes. This functionality offers a valuable tool for improving the fairness of ML systems utilizing graph embeddings, making them adaptable to different fairness paradigms. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: Poster at ICML 2024 Workshop on the Next Generation of AI Safety

arXiv:2407.20013 [pdf, other]

Classification of freshwater snails of the genus Radomaniola with multimodal triplet networks

Authors: Dennis Vetter, Muhammad Ahsan, Diana Delicado, Thomas A. Neubauer, Thomas Wilke, Gemma Roig

Abstract: In this paper, we present our first proposal of a machine learning system for the classification of freshwater snails of the genus Radomaniola. We elaborate on the specific challenges encountered during system design, and how we tackled them; namely a small, very imbalanced dataset with a high number of classes and high visual similarity between classes. We then show how we employed triplet networ… ▽ More In this paper, we present our first proposal of a machine learning system for the classification of freshwater snails of the genus Radomaniola. We elaborate on the specific challenges encountered during system design, and how we tackled them; namely a small, very imbalanced dataset with a high number of classes and high visual similarity between classes. We then show how we employed triplet networks and the multiple input modalities of images, measurements, and genetic information to overcome these challenges and reach a performance comparable to that of a trained domain expert. △ Less

Submitted 30 July, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

Comments: Spotlight at ICML 2024 AI for Science workshop

arXiv:2406.01352 [pdf, other]

Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience

Authors: Martina G. Vilas, Federico Adolfi, David Poeppel, Gemma Roig

Abstract: Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field:… ▽ More Inner Interpretability is a promising emerging field tasked with uncovering the inner mechanisms of AI systems, though how to develop these mechanistic theories is still much debated. Moreover, recent critiques raise issues that question its usefulness to advance the broader goals of AI. However, it has been overlooked that these issues resemble those that have been grappled with in another field: Cognitive Neuroscience. Here we draw the relevant connections and highlight lessons that can be transferred productively between fields. Based on these, we propose a general conceptual framework and give concrete methodological strategies for building mechanistic explanations in AI inner interpretability research. With this conceptual framework, Inner Interpretability can fend off critiques and position itself on a productive path to explain AI systems. △ Less

Submitted 31 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: Accepted at ICML 2024

arXiv:2405.05143 [pdf, other]

Learning Object Semantic Similarity with Self-Supervision

Authors: Arthur Aubret, Timothy Schaumlöffel, Gemma Roig, Jochen Triesch

Abstract: Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceiv… ▽ More Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a ``kitchen" or ``eating'' context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {\em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans. △ Less

Submitted 19 April, 2024; originally announced May 2024.

arXiv:2402.09464 [pdf]

doi 10.1109/BIBM58861.2023.10385662.

Different Algorithms (Might) Uncover Different Patterns: A Brain-Age Prediction Case Study

Authors: Tobias Ettling, Sari Saba-Sadiya, Gemma Roig

Abstract: Machine learning is a rapidly evolving field with a wide range of applications, including biological signal analysis, where novel algorithms often improve the state-of-the-art. However, robustness to algorithmic variability - measured by different algorithms, consistently uncovering similar findings - is seldom explored. In this paper we investigate whether established hypotheses in brain-age pred… ▽ More Machine learning is a rapidly evolving field with a wide range of applications, including biological signal analysis, where novel algorithms often improve the state-of-the-art. However, robustness to algorithmic variability - measured by different algorithms, consistently uncovering similar findings - is seldom explored. In this paper we investigate whether established hypotheses in brain-age prediction from EEG research validate across algorithms. First, we surveyed literature and identified various features known to be informative for brain-age prediction. We employed diverse feature extraction techniques, processing steps, and models, and utilized the interpretative power of SHapley Additive exPlanations (SHAP) values to align our findings with the existing research in the field. Few of our models achieved state-of-the-art performance on the specific data-set we utilized. Moreover, analysis demonstrated that while most models do uncover similar patterns in the EEG signals, some variability could still be observed. Finally, a few prominent findings could only be validated using specific models. We conclude by suggesting remedies to the potential implications of this lack of robustness to model variability. △ Less

Submitted 8 February, 2024; originally announced February 2024.

Journal ref: 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 4051-4058

arXiv:2312.04118 [pdf, other]

doi 10.1109/ICDL55364.2023.10364409

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Authors: Timothy Schaumlöffel, Arthur Aubret, Gemma Roig, Jochen Triesch

Abstract: Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers… ▽ More Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations. △ Less

Submitted 17 January, 2024; v1 submitted 7 December, 2023; originally announced December 2023.

Comments: Proceedings of the 2023 IEEE International Conference on Development and Learning (ICDL)

Journal ref: "Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play," 2023 IEEE International Conference on Development and Learning (ICDL), Macau, China, 2023, pp. 67-72

arXiv:2311.02599 [pdf, other]

Learning Class and Domain Augmentations for Single-Source Open-Domain Generalization

Authors: Prathmesh Bele, Valay Bundele, Avigyan Bhattacharya, Ankit Jha, Gemma Roig, Biplab Banerjee

Abstract: Single-source open-domain generalization (SS-ODG) addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify ope… ▽ More Single-source open-domain generalization (SS-ODG) addresses the challenge of labeled source domains with supervision during training and unlabeled novel target domains during testing. The target domain includes both known classes from the source domain and samples from previously unseen classes. Existing techniques for SS-ODG primarily focus on calibrating source-domain classifiers to identify open samples in the target domain. However, these methods struggle with visually fine-grained open-closed data, often misclassifying open samples as closed-set classes. Moreover, relying solely on a single source domain restricts the model's ability to generalize. To overcome these limitations, we propose a novel framework called SODG-Net that simultaneously synthesizes novel domains and generates pseudo-open samples using a learning-based objective, in contrast to the ad-hoc mixing strategies commonly found in the literature. Our approach enhances generalization by diversifying the styles of known class samples using a novel metric criterion and generates diverse pseudo-open samples to train a unified and confident multi-class classifier capable of handling both open and closed-set data. Extensive experimental evaluations conducted on multiple benchmarks consistently demonstrate the superior performance of SODG-Net compared to the literature. △ Less

Submitted 5 November, 2023; originally announced November 2023.

Comments: 11 pages, WACV 2024

arXiv:2310.18969 [pdf, other]

Analyzing Vision Transformers for Image Classification in Class Embedding Space

Authors: Martina G. Vilas, Timothy Schaumlöffel, Gemma Roig

Abstract: Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class emb… ▽ More Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research. △ Less

Submitted 29 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2301.03198 [pdf]

The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes

Authors: A. T. Gifford, B. Lahner, S. Saba-Sadiya, M. G. Vilas, A. Lascelles, A. Oliva, K. Kay, G. Roig, R. M. Cichy

Abstract: The sciences of biological and artificial intelligence are ever more intertwined. Neural computational principles inspire new intelligent machines, which are in turn used to advance theoretical understanding of the brain. To promote further exchange of ideas and collaboration between biological and artificial intelligence researchers, we introduce the 2023 installment of the Algonauts Project chal… ▽ More The sciences of biological and artificial intelligence are ever more intertwined. Neural computational principles inspire new intelligent machines, which are in turn used to advance theoretical understanding of the brain. To promote further exchange of ideas and collaboration between biological and artificial intelligence researchers, we introduce the 2023 installment of the Algonauts Project challenge: How the Human Brain Makes Sense of Natural Scenes (http://algonauts.csail.mit.edu). This installment prompts the fields of artificial and biological intelligence to come together towards building computational models of the visual brain using the largest and richest dataset of fMRI responses to visual scenes, the Natural Scenes Dataset (NSD). NSD provides high-quality fMRI responses to ~73,000 different naturalistic colored scenes, making it the ideal candidate for data-driven model building approaches promoted by the 2023 challenge. The challenge is open to all and makes results directly comparable and transparent through a public leaderboard automatically updated after each submission, thus allowing for rapid model development. We believe that the 2023 installment will spark symbiotic collaborations between biological and artificial intelligence scientists, leading to a deeper understanding of the brain through cutting-edge computational models and to novel ways of engineering artificial intelligent agents through inductive biases from biological systems. △ Less

Submitted 11 July, 2023; v1 submitted 9 January, 2023; originally announced January 2023.

Comments: 5 pages, 2 figures

arXiv:2208.09677 [pdf, other]

Net2Brain: A Toolbox to compare artificial vision models with human brain responses

Authors: Domenic Bersch, Kshitij Dwivedi, Martina Vilas, Radoslaw M. Cichy, Gemma Roig

Abstract: We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs tra… ▽ More We introduce Net2Brain, a graphical and command-line user interface toolbox for comparing the representational spaces of artificial deep neural networks (DNNs) and human brain recordings. While different toolboxes facilitate only single functionalities or only focus on a small subset of supervised image classification models, Net2Brain allows the extraction of activations of more than 600 DNNs trained to perform a diverse range of vision-related tasks (e.g semantic segmentation, depth estimation, action recognition, etc.), over both image and video datasets. The toolbox computes the representational dissimilarity matrices (RDMs) over those activations and compares them to brain recordings using representational similarity analysis (RSA), weighted RSA, both in specific ROIs and with searchlight search. In addition, it is possible to add a new data set of stimuli and brain recordings to the toolbox for evaluation. We demonstrate the functionality and advantages of Net2Brain with an example showcasing how it can be used to test hypotheses of cognitive computational neuroscience. △ Less

Submitted 25 August, 2022; v1 submitted 20 August, 2022; originally announced August 2022.

Comments: 4 Pages, 3 figures, submitted and accepted to CCNeuro 2022. For associated repository, see https://github.com/ToastyDom/Net2Brain Update 1: Changed Citation

arXiv:2208.04608 [pdf, other]

Using Sentence Embeddings and Semantic Similarity for Seeking Consensus when Assessing Trustworthy AI

Authors: Dennis Vetter, Jesmin Jahan Tithi, Magnus Westerlund, Roberto V. Zicari, Gemma Roig

Abstract: Assessing the trustworthiness of artificial intelligence systems requires knowledge from many different disciplines. These disciplines do not necessarily share concepts between them and might use words with different meanings, or even use the same words differently. Additionally, experts from different disciplines might not be aware of specialized terms readily used in other disciplines. Therefore… ▽ More Assessing the trustworthiness of artificial intelligence systems requires knowledge from many different disciplines. These disciplines do not necessarily share concepts between them and might use words with different meanings, or even use the same words differently. Additionally, experts from different disciplines might not be aware of specialized terms readily used in other disciplines. Therefore, a core challenge of the assessment process is to identify when experts from different disciplines talk about the same problem but use different terminologies. In other words, the problem is to group problem descriptions (a.k.a. issues) with the same semantic meaning but described using slightly different terminologies. In this work, we show how we employed recent advances in natural language processing, namely sentence embeddings and semantic textual similarity, to support this identification process and to bridge communication gaps in interdisciplinary teams of experts assessing the trustworthiness of an artificial intelligence system used in healthcare. △ Less

Submitted 9 August, 2022; originally announced August 2022.

arXiv:2206.08500 [pdf, other]

What do navigation agents learn about their environment?

Authors: Kshitij Dwivedi, Gemma Roig, Aniruddha Kembhavi, Roozbeh Mottaghi

Abstract: Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which… ▽ More Today's state of the art visual navigation agents typically consist of large deep learning models trained end to end. Such models offer little to no interpretability about the learned skills or the actions of the agent taken in response to its environment. While past works have explored interpreting deep learning models, little attention has been devoted to interpreting embodied AI systems, which often involve reasoning about the structure of the environment, target characteristics and the outcome of one's actions. In this paper, we introduce the Interpretability System for Embodied agEnts (iSEE) for Point Goal and Object Goal navigation agents. We use iSEE to probe the dynamic representations produced by these agents for the presence of information about the agent as well as the environment. We demonstrate interesting insights about navigation agents using iSEE, including the ability to encode reachable locations (to avoid obstacles), visibility of the target, progress from the initial spawn location as well as the dramatic effect on the behaviors of agents when we mask out critical individual neurons. The code is available at: https://github.com/allenai/iSEE △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: CVPR 2022

arXiv:2202.10453 [pdf, other]

Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses

Authors: Phoebe Chua, Dimos Makris, Dorien Herremans, Gemma Roig, Kat Agres

Abstract: Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived e… ▽ More Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual, and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition. Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by both visual and auditory information. The dataset is made publicly available. △ Less

Submitted 19 February, 2022; originally announced February 2022.

Comments: 16 pages with 9 figures

arXiv:2112.14316 [pdf, other]

FRIDA -- Generative Feature Replay for Incremental Domain Adaptation

Authors: Sayan Rakshit, Anwesh Mohanty, Ruchika Chavhan, Biplab Banerjee, Gemma Roig, Subhasis Chaudhuri

Abstract: We tackle the novel problem of incremental unsupervised domain adaptation (IDA) in this paper. We assume that a labeled source domain and different unlabeled target domains are incrementally observed with the constraint that data corresponding to the current domain is only available at a time. The goal is to preserve the accuracies for all the past domains while generalizing well for the current d… ▽ More We tackle the novel problem of incremental unsupervised domain adaptation (IDA) in this paper. We assume that a labeled source domain and different unlabeled target domains are incrementally observed with the constraint that data corresponding to the current domain is only available at a time. The goal is to preserve the accuracies for all the past domains while generalizing well for the current domain. The IDA setup suffers due to the abrupt differences among the domains and the unavailability of past data including the source domain. Inspired by the notion of generative feature replay, we propose a novel framework called Feature Replay based Incremental Domain Adaptation (FRIDA) which leverages a new incremental generative adversarial network (GAN) called domain-generic auxiliary classification GAN (DGAC-GAN) for producing domain-specific feature representations seamlessly. For domain alignment, we propose a simple extension of the popular domain adversarial neural network (DANN) called DANN-IB which encourages discriminative domain-invariant and task-relevant feature learning. Experimental results on Office-Home, Office-CalTech, and DomainNet datasets confirm that FRIDA maintains superior stability-plasticity trade-off than the literature. △ Less

Submitted 11 January, 2022; v1 submitted 28 December, 2021; originally announced December 2021.

Comments: Accepted at CVIU (7th January 2022)

arXiv:2104.13714 [pdf]

The Algonauts Project 2021 Challenge: How the Human Brain Makes Sense of a World in Motion

Authors: R. M. Cichy, K. Dwivedi, B. Lahner, A. Lascelles, P. Iamshchinina, M. Graumann, A. Andonian, N. A. R. Murty, K. Kay, G. Roig, A. Oliva

Abstract: The sciences of natural and artificial intelligence are fundamentally connected. Brain-inspired human-engineered AI are now the standard for predicting human brain responses during vision, and conversely, the brain continues to inspire invention in AI. To promote even deeper connections between these fields, we here release the 2021 edition of the Algonauts Project Challenge: How the Human Brain M… ▽ More The sciences of natural and artificial intelligence are fundamentally connected. Brain-inspired human-engineered AI are now the standard for predicting human brain responses during vision, and conversely, the brain continues to inspire invention in AI. To promote even deeper connections between these fields, we here release the 2021 edition of the Algonauts Project Challenge: How the Human Brain Makes Sense of a World in Motion (http://algonauts.csail.mit.edu/). We provide whole-brain fMRI responses recorded while 10 human participants viewed a rich set of over 1,000 short video clips depicting everyday events. The goal of the challenge is to accurately predict brain responses to these video clips. The format of our challenge ensures rapid development, makes results directly comparable and transparent, and is open to all. In this way it facilitates interdisciplinary collaboration towards a common goal of understanding visual intelligence. The 2021 Algonauts Project is conducted in collaboration with the Cognitive Computational Neuroscience (CCN) conference. △ Less

Submitted 28 April, 2021; originally announced April 2021.

Comments: 5 pages, 2 figures

arXiv:2010.11188 [pdf]

AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies

Authors: Ha Thi Phuong Thao, Balamurali B. T., Dorien Herremans, Gemma Roig

Abstract: In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integrati… ▽ More In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform many state-ofthe-art models for emotion prediction. The code to reproduce our results with the models' implementation is available at: https://github.com/ivyha010/AttendAffectNet. △ Less

Submitted 21 October, 2020; originally announced October 2020.

Comments: 8 pages, 6 figures

Journal ref: Proceedings of the International Conference on Pattern Recognition (ICPR2020)

arXiv:2008.02107 [pdf, other]

Duality Diagram Similarity: a generic framework for initialization selection in task transfer learning

Authors: Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, Gemma Roig

Abstract: In this paper, we tackle an open research question in transfer learning, which is selecting a model initialization to achieve high performance on a new task, given several pre-trained models. We propose a new highly efficient and accurate approach based on duality diagram similarity (DDS) between deep neural networks (DNNs). DDS is a generic framework to represent and compare data of different fea… ▽ More In this paper, we tackle an open research question in transfer learning, which is selecting a model initialization to achieve high performance on a new task, given several pre-trained models. We propose a new highly efficient and accurate approach based on duality diagram similarity (DDS) between deep neural networks (DNNs). DDS is a generic framework to represent and compare data of different feature dimensions. We validate our approach on the Taskonomy dataset by measuring the correspondence between actual transfer learning performance rankings on 17 taskonomy tasks and predicted rankings. Computing DDS based ranking for $17\times17$ transfers requires less than 2 minutes and shows a high correlation ($0.86$) with actual transfer learning rankings, outperforming state-of-the-art methods by a large margin ($10\%$) on the Taskonomy benchmark. We also demonstrate the robustness of our model selection approach to a new task, namely Pascal VOC semantic segmentation. Additionally, we show that our method can be applied to select the best layer locations within a DNN for transfer learning on 2D, 3D and semantic tasks on NYUv2 and Pascal VOC datasets. △ Less

Submitted 5 August, 2020; originally announced August 2020.

Comments: accepted at ECCV 2020. Code available here: https://github.com/cvai-repo/duality-diagram-similarity

arXiv:2007.00083 [pdf, other]

Using Human Psychophysics to Evaluate Generalization in Scene Text Recognition Models

Authors: Sahar Siddiqui, Elena Sizikova, Gemma Roig, Najib J. Majaj, Denis G. Pelli

Abstract: Scene text recognition models have advanced greatly in recent years. Inspired by human reading we characterize two important scene text recognition models by measuring their domains i.e. the range of stimulus images that they can read. The domain specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion. These metrics identify strengths and weaknesse… ▽ More Scene text recognition models have advanced greatly in recent years. Inspired by human reading we characterize two important scene text recognition models by measuring their domains i.e. the range of stimulus images that they can read. The domain specifies the ability of readers to generalize to different word lengths, fonts, and amounts of occlusion. These metrics identify strengths and weaknesses of existing models. Relative to the attention-based (Attn) model, we discover that the connectionist temporal classification (CTC) model is more robust to noise and occlusion, and better at generalizing to different word lengths. Further, we show that in both models, adding noise to training images yields better generalization to occlusion. These results demonstrate the value of testing models till they break, complementing the traditional data science focus on optimizing performance. △ Less

Submitted 30 June, 2020; originally announced July 2020.

arXiv:2001.09988 [pdf, other]

Regression-based music emotion prediction using triplet neural networks

Authors: Kin Wai Cheuk, Yin-Jyun Luo, Balamurali B, T, Gemma Roig, Dorien Herremans

Abstract: In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not for regression, we propose a mechanism that allows them to provide meaningful low dimensional representations for regression tasks. We then use these new representations as the input for regression algorithms such as support vector… ▽ More In this paper, we adapt triplet neural networks (TNNs) to a regression task, music emotion prediction. Since TNNs were initially introduced for classification, and not for regression, we propose a mechanism that allows them to provide meaningful low dimensional representations for regression tasks. We then use these new representations as the input for regression algorithms such as support vector machines and gradient boosting machines. To demonstrate the TNNs' effectiveness at creating meaningful representations, we compare them to different dimensionality reduction methods on music emotion prediction, i.e., predicting valence and arousal values from musical audio signals. Our results on the DEAM dataset show that by using TNNs we achieve 90% feature dimensionality reduction with a 9% improvement in valence prediction and 4% improvement in arousal prediction with respect to our baseline models (without TNN). Our TNN method outperforms other dimensionality reduction methods such as principal component analysis (PCA) and autoencoders (AE). This shows that, in addition to providing a compact latent space representation of audio features, the proposed approach has a higher performance than the baseline models. △ Less

Submitted 21 July, 2020; v1 submitted 24 January, 2020; originally announced January 2020.

Comments: Paper Accepted i nIJCNN 2020

Journal ref: IJCNN 2020

arXiv:1911.09326 [pdf, other]

LCD: Learned Cross-Domain Descriptors for 2D-3D Matching

Authors: Quang-Hieu Pham, Mikaela Angelina Uy, Binh-Son Hua, Duc Thanh Nguyen, Gemma Roig, Sai-Kit Yeung

Abstract: In this work, we present a novel method to learn a local cross-domain descriptor for 2D image and 3D point cloud matching. Our proposed method is a dual auto-encoder neural network that maps 2D and 3D input into a shared latent space representation. We show that such local cross-domain descriptors in the shared embedding are more discriminative than those obtained from individual training in 2D an… ▽ More In this work, we present a novel method to learn a local cross-domain descriptor for 2D image and 3D point cloud matching. Our proposed method is a dual auto-encoder neural network that maps 2D and 3D input into a shared latent space representation. We show that such local cross-domain descriptors in the shared embedding are more discriminative than those obtained from individual training in 2D and 3D domains. To facilitate the training process, we built a new dataset by collecting $\approx 1.4$ millions of 2D-3D correspondences with various lighting conditions and settings from publicly available RGB-D scenes. Our descriptor is evaluated in three main experiments: 2D-3D matching, cross-domain retrieval, and sparse-to-dense depth estimation. Experimental results confirm the robustness of our approach as well as its competitive performance not only in solving cross-domain tasks but also in being able to generalize to solve sole 2D and 3D tasks. Our dataset and code are released publicly at \url{https://hkust-vgd.github.io/lcd}. △ Less

Submitted 21 November, 2019; originally announced November 2019.

Comments: Accepted to AAAI 2020 (Oral)

arXiv:1910.10056 [pdf, other]

Predictive Coding Networks Meet Action Recognition

Authors: Xia Huang, Hossein Mousavi, Gemma Roig

Abstract: Action recognition is a key problem in computer vision that labels videos with a set of predefined actions. Capturing both, semantic content and motion, along the video frames is key to achieve high accuracy performance on this task. Most of the state-of-the-art methods rely on RGB frames for extracting the semantics and pre-computed optical flow fields as a motion cue. Then, both are combined usi… ▽ More Action recognition is a key problem in computer vision that labels videos with a set of predefined actions. Capturing both, semantic content and motion, along the video frames is key to achieve high accuracy performance on this task. Most of the state-of-the-art methods rely on RGB frames for extracting the semantics and pre-computed optical flow fields as a motion cue. Then, both are combined using deep neural networks. Yet, it has been argued that such models are not able to leverage the motion information extracted from the optical flow, but instead the optical flow allows for better recognition of people and objects in the video. This urges the need to explore different cues or models that can extract motion in a more informative fashion. To tackle this issue, we propose to explore the predictive coding network, so called PredNet, a recurrent neural network that propagates predictive coding errors across layers and time steps. We analyze whether PredNet can better capture motions in videos by estimating over time the representations extracted from pre-trained networks for action recognition. In this way, the model only relies on the video frames, and does not need pre-processed optical flows as input. We report the effectiveness of our proposed model on UCF101 and HMDB51 datasets. △ Less

Submitted 22 October, 2019; originally announced October 2019.

Comments: 6 pages

arXiv:1910.01463 [pdf, other]

Latent space representation for multi-target speaker detection and identification with a sparse dataset using Triplet neural networks

Authors: Kin Wai Cheuk, Balamurali B. T., Gemma Roig, Dorien Herremans

Abstract: We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, the $i$-vector representation with probabilistic linear discriminant analysis (PLDA) is the most commonly used technique to solve this problem, due to high classification accuracy with a relatively short computation time. In this paper, we explore a neural network approach, namely Triplet Neu… ▽ More We present an approach to tackle the speaker recognition problem using Triplet Neural Networks. Currently, the $i$-vector representation with probabilistic linear discriminant analysis (PLDA) is the most commonly used technique to solve this problem, due to high classification accuracy with a relatively short computation time. In this paper, we explore a neural network approach, namely Triplet Neural Networks (TNNs), to built a latent space for different classifiers to solve the Multi-Target Speaker Detection and Identification Challenge Evaluation 2018 (MCE 2018) dataset. This training set contains $i$-vectors from 3,631 speakers, with only 3 samples for each speaker, thus making speaker recognition a challenging task. When using the train and development set for training both the TNN and baseline model (i.e., similarity evaluation directly on the $i$-vector representation), our proposed model outperforms the baseline by 23%. When reducing the training data to only using the train set, our method results in 309 confusions for the Multi-target speaker identification task, which is 46% better than the baseline model. These results show that the representational power of TNNs is especially evident when training on small datasets with few instances available per class. △ Less

Submitted 3 October, 2019; v1 submitted 1 October, 2019; originally announced October 2019.

Comments: Accepted for ASRU 2019

MSC Class: 68T10; 68Txx

Journal ref: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019). Singapore. 2019

arXiv:1909.06957 [pdf, other]

Multimodal Deep Models for Predicting Affective Responses Evoked by Movies

Authors: Ha Thi Phuong Thao, Dorien Herremans, Gemma Roig

Abstract: The goal of this study is to develop and analyze multimodal models for predicting experienced affective responses of viewers watching movie clips. We develop hybrid multimodal prediction models based on both the video and audio of the clips. For the video content, we hypothesize that both image content and motion are crucial features for evoked emotion prediction. To capture such information, we e… ▽ More The goal of this study is to develop and analyze multimodal models for predicting experienced affective responses of viewers watching movie clips. We develop hybrid multimodal prediction models based on both the video and audio of the clips. For the video content, we hypothesize that both image content and motion are crucial features for evoked emotion prediction. To capture such information, we extract features from RGB frames and optical flow using pre-trained neural networks. For the audio model, we compute an enhanced set of low-level descriptors including intensity, loudness, cepstrum, linear predictor coefficients, pitch and voice quality. Both visual and audio features are then concatenated to create audio-visual features, which are used to predict the evoked emotion. To classify the movie clips into the corresponding affective response categories, we propose two approaches based on deep neural network models. The first one is based on fully connected layers without memory on the time component, the second incorporates the sequential dependency with a long short-term memory recurrent neural network (LSTM). We perform a thorough analysis of the importance of each feature set. Our experiments reveal that in our set-up, predicting emotions at each time step independently gives slightly better accuracy performance than with the LSTM. Interestingly, we also observe that the optical flow is more informative than the RGB in videos, and overall, models using audio features are more accurate than those based on video features when making the final prediction of evoked emotions. △ Less

Submitted 17 September, 2019; v1 submitted 15 September, 2019; originally announced September 2019.

Comments: 10 pages, 7 figures, Preprint accepted for publication in the Proceedings of the 2nd International Workshop on Computer Vision for Physiological Measurement as part of ICCV. Seoul, South Korea. 2019

MSC Class: 97R40; 68T45; 68Txx; 92B20

Journal ref: Proceedings of the 2nd International Workshop on Computer Vision for Physiological Measurement as part of ICCV. Seoul, South Korea. 2019

arXiv:1905.05675 [pdf, other]

The Algonauts Project: A Platform for Communication between the Sciences of Biological and Artificial Intelligence

Authors: Radoslaw Martin Cichy, Gemma Roig, Alex Andonian, Kshitij Dwivedi, Benjamin Lahner, Alex Lascelles, Yalda Mohsenzadeh, Kandan Ramakrishnan, Aude Oliva

Abstract: In the last decade, artificial intelligence (AI) models inspired by the brain have made unprecedented progress in performing real-world perceptual tasks like object classification and speech recognition. Recently, researchers of natural intelligence have begun using those AI models to explore how the brain performs such tasks. These developments suggest that future progress will benefit from incre… ▽ More In the last decade, artificial intelligence (AI) models inspired by the brain have made unprecedented progress in performing real-world perceptual tasks like object classification and speech recognition. Recently, researchers of natural intelligence have begun using those AI models to explore how the brain performs such tasks. These developments suggest that future progress will benefit from increased interaction between disciplines. Here we introduce the Algonauts Project as a structured and quantitative communication channel for interdisciplinary interaction between natural and artificial intelligence researchers. The project's core is an open challenge with a quantitative benchmark whose goal is to account for brain data through computational models. This project has the potential to provide better models of natural intelligence and to gather findings that advance AI. The 2019 Algonauts Project focuses on benchmarking computational models predicting human brain activity when people look at pictures of objects. The 2019 edition of the Algonauts Project is available online: http://algonauts.csail.mit.edu/. △ Less

Submitted 14 May, 2019; originally announced May 2019.

Comments: 4 pages, 2 figures

arXiv:1904.11740 [pdf, other]

Representation Similarity Analysis for Efficient Task taxonomy & Transfer Learning

Authors: Kshitij Dwivedi, Gemma Roig

Abstract: Transfer learning is widely used in deep neural network models when there are few labeled examples available. The common approach is to take a pre-trained network in a similar task and finetune the model parameters. This is usually done blindly without a pre-selection from a set of pre-trained models, or by finetuning a set of models trained on different tasks and selecting the best performing one… ▽ More Transfer learning is widely used in deep neural network models when there are few labeled examples available. The common approach is to take a pre-trained network in a similar task and finetune the model parameters. This is usually done blindly without a pre-selection from a set of pre-trained models, or by finetuning a set of models trained on different tasks and selecting the best performing one by cross-validation. We address this problem by proposing an approach to assess the relationship between visual tasks and their task-specific models. Our method uses Representation Similarity Analysis (RSA), which is commonly used to find a correlation between neuronal responses from brain data and models. With RSA we obtain a similarity score among tasks by computing correlations between models trained on different tasks. Our method is efficient as it requires only pre-trained models, and a few images with no further training. We demonstrate the effectiveness and efficiency of our method for generating task taxonomy on Taskonomy dataset. We next evaluate the relationship of RSA with the transfer learning performance on Taskonomy tasks and a new task: Pascal VOC semantic segmentation. Our results reveal that models trained on tasks with higher similarity score show higher transfer learning performance. Surprisingly, the best transfer learning result for Pascal VOC semantic segmentation is not obtained from the pre-trained model on semantic segmentation, probably due to the domain differences, and our method successfully selects the high performing models. △ Less

Submitted 26 April, 2019; originally announced April 2019.

Comments: Accepted at CVPR 2019. Code available at https://github.com/kshitijd20/RSA-CVPR19-release

arXiv:1904.09764 [pdf, other]

Deep Anchored Convolutional Neural Networks

Authors: Jiahui Huang, Kshitij Dwivedi, Gemma Roig

Abstract: Convolutional Neural Networks (CNNs) have been proven to be extremely successful at solving computer vision tasks. State-of-the-art methods favor such deep network architectures for its accuracy performance, with the cost of having massive number of parameters and high weights redundancy. Previous works have studied how to prune such CNNs weights. In this paper, we go to another extreme and analyz… ▽ More Convolutional Neural Networks (CNNs) have been proven to be extremely successful at solving computer vision tasks. State-of-the-art methods favor such deep network architectures for its accuracy performance, with the cost of having massive number of parameters and high weights redundancy. Previous works have studied how to prune such CNNs weights. In this paper, we go to another extreme and analyze the performance of a network stacked with a single convolution kernel across layers, as well as other weights sharing techniques. We name it Deep Anchored Convolutional Neural Network (DACNN). Sharing the same kernel weights across layers allows to reduce the model size tremendously, more precisely, the network is compressed in memory by a factor of L, where L is the desired depth of the network, disregarding the fully connected layer for prediction. The number of parameters in DACNN barely increases as the network grows deeper, which allows us to build deep DACNNs without any concern about memory costs. We also introduce a partial shared weights network (DACNN-mix) as well as an easy-plug-in module, coined regulators, to boost the performance of our architecture. We validated our idea on 3 datasets: CIFAR-10, CIFAR-100 and SVHN. Our results show that we can save massive amounts of memory with our model, while maintaining a high accuracy performance. △ Less

Submitted 22 April, 2019; originally announced April 2019.

Comments: This paper is accepted to 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

arXiv:1904.00699 [pdf, other]

JSIS3D: Joint Semantic-Instance Segmentation of 3D Point Clouds with Multi-Task Pointwise Networks and Multi-Value Conditional Random Fields

Authors: Quang-Hieu Pham, Duc Thanh Nguyen, Binh-Son Hua, Gemma Roig, Sai-Kit Yeung

Abstract: Deep learning techniques have become the to-go models for most vision-related tasks on 2D images. However, their power has not been fully realised on several tasks in 3D space, e.g., 3D scene understanding. In this work, we jointly address the problems of semantic and instance segmentation of 3D point clouds. Specifically, we develop a multi-task pointwise network that simultaneously performs two… ▽ More Deep learning techniques have become the to-go models for most vision-related tasks on 2D images. However, their power has not been fully realised on several tasks in 3D space, e.g., 3D scene understanding. In this work, we jointly address the problems of semantic and instance segmentation of 3D point clouds. Specifically, we develop a multi-task pointwise network that simultaneously performs two tasks: predicting the semantic classes of 3D points and embedding the points into high-dimensional vectors so that points of the same object instance are represented by similar embeddings. We then propose a multi-value conditional random field model to incorporate the semantic and instance labels and formulate the problem of semantic and instance segmentation as jointly optimising labels in the field model. The proposed method is thoroughly evaluated and compared with existing methods on different indoor scene datasets including S3DIS and SceneNN. Experimental results showed the robustness of the proposed joint semantic-instance segmentation scheme over its single components. Our method also achieved state-of-the-art performance on semantic segmentation. △ Less

Submitted 5 April, 2019; v1 submitted 1 April, 2019; originally announced April 2019.

Comments: CVPR 2019 (Oral). More information at https://pqhieu.github.io/cvpr19.html

arXiv:1808.07632 [pdf, other]

DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN

Authors: Swee Kiat Lim, Yi Loo, Ngoc-Trung Tran, Ngai-Man Cheung, Gemma Roig, Yuval Elovici

Abstract: Recently, the introduction of the generative adversarial network (GAN) and its variants has enabled the generation of realistic synthetic samples, which has been used for enlarging training sets. Previous work primarily focused on data augmentation for semi-supervised and supervised tasks. In this paper, we instead focus on unsupervised anomaly detection and propose a novel generative data augment… ▽ More Recently, the introduction of the generative adversarial network (GAN) and its variants has enabled the generation of realistic synthetic samples, which has been used for enlarging training sets. Previous work primarily focused on data augmentation for semi-supervised and supervised tasks. In this paper, we instead focus on unsupervised anomaly detection and propose a novel generative data augmentation framework optimized for this task. In particular, we propose to oversample infrequent normal samples - normal samples that occur with small probability, e.g., rare normal events. We show that these samples are responsible for false positives in anomaly detection. However, oversampling of infrequent normal samples is challenging for real-world high-dimensional data with multimodal distributions. To address this challenge, we propose to use a GAN variant known as the adversarial autoencoder (AAE) to transform the high-dimensional multimodal data distributions into low-dimensional unimodal latent distributions with well-defined tail probability. Then, we systematically oversample at the `edge' of the latent distributions to increase the density of infrequent normal samples. We show that our oversampling pipeline is a unified one: it is generally applicable to datasets with different complex data distributions. To the best of our knowledge, our method is the first data augmentation technique focused on improving performance in unsupervised anomaly detection. We validate our method by demonstrating consistent improvements across several real-world datasets. △ Less

Submitted 23 August, 2018; v1 submitted 23 August, 2018; originally announced August 2018.

Comments: Published as a conference paper at ICDM 2018 (IEEE International Conference on Data Mining)

arXiv:1706.08616 [pdf, other]

Do Deep Neural Networks Suffer from Crowding?

Authors: Anna Volokitin, Gemma Roig, Tomaso Poggio

Abstract: Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs w… ▽ More Crowding is a visual effect suffered by humans, in which an object that can be recognized in isolation can no longer be recognized when other objects, called flankers, are placed close to it. In this work, we study the effect of crowding in artificial Deep Neural Networks for object recognition. We analyze both standard deep convolutional neural networks (DCNNs) as well as a new version of DCNNs which is 1) multi-scale and 2) with size of the convolution filters change depending on the eccentricity wrt to the center of fixation. Such networks, that we call eccentricity-dependent, are a computational model of the feedforward path of the primate visual cortex. Our results reveal that the eccentricity-dependent model, trained on target objects in isolation, can recognize such targets in the presence of flankers, if the targets are near the center of the image, whereas DCNNs cannot. Also, for all tested networks, when trained on targets in isolation, we find that recognition accuracy of the networks decreases the closer the flankers are to the target and the more flankers there are. We find that visual similarity between the target and flankers also plays a role and that pooling in early layers of the network leads to more crowding. Additionally, we show that incorporating the flankers into the images of the training set does not improve performance with crowding. △ Less

Submitted 26 June, 2017; originally announced June 2017.

Comments: CBMM memo

Report number: 69

arXiv:1611.04353 [pdf, other]

Herding Generalizes Diverse M -Best Solutions

Authors: Ece Ozkan, Gemma Roig, Orcun Goksel, Xavier Boix

Abstract: We show that the algorithm to extract diverse M -solutions from a Conditional Random Field (called divMbest [1]) takes exactly the form of a Herding procedure [2], i.e. a deterministic dynamical system that produces a sequence of hypotheses that respect a set of observed moment constraints. This generalization enables us to invoke properties of Herding that show that divMbest enforces implausible… ▽ More We show that the algorithm to extract diverse M -solutions from a Conditional Random Field (called divMbest [1]) takes exactly the form of a Herding procedure [2], i.e. a deterministic dynamical system that produces a sequence of hypotheses that respect a set of observed moment constraints. This generalization enables us to invoke properties of Herding that show that divMbest enforces implausible constraints which may yield wrong assumptions for some problem settings. Our experiments in semantic segmentation demonstrate that seeing divMbest as an instance of Herding leads to better alternatives for the implausible constraints of divMbest. △ Less

Submitted 30 January, 2017; v1 submitted 14 November, 2016; originally announced November 2016.

Comments: 8 pages, 2 algorithms, 3 figures

arXiv:1511.06292 [pdf, other]

Foveation-based Mechanisms Alleviate Adversarial Examples

Authors: Yan Luo, Xavier Boix, Gemma Roig, Tomaso Poggio, Qi Zhao

Abstract: We show that adversarial examples, i.e., the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations---applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as… ▽ More We show that adversarial examples, i.e., the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations---applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as a linear classifier: CNNs act locally linearly to changes in the image regions with objects recognized by the CNN, and in other regions the CNN may act non-linearly. Then, we corroborate that when the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation. This is because, hypothetically, the CNNs for ImageNet are robust to changes of scale and translation of the object produced by the foveation, but this property does not generalize to transformations of the perturbation. As a result, the accuracy after a foveation is almost the same as the accuracy of the CNN without the adversarial perturbation, even if the adversarial perturbation is calculated taking into account a foveation. △ Less

Submitted 19 January, 2016; v1 submitted 19 November, 2015; originally announced November 2015.

arXiv:1408.6963 [pdf, other]

Comment on "Ensemble Projection for Semi-supervised Image Classification"

Authors: Xavier Boix, Gemma Roig, Luc Van Gool

Abstract: In a series of papers by Dai and colleagues [1,2], a feature map (or kernel) was introduced for semi- and unsupervised learning. This feature map is build from the output of an ensemble of classifiers trained without using the ground-truth class labels. In this critique, we analyze the latest version of this series of papers, which is called Ensemble Projections [2]. We show that the results repor… ▽ More In a series of papers by Dai and colleagues [1,2], a feature map (or kernel) was introduced for semi- and unsupervised learning. This feature map is build from the output of an ensemble of classifiers trained without using the ground-truth class labels. In this critique, we analyze the latest version of this series of papers, which is called Ensemble Projections [2]. We show that the results reported in [2] were not well conducted, and that Ensemble Projections performs poorly for semi-supervised learning. △ Less

Submitted 29 August, 2014; originally announced August 2014.

arXiv:1309.3848 [pdf, other]

SEEDS: Superpixels Extracted via Energy-Driven Sampling

Authors: Michael Van den Bergh, Xavier Boix, Gemma Roig, Luc Van Gool

Abstract: Superpixel algorithms aim to over-segment the image by grouping pixels that belong to the same object. Many state-of-the-art superpixel algorithms rely on minimizing objective functions to enforce color ho- mogeneity. The optimization is accomplished by sophis- ticated methods that progressively build the superpix- els, typically by adding cuts or growing superpixels. As a result, they are computa… ▽ More Superpixel algorithms aim to over-segment the image by grouping pixels that belong to the same object. Many state-of-the-art superpixel algorithms rely on minimizing objective functions to enforce color ho- mogeneity. The optimization is accomplished by sophis- ticated methods that progressively build the superpix- els, typically by adding cuts or growing superpixels. As a result, they are computationally too expensive for real-time applications. We introduce a new approach based on a simple hill-climbing optimization. Starting from an initial superpixel partitioning, it continuously refines the superpixels by modifying the boundaries. We define a robust and fast to evaluate energy function, based on enforcing color similarity between the bound- aries and the superpixel color histogram. In a series of experiments, we show that we achieve an excellent com- promise between accuracy and efficiency. We are able to achieve a performance comparable to the state-of- the-art, but in real-time on a single Intel i7 CPU at 2.8GHz. △ Less

Submitted 16 September, 2013; originally announced September 2013.

arXiv:1307.5161 [pdf, other]

Random Binary Mappings for Kernel Learning and Efficient SVM

Authors: Gemma Roig, Xavier Boix, Luc Van Gool

Abstract: Support Vector Machines (SVMs) are powerful learners that have led to state-of-the-art results in various computer vision problems. SVMs suffer from various drawbacks in terms of selecting the right kernel, which depends on the image descriptors, as well as computational and memory efficiency. This paper introduces a novel kernel, which serves such issues well. The kernel is learned by exploiting… ▽ More Support Vector Machines (SVMs) are powerful learners that have led to state-of-the-art results in various computer vision problems. SVMs suffer from various drawbacks in terms of selecting the right kernel, which depends on the image descriptors, as well as computational and memory efficiency. This paper introduces a novel kernel, which serves such issues well. The kernel is learned by exploiting a large amount of low-complex, randomized binary mappings of the input feature. This leads to an efficient SVM, while also alleviating the task of kernel selection. We demonstrate the capabilities of our kernel on 6 standard vision benchmarks, in which we combine several common image descriptors, namely histograms (Flowers17 and Daimler), attribute-like descriptors (UCI, OSR, and a-VOC08), and Sparse Quantization (ImageNet). Results show that our kernel learning adapts well to the different descriptors types, achieving the performance of the kernels specifically tuned for each image descriptor, and with similar evaluation cost as efficient SVM methods. △ Less

Submitted 28 March, 2014; v1 submitted 19 July, 2013; originally announced July 2013.

Showing 1–35 of 35 results for author: Roig, G