Search | arXiv e-print repository

Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling

Authors: Cristian Rodriguez-Opazo, Ehsan Abbasnejad, Damien Teney, Edison Marrese-Taylor, Hamed Damirchi, Anton van den Hengel

Abstract: Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data an… ▽ More Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this insight, we develop a straightforward yet powerful approach to adaptively ensemble multiple backbones. The approach uses as few as one labeled example per class to tune the adaptive combination of backbones. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2312.14400

arXiv:2312.14400 [pdf, other]

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Authors: Cristian Rodriguez-Opazo, Edison Marrese-Taylor, Ehsan Abbasnejad, Hamed Damirchi, Ignacio M. Jara, Felipe Bravo-Marquez, Anton van den Hengel

Abstract: Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various neural architectures, spanning Transformer-based models like Vision Transformers (ViTs) to Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve as universal backbones across diverse vision tasks. Despite utilizing the same data and training objectives… ▽ More Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various neural architectures, spanning Transformer-based models like Vision Transformers (ViTs) to Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve as universal backbones across diverse vision tasks. Despite utilizing the same data and training objectives, the effectiveness of representations learned by these architectures raises a critical question. Our investigation explores the differences in CLIP performance among these backbone architectures, revealing significant disparities in their classifications. Notably, normalizing these representations results in substantial performance variations. Our findings showcase a remarkable possible synergy between backbone predictions that could reach an improvement of over 20% through informed selection of the appropriate backbone. Moreover, we propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34\%. We will release the code for reproducing the results. △ Less

Submitted 21 December, 2023; originally announced December 2023.

arXiv:2311.17949 [pdf, other]

Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines

Authors: Hamed Damirchi, Cristian Rodríguez-Opazo, Ehsan Abbasnejad, Damien Teney, Javen Qinfeng Shi, Stephen Gould, Anton van den Hengel

Abstract: Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal le… ▽ More Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval. We propose to retrieve useful data from the Web at test time based on test cases that the model is uncertain about. Different from existing retrieval-augmented approaches, we then update the model to address this underlying uncertainty. We demonstrate substantial improvements in zero-shot performance, e.g. a remarkable increase of 15 percentage points in accuracy on the Stanford Cars and Flowers datasets. We also present extensive experiments that explore the impact of noisy retrieval and different learning strategies. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2307.03786 [pdf, other]

Context-aware Pedestrian Trajectory Prediction with Multimodal Transformer

Authors: Haleh Damirchi, Michael Greenspan, Ali Etemad

Abstract: We propose a novel solution for predicting future trajectories of pedestrians. Our method uses a multimodal encoder-decoder transformer architecture, which takes as input both pedestrian locations and ego-vehicle speeds. Notably, our decoder predicts the entire future trajectory in a single-pass and does not perform one-step-ahead prediction, which makes the method effective for embedded edge depl… ▽ More We propose a novel solution for predicting future trajectories of pedestrians. Our method uses a multimodal encoder-decoder transformer architecture, which takes as input both pedestrian locations and ego-vehicle speeds. Notably, our decoder predicts the entire future trajectory in a single-pass and does not perform one-step-ahead prediction, which makes the method effective for embedded edge deployment. We perform detailed experiments and evaluate our method on two popular datasets, PIE and JAAD. Quantitative results demonstrate the superiority of our proposed model over the current state-of-the-art, which consistently achieves the lowest error for 3 time horizons of 0.5, 1.0 and 1.5 seconds. Moreover, the proposed method is significantly faster than the state-of-the-art for the two datasets of PIE and JAAD. Lastly, ablation experiments demonstrate the impact of the key multimodal configuration of our method. △ Less

Submitted 7 July, 2023; originally announced July 2023.

arXiv:2306.01316 [pdf, other]

Independent Modular Networks

Authors: Hamed Damirchi, Forest Agostinelli, Pooyan Jamshidi

Abstract: Monolithic neural networks that make use of a single set of weights to learn useful representations for downstream tasks explicitly dismiss the compositional nature of data generation processes. This characteristic exists in data where every instance can be regarded as the combination of an identity concept, such as the shape of an object, combined with modifying concepts, such as orientation, col… ▽ More Monolithic neural networks that make use of a single set of weights to learn useful representations for downstream tasks explicitly dismiss the compositional nature of data generation processes. This characteristic exists in data where every instance can be regarded as the combination of an identity concept, such as the shape of an object, combined with modifying concepts, such as orientation, color, and size. The dismissal of compositionality is especially detrimental in robotics, where state estimation relies heavily on the compositional nature of physical mechanisms (e.g., rotations and transformations) to model interactions. To accommodate this data characteristic, modular networks have been proposed. However, a lack of structure in each module's role, and modular network-specific issues such as module collapse have restricted their usability. We propose a modular network architecture that accommodates the mentioned decompositional concept by proposing a unique structure that splits the modules into predetermined roles. Additionally, we provide regularizations that improve the resiliency of the modular network to the problem of module collapse while improving the decomposition accuracy of the model. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: ICRA23 RAP4Robots Workshop

arXiv:2301.13090 [pdf, other]

doi 10.1016/j.cviu.2023.103722

Action Capsules: Human Skeleton Action Recognition

Authors: Ali Farajzadeh Bavil, Hamed Damirchi, Hamid D. Taghirad

Abstract: Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spati… ▽ More Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this paper, we introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. We show that, during inference, our end-to-end network pays attention to a set of joints specific to each action, whose encoded spatio-temporal features are aggregated to recognize the action. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements. △ Less

Submitted 30 January, 2023; originally announced January 2023.

Comments: 11 pages, 11 figures

Journal ref: Computer Vision and Image Understanding Volume 233, August 2023, 103722

arXiv:2202.09942 [pdf, other]

Multiscale Crowd Counting and Localization By Multitask Point Supervision

Authors: Mohsen Zand, Haleh Damirchi, Andrew Farley, Mahdiyar Molahasani, Michael Greenspan, Ali Etemad

Abstract: We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our mode… ▽ More We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our model uses point supervision to allow for crowd locations to be accurately identified. We test our model on two popular crowd counting datasets, ShanghaiTech A and B, and demonstrate that our method achieves strong results on both counting and localization tasks, with MSE measures of 110.7 and 15.0 for crowd counting and AP measures of 0.71 and 0.75 for localization, on ShanghaiTech A and B respectively. Our detailed ablation experiments show the impact of our multiscale approach as well as the effectiveness of the fusion module embedded in our network. Our code is available at: https://github.com/RCVLab-AiimLab/crowd_counting. △ Less

Submitted 20 February, 2022; originally announced February 2022.

Comments: 4 pages + references, 3 figures, 2 tables, Accepted by ICASSP 2022 Conference

arXiv:2107.00366 [pdf, other]

A Consistency-Based Loss for Deep Odometry Through Uncertainty Propagation

Authors: Hamed Damirchi, Rooholla Khorrambakht, Hamid D. Taghirad, Behzad Moshiri

Abstract: The incremental poses computed through odometry can be integrated over time to calculate the pose of a device with respect to an initial location. The resulting global pose may be used to formulate a second, consistency based, loss term in a deep odometry setting. In such cases where multiple losses are imposed on a network, the uncertainty over each output can be derived to weigh the different lo… ▽ More The incremental poses computed through odometry can be integrated over time to calculate the pose of a device with respect to an initial location. The resulting global pose may be used to formulate a second, consistency based, loss term in a deep odometry setting. In such cases where multiple losses are imposed on a network, the uncertainty over each output can be derived to weigh the different loss terms in a maximum likelihood setting. However, when imposing a constraint on the integrated transformation, due to how only odometry is estimated at each iteration of the algorithm, there is no information about the uncertainty associated with the global pose to weigh the global loss term. In this paper, we associate uncertainties with the output poses of a deep odometry network and propagate the uncertainties through each iteration. Our goal is to use the estimated covariance matrix at each incremental step to weigh the loss at the corresponding step while weighting the global loss term using the compounded uncertainty. This formulation provides an adaptive method to weigh the incremental and integrated loss terms against each other, noting the increase in uncertainty as new estimates arrive. We provide quantitative and qualitative analysis of pose estimates and show that our method surpasses the accuracy of the state-of-the-art Visual Odometry approaches. Then, uncertainty estimates are evaluated and comparisons against fixed baselines are provided. Finally, the uncertainty values are used in a realistic example to show the effectiveness of uncertainty quantification for localization. △ Less

Submitted 1 July, 2021; originally announced July 2021.

Comments: 8 pages, 5 figures, 3 tables

ACM Class: I.2.9; I.2.10; I.5.1

arXiv:2101.07061 [pdf, other]

Deep Inertial Odometry with Accurate IMU Preintegration

Authors: Rooholla Khorrambakht, Chris Xiaoxuan Lu, Hamed Damirchi, Zhenghua Chen, Zhengguo Li

Abstract: Inertial Measurement Units (IMUs) are interceptive modalities that provide ego-motion measurements independent of the environmental factors. They are widely adopted in various autonomous systems. Motivated by the limitations in processing the noisy measurements from these sensors using their mathematical models, researchers have recently proposed various deep learning architectures to estimate ine… ▽ More Inertial Measurement Units (IMUs) are interceptive modalities that provide ego-motion measurements independent of the environmental factors. They are widely adopted in various autonomous systems. Motivated by the limitations in processing the noisy measurements from these sensors using their mathematical models, researchers have recently proposed various deep learning architectures to estimate inertial odometry in an end-to-end manner. Nevertheless, the high-frequency and redundant measurements from IMUs lead to long raw sequences to be processed. In this study, we aim to investigate the efficacy of accurate preintegration as a more realistic solution to the IMU motion model for deep inertial odometry (DIO) and the resultant DIO is a fusion of model-driven and data-driven approaches. The accurate IMU preintegration has the potential to outperform numerical approximation of the continuous IMU model used in the existing DIOs. Experimental results validate the proposed DIO. △ Less

Submitted 18 January, 2021; originally announced January 2021.

arXiv:2011.08634 [pdf, other]

Exploring Self-Attention for Visual Odometry

Authors: Hamed Damirchi, Rooholla Khorrambakht, Hamid D. Taghirad

Abstract: Visual odometry networks commonly use pretrained optical flow networks in order to derive the ego-motion between consecutive frames. The features extracted by these networks represent the motion of all the pixels between frames. However, due to the existence of dynamic objects and texture-less surfaces in the scene, the motion information for every image region might not be reliable for inferring… ▽ More Visual odometry networks commonly use pretrained optical flow networks in order to derive the ego-motion between consecutive frames. The features extracted by these networks represent the motion of all the pixels between frames. However, due to the existence of dynamic objects and texture-less surfaces in the scene, the motion information for every image region might not be reliable for inferring odometry due to the ineffectiveness of dynamic objects in derivation of the incremental changes in position. Recent works in this area lack attention mechanisms in their structures to facilitate dynamic reweighing of the feature maps for extracting more refined egomotion information. In this paper, we explore the effectiveness of self-attention in visual odometry. We report qualitative and quantitative results against the SOTA methods. Furthermore, saliency-based studies alongside specially designed experiments are utilized to investigate the effect of self-attention on VO. Our experiments show that using self-attention allows for the extraction of better features while achieving a better odometry performance compared to networks that lack such structures. △ Less

Submitted 17 November, 2020; originally announced November 2020.

Comments: 8 pages, 7 figures, 1 table

arXiv:2007.03063 [pdf, other]

ARC-Net: Activity Recognition Through Capsules

Authors: Hamed Damirchi, Rooholla Khorrambakht, Hamid Taghirad

Abstract: Human Activity Recognition (HAR) is a challenging problem that needs advanced solutions than using handcrafted features to achieve a desirable performance. Deep learning has been proposed as a solution to obtain more accurate HAR systems being robust against noise. In this paper, we introduce ARC-Net and propose the utilization of capsules to fuse the information from multiple inertial measurement… ▽ More Human Activity Recognition (HAR) is a challenging problem that needs advanced solutions than using handcrafted features to achieve a desirable performance. Deep learning has been proposed as a solution to obtain more accurate HAR systems being robust against noise. In this paper, we introduce ARC-Net and propose the utilization of capsules to fuse the information from multiple inertial measurement units (IMUs) to predict the activity performed by the subject. We hypothesize that this network will be able to tune out the unnecessary information and will be able to make more accurate decisions through the iterative mechanism embedded in capsule networks. We provide heatmaps of the priors, learned by the network, to visualize the utilization of each of the data sources by the trained network. By using the proposed network, we were able to increase the accuracy of the state-of-the-art approaches by 2%. Furthermore, we investigate the directionality of the confusion matrices of our results and discuss the specificity of the activities based on the provided data. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: 6 pages, 6 figures

arXiv:2007.02929 [pdf, other]

IMU Preintegrated Features for Efficient Deep Inertial Odometry

Authors: R. Khorrambakht, H. Damirchi, H. D. Taghirad

Abstract: MEMS Inertial Measurement Units (IMUs) as ubiquitous proprioceptive motion measurement devices are available on various everyday gadgets and robotic platforms. Nevertheless, the direct inference of geometrical transformations or odometry based on these data alone is a challenging task. This is due to the hard-to-model imperfections and high noise characteristics of the sensor, which has motivated… ▽ More MEMS Inertial Measurement Units (IMUs) as ubiquitous proprioceptive motion measurement devices are available on various everyday gadgets and robotic platforms. Nevertheless, the direct inference of geometrical transformations or odometry based on these data alone is a challenging task. This is due to the hard-to-model imperfections and high noise characteristics of the sensor, which has motivated research in formulating the system as an end-to-end learning problem, where the motion patterns of the agent are exploited to facilitate better odometry estimates. However, this benefit comes at the cost of high computation and memory requirements, which makes deep inertial odometry unsuitable for low-power and edge applications. This paper attempts to address this conflict by proposing the IMU preintegrated features as a replacement for the raw IMU data in deep inertial odometry. Exploiting the manifold structure of the IMU motion model, these features provide a temporally compressed motion representation that preserves important geometrical information. We demonstrate the effectiveness and efficiency of this approach for the task of inertial odometry on two applications of pedestrian motion estimation and autonomous vehicles. We show a performance improvement compared to raw inputs while reducing the computational burdens. Additionally, we demonstrate the efficiency of this approach through an embedded implementation on a resource-constrained microcontroller. △ Less

Submitted 18 March, 2022; v1 submitted 6 July, 2020; originally announced July 2020.

ACM Class: C.3; C.4; H.1; I.2.9; I.5.4; J.3; J.2

Showing 1–12 of 12 results for author: Damirchi, H