-
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Authors:
Cristian Rodriguez-Opazo,
Ehsan Abbasnejad,
Damien Teney,
Edison Marrese-Taylor,
Hamed Damirchi,
Anton van den Hengel
Abstract:
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data an…
▽ More
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various architectures, from vision transformers (ViTs) to convolutional networks (ResNets) have been trained with CLIP to serve as general solutions to diverse vision tasks. This paper explores the differences across various CLIP-trained vision backbones. Despite using the same data and training objective, we find that these architectures have notably different representations, different classification performance across datasets, and different robustness properties to certain types of image perturbations. Our findings indicate a remarkable possible synergy across backbones by leveraging their respective strengths. In principle, classification accuracy could be improved by over 40 percentage with an informed selection of the optimal backbone per test example.Using this insight, we develop a straightforward yet powerful approach to adaptively ensemble multiple backbones. The approach uses as few as one labeled example per class to tune the adaptive combination of backbones. On a large collection of datasets, the method achieves a remarkable increase in accuracy of up to 39.1% over the best single backbone, well beyond traditional ensembles
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Authors:
Cristian Rodriguez-Opazo,
Edison Marrese-Taylor,
Ehsan Abbasnejad,
Hamed Damirchi,
Ignacio M. Jara,
Felipe Bravo-Marquez,
Anton van den Hengel
Abstract:
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various neural architectures, spanning Transformer-based models like Vision Transformers (ViTs) to Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve as universal backbones across diverse vision tasks. Despite utilizing the same data and training objectives…
▽ More
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning. Various neural architectures, spanning Transformer-based models like Vision Transformers (ViTs) to Convolutional Networks (ConvNets) like ResNets, are trained with CLIP and serve as universal backbones across diverse vision tasks. Despite utilizing the same data and training objectives, the effectiveness of representations learned by these architectures raises a critical question. Our investigation explores the differences in CLIP performance among these backbone architectures, revealing significant disparities in their classifications. Notably, normalizing these representations results in substantial performance variations. Our findings showcase a remarkable possible synergy between backbone predictions that could reach an improvement of over 20% through informed selection of the appropriate backbone. Moreover, we propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34\%. We will release the code for reproducing the results.
△ Less
Submitted 21 December, 2023;
originally announced December 2023.
-
Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines
Authors:
Hamed Damirchi,
Cristian Rodríguez-Opazo,
Ehsan Abbasnejad,
Damien Teney,
Javen Qinfeng Shi,
Stephen Gould,
Anton van den Hengel
Abstract:
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal le…
▽ More
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box. The Web likely contains the information necessary to excel on any specific application, but identifying the right data a priori is challenging. This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval. We propose to retrieve useful data from the Web at test time based on test cases that the model is uncertain about. Different from existing retrieval-augmented approaches, we then update the model to address this underlying uncertainty. We demonstrate substantial improvements in zero-shot performance, e.g. a remarkable increase of 15 percentage points in accuracy on the Stanford Cars and Flowers datasets. We also present extensive experiments that explore the impact of noisy retrieval and different learning strategies.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
Context-aware Pedestrian Trajectory Prediction with Multimodal Transformer
Authors:
Haleh Damirchi,
Michael Greenspan,
Ali Etemad
Abstract:
We propose a novel solution for predicting future trajectories of pedestrians. Our method uses a multimodal encoder-decoder transformer architecture, which takes as input both pedestrian locations and ego-vehicle speeds. Notably, our decoder predicts the entire future trajectory in a single-pass and does not perform one-step-ahead prediction, which makes the method effective for embedded edge depl…
▽ More
We propose a novel solution for predicting future trajectories of pedestrians. Our method uses a multimodal encoder-decoder transformer architecture, which takes as input both pedestrian locations and ego-vehicle speeds. Notably, our decoder predicts the entire future trajectory in a single-pass and does not perform one-step-ahead prediction, which makes the method effective for embedded edge deployment. We perform detailed experiments and evaluate our method on two popular datasets, PIE and JAAD. Quantitative results demonstrate the superiority of our proposed model over the current state-of-the-art, which consistently achieves the lowest error for 3 time horizons of 0.5, 1.0 and 1.5 seconds. Moreover, the proposed method is significantly faster than the state-of-the-art for the two datasets of PIE and JAAD. Lastly, ablation experiments demonstrate the impact of the key multimodal configuration of our method.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Independent Modular Networks
Authors:
Hamed Damirchi,
Forest Agostinelli,
Pooyan Jamshidi
Abstract:
Monolithic neural networks that make use of a single set of weights to learn useful representations for downstream tasks explicitly dismiss the compositional nature of data generation processes. This characteristic exists in data where every instance can be regarded as the combination of an identity concept, such as the shape of an object, combined with modifying concepts, such as orientation, col…
▽ More
Monolithic neural networks that make use of a single set of weights to learn useful representations for downstream tasks explicitly dismiss the compositional nature of data generation processes. This characteristic exists in data where every instance can be regarded as the combination of an identity concept, such as the shape of an object, combined with modifying concepts, such as orientation, color, and size. The dismissal of compositionality is especially detrimental in robotics, where state estimation relies heavily on the compositional nature of physical mechanisms (e.g., rotations and transformations) to model interactions. To accommodate this data characteristic, modular networks have been proposed. However, a lack of structure in each module's role, and modular network-specific issues such as module collapse have restricted their usability. We propose a modular network architecture that accommodates the mentioned decompositional concept by proposing a unique structure that splits the modules into predetermined roles. Additionally, we provide regularizations that improve the resiliency of the modular network to the problem of module collapse while improving the decomposition accuracy of the model.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Action Capsules: Human Skeleton Action Recognition
Authors:
Ali Farajzadeh Bavil,
Hamed Damirchi,
Hamid D. Taghirad
Abstract:
Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spati…
▽ More
Due to the compact and rich high-level representations offered, skeleton-based human action recognition has recently become a highly active research topic. Previous studies have demonstrated that investigating joint relationships in spatial and temporal dimensions provides effective information critical to action recognition. However, effectively encoding global dependencies of joints during spatio-temporal feature extraction is still challenging. In this paper, we introduce Action Capsule which identifies action-related key joints by considering the latent correlation of joints in a skeleton sequence. We show that, during inference, our end-to-end network pays attention to a set of joints specific to each action, whose encoded spatio-temporal features are aggregated to recognize the action. Additionally, the use of multiple stages of action capsules enhances the ability of the network to classify similar actions. Consequently, our network outperforms the state-of-the-art approaches on the N-UCLA dataset and obtains competitive results on the NTURGBD dataset. This is while our approach has significantly lower computational requirements based on GFLOPs measurements.
△ Less
Submitted 30 January, 2023;
originally announced January 2023.
-
Multiscale Crowd Counting and Localization By Multitask Point Supervision
Authors:
Mohsen Zand,
Haleh Damirchi,
Andrew Farley,
Mahdiyar Molahasani,
Michael Greenspan,
Ali Etemad
Abstract:
We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our mode…
▽ More
We propose a multitask approach for crowd counting and person localization in a unified framework. As the detection and localization tasks are well-correlated and can be jointly tackled, our model benefits from a multitask solution by learning multiscale representations of encoded crowd images, and subsequently fusing them. In contrast to the relatively more popular density-based methods, our model uses point supervision to allow for crowd locations to be accurately identified. We test our model on two popular crowd counting datasets, ShanghaiTech A and B, and demonstrate that our method achieves strong results on both counting and localization tasks, with MSE measures of 110.7 and 15.0 for crowd counting and AP measures of 0.71 and 0.75 for localization, on ShanghaiTech A and B respectively. Our detailed ablation experiments show the impact of our multiscale approach as well as the effectiveness of the fusion module embedded in our network. Our code is available at: https://github.com/RCVLab-AiimLab/crowd_counting.
△ Less
Submitted 20 February, 2022;
originally announced February 2022.
-
A Consistency-Based Loss for Deep Odometry Through Uncertainty Propagation
Authors:
Hamed Damirchi,
Rooholla Khorrambakht,
Hamid D. Taghirad,
Behzad Moshiri
Abstract:
The incremental poses computed through odometry can be integrated over time to calculate the pose of a device with respect to an initial location. The resulting global pose may be used to formulate a second, consistency based, loss term in a deep odometry setting. In such cases where multiple losses are imposed on a network, the uncertainty over each output can be derived to weigh the different lo…
▽ More
The incremental poses computed through odometry can be integrated over time to calculate the pose of a device with respect to an initial location. The resulting global pose may be used to formulate a second, consistency based, loss term in a deep odometry setting. In such cases where multiple losses are imposed on a network, the uncertainty over each output can be derived to weigh the different loss terms in a maximum likelihood setting. However, when imposing a constraint on the integrated transformation, due to how only odometry is estimated at each iteration of the algorithm, there is no information about the uncertainty associated with the global pose to weigh the global loss term. In this paper, we associate uncertainties with the output poses of a deep odometry network and propagate the uncertainties through each iteration. Our goal is to use the estimated covariance matrix at each incremental step to weigh the loss at the corresponding step while weighting the global loss term using the compounded uncertainty. This formulation provides an adaptive method to weigh the incremental and integrated loss terms against each other, noting the increase in uncertainty as new estimates arrive. We provide quantitative and qualitative analysis of pose estimates and show that our method surpasses the accuracy of the state-of-the-art Visual Odometry approaches. Then, uncertainty estimates are evaluated and comparisons against fixed baselines are provided. Finally, the uncertainty values are used in a realistic example to show the effectiveness of uncertainty quantification for localization.
△ Less
Submitted 1 July, 2021;
originally announced July 2021.
-
Deep Inertial Odometry with Accurate IMU Preintegration
Authors:
Rooholla Khorrambakht,
Chris Xiaoxuan Lu,
Hamed Damirchi,
Zhenghua Chen,
Zhengguo Li
Abstract:
Inertial Measurement Units (IMUs) are interceptive modalities that provide ego-motion measurements independent of the environmental factors. They are widely adopted in various autonomous systems. Motivated by the limitations in processing the noisy measurements from these sensors using their mathematical models, researchers have recently proposed various deep learning architectures to estimate ine…
▽ More
Inertial Measurement Units (IMUs) are interceptive modalities that provide ego-motion measurements independent of the environmental factors. They are widely adopted in various autonomous systems. Motivated by the limitations in processing the noisy measurements from these sensors using their mathematical models, researchers have recently proposed various deep learning architectures to estimate inertial odometry in an end-to-end manner. Nevertheless, the high-frequency and redundant measurements from IMUs lead to long raw sequences to be processed. In this study, we aim to investigate the efficacy of accurate preintegration as a more realistic solution to the IMU motion model for deep inertial odometry (DIO) and the resultant DIO is a fusion of model-driven and data-driven approaches. The accurate IMU preintegration has the potential to outperform numerical approximation of the continuous IMU model used in the existing DIOs. Experimental results validate the proposed DIO.
△ Less
Submitted 18 January, 2021;
originally announced January 2021.
-
Exploring Self-Attention for Visual Odometry
Authors:
Hamed Damirchi,
Rooholla Khorrambakht,
Hamid D. Taghirad
Abstract:
Visual odometry networks commonly use pretrained optical flow networks in order to derive the ego-motion between consecutive frames. The features extracted by these networks represent the motion of all the pixels between frames. However, due to the existence of dynamic objects and texture-less surfaces in the scene, the motion information for every image region might not be reliable for inferring…
▽ More
Visual odometry networks commonly use pretrained optical flow networks in order to derive the ego-motion between consecutive frames. The features extracted by these networks represent the motion of all the pixels between frames. However, due to the existence of dynamic objects and texture-less surfaces in the scene, the motion information for every image region might not be reliable for inferring odometry due to the ineffectiveness of dynamic objects in derivation of the incremental changes in position. Recent works in this area lack attention mechanisms in their structures to facilitate dynamic reweighing of the feature maps for extracting more refined egomotion information. In this paper, we explore the effectiveness of self-attention in visual odometry. We report qualitative and quantitative results against the SOTA methods. Furthermore, saliency-based studies alongside specially designed experiments are utilized to investigate the effect of self-attention on VO. Our experiments show that using self-attention allows for the extraction of better features while achieving a better odometry performance compared to networks that lack such structures.
△ Less
Submitted 17 November, 2020;
originally announced November 2020.
-
ARC-Net: Activity Recognition Through Capsules
Authors:
Hamed Damirchi,
Rooholla Khorrambakht,
Hamid Taghirad
Abstract:
Human Activity Recognition (HAR) is a challenging problem that needs advanced solutions than using handcrafted features to achieve a desirable performance. Deep learning has been proposed as a solution to obtain more accurate HAR systems being robust against noise. In this paper, we introduce ARC-Net and propose the utilization of capsules to fuse the information from multiple inertial measurement…
▽ More
Human Activity Recognition (HAR) is a challenging problem that needs advanced solutions than using handcrafted features to achieve a desirable performance. Deep learning has been proposed as a solution to obtain more accurate HAR systems being robust against noise. In this paper, we introduce ARC-Net and propose the utilization of capsules to fuse the information from multiple inertial measurement units (IMUs) to predict the activity performed by the subject. We hypothesize that this network will be able to tune out the unnecessary information and will be able to make more accurate decisions through the iterative mechanism embedded in capsule networks. We provide heatmaps of the priors, learned by the network, to visualize the utilization of each of the data sources by the trained network. By using the proposed network, we were able to increase the accuracy of the state-of-the-art approaches by 2%. Furthermore, we investigate the directionality of the confusion matrices of our results and discuss the specificity of the activities based on the provided data.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
IMU Preintegrated Features for Efficient Deep Inertial Odometry
Authors:
R. Khorrambakht,
H. Damirchi,
H. D. Taghirad
Abstract:
MEMS Inertial Measurement Units (IMUs) as ubiquitous proprioceptive motion measurement devices are available on various everyday gadgets and robotic platforms. Nevertheless, the direct inference of geometrical transformations or odometry based on these data alone is a challenging task. This is due to the hard-to-model imperfections and high noise characteristics of the sensor, which has motivated…
▽ More
MEMS Inertial Measurement Units (IMUs) as ubiquitous proprioceptive motion measurement devices are available on various everyday gadgets and robotic platforms. Nevertheless, the direct inference of geometrical transformations or odometry based on these data alone is a challenging task. This is due to the hard-to-model imperfections and high noise characteristics of the sensor, which has motivated research in formulating the system as an end-to-end learning problem, where the motion patterns of the agent are exploited to facilitate better odometry estimates. However, this benefit comes at the cost of high computation and memory requirements, which makes deep inertial odometry unsuitable for low-power and edge applications. This paper attempts to address this conflict by proposing the IMU preintegrated features as a replacement for the raw IMU data in deep inertial odometry. Exploiting the manifold structure of the IMU motion model, these features provide a temporally compressed motion representation that preserves important geometrical information. We demonstrate the effectiveness and efficiency of this approach for the task of inertial odometry on two applications of pedestrian motion estimation and autonomous vehicles. We show a performance improvement compared to raw inputs while reducing the computational burdens. Additionally, we demonstrate the efficiency of this approach through an embedded implementation on a resource-constrained microcontroller.
△ Less
Submitted 18 March, 2022; v1 submitted 6 July, 2020;
originally announced July 2020.