Search | arXiv e-print repository

doi 10.1109/FG59268.2024.10581905

FE-Adapter: Adapting Image-based Emotion Classifiers to Videos

Authors: Shreyank N Gowda, Boyan Gao, David A. Clifton

Abstract: Utilizing large pre-trained models for specific tasks has yielded impressive results. However, fully fine-tuning these increasingly large models is becoming prohibitively resource-intensive. This has led to a focus on more parameter-efficient transfer learning, primarily within the same modality. But this approach has limitations, particularly in video understanding where suitable pre-trained mode… ▽ More Utilizing large pre-trained models for specific tasks has yielded impressive results. However, fully fine-tuning these increasingly large models is becoming prohibitively resource-intensive. This has led to a focus on more parameter-efficient transfer learning, primarily within the same modality. But this approach has limitations, particularly in video understanding where suitable pre-trained models are less common. Addressing this, our study introduces a novel cross-modality transfer learning approach from images to videos, which we call parameter-efficient image-to-video transfer learning. We present the Facial-Emotion Adapter (FE-Adapter), designed for efficient fine-tuning in video tasks. This adapter allows pre-trained image models, which traditionally lack temporal processing capabilities, to analyze dynamic video content efficiently. Notably, it uses about 15 times fewer parameters than previous methods, while improving accuracy. Our experiments in video emotion recognition demonstrate that the FE-Adapter can match or even surpass existing fine-tuning and video emotion models in both performance and efficiency. This breakthrough highlights the potential for cross-modality approaches in enhancing the capabilities of AI models, particularly in fields like video emotion analysis where the demand for efficiency and accuracy is constantly rising. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.00181 [pdf, other]

CC-SAM: SAM with Cross-feature Attention and Context for Ultrasound Image Segmentation

Authors: Shreyank N Gowda, David A. Clifton

Abstract: The Segment Anything Model (SAM) has achieved remarkable successes in the realm of natural image segmentation, but its deployment in the medical imaging sphere has encountered challenges. Specifically, the model struggles with medical images that feature low contrast, faint boundaries, intricate morphologies, and small-sized objects. To address these challenges and enhance SAM's performance in the… ▽ More The Segment Anything Model (SAM) has achieved remarkable successes in the realm of natural image segmentation, but its deployment in the medical imaging sphere has encountered challenges. Specifically, the model struggles with medical images that feature low contrast, faint boundaries, intricate morphologies, and small-sized objects. To address these challenges and enhance SAM's performance in the medical domain, we introduce a comprehensive modification. Firstly, we incorporate a frozen Convolutional Neural Network (CNN) branch as an image encoder, which synergizes with SAM's original Vision Transformer (ViT) encoder through a novel variational attention fusion module. This integration bolsters the model's capability to capture local spatial information, which is often paramount in medical imagery. Moreover, to further optimize SAM for medical imaging, we introduce feature and position adapters within the ViT branch, refining the encoder's representations. We see that compared to current prompting strategies to fine-tune SAM for ultrasound medical segmentation, the use of text descriptions that serve as text prompts for SAM helps significantly improve the performance. Leveraging ChatGPT's natural language understanding capabilities, we generate prompts that offer contextual information and guidance to SAM, enabling it to better understand the nuances of ultrasound medical images and improve its segmentation accuracy. Our method, in its entirety, represents a significant stride towards making universal image segmentation models more adaptable and efficient in the medical domain. △ Less

Submitted 31 July, 2024; originally announced August 2024.

Comments: Accepted to ECCV 2024

arXiv:2407.16264 [pdf, other]

Masks and Manuscripts: Advancing Medical Pre-training with End-to-End Masking and Narrative Structuring

Authors: Shreyank N Gowda, David A. Clifton

Abstract: Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the… ▽ More Contemporary medical contrastive learning faces challenges from inconsistent semantics and sample pair morphology, leading to dispersed and converging semantic shifts. The variability in text reports, due to multiple authors, complicates semantic consistency. To tackle these issues, we propose a two-step approach. Initially, text reports are converted into a standardized triplet format, laying the groundwork for our novel concept of ``observations'' and ``verdicts''. This approach refines the {Entity, Position, Exist} triplet into binary questions, guiding towards a clear ``verdict''. We also innovate in visual pre-training with a Meijering-based masking, focusing on features representative of medical images' local context. By integrating this with our text conversion method, our model advances cross-modal representation in a multimodal contrastive learning framework, setting new benchmarks in medical image analysis. △ Less

Submitted 23 July, 2024; originally announced July 2024.

Comments: Accepted in MICCAI-24

arXiv:2406.02929 [pdf, other]

Exploring Data Efficiency in Zero-Shot Learning with Diffusion Models

Authors: Zihan Ye, Shreyank N. Gowda, Xiaobo Jin, Xiaowei Huang, Haotian Xu, Yaochu Jin, Kaizhu Huang

Abstract: Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate th… ▽ More Zero-Shot Learning (ZSL) aims to enable classifiers to identify unseen classes by enhancing data efficiency at the class level. This is achieved by generating image features from pre-defined semantics of unseen classes. However, most current approaches heavily depend on the number of samples from seen classes, i.e. they do not consider instance-level effectiveness. In this paper, we demonstrate that limited seen examples generally result in deteriorated performance of generative models. To overcome these challenges, we propose ZeroDiff, a Diffusion-based Generative ZSL model. This unified framework incorporates diffusion models to improve data efficiency at both the class and instance levels. Specifically, for instance-level effectiveness, ZeroDiff utilizes a forward diffusion chain to transform limited data into an expanded set of noised data. For class-level effectiveness, we design a two-branch generation structure that consists of a Diffusion-based Feature Generator (DFG) and a Diffusion-based Representation Generator (DRG). DFG focuses on learning and sampling the distribution of cross-entropy-based features, whilst DRG learns the supervised contrastive-based representation to boost the zero-shot capabilities of DFG. Additionally, we employ three discriminators to evaluate generated features from various aspects and introduce a Wasserstein-distance-based mutual learning loss to transfer knowledge among discriminators, thereby enhancing guidance for generation. Demonstrated through extensive experiments on three popular ZSL benchmarks, our ZeroDiff not only achieves significant improvements over existing ZSL methods but also maintains robust performance even with scarce training data. Code will be released upon acceptance. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2404.09752 [pdf, other]

Can We Break Free from Strong Data Augmentations in Self-Supervised Learning?

Authors: Shruthi Gowda, Elahe Arani, Bahram Zonooz

Abstract: Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their cruc… ▽ More Self-supervised learning (SSL) has emerged as a promising solution for addressing the challenge of limited labeled data in deep neural networks (DNNs), offering scalability potential. However, the impact of design dependencies within the SSL framework remains insufficiently investigated. In this study, we comprehensively explore SSL behavior across a spectrum of augmentations, revealing their crucial role in shaping SSL model performance and learning mechanisms. Leveraging these insights, we propose a novel learning approach that integrates prior knowledge, with the aim of curtailing the need for extensive data augmentations and thereby amplifying the efficacy of learned representations. Notably, our findings underscore that SSL models imbued with prior knowledge exhibit reduced texture bias, diminished reliance on shortcuts and augmentations, and improved robustness against both natural and adversarial corruptions. These findings not only illuminate a new direction in SSL research, but also pave the way for enhancing DNN performance while concurrently alleviating the imperative for intensive data augmentation, thereby enhancing scalability and real-world problem-solving capabilities. △ Less

Submitted 15 April, 2024; originally announced April 2024.

arXiv:2402.10240 [pdf, other]

A Dynamical View of the Question of Why

Authors: Mehdi Fatemi, Sindhu Gowda

Abstract: We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions a… ▽ More We address causal reasoning in multivariate time series data generated by stochastic processes. Existing approaches are largely restricted to static settings, ignoring the continuity and emission of variations across time. In contrast, we propose a learning paradigm that directly establishes causation between events in the course of time. We present two key lemmas to compute causal contributions and frame them as reinforcement learning problems. Our approach offers formal and computational tools for uncovering and quantifying causal relationships in diffusion processes, subsuming various important settings such as discrete-time Markov decision processes. Finally, in fairly intricate experiments and through sheer learning, our framework reveals and quantifies causal links, which otherwise seem inexplicable. △ Less

Submitted 27 February, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

Comments: Accepted at the Twelfth International Conference on Learning Representations (ICLR'24)

arXiv:2401.17883 [pdf, other]

Reimagining Reality: A Comprehensive Survey of Video Inpainting Techniques

Authors: Shreyank N Gowda, Yash Thakre, Shashank Narayana Gowda, Xiaobo Jin

Abstract: This paper offers a comprehensive analysis of recent advancements in video inpainting techniques, a critical subset of computer vision and artificial intelligence. As a process that restores or fills in missing or corrupted portions of video sequences with plausible content, video inpainting has evolved significantly with the advent of deep learning methodologies. Despite the plethora of existing… ▽ More This paper offers a comprehensive analysis of recent advancements in video inpainting techniques, a critical subset of computer vision and artificial intelligence. As a process that restores or fills in missing or corrupted portions of video sequences with plausible content, video inpainting has evolved significantly with the advent of deep learning methodologies. Despite the plethora of existing methods and their swift development, the landscape remains complex, posing challenges to both novices and established researchers. Our study deconstructs major techniques, their underpinning theories, and their effective applications. Moreover, we conduct an exhaustive comparative study, centering on two often-overlooked dimensions: visual quality and computational efficiency. We adopt a human-centric approach to assess visual quality, enlisting a panel of annotators to evaluate the output of different video inpainting techniques. This provides a nuanced qualitative understanding that complements traditional quantitative metrics. Concurrently, we delve into the computational aspects, comparing inference times and memory demands across a standardized hardware setup. This analysis underscores the balance between quality and efficiency: a critical consideration for practical applications where resources may be constrained. By integrating human validation and computational resource comparison, this survey not only clarifies the present landscape of video inpainting techniques but also charts a course for future explorations in this vibrant and evolving field. △ Less

Submitted 31 January, 2024; originally announced January 2024.

arXiv:2401.14948 [pdf, other]

Conserve-Update-Revise to Cure Generalization and Robustness Trade-off in Adversarial Training

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani

Abstract: Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate… ▽ More Adversarial training improves the robustness of neural networks against adversarial attacks, albeit at the expense of the trade-off between standard and robust generalization. To unveil the underlying factors driving this phenomenon, we examine the layer-wise learning capabilities of neural networks during the transition from a standard to an adversarial setting. Our empirical findings demonstrate that selectively updating specific layers while preserving others can substantially enhance the network's learning capacity. We therefore propose CURE, a novel training framework that leverages a gradient prominence criterion to perform selective conservation, updating, and revision of weights. Importantly, CURE is designed to be dataset- and architecture-agnostic, ensuring its applicability across various scenarios. It effectively tackles both memorization and overfitting issues, thus enhancing the trade-off between robustness and generalization and additionally, this training approach also aids in mitigating "robust overfitting". Furthermore, our study provides valuable insights into the mechanisms of selective adversarial training and offers a promising avenue for future research. △ Less

Submitted 26 January, 2024; originally announced January 2024.

Comments: Accepted as a conference paper at ICLR 2024

arXiv:2401.11406 [pdf, other]

Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts

Authors: Kiyoon Kim, Shreyank N Gowda, Panagiotis Eustratiadis, Antreas Antoniou, Robert B Fisher

Abstract: Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one… ▽ More Despite recent advances in video action recognition achieving strong performance on existing benchmarks, these models often lack robustness when faced with natural distribution shifts between training and test data. We propose two novel evaluation methods to assess model resilience to such distribution disparity. One method uses two different datasets collected from different sources and uses one for training and validation, and the other for testing. More precisely, we created dataset splits of HMDB-51 or UCF-101 for training, and Kinetics-400 for testing, using the subset of the classes that are overlapping in both train and test datasets. The other proposed method extracts the feature mean of each class from the target evaluation dataset's training data (i.e. class prototype) and estimates test video prediction as a cosine similarity score between each sample to the class prototypes of each target class. This procedure does not alter model weights using the target dataset and it does not require aligning overlapping classes of two different datasets, thus is a very efficient method to test the model robustness to distribution shifts without prior knowledge of the target distribution. We address the robustness problem by adversarial augmentation training - generating augmented views of videos that are "hard" for the classification model by applying gradient ascent on the augmentation parameters - as well as "curriculum" scheduling the strength of the video augmentations. We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models - TSM, Video Swin Transformer, and Uniformer. The presented work provides critical insight into model robustness to distribution shifts and presents effective techniques to enhance video action recognition performance in a real-world deployment. △ Less

Submitted 21 January, 2024; originally announced January 2024.

arXiv:2310.11341 [pdf, other]

Dual Cognitive Architecture: Incorporating Biases and Multi-Memory Systems for Lifelong Learning

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani

Abstract: Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, the data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational str… ▽ More Artificial neural networks (ANNs) exhibit a narrow scope of expertise on stationary independent data. However, the data in the real world is continuous and dynamic, and ANNs must adapt to novel scenarios while also retaining the learned knowledge to become lifelong learners. The ability of humans to excel at these tasks can be attributed to multiple factors ranging from cognitive computational structures, cognitive biases, and the multi-memory systems in the brain. We incorporate key concepts from each of these to design a novel framework, Dual Cognitive Architecture (DUCA), which includes multiple sub-systems, implicit and explicit knowledge representation dichotomy, inductive bias, and a multi-memory system. The inductive bias learner within DUCA is instrumental in encoding shape information, effectively countering the tendency of ANNs to learn local textures. Simultaneously, the inclusion of a semantic memory submodule facilitates the gradual consolidation of knowledge, replicating the dynamics observed in fast and slow learning systems, reminiscent of the principles underpinning the complementary learning system in human cognition. DUCA shows improvement across different settings and datasets, and it also exhibits reduced task recency bias, without the need for extra information. To further test the versatility of lifelong learning methods on a challenging distribution shift, we introduce a novel domain-incremental dataset DN4IL. In addition to improving performance on existing benchmarks, DUCA also demonstrates superior performance on this complex dataset. △ Less

Submitted 17 October, 2023; originally announced October 2023.

Comments: Published in Transactions on Machine Learning Research (TMLR)

arXiv:2310.06522 [pdf, other]

Watt For What: Rethinking Deep Learning's Energy-Performance Relationship

Authors: Shreyank N Gowda, Xinyue Hao, Gen Li, Laura Sevilla-Lara, Shashank Narayana Gowda

Abstract: Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model… ▽ More Deep learning models have revolutionized various fields, from image recognition to natural language processing, by achieving unprecedented levels of accuracy. However, their increasing energy consumption has raised concerns about their environmental impact, disadvantaging smaller entities in research and exacerbating global energy consumption. In this paper, we explore the trade-off between model accuracy and electricity consumption, proposing a metric that penalizes large consumption of electricity. We conduct a comprehensive study on the electricity consumption of various deep learning models across different GPUs, presenting a detailed analysis of their accuracy-efficiency trade-offs. By evaluating accuracy per unit of electricity consumed, we demonstrate how smaller, more energy-efficient models can significantly expedite research while mitigating environmental concerns. Our results highlight the potential for a more sustainable approach to deep learning, emphasizing the importance of optimizing models for efficiency. This research also contributes to a more equitable research landscape, where smaller entities can compete effectively with larger counterparts. This advocates for the adoption of efficient deep learning practices to reduce electricity consumption, safeguarding the environment for future generations whilst also helping ensure a fairer competitive landscape. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2309.17327 [pdf, other]

Telling Stories for Common Sense Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara

Abstract: Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual desc… ▽ More Video understanding has long suffered from reliance on large labeled datasets, motivating research into zero-shot learning. Recent progress in language modeling presents opportunities to advance zero-shot video analysis, but constructing an effective semantic space relating action classes remains challenging. We address this by introducing a novel dataset, Stories, which contains rich textual descriptions for diverse action classes extracted from WikiHow articles. For each class, we extract multi-sentence narratives detailing the necessary steps, scenes, objects, and verbs that characterize the action. This contextual data enables modeling of nuanced relationships between actions, paving the way for zero-shot transfer. We also propose an approach that harnesses Stories to improve feature generation for training zero-shot classification. Without any target dataset fine-tuning, our method achieves new state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to 6.1%. We believe Stories provides a valuable resource that can catalyze progress in zero-shot action recognition. The textual narratives forge connections between seen and unseen classes, overcoming the bottleneck of labeled data that has long impeded advancements in this exciting domain. The data can be found here: https://github.com/kini5gowda/Stories . △ Less

Submitted 29 September, 2023; originally announced September 2023.

arXiv:2309.01390 [pdf, other]

Bridging the Projection Gap: Overcoming Projection Bias Through Parameterized Distance Learning

Authors: Chong Zhang, Mingyu Jin, Qinkai Yu, Haochen Xue, Shreyank N Gowda, Xiaobo Jin

Abstract: Generalized zero-shot learning (GZSL) aims to recognize samples from both seen and unseen classes using only seen class samples for training. However, GZSL methods are prone to bias towards seen classes during inference due to the projection function being learned from seen classes. Most methods focus on learning an accurate projection, but bias in the projection is inevitable. We address this pro… ▽ More Generalized zero-shot learning (GZSL) aims to recognize samples from both seen and unseen classes using only seen class samples for training. However, GZSL methods are prone to bias towards seen classes during inference due to the projection function being learned from seen classes. Most methods focus on learning an accurate projection, but bias in the projection is inevitable. We address this projection bias by proposing to learn a parameterized Mahalanobis distance metric for robust inference. Our key insight is that the distance computation during inference is critical, even with a biased projection. We make two main contributions - (1) We extend the VAEGAN (Variational Autoencoder \& Generative Adversarial Networks) architecture with two branches to separately output the projection of samples from seen and unseen classes, enabling more robust distance learning. (2) We introduce a novel loss function to optimize the Mahalanobis distance representation and reduce projection bias. Extensive experiments on four datasets show that our approach outperforms state-of-the-art GZSL techniques with improvements of up to 3.5 \% on the harmonic mean metric. △ Less

Submitted 2 April, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: 18 pages, 9 figures

arXiv:2308.16041 [pdf, other]

From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications

Authors: Shreyank N Gowda, Dheeraj Pandey, Shashank Narayana Gowda

Abstract: Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. This paper presents a comprehensive survey of state-of-the-art methods for talking head generation. We systematically categorises them into four main approaches: image-driven, audio-driven, video-driven and others (including neural radiance fields (NeRF), and 3D-based met… ▽ More Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. This paper presents a comprehensive survey of state-of-the-art methods for talking head generation. We systematically categorises them into four main approaches: image-driven, audio-driven, video-driven and others (including neural radiance fields (NeRF), and 3D-based methods). We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations. Furthermore, we thoroughly compare publicly available models, evaluating them on key aspects such as inference time and human-rated quality of the generated outputs. Our aim is to provide a clear and concise overview of the current landscape in talking head generation, elucidating the relationships between different approaches and identifying promising directions for future research. This survey will serve as a valuable reference for researchers and practitioners interested in this rapidly evolving field. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2306.04822 [pdf, other]

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

Authors: Shreyank N Gowda, Anurag Arnab, Jonathan Huang

Abstract: In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approach… ▽ More In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing temporal information) (2) introducing a compact adapter model connecting frozen spatial representations ((a module that selectively focuses on regions of the input image) to the temporal transformer, we can enjoy the benefits of freezing the spatial transformer without sacrificing accuracy. Through extensive experimentation over 6 benchmarks, we demonstrate that our proposed training strategy significantly reduces training costs (by $\sim 50\%$) and memory consumption while maintaining or slightly improving performance by up to 1.79\% compared to the baseline model. Our approach additionally unlocks the capability to utilize larger image transformer models as our spatial transformer and access more frames with the same memory consumption. △ Less

Submitted 7 June, 2023; originally announced June 2023.

arXiv:2304.06935 [pdf, ps, other]

Groebner.jl: A package for Gröbner bases computations in Julia

Authors: Alexander Demin, Shashi Gowda

Abstract: We present Groebner.jl, a Julia package for computing Groebner bases with the F4 algorithm. Groebner.jl is an efficient, portable, and open-source software. Groebner.jl works over integers modulo a prime and over the rationals, supports basic multi-threading, and specializes in computation in the degree reverse lexicographical monomial ordering. The implementation incorporates various symbolic com… ▽ More We present Groebner.jl, a Julia package for computing Groebner bases with the F4 algorithm. Groebner.jl is an efficient, portable, and open-source software. Groebner.jl works over integers modulo a prime and over the rationals, supports basic multi-threading, and specializes in computation in the degree reverse lexicographical monomial ordering. The implementation incorporates various symbolic computation techniques and leverages the Julia type system and tooling, which allows Groebner.jl to compete with the existing state of the art, in many instances outperform it, and exceed them in extensibility. Groebner.jl is freely available at https://github.com/sumiya11/Groebner.jl. △ Less

Submitted 12 February, 2024; v1 submitted 14 April, 2023; originally announced April 2023.

Comments: 10 pages

arXiv:2304.06672 [pdf, other]

LSFSL: Leveraging Shape Information in Few-shot Learning

Authors: Deepan Chakravarthi Padmanabhan, Shruthi Gowda, Elahe Arani, Bahram Zonooz

Abstract: Few-shot learning (FSL) techniques seek to learn the underlying patterns in data using fewer samples, analogous to how humans learn from limited experience. In this limited-data scenario, the challenges associated with deep neural networks, such as shortcut learning and texture bias behaviors, are further exacerbated. Moreover, the significance of addressing shortcut learning is not yet fully expl… ▽ More Few-shot learning (FSL) techniques seek to learn the underlying patterns in data using fewer samples, analogous to how humans learn from limited experience. In this limited-data scenario, the challenges associated with deep neural networks, such as shortcut learning and texture bias behaviors, are further exacerbated. Moreover, the significance of addressing shortcut learning is not yet fully explored in the few-shot setup. To address these issues, we propose LSFSL, which enforces the model to learn more generalizable features utilizing the implicit prior information present in the data. Through comprehensive analyses, we demonstrate that LSFSL-trained models are less vulnerable to alteration in color schemes, statistical correlations, and adversarial perturbations leveraging the global semantics in the data. Our findings highlight the potential of incorporating relevant priors in few-shot approaches to increase robustness and generalization. △ Less

Submitted 13 April, 2023; originally announced April 2023.

Comments: Accepted at CVPR 2023 (2nd Workshop on Learning with Limited Labelled Data for Image and Video Understanding)

arXiv:2304.02846 [pdf, other]

Synthetic Sample Selection for Generalized Zero-Shot Learning

Authors: Shreyank N Gowda

Abstract: Generalized Zero-Shot Learning (GZSL) has emerged as a pivotal research domain in computer vision, owing to its capability to recognize objects that have not been seen during training. Despite the significant progress achieved by generative techniques in converting traditional GZSL to fully supervised learning, they tend to generate a large number of synthetic features that are often redundant, th… ▽ More Generalized Zero-Shot Learning (GZSL) has emerged as a pivotal research domain in computer vision, owing to its capability to recognize objects that have not been seen during training. Despite the significant progress achieved by generative techniques in converting traditional GZSL to fully supervised learning, they tend to generate a large number of synthetic features that are often redundant, thereby increasing training time and decreasing accuracy. To address this issue, this paper proposes a novel approach for synthetic feature selection using reinforcement learning. In particular, we propose a transformer-based selector that is trained through proximal policy optimization (PPO) to select synthetic features based on the validation classification accuracy of the seen classes, which serves as a reward. The proposed method is model-agnostic and data-agnostic, making it applicable to both images and videos and versatile for diverse applications. Our experimental results demonstrate the superiority of our approach over existing feature-generating methods, yielding improved overall performance on multiple benchmarks. △ Less

Submitted 5 April, 2023; originally announced April 2023.

Comments: Paper accepted in CVPRW 2023

arXiv:2211.05229 [pdf]

Automatic Number Plate Recognition (ANPR) with YOLOv3-CNN

Authors: Rajdeep Adak, Abhishek Kumbhar, Rajas Pathare, Sagar Gowda

Abstract: We present a YOLOv3-CNN pipeline for detecting vehicles, segregation of number plates, and local storage of final recognized characters. Vehicle identification is performed under various image correction schemes to determine the effect of environmental factors (angle of perception, luminosity, motion-blurring, and multi-line custom font etc.). A YOLOv3 object detection model was trained to identif… ▽ More We present a YOLOv3-CNN pipeline for detecting vehicles, segregation of number plates, and local storage of final recognized characters. Vehicle identification is performed under various image correction schemes to determine the effect of environmental factors (angle of perception, luminosity, motion-blurring, and multi-line custom font etc.). A YOLOv3 object detection model was trained to identify vehicles from a dataset of traffic images. A second YOLOv3 layer was trained to identify number plates from vehicle images. Based upon correction schemes, individual characters were segregated and verified against real-time data to calculate accuracy of this approach. While characters under direct view were recognized accurately, some numberplates affected by environmental factors had reduced levels of accuracy. We summarize the results under various environmental factors against real-time data and produce an overall accuracy of the pipeline model. △ Less

Submitted 7 November, 2022; originally announced November 2022.

Comments: 29 pages, 4 figures, 2 tables

arXiv:2209.15501 [pdf, other]

A Closer Look at Temporal Ordering in the Segmentation of Instructional Videos

Authors: Anil Batra, Shreyank N Gowda, Frank Keller, Laura Sevilla-Lara

Abstract: Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look a… ▽ More Understanding the steps required to perform a task is an important skill for AI systems. Learning these steps from instructional videos involves two subproblems: (i) identifying the temporal boundary of sequentially occurring segments and (ii) summarizing these steps in natural language. We refer to this task as Procedure Segmentation and Summarization (PSS). In this paper, we take a closer look at PSS and propose three fundamental improvements over current methods. The segmentation task is critical, as generating a correct summary requires each step of the procedure to be correctly identified. However, current segmentation metrics often overestimate the segmentation quality because they do not consider the temporal order of segments. In our first contribution, we propose a new segmentation metric that takes into account the order of segments, giving a more reliable measure of the accuracy of a given predicted segmentation. Current PSS methods are typically trained by proposing segments, matching them with the ground truth and computing a loss. However, much like segmentation metrics, existing matching algorithms do not consider the temporal order of the mapping between candidate segments and the ground truth. In our second contribution, we propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable. Lastly, we introduce multi-modal feature training for PSS, which further improves segmentation. We evaluate our approach on two instructional video datasets (YouCook2 and Tasty) and observe an improvement over the state-of-the-art of $\sim7\%$ and $\sim2.5\%$ for procedure segmentation and summarization, respectively. △ Less

Submitted 7 October, 2022; v1 submitted 30 September, 2022; originally announced September 2022.

Comments: Accepted at BMVC 2022

arXiv:2208.10895 [pdf, other]

A Comprehensive Study of Real-Time Object Detection Networks Across Multiple Domains: A Survey

Authors: Elahe Arani, Shruthi Gowda, Ratnajit Mukherjee, Omar Magdy, Senthilkumar Kathiresan, Bahram Zonooz

Abstract: Deep neural network based object detectors are continuously evolving and are used in a multitude of applications, each having its own set of requirements. While safety-critical applications need high accuracy and reliability, low-latency tasks need resource and energy-efficient networks. Real-time detectors, which are a necessity in high-impact real-world applications, are continuously proposed, b… ▽ More Deep neural network based object detectors are continuously evolving and are used in a multitude of applications, each having its own set of requirements. While safety-critical applications need high accuracy and reliability, low-latency tasks need resource and energy-efficient networks. Real-time detectors, which are a necessity in high-impact real-world applications, are continuously proposed, but they overemphasize the improvements in accuracy and speed while other capabilities such as versatility, robustness, resource and energy efficiency are omitted. A reference benchmark for existing networks does not exist, nor does a standard evaluation guideline for designing new networks, which results in ambiguous and inconsistent comparisons. We, thus, conduct a comprehensive study on multiple real-time detectors (anchor-, keypoint-, and transformer-based) on a wide range of datasets and report results on an extensive set of metrics. We also study the impact of variables such as image size, anchor dimensions, confidence thresholds, and architecture layers on the overall performance. We analyze the robustness of detection networks against distribution shifts, natural corruptions, and adversarial attacks. Also, we provide a calibration analysis to gauge the reliability of the predictions. Finally, to highlight the real-world impact, we conduct two unique case studies, on autonomous driving and healthcare applications. To further gauge the capability of networks in critical real-time applications, we report the performance after deploying the detection networks on edge devices. Our extensive empirical study can act as a guideline for the industrial community to make an informed choice on the existing networks. We also hope to inspire the research community towards a new direction in the design and evaluation of networks that focuses on a bigger and holistic overview for a far-reaching impact. △ Less

Submitted 14 February, 2023; v1 submitted 23 August, 2022; originally announced August 2022.

Comments: Published in Transactions on Machine Learning Research (TMLR) with Survey Certification

Journal ref: Transactions on Machine Learning Research, 2022

arXiv:2206.05846 [pdf, other]

InBiaseD: Inductive Bias Distillation to Improve Generalization and Robustness through Shape-awareness

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani

Abstract: Humans rely less on spurious correlations and trivial cues, such as texture, compared to deep neural networks which lead to better generalization and robustness. It can be attributed to the prior knowledge or the high-level cognitive inductive bias present in the brain. Therefore, introducing meaningful inductive bias to neural networks can help learn more generic and high-level representations an… ▽ More Humans rely less on spurious correlations and trivial cues, such as texture, compared to deep neural networks which lead to better generalization and robustness. It can be attributed to the prior knowledge or the high-level cognitive inductive bias present in the brain. Therefore, introducing meaningful inductive bias to neural networks can help learn more generic and high-level representations and alleviate some of the shortcomings. We propose InBiaseD to distill inductive bias and bring shape-awareness to the neural networks. Our method includes a bias alignment objective that enforces the networks to learn more generic representations that are less vulnerable to unintended cues in the data which results in improved generalization performance. InBiaseD is less susceptible to shortcut learning and also exhibits lower texture bias. The better representations also aid in improving robustness to adversarial attacks and we hence plugin InBiaseD seamlessly into the existing adversarial training schemes to show a better trade-off between generalization and robustness. △ Less

Submitted 12 June, 2022; originally announced June 2022.

Comments: Accepted at 1st Conference on Lifelong Learning Agents (CoLLAs 2022)

arXiv:2206.04790 [pdf, other]

Learn2Augment: Learning to Composite Videos for Data Augmentation in Action Recognition

Authors: Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara

Abstract: We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a good video for action recognition and select only high-quality samples for augment… ▽ More We address the problem of data augmentation for video action recognition. Standard augmentation strategies in video are hand-designed and sample the space of possible augmented data points either at random, without knowing which augmented points will be better, or through heuristics. We propose to learn what makes a good video for action recognition and select only high-quality samples for augmentation. In particular, we choose video compositing of a foreground and a background video as the data augmentation process, which results in diverse and realistic new samples. We learn which pairs of videos to augment without having to actually composite them. This reduces the space of possible augmentations, which has two advantages: it saves computational cost and increases the accuracy of the final trained classifier, as the augmented pairs are of higher quality than average. We present experimental results on the entire spectrum of training settings: few-shot, semi-supervised and fully supervised. We observe consistent improvements across all of them over prior work and baselines on Kinetics, UCF101, HMDB51, and achieve a new state-of-the-art on settings with limited data. We see improvements of up to 8.6% in the semi-supervised setting. △ Less

Submitted 23 July, 2022; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: Accepted to ECCV-2022

arXiv:2201.12468 [pdf, ps, other]

Symbolic-Numeric Integration of Univariate Expressions based on Sparse Regression

Authors: Shahriar Iravanian, Carl Julius Martensen, Alessandro Cheli, Shashi Gowda, Anand Jain, Yingbo Ma, Chris Rackauckas

Abstract: Most computer algebra systems (CAS) support symbolic integration as core functionality. The majority of the integration packages use a combination of heuristic algebraic and rule-based (integration table) methods. In this paper, we present a hybrid (symbolic-numeric) methodology to calculate the indefinite integrals of univariate expressions. The primary motivation for this work is to add symbolic… ▽ More Most computer algebra systems (CAS) support symbolic integration as core functionality. The majority of the integration packages use a combination of heuristic algebraic and rule-based (integration table) methods. In this paper, we present a hybrid (symbolic-numeric) methodology to calculate the indefinite integrals of univariate expressions. The primary motivation for this work is to add symbolic integration functionality to a modern CAS (the symbolic manipulation packages of SciML, the Scientific Machine Learning ecosystem of the Julia programming language), which is mainly designed toward numerical and machine learning applications and has a different set of features than traditional CAS. The symbolic part of our method is based on the combination of candidate terms generation (borrowed from the Homotopy operators theory) with rule-based expression transformations provided by the underlying CAS. The numeric part is based on sparse-regression, a component of Sparse Identification of Nonlinear Dynamics (SINDy) technique. We show that this system can solve a large variety of common integration problems using only a few dozen basic integration rules. △ Less

Submitted 6 February, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

Comments: 8 pages. submitted to ISSAC 2022. Code at https://github.com/SciML/SymbolicNumericIntegration.jl

ACM Class: I.1.0; I.1.2

arXiv:2201.10394 [pdf, other]

Capturing Temporal Information in a Single Frame: Channel Sampling Strategies for Action Recognition

Authors: Kiyoon Kim, Shreyank N Gowda, Oisin Mac Aodha, Laura Sevilla-Lara

Abstract: We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling s… ▽ More We address the problem of capturing temporal information for video classification in 2D networks, without increasing their computational cost. Existing approaches focus on modifying the architecture of 2D networks (e.g. by including filters in the temporal dimension to turn them into 3D networks, or using optical flow, etc.), which increases computation cost. Instead, we propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes. We observe that without bells and whistles, the proposed sampling strategy improves performance on multiple architectures (e.g. TSN, TRN, TSM, and MVFNet) and datasets (CATER, Something-Something-V1 and V2), up to 24% over the baseline of using the standard video input. In addition, our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing. Given the generality of the results and the flexibility of the approach, we hope this can be widely useful to the video understanding community. Code is available on our website: https://github.com/kiyoon/channel_sampling. △ Less

Submitted 10 October, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

Comments: BMVC 2022

arXiv:2111.09243 [pdf, other]

An Investigation into Keystroke Dynamics and Heart Rate Variability as Indicators of Stress

Authors: Srijith Unni, Sushma Suryanarayana Gowda, Alan F. Smeaton

Abstract: Lifelogging has become a prominent research topic in recent years. Wearable sensors like Fitbits and smart watches are now increasingly popular for recording ones activities. Some researchers are also exploring keystroke dynamics for lifelogging. Keystroke dynamics refers to the process of measuring and assessing a persons typing rhythm on digital devices. A digital footprint is created when a use… ▽ More Lifelogging has become a prominent research topic in recent years. Wearable sensors like Fitbits and smart watches are now increasingly popular for recording ones activities. Some researchers are also exploring keystroke dynamics for lifelogging. Keystroke dynamics refers to the process of measuring and assessing a persons typing rhythm on digital devices. A digital footprint is created when a user interacts with devices like keyboards, mobile phones or touch screen panels and the timing of the keystrokes is unique to each individual though likely to be affected by factors such as fatigue, distraction or emotional stress. In this work we explore the relationship between keystroke dynamics as measured by the timing for the top-10 most frequently occurring bi-grams in English, and the emotional state and stress of an individual as measured by heart rate variability (HRV). We collected keystroke data using the Loggerman application while HRV was simultaneously gathered. With this data we performed an analysis to determine the relationship between variations in keystroke dynamics and variations in HRV. Our conclusion is that we need to use a more detailed representation of keystroke timing than the top-10 bigrams, probably personalised to each user. △ Less

Submitted 17 November, 2021; originally announced November 2021.

Comments: 12 pages. To appear at MMM 2022, 28th International Conference on Multimedia Modeling, 5-8 April 2022, Phu Quoc, Vietnam

arXiv:2111.05191 [pdf, other]

Does Thermal data make the detection systems more reliable?

Authors: Shruthi Gowda, Bahram Zonooz, Elahe Arani

Abstract: Deep learning-based detection networks have made remarkable progress in autonomous driving systems (ADS). ADS should have reliable performance across a variety of ambient lighting and adverse weather conditions. However, luminance degradation and visual obstructions (such as glare, fog) result in poor quality images by the visual camera which leads to performance decline. To overcome these challen… ▽ More Deep learning-based detection networks have made remarkable progress in autonomous driving systems (ADS). ADS should have reliable performance across a variety of ambient lighting and adverse weather conditions. However, luminance degradation and visual obstructions (such as glare, fog) result in poor quality images by the visual camera which leads to performance decline. To overcome these challenges, we explore the idea of leveraging a different data modality that is disparate yet complementary to the visual data. We propose a comprehensive detection system based on a multimodal-collaborative framework that learns from both RGB (from visual cameras) and thermal (from Infrared cameras) data. This framework trains two networks collaboratively and provides flexibility in learning optimal features of its own modality while also incorporating the complementary knowledge of the other. Our extensive empirical results show that while the improvement in accuracy is nominal, the value lies in challenging and extremely difficult edge cases which is crucial in safety-critical applications such as AD. We provide a holistic view of both merits and limitations of using a thermal imaging system in detection. △ Less

Submitted 9 November, 2021; originally announced November 2021.

Comments: Accepted at NeurIPS 2021 - ML4AD workshop (The code for this research is available at: https://github.com/NeurAI-Lab/MMC)

arXiv:2108.12510 [pdf, other]

doi 10.1145/3459637.3482380

Pulling Up by the Causal Bootstraps: Causal Data Augmentation for Pre-training Debiasing

Authors: Sindhu C. M. Gowda, Shalmali Joshi, Haoran Zhang, Marzyeh Ghassemi

Abstract: Machine learning models achieve state-of-the-art performance on many supervised learning tasks. However, prior evidence suggests that these models may learn to rely on shortcut biases or spurious correlations (intuitively, correlations that do not hold in the test as they hold in train) for good predictive performance. Such models cannot be trusted in deployment environments to provide accurate pr… ▽ More Machine learning models achieve state-of-the-art performance on many supervised learning tasks. However, prior evidence suggests that these models may learn to rely on shortcut biases or spurious correlations (intuitively, correlations that do not hold in the test as they hold in train) for good predictive performance. Such models cannot be trusted in deployment environments to provide accurate predictions. While viewing the problem from a causal lens is known to be useful, the seamless integration of causation techniques into machine learning pipelines remains cumbersome and expensive. In this work, we study and extend a causal pre-training debiasing technique called causal bootstrapping (CB) under five practical confounded-data generation-acquisition scenarios (with known and unknown confounding). Under these settings, we systematically investigate the effect of confounding bias on deep learning model performance, demonstrating their propensity to rely on shortcut biases when these biases are not properly accounted for. We demonstrate that such a causal pre-training technique can significantly outperform existing base practices to mitigate confounding bias on real-world domain generalization benchmarking tasks. This systematic investigation underlines the importance of accounting for the underlying data-generating mechanisms and fortifying data-preprocessing pipelines with a causal framework to develop methods robust to confounding biases. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: Published in CIKM 2021

arXiv:2107.13029 [pdf, other]

A New Split for Evaluating True Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach

Abstract: Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap… ▽ More Zero-shot action recognition is the task of classifying action categories that are not available in the training set. In this setting, the standard evaluation protocol is to use existing action recognition datasets(e.g. UCF101) and randomly split the classes into seen and unseen. However, most recent work builds on representations pre-trained on the Kinetics dataset, where classes largely overlap with classes in the zero-shot evaluation datasets. As a result, classes which are supposed to be unseen, are present during supervised pre-training, invalidating the condition of the zero-shot setting. A similar concern was previously noted several years ago for image based zero-shot recognition but has not been considered by the zero-shot action recognition community. In this paper, we propose a new split for true zero-shot action recognition with no overlap between unseen test classes and training or pre-training classes. We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation. In our extensive analysis, we find that our TruZesplits are significantly harder than comparable random splits as nothing is leaking from pre-training, i.e. unseen performance is consistently lower,up to 8.9% for zero-shot action recognition. In an additional evaluation we also find that similar issues exist in the splits used in few-shot action recognition, here we see differences of up to 17.1%. We publish oursplits1and hope that our benchmark analysis will change how the field is evaluating zero- and few-shot action recognition moving forward. △ Less

Submitted 13 September, 2021; v1 submitted 27 July, 2021; originally announced July 2021.

Comments: Accepted to GCPR 2021

arXiv:2107.00443 [pdf, other]

Test Framework for a Virtual Competition Testbed

Authors: Liam Wellacott, Emilyann Nault, Ioannis Skottis, Alexandre Colle, Shreyank N Gowda, Pierre Nicolay, Emily Rolley-Parnell

Abstract: Virtual environments have been utilised in robotics research as a tool to assess systems before deploying them in the field. The COVID-19 pandemic has brought about additional motivation for the development of virtual benchmarks in order to aid in safe and productive development. In-person robotics competitions have also halted, thus limiting the scope of opportunities for students and researchers… ▽ More Virtual environments have been utilised in robotics research as a tool to assess systems before deploying them in the field. The COVID-19 pandemic has brought about additional motivation for the development of virtual benchmarks in order to aid in safe and productive development. In-person robotics competitions have also halted, thus limiting the scope of opportunities for students and researchers. We implemented the structure of a service robotics competition into an extendable and adaptable virtual scoring environment. The competition challenges the state of the art in home service robotics by presenting realistic household tasks for robots to complete. The virtual environment provides a foundation for competition teams to assess their systems when accessing the physical environment is not possible. We believe that utilising virtual environments as a means of assessment will lead to other benefits such as increased access and generalisation. △ Less

Submitted 1 July, 2021; originally announced July 2021.

arXiv:2106.02567 [pdf]

AI Driven Road Maintenance Inspection

Authors: Ratnajit Mukherjee, Haris Iqbal, Shabbir Marzban, Ahmed Badar, Terence Brouns, Shruthi Gowda, Elahe Arani, Bahram Zonooz

Abstract: Road infrastructure maintenance inspection is typically a labour-intensive and critical task to ensure the safety of all the road users. In this work, we propose a detailed methodology to use state-of-the-art techniques in artificial intelligence and computer vision to automate a sizeable portion of the maintenance inspection subtasks and reduce the labour costs. The proposed methodology uses stat… ▽ More Road infrastructure maintenance inspection is typically a labour-intensive and critical task to ensure the safety of all the road users. In this work, we propose a detailed methodology to use state-of-the-art techniques in artificial intelligence and computer vision to automate a sizeable portion of the maintenance inspection subtasks and reduce the labour costs. The proposed methodology uses state-of-the-art computer vision techniques such as object detection and semantic segmentation to automate inspections on primary road structures such as the road surface, markings, barriers (guardrails) and traffic signs. The models are mostly trained on commercially viable datasets and augmented with proprietary data. We demonstrate that our AI models can not only automate and scale maintenance inspections on primary road structures but also result in higher recall compared to traditional manual inspections. △ Less

Submitted 4 June, 2021; originally announced June 2021.

Comments: accepted at 27th ITS World Congress, 2021

arXiv:2105.05946 [pdf, other]

Composing Modeling and Simulation with Machine Learning in Julia

Authors: Chris Rackauckas, Ranjan Anantharaman, Alan Edelman, Shashi Gowda, Maja Gwozdz, Anand Jain, Chris Laughman, Yingbo Ma, Francesco Martinuzzi, Avik Pal, Utkarsh Rajput, Elliot Saba, Viral B. Shah

Abstract: In this paper we introduce JuliaSim, a high-performance programming environment designed to blend traditional modeling and simulation with machine learning. JuliaSim can build accelerated surrogates from component-based models, such as those conforming to the FMI standard, using continuous-time echo state networks (CTESN). The foundation of this environment, ModelingToolkit.jl, is an acausal model… ▽ More In this paper we introduce JuliaSim, a high-performance programming environment designed to blend traditional modeling and simulation with machine learning. JuliaSim can build accelerated surrogates from component-based models, such as those conforming to the FMI standard, using continuous-time echo state networks (CTESN). The foundation of this environment, ModelingToolkit.jl, is an acausal modeling language which can compose the trained surrogates as components within its staged compilation process. As a complementary factor we present the JuliaSim model library, a standard library with differential-algebraic equations and pre-trained surrogates, which can be composed using the modeling system for design, optimization, and control. We demonstrate the effectiveness of the surrogate-accelerated modeling and simulation approach on HVAC dynamics by showing that the CTESN surrogates accurately capture the dynamics of a HVAC cycle at less than 4\% error while accelerating its simulation by 340x. We illustrate the use of surrogate acceleration in the design process via global optimization of simulation parameters using the embedded surrogate, yielding a speedup of two orders of magnitude to find the optimum. We showcase the surrogate deployed in a co-simulation loop, as a drop-in replacement for one of the coupled FMUs, allowing engineers to effectively explore the design space of a coupled system. Together this demonstrates a workflow for automating the integration of machine learning techniques into traditional modeling and simulation processes. △ Less

Submitted 12 May, 2021; originally announced May 2021.

arXiv:2105.03949 [pdf, other]

High-performance symbolic-numerics via multiple dispatch

Authors: Shashi Gowda, Yingbo Ma, Alessandro Cheli, Maja Gwozdz, Viral B. Shah, Alan Edelman, Christopher Rackauckas

Abstract: As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. T… ▽ More As mathematical computing becomes more democratized in high-level languages, high-performance symbolic-numeric systems are necessary for domain scientists and engineers to get the best performance out of their machine without deep knowledge of code optimization. Naturally, users need different term types either to have different algebraic properties for them, or to use efficient data structures. To this end, we developed Symbolics.jl, an extendable symbolic system which uses dynamic multiple dispatch to change behavior depending on the domain needs. In this work we detail an underlying abstract term interface which allows for speed without sacrificing generality. We show that by formalizing a generic API on actions independent of implementation, we can retroactively add optimized data structures to our system without changing the pre-existing term rewriters. We showcase how this can be used to optimize term construction and give a 113x acceleration on general symbolic transformations. Further, we show that such a generic API allows for complementary term-rewriting implementations. We demonstrate the ability to swap between classical term-rewriting simplifiers and e-graph-based term-rewriting simplifiers. We showcase an e-graph ruleset which minimizes the number of CPU cycles during expression evaluation, and demonstrate how it simplifies a real-world reaction-network simulation to halve the runtime. Additionally, we show a reaction-diffusion partial differential equation solver which is able to be automatically converted into symbolic expressions via multiple dispatch tracing, which is subsequently accelerated and parallelized to give a 157x simulation speedup. Together, this presents Symbolics.jl as a next-generation symbolic-numeric computing environment geared towards modeling and simulation. △ Less

Submitted 5 February, 2022; v1 submitted 9 May, 2021; originally announced May 2021.

ACM Class: D.3.3; I.1.1; I.1.3

arXiv:2103.05244 [pdf]

ModelingToolkit: A Composable Graph Transformation System For Equation-Based Modeling

Authors: Yingbo Ma, Shashi Gowda, Ranjan Anantharaman, Chris Laughman, Viral Shah, Chris Rackauckas

Abstract: Getting good performance out of numerical equation solvers requires that the user has provided stable and efficient functions representing their model. However, users should not be trusted to write good code. In this manuscript we describe ModelingToolkit (MTK), a symbolic equation-based modeling system which allows for composable transformations to generate stable, efficient, and parallelized mod… ▽ More Getting good performance out of numerical equation solvers requires that the user has provided stable and efficient functions representing their model. However, users should not be trusted to write good code. In this manuscript we describe ModelingToolkit (MTK), a symbolic equation-based modeling system which allows for composable transformations to generate stable, efficient, and parallelized model implementations. MTK blurs the lines of traditional symbolic computing by acting directly on a user's numerical code. We show the ability to apply graph algorithms for automatically parallelizing and performing index reduction on code written for differential-algebraic equation (DAE) solvers, "fixing" the performance and stability of the model without requiring any changes to on the user's part. We demonstrate how composable model transformations can be combined with automated data-driven surrogate generation techniques, allowing machine learning methods to generate accelerated approximate models within an acausal modeling framework. These reduced models are shown to outperform the Dymola Modelica compiler on an HVAC model by 590x at 3\% error. Together, this demonstrates MTK as a system for bringing the latest research in graph transformations directly to modeling applications. △ Less

Submitted 9 February, 2022; v1 submitted 9 March, 2021; originally announced March 2021.

Comments: 10 pages, 3 figures, 1 table

arXiv:2101.07042 [pdf, other]

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

Authors: Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach

Abstract: Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plain… ▽ More Zero-shot action recognition is the task of recognizingaction classes without visual examples, only with a seman-tic embedding which relates unseen to seen classes. Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes. Neural networks can modelthe complex boundaries between visual classes, which ex-plains their success as supervised models. However, inzero-shot learning, these highly specialized class bound-aries may not transfer well from seen to unseen classes.In this paper we propose a centroid-based representation,which clusters visual and semantic representation, consid-ers all training samples at once, and in this way generaliz-ing well to instances from unseen classes. We optimize theclustering using Reinforcement Learning which we show iscritical for our approach to work. We call the proposedmethod CLASTER and observe that it consistently outper-forms the state-of-the-art in all standard datasets, includ-ing UCF101, HMDB51 and Olympic Sports; both in thestandard zero-shot evaluation and the generalized zero-shotlearning. Further, we show that our model performs com-petitively in the image domain as well, outperforming thestate-of-the-art in many settings. △ Less

Submitted 23 July, 2022; v1 submitted 18 January, 2021; originally announced January 2021.

Comments: Accepted to ECCV-22

arXiv:2012.10671 [pdf, other]

SMART Frame Selection for Action Recognition

Authors: Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

Abstract: Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not releva… ▽ More Action recognition is computationally expensive. In this paper, we address the problem of frame selection to improve the accuracy of action recognition. In particular, we show that selecting good frames helps in action recognition performance even in the trimmed videos domain. Recent work has successfully leveraged frame selection for long, untrimmed videos, where much of the content is not relevant, and easy to discard. In this work, however, we focus on the more standard short, trimmed action recognition problem. We argue that good frame selection can not only reduce the computational cost of action recognition but also increase the accuracy by getting rid of frames that are hard to classify. In contrast to previous work, we propose a method that instead of selecting frames by considering one at a time, considers them jointly. This results in a more efficient selection, where good frames are more effectively distributed over the video, like snapshots that tell a story. We call the proposed frame selection SMART and we test it in combination with different backbone architectures and on multiple benchmarks (Kinetics, Something-something, UCF101). We show that the SMART frame selection consistently improves the accuracy compared to other frame selection strategies while reducing the computational cost by a factor of 4 to 10 times. Additionally, we show that when the primary goal is recognition performance, our selection strategy can improve over recent state-of-the-art models and frame selection strategies on various benchmarks (UCF101, HMDB51, FCVID, and ActivityNet). △ Less

Submitted 19 December, 2020; originally announced December 2020.

Comments: To be published in AAAI-21

arXiv:2010.04004 [pdf, other]

Accelerating Simulation of Stiff Nonlinear Systems using Continuous-Time Echo State Networks

Authors: Ranjan Anantharaman, Yingbo Ma, Shashi Gowda, Chris Laughman, Viral Shah, Alan Edelman, Chris Rackauckas

Abstract: Modern design, control, and optimization often requires simulation of highly nonlinear models, leading to prohibitive computational costs. These costs can be amortized by evaluating a cheap surrogate of the full model. Here we present a general data-driven method, the continuous-time echo state network (CTESN), for generating surrogates of nonlinear ordinary differential equations with dynamics at… ▽ More Modern design, control, and optimization often requires simulation of highly nonlinear models, leading to prohibitive computational costs. These costs can be amortized by evaluating a cheap surrogate of the full model. Here we present a general data-driven method, the continuous-time echo state network (CTESN), for generating surrogates of nonlinear ordinary differential equations with dynamics at widely separated timescales. We empirically demonstrate near-constant time performance using our CTESNs on a physically motivated scalable model of a heating system whose full execution time increases exponentially, while maintaining relative error of within 0.2 %. We also show that our model captures fast transients as well as slow dynamics effectively, while other techniques such as physics informed neural networks have difficulties trying to train and predict the highly nonlinear behavior of these models. △ Less

Submitted 24 March, 2021; v1 submitted 7 October, 2020; originally announced October 2020.

arXiv:2005.13039 [pdf, other]

ALBA : Reinforcement Learning for Video Object Segmentation

Authors: Shreyank N Gowda, Panagiotis Eustratiadis, Timothy Hospedales, Laura Sevilla-Lara

Abstract: We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We propose a network architecture for tractably perfo… ▽ More We consider the challenging problem of zero-shot video object segmentation (VOS). That is, segmenting and tracking multiple moving objects within a video fully automatically, without any manual initialization. We treat this as a grouping problem by exploiting object proposals and making a joint inference about grouping over both space and time. We propose a network architecture for tractably performing proposal selection and joint grouping. Crucially, we then show how to train this network with reinforcement learning so that it learns to perform the optimal non-myopic sequence of grouping decisions to segment the whole video. Unlike standard supervised techniques, this also enables us to directly optimize for the non-differentiable overlap-based metrics used to evaluate VOS. We show that the proposed method, which we call ALBA outperforms the previous stateof-the-art on three benchmarks: DAVIS 2017 [2], FBMS [20] and Youtube-VOS [27]. △ Less

Submitted 14 August, 2020; v1 submitted 26 May, 2020; originally announced May 2020.

arXiv:2003.05005 [pdf, other]

Using an ensemble color space model to tackle adversarial examples

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Minute pixel changes in an image drastically change the prediction that the deep learning model makes. One of the most significant problems that could arise due to this, for instance, is autonomous driving. Many methods have been proposed to combat this with varying amounts of success. We propose a 3 step method for defending such attacks. First, we denoise the image using statistical methods. Sec… ▽ More Minute pixel changes in an image drastically change the prediction that the deep learning model makes. One of the most significant problems that could arise due to this, for instance, is autonomous driving. Many methods have been proposed to combat this with varying amounts of success. We propose a 3 step method for defending such attacks. First, we denoise the image using statistical methods. Second, we show that adopting multiple color spaces in the same model can help us to fight these adversarial attacks further as each color space detects certain features explicit to itself. Finally, the feature maps generated are enlarged and sent back as an input to obtain even smaller features. We show that the proposed model does not need to be trained to defend an particular type of attack and is inherently more robust to black-box, white-box, and grey-box adversarial attack techniques. In particular, the model is 56.12 percent more robust than compared models in case of white box attacks when the models are not subject to adversarial example training. △ Less

Submitted 10 March, 2020; originally announced March 2020.

arXiv:2002.02413 [pdf, other]

doi 10.1007/978-3-030-73973-7_30

StegColNet: Steganalysis based on an ensemble colorspace approach

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Image steganography refers to the process of hiding information inside images. Steganalysis is the process of detecting a steganographic image. We introduce a steganalysis approach that uses an ensemble color space model to obtain a weighted concatenated feature activation map. The concatenated map helps to obtain certain features explicit to each color space. We use a levy-flight grey wolf optimi… ▽ More Image steganography refers to the process of hiding information inside images. Steganalysis is the process of detecting a steganographic image. We introduce a steganalysis approach that uses an ensemble color space model to obtain a weighted concatenated feature activation map. The concatenated map helps to obtain certain features explicit to each color space. We use a levy-flight grey wolf optimization strategy to reduce the number of features selected in the map. We then use these features to classify the image into one of two classes: whether the given image has secret information stored or not. Extensive experiments have been done on a large scale dataset extracted from the Bossbase dataset. Also, we show that the model can be transferred to different datasets and perform extensive experiments on a mixture of datasets. Our results show that the proposed approach outperforms the recent state of the art deep learning steganalytical approaches by 2.32 percent on average for 0.2 bits per channel (bpc) and 1.87 percent on average for 0.4 bpc. △ Less

Submitted 16 October, 2020; v1 submitted 6 February, 2020; originally announced February 2020.

arXiv:1906.07421 [pdf]

Using colorization as a tool for automatic makeup suggestion

Authors: Shreyank Narayana Gowda

Abstract: Colorization is the method of converting an image in grayscale to a fully color image. There are multiple methods to do the same. Old school methods used machine learning algorithms and optimization techniques to suggest possible colors to use. With advances in the field of deep learning, colorization results have improved consistently with improvements in deep learning architectures. The latest d… ▽ More Colorization is the method of converting an image in grayscale to a fully color image. There are multiple methods to do the same. Old school methods used machine learning algorithms and optimization techniques to suggest possible colors to use. With advances in the field of deep learning, colorization results have improved consistently with improvements in deep learning architectures. The latest development in the field of deep learning is the emergence of generative adversarial networks (GANs) which is used to generate information and not just predict or classify. As part of this report, 2 architectures of recent papers are reproduced along with a novel architecture being suggested for general colorization. Following this, we propose the use of colorization by generating makeup suggestions automatically on a face. To do this, a dataset consisting of 1000 images has been created. When an image of a person without makeup is sent to the model, the model first converts the image to grayscale and then passes it through the suggested GAN model. The output is a generated makeup suggestion. To develop this model, we need to tweak the general colorization model to deal only with faces of people. △ Less

Submitted 18 June, 2019; originally announced June 2019.

arXiv:1902.00267 [pdf, ps, other]

ColorNet: Investigating the importance of color spaces for image classification

Authors: Shreyank N Gowda, Chun Yuan

Abstract: Image classification is a fundamental application in computer vision. Recently, deeper networks and highly connected networks have shown state of the art performance for image classification tasks. Most datasets these days consist of a finite number of color images. These color images are taken as input in the form of RGB images and classification is done without modifying them. We explore the imp… ▽ More Image classification is a fundamental application in computer vision. Recently, deeper networks and highly connected networks have shown state of the art performance for image classification tasks. Most datasets these days consist of a finite number of color images. These color images are taken as input in the form of RGB images and classification is done without modifying them. We explore the importance of color spaces and show that color spaces (essentially transformations of original RGB images) can significantly affect classification accuracy. Further, we show that certain classes of images are better represented in particular color spaces and for a dataset with a highly varying number of classes such as CIFAR and Imagenet, using a model that considers multiple color spaces within the same model gives excellent levels of accuracy. Also, we show that such a model, where the input is preprocessed into multiple color spaces simultaneously, needs far fewer parameters to obtain high accuracy for classification. For example, our model with 1.75M parameters significantly outperforms DenseNet 100-12 that has 12M parameters and gives results comparable to Densenet-BC-190-40 that has 25.6M parameters for classification of four competitive image classification datasets namely: CIFAR-10, CIFAR-100, SVHN and Imagenet. Our model essentially takes an RGB image as input, simultaneously converts the image into 7 different color spaces and uses these as inputs to individual densenets. We use small and wide densenets to reduce computation overhead and number of hyperparameters required. We obtain significant improvement on current state of the art results on these datasets as well. △ Less

Submitted 1 February, 2019; originally announced February 2019.

Journal ref: Asian Conference on Computer Vision 2018

arXiv:1706.07913 [pdf]

doi 10.1007/978-981-10-4859-3_20

Semi-supervised Text Categorization Using Recursive K-means Clustering

Authors: Harsha S. Gowda, Mahamad Suhil, D. S. Guru, Lavanya Narayana Raju

Abstract: In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition… ▽ More In this paper, we present a semi-supervised learning algorithm for classification of text documents. A method of labeling unlabeled text documents is presented. The presented method is based on the principle of divide and conquer strategy. It uses recursive K-means algorithm for partitioning both labeled and unlabeled data collection. The K-means algorithm is applied recursively on each partition till a desired level partition is achieved such that each partition contains labeled documents of a single class. Once the desired clusters are obtained, the respective cluster centroids are considered as representatives of the clusters and the nearest neighbor rule is used for classifying an unknown text document. Series of experiments have been conducted to bring out the superiority of the proposed model over other recent state of the art models on 20Newsgroups dataset. △ Less

Submitted 24 June, 2017; originally announced June 2017.

Comments: 11 Pages, 8 Figures, Conference: RTIP2R

arXiv:1706.07912 [pdf]

doi 10.1007/978-981-10-4859-3_19

Cluster Based Symbolic Representation for Skewed Text Categorization

Authors: Lavanya Narayana Raju, Mahamad Suhil, D S Guru, Harsha S Gowda

Abstract: In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to… ▽ More In this work, a problem associated with imbalanced text corpora is addressed. A method of converting an imbalanced text corpus into a balanced one is presented. The presented method employs a clustering algorithm for conversion. Initially to avoid curse of dimensionality, an effective representation scheme based on term class relevancy measure is adapted, which drastically reduces the dimension to the number of classes in the corpus. Subsequently, the samples of larger sized classes are grouped into a number of subclasses of smaller sizes to make the entire corpus balanced. Each subclass is then given a single symbolic vector representation by the use of interval valued features. This symbolic representation in addition to being compact helps in reducing the space requirement and also the classification time. The proposed model has been empirically demonstrated for its superiority on bench marking datasets viz., Reuters 21578 and TDT2. Further, it has been compared against several other existing contemporary models including model based on support vector machine. The comparative analysis indicates that the proposed model outperforms the other existing models. △ Less

Submitted 24 June, 2017; originally announced June 2017.

Comments: 14 Pages, 15 Figures, 1 Table, Conference: RTIP2R

Showing 1–44 of 44 results for author: Gowda, S