Search | arXiv e-print repository

LoSA: Long-Short-range Adapter for Scaling End-to-End Temporal Action Localization

Authors: Akshita Gupta, Gaurav Mittal, Ahmed Magooda, Ye Yu, Graham W. Taylor, Mei Chen

Abstract: Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to ad… ▽ More Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Gated Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2~(ViT-g) and leveraging them beyond head-only transfer learning. △ Less

Submitted 6 August, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

Comments: Under submission

arXiv:2402.18085 [pdf, other]

AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

Authors: Govind Mittal, Arthur Jakobsson, Kelly O. Marshall, Chinmay Hegde, Nasir Memon

Abstract: Scammers are aggressively leveraging AI voice-cloning technology for social engineering attacks, a situation significantly worsened by the advent of audio Real-time Deepfakes (RTDFs). RTDFs can clone a target's voice in real-time over phone calls, making these interactions highly interactive and thus far more convincing. Our research confidently addresses the gap in the existing literature on deep… ▽ More Scammers are aggressively leveraging AI voice-cloning technology for social engineering attacks, a situation significantly worsened by the advent of audio Real-time Deepfakes (RTDFs). RTDFs can clone a target's voice in real-time over phone calls, making these interactions highly interactive and thus far more convincing. Our research confidently addresses the gap in the existing literature on deepfake detection, which has largely been ineffective against RTDF threats. We introduce a robust challenge-response-based method to detect deepfake audio calls, pioneering a comprehensive taxonomy of audio challenges. Our evaluation pitches 20 prospective challenges against a leading voice-cloning system. We have compiled a novel open-source challenge dataset with contributions from 100 smartphone and desktop users, yielding 18,600 original and 1.6 million deepfake samples. Through rigorous machine and human evaluations of this dataset, we achieved a deepfake detection rate of 86% and an 80% AUC score, respectively. Notably, utilizing a set of 11 challenges significantly enhances detection capabilities. Our findings reveal that combining human intuition with machine precision offers complementary advantages. Consequently, we have developed an innovative human-AI collaborative system that melds human discernment with algorithmic accuracy, boosting final joint accuracy to 82.9%. This system highlights the significant advantage of AI-assisted pre-screening in call verification processes. Samples can be heard at https://mittalgovind.github.io/autch-samples/ △ Less

Submitted 28 February, 2024; originally announced February 2024.

Comments: Dataset will be made public by end of March 2024

arXiv:2308.01508 [pdf, other]

Circumventing Concept Erasure Methods For Text-to-Image Generative Models

Authors: Minh Pham, Kelly O. Marshall, Niv Cohen, Govind Mittal, Chinmay Hegde

Abstract: Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likene… ▽ More Text-to-image generative models can produce photo-realistic images for an extremely broad range of concepts, and their usage has proliferated widely among the general public. On the flip side, these models have numerous drawbacks, including their potential to generate images featuring sexually explicit content, mirror artistic styles without permission, or even hallucinate (or deepfake) the likenesses of celebrities. Consequently, various methods have been proposed in order to "erase" sensitive concepts from text-to-image models. In this work, we examine five recently proposed concept erasure methods, and show that targeted concepts are not fully excised from any of these methods. Specifically, we leverage the existence of special learned word embeddings that can retrieve "erased" concepts from the sanitized models with no alterations to their weights. Our results highlight the brittleness of post hoc concept erasure methods, and call into question their use in the algorithmic toolkit for AI safety. △ Less

Submitted 8 October, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

arXiv:2307.12935 [pdf, other]

Rule By Example: Harnessing Logical Rules for Explainable Hate Speech Detection

Authors: Christopher Clarke, Matthew Hall, Gaurav Mittal, Ye Yu, Sandra Sajeev, Jason Mars, Mei Chen

Abstract: Classic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using hi… ▽ More Classic approaches to content moderation typically apply a rule-based heuristic approach to flag content. While rules are easily customizable and intuitive for humans to interpret, they are inherently fragile and lack the flexibility or robustness needed to moderate the vast amount of undesirable content found online today. Recent advances in deep learning have demonstrated the promise of using highly effective deep neural models to overcome these challenges. However, despite the improved performance, these data-driven models lack transparency and explainability, often leading to mistrust from everyday users and a lack of adoption by many platforms. In this paper, we present Rule By Example (RBE): a novel exemplar-based contrastive learning approach for learning from logical rules for the task of textual content moderation. RBE is capable of providing rule-grounded predictions, allowing for more explainable and customizable predictions compared to typical deep learning-based approaches. We demonstrate that our approach is capable of learning rich rule embedding representations using only a few data examples. Experimental results on 3 popular hate speech classification datasets show that RBE is able to outperform state-of-the-art deep learning classifiers as well as the use of rules in both supervised and unsupervised settings while providing explainable model predictions via rule-grounding. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: ACL 2023 Main Conference

arXiv:2307.08585 [pdf, other]

Identity-Preserving Aging of Face Images via Latent Diffusion Models

Authors: Sudipta Banerjee, Govind Mittal, Ameya Joshi, Chinmay Hegde, Nasir Memon

Abstract: The performance of automated face recognition systems is inevitably impacted by the facial aging process. However, high quality datasets of individuals collected over several years are typically small in scale. In this work, we propose, train, and validate the use of latent text-to-image diffusion models for synthetically aging and de-aging face images. Our models succeed with few-shot training, a… ▽ More The performance of automated face recognition systems is inevitably impacted by the facial aging process. However, high quality datasets of individuals collected over several years are typically small in scale. In this work, we propose, train, and validate the use of latent text-to-image diffusion models for synthetically aging and de-aging face images. Our models succeed with few-shot training, and have the added benefit of being controllable via intuitive textual prompting. We observe high degrees of visual realism in the generated images while maintaining biometric fidelity measured by commonly used metrics. We evaluate our method on two benchmark datasets (CelebA and AgeDB) and observe significant reduction (~44%) in the False Non-Match Rate compared to existing state-of the-art baselines. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: Accepted to appear in International Joint Conference in Biometrics (IJCB) 2023

arXiv:2306.16410 [pdf, other]

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language

Authors: William Berrios, Gautam Mittal, Tristan Thrush, Douwe Kiela, Amanpreet Singh

Abstract: We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recog… ▽ More We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at https://github.com/ContextualAI/lens and provide an interactive demo. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2305.10547 [pdf, other]

Rethinking Multimodal Content Moderation from an Asymmetric Angle with Mixed-modality

Authors: Jialin Yuan, Ye Yu, Gaurav Mittal, Matthew Hall, Sandra Sajeev, Mei Chen

Abstract: There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and un… ▽ More There is a rapidly growing need for multimodal content moderation (CM) as more and more content on social media is multimodal in nature. Existing unimodal CM systems may fail to catch harmful content that crosses modalities (e.g., memes or videos), which may lead to severe consequences. In this paper, we present a novel CM model, Asymmetric Mixed-Modal Moderation (AM3), to target multimodal and unimodal CM tasks. Specifically, to address the asymmetry in semantics between vision and language, AM3 has a novel asymmetric fusion architecture that is designed to not only fuse the common knowledge in both modalities but also to exploit the unique information in each modality. Unlike previous works that focus on representing the two modalities into a similar feature space while overlooking the intrinsic difference between the information conveyed in multimodality and in unimodality (asymmetry in modalities), we propose a novel cross-modality contrastive loss to learn the unique knowledge that only appears in multimodality. This is critical as some harmful intent may only be conveyed through the intersection of both modalities. With extensive experiments, we show that AM3 outperforms all existing state-of-the-art methods on both multimodal and unimodal CM benchmarks. △ Less

Submitted 13 December, 2023; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: Accepted at WACV 2024

arXiv:2210.06186 [pdf, other]

GOTCHA: Real-Time Video Deepfake Detection via Challenge-Response

Authors: Govind Mittal, Chinmay Hegde, Nasir Memon

Abstract: With the rise of AI-enabled Real-Time Deepfakes (RTDFs), the integrity of online video interactions has become a growing concern. RTDFs have now made it feasible to replace an imposter's face with their victim in live video interactions. Such advancement in deepfakes also coaxes detection to rise to the same standard. However, existing deepfake detection techniques are asynchronous and hence ill-s… ▽ More With the rise of AI-enabled Real-Time Deepfakes (RTDFs), the integrity of online video interactions has become a growing concern. RTDFs have now made it feasible to replace an imposter's face with their victim in live video interactions. Such advancement in deepfakes also coaxes detection to rise to the same standard. However, existing deepfake detection techniques are asynchronous and hence ill-suited for RTDFs. To bridge this gap, we propose a challenge-response approach that establishes authenticity in live settings. We focus on talking-head style video interaction and present a taxonomy of challenges that specifically target inherent limitations of RTDF generation pipelines. We evaluate representative examples from the taxonomy by collecting a unique dataset comprising eight challenges, which consistently and visibly degrades the quality of state-of-the-art deepfake generators. These results are corroborated both by humans and a new automated scoring function, leading to 88.6% and 80.1% AUC, respectively. The findings underscore the promising potential of challenge-response systems for explainable and scalable real-time deepfake detection in practical scenarios. We provide access to data and code at \url{https://github.com/mittalgovind/GOTCHA-Deepfakes}. △ Less

Submitted 23 May, 2024; v1 submitted 12 October, 2022; originally announced October 2022.

Comments: Accepted to IEEE Euro S&P 2024

arXiv:2208.01159 [pdf, other]

BATMAN: Bilateral Attention Transformer in Motion-Appearance Neighboring Space for Video Object Segmentation

Authors: Ye Yu, Jialin Yuan, Gaurav Mittal, Li Fuxin, Mei Chen

Abstract: Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-super… ▽ More Video Object Segmentation (VOS) is fundamental to video understanding. Transformer-based methods show significant performance improvement on semi-supervised VOS. However, existing work faces challenges segmenting visually similar objects in close proximity of each other. In this paper, we propose a novel Bilateral Attention Transformer in Motion-Appearance Neighboring space (BATMAN) for semi-supervised VOS. It captures object motion in the video via a novel optical flow calibration module that fuses the segmentation mask with optical flow estimation to improve within-object optical flow smoothness and reduce noise at object boundaries. This calibrated optical flow is then employed in our novel bilateral attention, which computes the correspondence between the query and reference frames in the neighboring bilateral space considering both motion and appearance. Extensive experiments validate the effectiveness of BATMAN architecture by outperforming all existing state-of-the-art on all four popular VOS benchmarks: Youtube-VOS 2019 (85.0%), Youtube-VOS 2018 (85.3%), DAVIS 2017Val/Testdev (86.2%/82.2%), and DAVIS 2016 (92.5%). △ Less

Submitted 7 August, 2022; v1 submitted 1 August, 2022; originally announced August 2022.

Comments: Accepted by ECCV 2022 (Oral)

arXiv:2206.04668 [pdf, other]

GateHUB: Gated History Unit with Background Suppression for Online Action Detection

Authors: Junwen Chen, Gaurav Mittal, Ye Yu, Yu Kong, Mei Chen

Abstract: Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We… ▽ More Online action detection is the task of predicting the action as soon as it happens in a streaming video. A major challenge is that the model does not have access to the future and has to solely rely on the history, i.e., the frames observed so far, to make predictions. It is therefore important to accentuate parts of the history that are more informative to the prediction of the current frame. We present GateHUB, Gated History Unit with Background Suppression, that comprises a novel position-guided gated cross-attention mechanism to enhance or suppress parts of the history as per how informative they are for current frame prediction. GateHUB further proposes Future-augmented History (FaH) to make history features more informative by using subsequently observed frames when available. In a single unified framework, GateHUB integrates the transformer's ability of long-range temporal modeling and the recurrent model's capacity to selectively encode relevant information. GateHUB also introduces a background suppression objective to further mitigate false positive background frames that closely resemble the action frames. Extensive validation on three benchmark datasets, THUMOS, TVSeries, and HDD, demonstrates that GateHUB significantly outperforms all existing methods and is also more efficient than the existing best work. Furthermore, a flow-free version of GateHUB is able to achieve higher or close accuracy at 2.8x higher frame rate compared to all existing methods that require both RGB and optical flow information for prediction. △ Less

Submitted 9 June, 2022; originally announced June 2022.

Comments: CVPR 2022

arXiv:2201.01441 [pdf, other]

doi 10.1145/3514221.3517885

Balsa: Learning a Query Optimizer Without Expert Demonstrations

Authors: Zongheng Yang, Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, Ion Stoica

Abstract: Query optimizers are a performance-critical component in every database system. Due to their complexity, optimizers take experts months to write and years to refine. In this work, we demonstrate for the first time that learning to optimize queries without learning from an expert optimizer is both possible and efficient. We present Balsa, a query optimizer built by deep reinforcement learning. Bals… ▽ More Query optimizers are a performance-critical component in every database system. Due to their complexity, optimizers take experts months to write and years to refine. In this work, we demonstrate for the first time that learning to optimize queries without learning from an expert optimizer is both possible and efficient. We present Balsa, a query optimizer built by deep reinforcement learning. Balsa first learns basic knowledge from a simple, environment-agnostic simulator, followed by safe learning in real execution. On the Join Order Benchmark, Balsa matches the performance of two expert query optimizers, both open-source and commercial, with two hours of learning, and outperforms them by up to 2.8$\times$ in workload runtime after a few more hours. Balsa thus opens the possibility of automatically learning to optimize in future compute environments where expert-designed optimizers do not exist. △ Less

Submitted 3 May, 2022; v1 submitted 4 January, 2022; originally announced January 2022.

Comments: SIGMOD 2022; code released at: https://github.com/balsa-project/balsa/

arXiv:2110.12606 [pdf, other]

MUSE: Feature Self-Distillation with Mutual Information and Self-Information

Authors: Yu Gong, Ye Yu, Gaurav Mittal, Greg Mori, Mei Chen

Abstract: We present a novel information-theoretic approach to introduce dependency among features of a deep convolutional neural network (CNN). The core idea of our proposed method, called MUSE, is to combine MUtual information and SElf-information to jointly improve the expressivity of all features extracted from different layers in a CNN. We present two variants of the realization of MUSE -- Additive Inf… ▽ More We present a novel information-theoretic approach to introduce dependency among features of a deep convolutional neural network (CNN). The core idea of our proposed method, called MUSE, is to combine MUtual information and SElf-information to jointly improve the expressivity of all features extracted from different layers in a CNN. We present two variants of the realization of MUSE -- Additive Information and Multiplicative Information. Importantly, we argue and empirically demonstrate that MUSE, compared to other feature discrepancy functions, is a more functional proxy to introduce dependency and effectively improve the expressivity of all features in the knowledge distillation framework. MUSE achieves superior performance over a variety of popular architectures and feature discrepancy functions for self-distillation and online distillation, and performs competitively with the state-of-the-art methods for offline distillation. MUSE is also demonstrably versatile that enables it to be easily extended to CNN-based models on tasks other than image classification such as object detection. △ Less

Submitted 24 October, 2021; originally announced October 2021.

Comments: The 32nd British Machine Vision Conference (BMVC 2021)

arXiv:2109.15317 [pdf, other]

Unsupervised Few-Shot Action Recognition via Action-Appearance Aligned Meta-Adaptation

Authors: Jay Patravali, Gaurav Mittal, Ye Yu, Fuxin Li, Mei Chen

Abstract: We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adapt… ▽ More We present MetaUVFS as the first Unsupervised Meta-learning algorithm for Video Few-Shot action recognition. MetaUVFS leverages over 550K unlabeled videos to train a two-stream 2D and 3D CNN architecture via contrastive learning to capture the appearance-specific spatial and action-specific spatio-temporal video features respectively. MetaUVFS comprises a novel Action-Appearance Aligned Meta-adaptation (A3M) module that learns to focus on the action-oriented video features in relation to the appearance features via explicit few-shot episodic meta-learning over unsupervised hard-mined episodes. Our action-appearance alignment and explicit few-shot learner conditions the unsupervised training to mimic the downstream few-shot task, enabling MetaUVFS to significantly outperform all unsupervised methods on few-shot benchmarks. Moreover, unlike previous few-shot action recognition methods that are supervised, MetaUVFS needs neither base-class labels nor a supervised pretrained backbone. Thus, we need to train MetaUVFS just once to perform competitively or sometimes even outperform state-of-the-art supervised methods on popular HMDB51, UCF101, and Kinetics100 few-shot datasets. △ Less

Submitted 11 October, 2021; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: ICCV 2021 (Oral)

arXiv:2103.16091 [pdf, other]

Symbolic Music Generation with Diffusion Models

Authors: Gautam Mittal, Jesse Engel, Curtis Hawthorne, Ian Simon

Abstract: Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizin… ▽ More Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show strong unconditional generation and post-hoc conditional infilling results compared to autoregressive language models operating over the same continuous embeddings. △ Less

Submitted 25 November, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: ISMIR 2021

arXiv:2007.08428 [pdf, other]

On Adversarial Robustness: A Neural Architecture Search perspective

Authors: Chaitanya Devaguptapu, Devansh Agarwal, Gaurav Mittal, Pulkit Gopalani, Vineeth N Balasubramanian

Abstract: Adversarial robustness of deep learning models has gained much traction in the last few years. Various attacks and defenses are proposed to improve the adversarial robustness of modern-day deep learning architectures. While all these approaches help improve the robustness, one promising direction for improving adversarial robustness is unexplored, i.e., the complex topology of the neural network a… ▽ More Adversarial robustness of deep learning models has gained much traction in the last few years. Various attacks and defenses are proposed to improve the adversarial robustness of modern-day deep learning architectures. While all these approaches help improve the robustness, one promising direction for improving adversarial robustness is unexplored, i.e., the complex topology of the neural network architecture. In this work, we address the following question: Can the complex topology of a neural network give adversarial robustness without any form of adversarial training?. We answer this empirically by experimenting with different hand-crafted and NAS-based architectures. Our findings show that, for small-scale attacks, NAS-based architectures are more robust for small-scale datasets and simple tasks than hand-crafted architectures. However, as the size of the dataset or the complexity of task increases, hand-crafted architectures are more robust than NAS-based architectures. Our work is the first large-scale study to understand adversarial robustness purely from an architectural perspective. Our study shows that random sampling in the search space of DARTS (a popular NAS method) with simple ensembling can improve the robustness to PGD attack by nearly~12\%. We show that NAS, which is popular for achieving SoTA accuracy, can provide adversarial accuracy as a free add-on without any form of adversarial training. Our results show that leveraging the search space of NAS methods with methods like ensembles can be an excellent way to achieve adversarial robustness without any form of adversarial training. We also introduce a metric that can be used to calculate the trade-off between clean accuracy and adversarial robustness. Code and pre-trained models will be made available at \url{https://github.com/tdchaitanya/nas-robustness} △ Less

Submitted 26 August, 2021; v1 submitted 16 July, 2020; originally announced July 2020.

Comments: Accepted at the Workshop on Adversarial Robustness in Real-World, ICCV-2021 (previous version accepted at four ICLR-21 Workshops)

arXiv:2005.10524 [pdf, other]

HyperSTAR: Task-Aware Hyperparameters for Deep Networks

Authors: Gaurav Mittal, Chang Liu, Nikolaos Karianakis, Victor Fragoso, Mei Chen, Yun Fu

Abstract: While deep neural networks excel in solving visual recognition tasks, they require significant effort to find hyperparameters that make them work optimally. Hyperparameter Optimization (HPO) approaches have automated the process of finding good hyperparameters but they do not adapt to a given task (task-agnostic), making them computationally inefficient. To reduce HPO time, we present HyperSTAR (S… ▽ More While deep neural networks excel in solving visual recognition tasks, they require significant effort to find hyperparameters that make them work optimally. Hyperparameter Optimization (HPO) approaches have automated the process of finding good hyperparameters but they do not adapt to a given task (task-agnostic), making them computationally inefficient. To reduce HPO time, we present HyperSTAR (System for Task Aware Hyperparameter Recommendation), a task-aware method to warm-start HPO for deep neural networks. HyperSTAR ranks and recommends hyperparameters by predicting their performance conditioned on a joint dataset-hyperparameter space. It learns a dataset (task) representation along with the performance predictor directly from raw images in an end-to-end fashion. The recommendations, when integrated with an existing HPO method, make it task-aware and significantly reduce the time to achieve optimal performance. We conduct extensive experiments on 10 publicly available large-scale image classification datasets over two different network architectures, validating that HyperSTAR evaluates 50% less configurations to achieve the best performance compared to existing methods. We further demonstrate that HyperSTAR makes Hyperband (HB) task-aware, achieving the optimal accuracy in just 25% of the budget required by both vanilla HB and Bayesian Optimized HB~(BOHB). △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: Published at CVPR 2020 (Oral)

arXiv:1910.04269 [pdf, other]

Spoken Language Identification using ConvNets

Authors: Sarthak, Shikhar Shukla, Govind Mittal

Abstract: Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding… ▽ More Language Identification (LI) is an important first step in several speech processing systems. With a growing number of voice-based assistants, speech LI has emerged as a widely researched field. To approach the problem of identifying languages, we can either adopt an implicit approach where only the speech for a language is present or an explicit one where text is available with its corresponding transcript. This paper focuses on an implicit approach due to the absence of transcriptive data. This paper benchmarks existing models and proposes a new attention based model for language identification which uses log-Mel spectrogram images as input. We also present the effectiveness of raw waveforms as features to neural network models for LI tasks. For training and evaluation of models, we classified six languages (English, French, German, Spanish, Russian and Italian) with an accuracy of 95.4% and four languages (English, French, German, Spanish) with an accuracy of 96.3% obtained from the VoxForge dataset. This approach can further be scaled to incorporate more languages. △ Less

Submitted 9 October, 2019; originally announced October 2019.

Comments: 2019 European Conference on Ambient Intelligence

arXiv:1910.00726 [pdf, other]

Animating Face using Disentangled Audio Representations

Authors: Gaurav Mittal, Baoyuan Wang

Abstract: All previous methods for audio-driven talking head generation assume the input audio to be clean with a neutral tone. As we show empirically, one can easily break these systems by simply adding certain background noise to the utterance or changing its emotional tone (to such as sad). To make talking head generation robust to such variations, we propose an explicit audio representation learning fra… ▽ More All previous methods for audio-driven talking head generation assume the input audio to be clean with a neutral tone. As we show empirically, one can easily break these systems by simply adding certain background noise to the utterance or changing its emotional tone (to such as sad). To make talking head generation robust to such variations, we propose an explicit audio representation learning framework that disentangles audio sequences into various factors such as phonetic content, emotional tone, background noise and others. We conduct experiments to validate that conditioned on disentangled content representation, the generated mouth movement by our model is significantly more accurate than previous approaches (without disentangled learning) in the presence of noise and emotional variations. We further demonstrate that our framework is compatible with current state-of-the-art approaches by replacing their original audio learning component with ours. To our best knowledge, this is the first work which improves the performance of talking head generation from disentangled audio representation perspective, which is important for many real-world applications. △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: Accepted at WACV 2020 (Winter conference on Applications of Computer Vision)

arXiv:1909.06781 [pdf, ps, other]

A Vector Space Approach to Generate Dynamic Keys for Hill Cipher

Authors: Sunil Kumar, Sandeep Kumar, Gaurav Mittal, Shiv Narain

Abstract: In this paper, a variant of the Hill cipher is proposed. In the classical Hill cipher, an invertible matrix is used for encryption but the scheme is vulnerable to the known-plaintext attack which can reveal the matrix. In our proposed cryptosystem, each plaintext block is encrypted by a new invertible key matrix that thwarts the known-plaintext attack. To generate the invertible matrices which ser… ▽ More In this paper, a variant of the Hill cipher is proposed. In the classical Hill cipher, an invertible matrix is used for encryption but the scheme is vulnerable to the known-plaintext attack which can reveal the matrix. In our proposed cryptosystem, each plaintext block is encrypted by a new invertible key matrix that thwarts the known-plaintext attack. To generate the invertible matrices which serve as the dynamic keys we make use of the vector spaces, randomly generated basis and non-singular linear transformation. Resulting cipher is secure against the known-plaintext attack. △ Less

Submitted 10 May, 2021; v1 submitted 15 September, 2019; originally announced September 2019.

MSC Class: 11T71; 94A60

arXiv:1908.06148 [pdf, other]

FiFTy: Large-scale File Fragment Type Identification using Neural Networks

Authors: Govind Mittal, Pawel Korus, Nasir Memon

Abstract: We present FiFTy, a modern file type identification tool for memory forensics and data carving. In contrast to previous approaches based on hand-crafted features, we design a compact neural network architecture, which uses a trainable embedding space, akin to successful natural language processing models. Our approach dispenses with explicit feature extraction which is a bottleneck in legacy syste… ▽ More We present FiFTy, a modern file type identification tool for memory forensics and data carving. In contrast to previous approaches based on hand-crafted features, we design a compact neural network architecture, which uses a trainable embedding space, akin to successful natural language processing models. Our approach dispenses with explicit feature extraction which is a bottleneck in legacy systems. We evaluate the proposed method on a novel dataset with 75 file types - the most diverse and balanced dataset reported to date. FiFTy consistently outperforms all baselines in terms of speed, accuracy and individual misclassification rates. We achieved an average accuracy of 77.5% with processing speed of approx 38 sec/GB, which is better and more than an order of magnitude faster than the previous state-of-the-art tool - Sceadan (69% at 9 min/GB). Our tool and the corresponding dataset are available publicly online. △ Less

Submitted 7 June, 2020; v1 submitted 16 August, 2019; originally announced August 2019.

Comments: Paper accepted for publication in the IEEE Transactions on Information Forensics and Security

arXiv:1905.03743 [pdf, other]

Interactive Image Generation Using Scene Graphs

Authors: Gaurav Mittal, Shubham Agrawal, Anuva Agarwal, Sushant Mehta, Tanya Marwah

Abstract: Recent years have witnessed some exciting developments in the domain of generating images from scene-based text descriptions. These approaches have primarily focused on generating images from a static text description and are limited to generating images in a single pass. They are unable to generate an image interactively based on an incrementally additive text description (something that is more… ▽ More Recent years have witnessed some exciting developments in the domain of generating images from scene-based text descriptions. These approaches have primarily focused on generating images from a static text description and are limited to generating images in a single pass. They are unable to generate an image interactively based on an incrementally additive text description (something that is more intuitive and similar to the way we describe an image). We propose a method to generate an image incrementally based on a sequence of graphs of scene descriptions (scene-graphs). We propose a recurrent network architecture that preserves the image content generated in previous steps and modifies the cumulative image as per the newly provided scene information. Our model utilizes Graph Convolutional Networks (GCN) to cater to variable-sized scene graphs along with Generative Adversarial image translation networks to generate realistic multi-object images without needing any intermediate supervision during training. We experiment with Coco-Stuff dataset which has multi-object images along with annotations describing the visual scene and show that our model significantly outperforms other approaches on the same dataset in generating visually consistent images for incrementally growing scene graphs. △ Less

Submitted 9 May, 2019; originally announced May 2019.

Comments: Published at ICLR 2019 Deep Generative Models for Highly Structured Data Workshop

arXiv:1708.05980 [pdf, other]

Attentive Semantic Video Generation using Captions

Authors: Tanya Marwah, Gaurav Mittal, Vineeth N. Balasubramanian

Abstract: This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term dependencies between video frames and thus generate a video in an incremental manner. Our experiments demonstrate our network architecture's ability to disting… ▽ More This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term dependencies between video frames and thus generate a video in an incremental manner. Our experiments demonstrate our network architecture's ability to distinguish between objects, actions and interactions in a video and combine them to generate videos for unseen captions. The network also exhibits the capability to perform spatio-temporal style transfer when asked to generate videos for a sequence of captions. We also show that the network's ability to learn a latent representation allows it generate videos in an unsupervised manner and perform other tasks such as action recognition. (Accepted in International Conference in Computer Vision (ICCV) 2017) △ Less

Submitted 21 October, 2017; v1 submitted 20 August, 2017; originally announced August 2017.

Journal ref: Presented at ICCV 2017 (International Conference on Computer Vision)

arXiv:1611.10314 [pdf, other]

doi 10.1145/3123266.3123309

Sync-DRAW: Automatic Video Generation using Deep Recurrent Attentive Architectures

Authors: Gaurav Mittal, Tanya Marwah, Vineeth N. Balasubramanian

Abstract: This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW). Sync-DRAW can also perform text-to-video generation which, to the best of our knowledge, makes it the first approach of its kind. It combines a Variational Autoencoder~(VAE) with a Recurrent Attention Mechanism in a novel manner to create a temporally dependent sequence of… ▽ More This paper introduces a novel approach for generating videos called Synchronized Deep Recurrent Attentive Writer (Sync-DRAW). Sync-DRAW can also perform text-to-video generation which, to the best of our knowledge, makes it the first approach of its kind. It combines a Variational Autoencoder~(VAE) with a Recurrent Attention Mechanism in a novel manner to create a temporally dependent sequence of frames that are gradually formed over time. The recurrent attention mechanism in Sync-DRAW attends to each individual frame of the video in sychronization, while the VAE learns a latent distribution for the entire video at the global level. Our experiments with Bouncing MNIST, KTH and UCF-101 suggest that Sync-DRAW is efficient in learning the spatial and temporal information of the videos and generates frames with high structural integrity, and can generate videos from simple captions on these datasets. (Accepted as oral paper in ACM-Multimedia 2017) △ Less

Submitted 21 October, 2017; v1 submitted 30 November, 2016; originally announced November 2016.

arXiv:1308.1806 [pdf]

A Survey of Current Trends in Distributed, Grid and Cloud Computing

Authors: Gaurav Mittal, Dr. Nishtha Kesswani, Kuldeep Goswami

Abstract: Through the 1990s to 2012 the internet changed the world of computing drastically. It started its journey with parallel computing after it advanced to distributed computing and further to grid computing. And in present scenario it creates a new world which is pronounced as a Cloud Computing [1]. These all three terms have different meanings. Cloud computing is based on backward computing schemes l… ▽ More Through the 1990s to 2012 the internet changed the world of computing drastically. It started its journey with parallel computing after it advanced to distributed computing and further to grid computing. And in present scenario it creates a new world which is pronounced as a Cloud Computing [1]. These all three terms have different meanings. Cloud computing is based on backward computing schemes like cluster computing, distributed computing, grid computing and utility computing. The basic concept of cloud computing is virtualization. It provides virtual hardware and software resources to various requesting programs. This paper gives a detailed description about cluster computing, grid computing and cloud computing and gives an insight of some implementations of the same. We try to list the inspirations for the advent of all these technologies. We also account for some present scenario faults of grid computing and also discuss new cloud computing projects which are being managed by the Government of India for learning. The paper also reviews the existing work and covers (analytically), to some extent, some innovative ideas that can be implemented. △ Less

Submitted 8 August, 2013; originally announced August 2013.

Comments: 6 pages

Journal ref: International Journal of Advanced Studies in Computers, Science & Engineering (IJASCSE Vol 2, Issue 3, 2013)

Showing 1–24 of 24 results for author: Mittal, G