Search | arXiv e-print repository

ActAnywhere: Subject-Aware Video Background Generation

Authors: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang, Krishna Kumar Singh, Yang Zhou, Leonidas J. Guibas, Jimei Yang

Abstract: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which tra… ▽ More Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io. △ Less

Submitted 19 January, 2024; originally announced January 2024.

arXiv:2312.14985 [pdf, other]

UniHuman: A Unified Model for Editing Human Images in the Wild

Authors: Nannan Li, Qing Liu, Krishna Kumar Singh, Yilin Wang, Jianming Zhang, Bryan A. Plummer, Zhe Lin

Abstract: Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To en… ▽ More Human image editing includes tasks like changing a person's pose, their clothing, or editing the image according to a text prompt. However, prior work often tackles these tasks separately, overlooking the benefit of mutual reinforcement from learning them jointly. In this paper, we propose UniHuman, a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity, we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations, accommodating unseen textures and patterns. Furthermore, to bridge the disparity between existing human editing benchmarks with real-world data, we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing, both encompassing diverse clothing styles, backgrounds, and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies, UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman. △ Less

Submitted 31 March, 2024; v1 submitted 22 December, 2023; originally announced December 2023.

Comments: Accepted to CVPR 2024

arXiv:2312.06712 [pdf, other]

Separate-and-Enhance: Compositional Finetuning for Text2Image Diffusion Models

Authors: Zhipeng Bao, Yijun Li, Krishna Kumar Singh, Yu-Xiong Wang, Martial Hebert

Abstract: Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. Whi… ▽ More Despite recent significant strides achieved by diffusion-based Text-to-Image (T2I) models, current systems are still less capable of ensuring decent compositional generation aligned with text prompts, particularly for the multi-object generation. This work illuminates the fundamental reasons for such misalignment, pinpointing issues related to low attention activation scores and mask overlaps. While previous research efforts have individually tackled these issues, we assert that a holistic approach is paramount. Thus, we propose two novel objectives, the Separate loss and the Enhance loss, that reduce object mask overlaps and maximize attention scores, respectively. Our method diverges from conventional test-time-adaptation techniques, focusing on finetuning critical parameters, which enhances scalability and generalizability. Comprehensive evaluations demonstrate the superior performance of our model in terms of image realism, text-image alignment, and adaptability, notably outperforming prominent baselines. Ultimately, this research paves the way for T2I diffusion models with enhanced compositional capacities and broader applicability. △ Less

Submitted 31 January, 2024; v1 submitted 10 December, 2023; originally announced December 2023.

arXiv:2307.01425 [pdf, other]

Consistent Multimodal Generation via A Unified GAN Framework

Authors: Zhen Zhu, Yijun Li, Weijie Lyu, Krishna Kumar Singh, Zhixin Shu, Soeren Pirk, Derek Hoiem

Abstract: We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality… ▽ More We investigate how to generate multimodal image outputs, such as RGB, depth, and surface normals, with a single generative model. The challenge is to produce outputs that are realistic, and also consistent with each other. Our solution builds on the StyleGAN3 architecture, with a shared backbone and modality-specific branches in the last layers of the synthesis network, and we propose per-modality fidelity discriminators and a cross-modality consistency discriminator. In experiments on the Stanford2D3D dataset, we demonstrate realistic and consistent generation of RGB, depth, and normal images. We also show a training recipe to easily extend our pretrained model on a new domain, even with a few pairwise data. We further evaluate the use of synthetically generated RGB and depth pairs for training or fine-tuning depth estimators. Code will be available at https://github.com/jessemelpolio/MultimodalGAN. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: In review

arXiv:2304.14406 [pdf, other]

Putting People in Their Place: Affordance-Aware Human Insertion into Scenes

Authors: Sumith Kulal, Tim Brooks, Alex Aiken, Jiajun Wu, Jimei Yang, Jingwan Lu, Alexei A. Efros, Krishna Kumar Singh

Abstract: We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. W… ▽ More We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re-pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. A quantitative evaluation shows that our method synthesizes more realistic human appearance and more natural human-scene interactions than prior work. △ Less

Submitted 27 April, 2023; originally announced April 2023.

Comments: CVPR 2023. Project page with code: https://sumith1896.github.io/affordance-insertion/

arXiv:2302.14368 [pdf, other]

Enhanced Controllability of Diffusion Models via Feature Disentanglement and Realism-Enhanced Sampling Methods

Authors: Wonwoong Cho, Hareesh Ravi, Midhun Harikumar, Vinh Khuc, Krishna Kumar Singh, Jingwan Lu, David I. Inouye, Ajinkya Kale

Abstract: As Diffusion Models have shown promising performance, a lot of efforts have been made to improve the controllability of Diffusion Models. However, how to train Diffusion Models to have the disentangled latent spaces and how to naturally incorporate the disentangled conditions during the sampling process have been underexplored. In this paper, we present a training framework for feature disentangle… ▽ More As Diffusion Models have shown promising performance, a lot of efforts have been made to improve the controllability of Diffusion Models. However, how to train Diffusion Models to have the disentangled latent spaces and how to naturally incorporate the disentangled conditions during the sampling process have been underexplored. In this paper, we present a training framework for feature disentanglement of Diffusion Models (FDiff). We further propose two sampling methods that can boost the realism of our Diffusion Models and also enhance the controllability. Concisely, we train Diffusion Models conditioned on two latent features, a spatial content mask, and a flattened style embedding. We rely on the inductive bias of the denoising process of Diffusion Models to encode pose/layout information in the content feature and semantic/style information in the style feature. Regarding the sampling methods, we first generalize Composable Diffusion Models (GCDM) by breaking the conditional independence assumption to allow for some dependence between conditional inputs, which is shown to be effective in realistic generation in our experiments. Second, we propose timestep-dependent weight scheduling for content and style features to further improve the performance. We also observe better controllability of our proposed methods compared to existing methods in image manipulation and image translation. △ Less

Submitted 23 July, 2024; v1 submitted 28 February, 2023; originally announced February 2023.

Comments: ECCV 2024; Code will be opened after a patent application is granted

arXiv:2302.12764 [pdf, other]

Modulating Pretrained Diffusion Models for Multimodal Image Synthesis

Authors: Cusuh Ham, James Hays, Jingwan Lu, Krishna Kumar Singh, Zhifei Zhang, Tobias Hinz

Abstract: We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updat… ▽ More We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but \textit{does not require any updates to the diffusion network's parameters}. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs. △ Less

Submitted 18 May, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: SIGGRAPH Conference Proceedings 2023. Project page at https://mcm-diffusion.github.io

arXiv:2302.03027 [pdf, other]

Zero-shot Image-to-Image Translation

Authors: Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, Jingwan Lu, Jun-Yan Zhu

Abstract: Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can in… ▽ More Large-scale text-to-image generative models have shown their remarkable ability to synthesize diverse and high-quality images. However, it is still challenging to directly apply these models for editing real images for two reasons. First, it is hard for users to come up with a perfect text prompt that accurately describes every visual detail in the input image. Second, while existing models can introduce desirable changes in certain regions, they often dramatically alter the input content and introduce unexpected changes in unwanted regions. In this work, we propose pix2pix-zero, an image-to-image translation method that can preserve the content of the original image without manual prompting. We first automatically discover editing directions that reflect desired edits in the text embedding space. To preserve the general content structure after editing, we further propose cross-attention guidance, which aims to retain the cross-attention maps of the input image throughout the diffusion process. In addition, our method does not need additional training for these edits and can directly use the existing pre-trained text-to-image diffusion model. We conduct extensive experiments and show that our method outperforms existing and concurrent works for both real and synthetic image editing. △ Less

Submitted 6 February, 2023; originally announced February 2023.

Comments: website: https://pix2pixzero.github.io/

arXiv:2211.10157 [pdf, other]

UMFuse: Unified Multi View Fusion for Human Editing applications

Authors: Rishabh Jain, Mayur Hemani, Duygu Ceylan, Krishna Kumar Singh, Jingwan Lu, Mausoom Sarkar, Balaji Krishnamurthy

Abstract: Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing… ▽ More Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per-pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative. △ Less

Submitted 28 March, 2023; v1 submitted 17 November, 2022; originally announced November 2022.

Comments: 8 pages, 6 figures

ACM Class: I.4; I.5

arXiv:2211.08540 [pdf, other]

VGFlow: Visibility guided Flow Network for Human Reposing

Authors: Rishabh Jain, Krishna Kumar Singh, Mayur Hemani, Jingwan Lu, Mausoom Sarkar, Duygu Ceylan, Balaji Krishnamurthy

Abstract: The task of human reposing involves generating a realistic image of a person standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images, and existing methods suffer from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation, etc. These difficulties a… ▽ More The task of human reposing involves generating a realistic image of a person standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images, and existing methods suffer from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation, etc. These difficulties are further exacerbated by the fact that the possible space of pose orientation for humans is large and variable, the nature of clothing items is highly non-rigid, and the diversity in body shape differs largely among the population. To alleviate these difficulties and synthesize perceptually accurate images, we propose VGFlow. Our model uses a visibility-guided flow module to disentangle the flow into visible and invisible parts of the target for simultaneous texture preservation and style manipulation. Furthermore, to tackle distinct body shapes and avoid network artifacts, we also incorporate a self-supervised patch-wise "realness" loss to improve the output. VGFlow achieves state-of-the-art results as observed qualitatively and quantitatively on different image quality metrics (SSIM, LPIPS, FID). △ Less

Submitted 28 March, 2023; v1 submitted 13 November, 2022; originally announced November 2022.

Comments: Selected for publication in CVPR2023

ACM Class: I.4; I.5

arXiv:2211.02707 [pdf, other]

Contrastive Learning for Diverse Disentangled Foreground Generation

Authors: Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

Abstract: We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with… ▽ More We introduce a new method for diverse foreground generation with explicit control over various factors. Existing image inpainting based foreground generation methods often struggle to generate diverse results and rarely allow users to explicitly control specific factors of variation (e.g., varying the facial identity or expression for face inpainting results). We leverage contrastive learning with latent codes to generate diverse foreground results for the same masked input. Specifically, we define two sets of latent codes, where one controls a pre-defined factor (``known''), and the other controls the remaining factors (``unknown''). The sampled latent codes from the two sets jointly bi-modulate the convolution kernels to guide the generator to synthesize diverse results. Experiments demonstrate the superiority of our method over state-of-the-arts in result diversity and generation controllability. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: ECCV 2022

arXiv:2208.01350 [pdf, ps, other]

Application of Blockchain Smart Contracts in E-Commerce and Government

Authors: Kamal Kishor Singh

Abstract: With technological advances and the establishment of e-commerce models, business challenges have shifted to online platforms. The promise of embedding self-executing and autonomous programs into blockchain technologies has attracted increased interest and its use in niche solutions. Using qualitative interviews, this paper sought the opinions of the eleven industry leaders regarding smart contract… ▽ More With technological advances and the establishment of e-commerce models, business challenges have shifted to online platforms. The promise of embedding self-executing and autonomous programs into blockchain technologies has attracted increased interest and its use in niche solutions. Using qualitative interviews, this paper sought the opinions of the eleven industry leaders regarding smart contracts. Findings reveal that the technology is gaining momentum in e-commerce, particularly in financial transfer, record-keeping, real estate, and property management, insurance, mortgage, supply chain management, data storage, authorization of credit, denaturalized intelligence, aviation sector, shipping of products, invoice financing and other domains. The significant benefits of widespread adoption and deployment of smart contracts include their capability to deliver decentralization, efficacy, cost-effectiveness, transparency, speed, autonomy, transparency, privacy, and security, encouraging the emergence of novel business models. Albeit these benefits that revolutionize online transactions, the technology faced multifaceted challenges. Smart technologies are only a decade old and are not advanced in security, transparency, cost-effectiveness, and regulatory framework. Furthermore, organizational, and technical challenges limit their deployment: incompatibility with legacy systems, scalability, bugs, speed, and lack of talent and understanding regarding smart contracts. Consequently, policymakers, developers, researchers, practitioners, and other stakeholders need to invest effort and time to foster the technologies and address pertinent issues to enable the global adoption of smart contracts by small and big businesses. △ Less

Submitted 2 August, 2022; originally announced August 2022.

arXiv:2206.08357 [pdf, other]

Spatially-Adaptive Multilayer Selection for GAN Inversion and Editing

Authors: Gaurav Parmar, Yijun Li, Jingwan Lu, Richard Zhang, Jun-Yan Zhu, Krishna Kumar Singh

Abstract: Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea… ▽ More Existing GAN inversion and editing methods work well for aligned objects with a clean background, such as portraits and animal faces, but often struggle for more difficult categories with complex scene layouts and object occlusions, such as cars, animals, and outdoor images. We propose a new method to invert and edit such complex images in the latent space of GANs, such as StyleGAN2. Our key idea is to explore inversion with a collection of layers, spatially adapting the inversion process to the difficulty of the image. We learn to predict the "invertibility" of different image segments and project each segment into a latent layer. Easier regions can be inverted into an earlier layer in the generator's latent space, while more challenging regions can be inverted into a later feature space. Experiments show that our method obtains better inversion results compared to the recent approaches on complex categories, while maintaining downstream editability. Please refer to our project page at https://www.cs.cmu.edu/~SAMInversion. △ Less

Submitted 16 June, 2022; originally announced June 2022.

Comments: CVPR 2022. Github: https://github.com/adobe-research/sam_inversion Website: https://www.cs.cmu.edu/~SAMInversion

arXiv:2206.07331 [pdf]

ETMA: Efficient Transformer Based Multilevel Attention framework for Multimodal Fake News Detection

Authors: Ashima Yadav, Shivani Gaba, Haneef Khan, Ishan Budhiraja, Akansha Singh, Krishan Kant Singh

Abstract: In this new digital era, social media has created a severe impact on the lives of people. In recent times, fake news content on social media has become one of the major challenging problems for society. The dissemination of fabricated and false news articles includes multimodal data in the form of text and images. The previous methods have mainly focused on unimodal analysis. Moreover, for multimo… ▽ More In this new digital era, social media has created a severe impact on the lives of people. In recent times, fake news content on social media has become one of the major challenging problems for society. The dissemination of fabricated and false news articles includes multimodal data in the form of text and images. The previous methods have mainly focused on unimodal analysis. Moreover, for multimodal analysis, researchers fail to keep the unique characteristics corresponding to each modality. This paper aims to overcome these limitations by proposing an Efficient Transformer based Multilevel Attention (ETMA) framework for multimodal fake news detection, which comprises the following components: visual attention-based encoder, textual attention-based encoder, and joint attention-based learning. Each component utilizes the different forms of attention mechanism and uniquely deals with multimodal data to detect fraudulent content. The efficacy of the proposed network is validated by conducting several experiments on four real-world fake news datasets: Twitter, Jruvika Fake News Dataset, Pontes Fake News Dataset, and Risdal Fake News Dataset using multiple evaluation metrics. The results show that the proposed method outperforms the baseline methods on all four datasets. Further, the computation time of the model is also lower than the state-of-the-art methods. △ Less

Submitted 13 March, 2023; v1 submitted 15 June, 2022; originally announced June 2022.

Comments: Accepted for publication in IEEE Transactions on Computational Social Systems

arXiv:2203.14954 [pdf, other]

GIRAFFE HD: A High-Resolution 3D-aware Generative Model

Authors: Yang Xue, Yuheng Li, Krishna Kumar Singh, Yong Jae Lee

Abstract: 3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation. In particular, the current state-of-the-art model GIRAFFE can control each object's rotation, translation, scale, and scene camera pose without corresponding supervision. However, GIRAFFE only operates well when the image resolution is low. We propose GIRAFFE HD, a high-res… ▽ More 3D-aware generative models have shown that the introduction of 3D information can lead to more controllable image generation. In particular, the current state-of-the-art model GIRAFFE can control each object's rotation, translation, scale, and scene camera pose without corresponding supervision. However, GIRAFFE only operates well when the image resolution is low. We propose GIRAFFE HD, a high-resolution 3D-aware generative model that inherits all of GIRAFFE's controllable features while generating high-quality, high-resolution images ($512^2$ resolution and above). The key idea is to leverage a style-based neural renderer, and to independently generate the foreground and background to force their disentanglement while imposing consistency constraints to stitch them together to composite a coherent final image. We demonstrate state-of-the-art 3D controllable high-resolution image generation on multiple natural image datasets. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: CVPR 2022

arXiv:2203.07293 [pdf, other]

InsetGAN for Full-Body Image Generation

Authors: Anna Frühstück, Krishna Kumar Singh, Eli Shechtman, Niloy J. Mitra, Peter Wonka, Jingwan Lu

Abstract: While GANs can produce photo-realistic images in ideal conditions for certain domains, the generation of full-body human images remains difficult due to the diversity of identities, hairstyles, clothing, and the variance in pose. Instead of modeling this complex domain with a single GAN, we propose a novel method to combine multiple pretrained GANs, where one GAN generates a global canvas (e.g., h… ▽ More While GANs can produce photo-realistic images in ideal conditions for certain domains, the generation of full-body human images remains difficult due to the diversity of identities, hairstyles, clothing, and the variance in pose. Instead of modeling this complex domain with a single GAN, we propose a novel method to combine multiple pretrained GANs, where one GAN generates a global canvas (e.g., human body) and a set of specialized GANs, or insets, focus on different parts (e.g., faces, shoes) that can be seamlessly inserted onto the global canvas. We model the problem as jointly exploring the respective latent spaces such that the generated images can be combined, by inserting the parts from the specialized generators onto the global canvas, without introducing seams. We demonstrate the setup by combining a full body GAN with a dedicated high-quality face GAN to produce plausible-looking humans. We evaluate our results with quantitative metrics and user studies. △ Less

Submitted 14 March, 2022; originally announced March 2022.

Comments: Project webpage and video available at http://afruehstueck.github.io/insetgan

arXiv:2111.05916 [pdf, other]

Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis

Authors: Tuanfeng Y. Wang, Duygu Ceylan, Krishna Kumar Singh, Niloy J. Mitra

Abstract: Synthesizing dynamic appearances of humans in motion plays a central role in applications such as AR/VR and video editing. While many recent methods have been proposed to tackle this problem, handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper, we propose a video based appearance synthesis method that tackles such challenges and demonstrat… ▽ More Synthesizing dynamic appearances of humans in motion plays a central role in applications such as AR/VR and video editing. While many recent methods have been proposed to tackle this problem, handling loose garments with complex textures and high dynamic motion still remains challenging. In this paper, we propose a video based appearance synthesis method that tackles such challenges and demonstrates high quality results for in-the-wild videos that have not been shown before. Specifically, we adopt a StyleGAN based architecture to the task of person specific video based motion retargeting. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes as well as regularizing the single frame based pose estimates to improve temporal coherency. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively. △ Less

Submitted 10 November, 2021; originally announced November 2021.

arXiv:2110.04281 [pdf, other]

Collaging Class-specific GANs for Semantic Image Synthesis

Authors: Yuheng Li, Yijun Li, Jingwan Lu, Eli Shechtman, Yong Jae Lee, Krishna Kumar Singh

Abstract: We propose a new approach for high resolution semantic image synthesis. It consists of one base image generator and multiple class-specific generators. The base generator generates high quality images based on a segmentation map. To further improve the quality of different objects, we create a bank of Generative Adversarial Networks (GANs) by separately training class-specific models. This has sev… ▽ More We propose a new approach for high resolution semantic image synthesis. It consists of one base image generator and multiple class-specific generators. The base generator generates high quality images based on a segmentation map. To further improve the quality of different objects, we create a bank of Generative Adversarial Networks (GANs) by separately training class-specific models. This has several benefits including -- dedicated weights for each class; centrally aligned data for each model; additional training data from other sources, potential of higher resolution and quality; and easy manipulation of a specific object in the scene. Experiments show that our approach can generate high quality images in high resolution while having flexibility of object-level control by using class-specific generators. △ Less

Submitted 8 October, 2021; originally announced October 2021.

Comments: ICCV 2021

arXiv:2104.05895 [pdf, other]

IMAGINE: Image Synthesis by Image-Guided Model Inversion

Authors: Pei Wang, Yijun Li, Krishna Kumar Singh, Jingwan Lu, Nuno Vasconcelos

Abstract: We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images from only a single training sample. We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations via matching multi-level feature representations in the classifier, associated with adversarial training with an external… ▽ More We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images from only a single training sample. We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations via matching multi-level feature representations in the classifier, associated with adversarial training with an external discriminator. IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process. With extensive experimental results, we demonstrate qualitatively and quantitatively that IMAGINE performs favorably against state-of-the-art GAN-based and inversion-based methods, across three different image domains (i.e., objects, scenes, and textures). △ Less

Submitted 12 April, 2021; originally announced April 2021.

Comments: Published in CVPR2021

arXiv:2104.02052 [pdf, other]

Generating Furry Cars: Disentangling Object Shape & Appearance across Multiple Domains

Authors: Utkarsh Ojha, Krishna Kumar Singh, Yong Jae Lee

Abstract: We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e.g., dogs and cars). The goal is to learn a generative model that learns an intermediate distribution, which borrows a subset of properties from each domain, enabling the generation of images that did not exist in any domain exclusively. This challenging problem requires an… ▽ More We consider the novel task of learning disentangled representations of object shape and appearance across multiple domains (e.g., dogs and cars). The goal is to learn a generative model that learns an intermediate distribution, which borrows a subset of properties from each domain, enabling the generation of images that did not exist in any domain exclusively. This challenging problem requires an accurate disentanglement of object shape, appearance, and background from each domain, so that the appearance and shape factors from the two domains can be interchanged. We augment an existing approach that can disentangle factors within a single domain but struggles to do so across domains. Our key technical contribution is to represent object appearance with a differentiable histogram of visual features, and to optimize the generator so that two images with the same latent appearance factor but different latent shape factors produce similar histograms. On multiple multi-domain datasets, we demonstrate our method leads to accurate and consistent appearance and shape transfer across domains. △ Less

Submitted 5 April, 2021; originally announced April 2021.

Comments: Camera ready version for ICLR 2021

arXiv:2001.03152 [pdf, other]

Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Authors: Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram

Abstract: Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on context risks a model's generalizability, especially when typical co-occurrence patterns are absent. This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations. Our goal is to accurately recognize a… ▽ More Existing models often leverage co-occurrences between objects and their context to improve recognition accuracy. However, strongly relying on context risks a model's generalizability, especially when typical co-occurrence patterns are absent. This work focuses on addressing such contextual biases to improve the robustness of the learnt feature representations. Our goal is to accurately recognize a category in the absence of its context, without compromising on performance when it co-occurs with context. Our key idea is to decorrelate feature representations of a category from its co-occurring context. We achieve this by learning a feature subspace that explicitly represents categories occurring in the absence of context along side a joint feature subspace that represents both categories and context. Our very simple yet effective method is extensible to two multi-label tasks -- object and attribute classification. On 4 challenging datasets, we demonstrate the effectiveness of our method in reducing contextual bias. △ Less

Submitted 5 May, 2020; v1 submitted 9 January, 2020; originally announced January 2020.

Comments: CVPR 2020

arXiv:1911.11758 [pdf, other]

MixNMatch: Multifactor Disentanglement and Encoding for Conditional Image Generation

Authors: Yuheng Li, Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee

Abstract: We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN, an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching to lea… ▽ More We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN, an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch's ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch △ Less

Submitted 13 April, 2020; v1 submitted 26 November, 2019; originally announced November 2019.

Comments: CVPR 2020 camera ready

Journal ref: CVPR 2020

arXiv:1910.01112 [pdf, other]

Elastic-InfoGAN: Unsupervised Disentangled Representation Learning in Class-Imbalanced Data

Authors: Utkarsh Ojha, Krishna Kumar Singh, Cho-Jui Hsieh, Yong Jae Lee

Abstract: We propose a novel unsupervised generative model that learns to disentangle object identity from other low-level aspects in class-imbalanced data. We first investigate the issues surrounding the assumptions about uniformity made by InfoGAN, and demonstrate its ineffectiveness to properly disentangle object identity in imbalanced data. Our key idea is to make the discovery of the discrete latent fa… ▽ More We propose a novel unsupervised generative model that learns to disentangle object identity from other low-level aspects in class-imbalanced data. We first investigate the issues surrounding the assumptions about uniformity made by InfoGAN, and demonstrate its ineffectiveness to properly disentangle object identity in imbalanced data. Our key idea is to make the discovery of the discrete latent factor of variation invariant to identity-preserving transformations in real images, and use that as a signal to learn the appropriate latent distribution representing object identity. Experiments on both artificial (MNIST, 3D cars, 3D chairs, ShapeNet) and real-world (YouTube-Faces) imbalanced datasets demonstrate the effectiveness of our method in disentangling object identity as a latent factor of variation. △ Less

Submitted 30 October, 2020; v1 submitted 1 October, 2019; originally announced October 2019.

Comments: Camera ready version for NeurIPS 2020

arXiv:1811.11155 [pdf, other]

FineGAN: Unsupervised Hierarchical Disentanglement for Fine-Grained Object Generation and Discovery

Authors: Krishna Kumar Singh, Utkarsh Ojha, Yong Jae Lee

Abstract: We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories. To disentangle the factors without supervision, our key idea is to use information theory to associate each factor to a latent code, and to condition the relationships between the codes in a specific way… ▽ More We propose FineGAN, a novel unsupervised GAN framework, which disentangles the background, object shape, and object appearance to hierarchically generate images of fine-grained object categories. To disentangle the factors without supervision, our key idea is to use information theory to associate each factor to a latent code, and to condition the relationships between the codes in a specific way to induce the desired hierarchy. Through extensive experiments, we show that FineGAN achieves the desired disentanglement to generate realistic and diverse images belonging to fine-grained classes of birds, dogs, and cars. Using FineGAN's automatically learned features, we also cluster real images as a first attempt at solving the novel problem of unsupervised fine-grained object category discovery. Our code/models/demo can be found at https://github.com/kkanshul/finegan △ Less

Submitted 9 April, 2019; v1 submitted 27 November, 2018; originally announced November 2018.

Journal ref: CVPR 2019 (Oral Presentation)

arXiv:1811.02545 [pdf, other]

Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization and Beyond

Authors: Krishna Kumar Singh, Hao Yu, Aron Sarmasi, Gautam Pradeep, Yong Jae Lee

Abstract: We propose 'Hide-and-Seek' a general purpose data augmentation technique, which is complementary to existing data augmentation techniques and is beneficial for various visual recognition tasks. The key idea is to hide patches in a training image randomly, in order to force the network to seek other relevant content when the most discriminative content is hidden. Our approach only needs to modify t… ▽ More We propose 'Hide-and-Seek' a general purpose data augmentation technique, which is complementary to existing data augmentation techniques and is beneficial for various visual recognition tasks. The key idea is to hide patches in a training image randomly, in order to force the network to seek other relevant content when the most discriminative content is hidden. Our approach only needs to modify the input image and can work with any network to improve its performance. During testing, it does not need to hide any patches. The main advantage of Hide-and-Seek over existing data augmentation techniques is its ability to improve object localization accuracy in the weakly-supervised setting, and we therefore use this task to motivate the approach. However, Hide-and-Seek is not tied only to the image localization task, and can generalize to other forms of visual input like videos, as well as other recognition tasks like image classification, temporal action localization, semantic segmentation, emotion recognition, age/gender estimation, and person re-identification. We perform extensive experiments to showcase the advantage of Hide-and-Seek on these various visual recognition problems. △ Less

Submitted 6 November, 2018; originally announced November 2018.

Comments: TPAMI submission. This is a journal extension of our ICCV 2017 paper arXiv:1704.04232

arXiv:1804.01077 [pdf, other]

DOCK: Detecting Objects by transferring Common-sense Knowledge

Authors: Krishna Kumar Singh, Santosh Divvala, Ali Farhadi, Yong Jae Lee

Abstract: We present a scalable approach for Detecting Objects by transferring Common-sense Knowledge (DOCK) from source to target categories. In our setting, the training data for the source categories have bounding box annotations, while those for the target categories only have image-level annotations. Current state-of-the-art approaches focus on image-level visual or semantic similarity to adapt a detec… ▽ More We present a scalable approach for Detecting Objects by transferring Common-sense Knowledge (DOCK) from source to target categories. In our setting, the training data for the source categories have bounding box annotations, while those for the target categories only have image-level annotations. Current state-of-the-art approaches focus on image-level visual or semantic similarity to adapt a detector trained on the source categories to the new target categories. In contrast, our key idea is to (i) use similarity not at the image-level, but rather at the region-level, and (ii) leverage richer common-sense (based on attribute, spatial, etc.) to guide the algorithm towards learning the correct detections. We acquire such common-sense cues automatically from readily-available knowledge bases without any extra human effort. On the challenging MS COCO dataset, we find that common-sense knowledge can substantially improve detection performance over existing transfer-learning baselines. △ Less

Submitted 31 July, 2018; v1 submitted 3 April, 2018; originally announced April 2018.

Journal ref: ECCV, 2018

arXiv:1705.09275 [pdf, other]

Who Will Share My Image? Predicting the Content Diffusion Path in Online Social Networks

Authors: Wenjian Hu, Krishna Kumar Singh, Fanyi Xiao, Jinyoung Han, Chen-Nee Chuah, Yong Jae Lee

Abstract: Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest. However, existing work mainly focuses on modeling popularity using a single metric such as the total number of likes or shares. In this work, we propose Diffusion-LSTM, a memory-based deep recurrent network th… ▽ More Content popularity prediction has been extensively studied due to its importance and interest for both users and hosts of social media sites like Facebook, Instagram, Twitter, and Pinterest. However, existing work mainly focuses on modeling popularity using a single metric such as the total number of likes or shares. In this work, we propose Diffusion-LSTM, a memory-based deep recurrent network that learns to recursively predict the entire diffusion path of an image through a social network. By combining user social features and image features, and encoding the diffusion path taken thus far with an explicit memory cell, our model predicts the diffusion path of an image more accurately compared to alternate baselines that either encode only image or social features, or lack memory. By mapping individual users to user prototypes, our model can generalize to new users not seen during training. Finally, we demonstrate our model's capability of generating diffusion trees, and show that the generated trees closely resemble ground-truth trees. △ Less

Submitted 29 November, 2017; v1 submitted 25 May, 2017; originally announced May 2017.

Comments: 9 pages, 6 figures

arXiv:1704.06340 [pdf, other]

Identifying First-person Camera Wearers in Third-person Videos

Authors: Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Kumar Singh, Yong Jae Lee, David J. Crandall, Michael S. Ryoo

Abstract: We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene. To do this, we need to establish person-level correspondences across first- and third-person videos, which is challenging because the cam… ▽ More We consider scenarios in which we wish to perform joint scene understanding, object tracking, activity recognition, and other tasks in environments in which multiple people are wearing body-worn cameras while a third-person static camera also captures the scene. To do this, we need to establish person-level correspondences across first- and third-person videos, which is challenging because the camera wearer is not visible from his/her own egocentric video, preventing the use of direct feature matching. In this paper, we propose a new semi-Siamese Convolutional Neural Network architecture to address this novel challenge. We formulate the problem as learning a joint embedding space for first- and third-person videos that considers both spatial- and motion-domain cues. A new triplet loss function is designed to minimize the distance between correct first- and third-person matches while maximizing the distance between incorrect ones. This end-to-end approach performs significantly better than several baselines, in part by learning the first- and third-person features optimized for matching jointly with the distance measure itself. △ Less

Submitted 20 April, 2017; originally announced April 2017.

arXiv:1704.04232 [pdf, other]

Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-supervised Object and Action Localization

Authors: Krishna Kumar Singh, Yong Jae Lee

Abstract: We propose `Hide-and-Seek', a weakly-supervised framework that aims to improve object localization in images and action localization in videos. Most existing weakly-supervised methods localize only the most discriminative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to see… ▽ More We propose `Hide-and-Seek', a weakly-supervised framework that aims to improve object localization in images and action localization in videos. Most existing weakly-supervised methods localize only the most discriminative parts of an object rather than all relevant parts, which leads to suboptimal performance. Our key idea is to hide patches in a training image randomly, forcing the network to seek other relevant parts when the most discriminative part is hidden. Our approach only needs to modify the input image and can work with any network designed for object localization. During testing, we do not need to hide any patches. Our Hide-and-Seek approach obtains superior performance compared to previous methods for weakly-supervised object localization on the ILSVRC dataset. We also demonstrate that our framework can be easily extended to weakly-supervised action localization. △ Less

Submitted 23 December, 2017; v1 submitted 13 April, 2017; originally announced April 2017.

Comments: Camera-Ready Version (ICCV 2017)

arXiv:1608.02676 [pdf, other]

End-to-End Localization and Ranking for Relative Attributes

Authors: Krishna Kumar Singh, Yong Jae Lee

Abstract: We propose an end-to-end deep convolutional network to simultaneously localize and rank relative visual attributes, given only weakly-supervised pairwise image comparisons. Unlike previous methods, our network jointly learns the attribute's features, localization, and ranker. The localization module of our network discovers the most informative image region for the attribute, which is then used by… ▽ More We propose an end-to-end deep convolutional network to simultaneously localize and rank relative visual attributes, given only weakly-supervised pairwise image comparisons. Unlike previous methods, our network jointly learns the attribute's features, localization, and ranker. The localization module of our network discovers the most informative image region for the attribute, which is then used by the ranking module to learn a ranking model of the attribute. Our end-to-end framework also significantly speeds up processing and is much faster than previous methods. We show state-of-the-art ranking results on various relative attribute datasets, and our qualitative localization results clearly demonstrate our network's ability to learn meaningful image patches. △ Less

Submitted 8 August, 2016; originally announced August 2016.

Comments: Appears in European Conference on Computer Vision (ECCV), 2016

arXiv:1604.05766 [pdf, other]

Track and Transfer: Watching Videos to Simulate Strong Human Supervision for Weakly-Supervised Object Detection

Authors: Krishna Kumar Singh, Fanyi Xiao, Yong Jae Lee

Abstract: The status quo approach to training object detectors requires expensive bounding box annotations. Our framework takes a markedly different direction: we transfer tracked object boxes from weakly-labeled videos to weakly-labeled images to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes. We first mine discriminative regions in the weakly-labeled imag… ▽ More The status quo approach to training object detectors requires expensive bounding box annotations. Our framework takes a markedly different direction: we transfer tracked object boxes from weakly-labeled videos to weakly-labeled images to automatically generate pseudo ground-truth boxes, which replace manually annotated bounding boxes. We first mine discriminative regions in the weakly-labeled image collection that frequently/rarely appear in the positive/negative images. We then match those regions to videos and retrieve the corresponding tracked object boxes. Finally, we design a hough transform algorithm to vote for the best box to serve as the pseudo GT for each image, and use them to train an object detector. Together, these lead to state-of-the-art weakly-supervised detection results on the PASCAL 2007 and 2010 datasets. △ Less

Submitted 19 April, 2016; originally announced April 2016.

Comments: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

arXiv:1303.2171 [pdf, ps, other]

CPU and/or GPU: Revisiting the GPU Vs. CPU Myth

Authors: Kishore Kothapalli, Dip Sankar Banerjee, P. J. Narayanan, Surinder Sood, Aman Kumar Bahl, Shashank Sharma, Shrenik Lad, Krishna Kumar Singh, Kiran Matam, Sivaramakrishna Bharadwaj, Rohit Nigam, Parikshit Sakurikar, Aditya Deshpande, Ishan Misra, Siddharth Choudhary, Shubham Gupta

Abstract: Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi- cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching c… ▽ More Parallel computing using accelerators has gained widespread research attention in the past few years. In particular, using GPUs for general purpose computing has brought forth several success stories with respect to time taken, cost, power, and other metrics. However, accelerator based computing has signifi- cantly relegated the role of CPUs in computation. As CPUs evolve and also offer matching computational resources, it is important to also include CPUs in the computation. We call this the hybrid computing model. Indeed, most computer systems of the present age offer a degree of heterogeneity and therefore such a model is quite natural. We reevaluate the claim of a recent paper by Lee et al.(ISCA 2010). We argue that the right question arising out of Lee et al. (ISCA 2010) should be how to use a CPU+GPU platform efficiently, instead of whether one should use a CPU or a GPU exclusively. To this end, we experiment with a set of 13 diverse workloads ranging from databases, image processing, sparse matrix kernels, and graphs. We experiment with two different hybrid platforms: one consisting of a 6-core Intel i7-980X CPU and an NVidia Tesla T10 GPU, and another consisting of an Intel E7400 dual core CPU with an NVidia GT520 GPU. On both these platforms, we show that hybrid solutions offer good advantage over CPU or GPU alone solutions. On both these platforms, we also show that our solutions are 90% resource efficient on average. Our work therefore suggests that hybrid computing can offer tremendous advantages at not only research-scale platforms but also the more realistic scale systems with significant performance gains and resource efficiency to the large scale user community. △ Less

Submitted 9 March, 2013; originally announced March 2013.

Comments: 20 pages

Showing 1–32 of 32 results for author: Singh, K K