-
InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation
Authors:
Chanran Kim,
Jeongin Lee,
Shichang Joung,
Bongmo Kim,
Yeul-Min Baek
Abstract:
In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack…
▽ More
In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Emerging Property of Masked Token for Effective Pre-training
Authors:
Hyesong Choi,
Hunsang Lee,
Seyoung Joung,
Hyejin Park,
Jiyeong Kim,
Dongbo Min
Abstract:
Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This pap…
▽ More
Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
Learning Canonical 3D Object Representation for Fine-Grained Recognition
Authors:
Sunghun Joung,
Seungryong Kim,
Minsu Kim,
Ig-Jae Kim,
Kwanghoon Sohn
Abstract:
We propose a novel framework for fine-grained object recognition that learns to recover object variation in 3D space from a single image, trained on an image collection without using any ground-truth 3D annotation. We accomplish this by representing an object as a composition of 3D shape and its appearance, while eliminating the effect of camera viewpoint, in a canonical configuration. Unlike conv…
▽ More
We propose a novel framework for fine-grained object recognition that learns to recover object variation in 3D space from a single image, trained on an image collection without using any ground-truth 3D annotation. We accomplish this by representing an object as a composition of 3D shape and its appearance, while eliminating the effect of camera viewpoint, in a canonical configuration. Unlike conventional methods modeling spatial variation in 2D images only, our method is capable of reconfiguring the appearance feature in a canonical 3D space, thus enabling the subsequent object classifier to be invariant under 3D geometric variation. Our representation also allows us to go beyond existing methods, by incorporating 3D shape variation as an additional cue for object recognition. To learn the model without ground-truth 3D annotation, we deploy a differentiable renderer in an analysis-by-synthesis framework. By incorporating 3D shape and appearance jointly in a deep representation, our method learns the discriminative representation of the object and achieves competitive performance on fine-grained image recognition and vehicle re-identification. We also demonstrate that the performance of 3D shape reconstruction is improved by learning fine-grained shape deformation in a boosting manner.
△ Less
Submitted 10 August, 2021;
originally announced August 2021.
-
Cross-Domain Grouping and Alignment for Domain Adaptive Semantic Segmentation
Authors:
Minsu Kim,
Sunghun Joung,
Seungryong Kim,
JungIn Park,
Ig-Jae Kim,
Kwanghoon Sohn
Abstract:
Existing techniques to adapt semantic segmentation networks across the source and target domains within deep convolutional neural networks (CNNs) deal with all the samples from the two domains in a global or category-aware manner. They do not consider an inter-class variation within the target domain itself or estimated category, providing the limitation to encode the domains having a multi-modal…
▽ More
Existing techniques to adapt semantic segmentation networks across the source and target domains within deep convolutional neural networks (CNNs) deal with all the samples from the two domains in a global or category-aware manner. They do not consider an inter-class variation within the target domain itself or estimated category, providing the limitation to encode the domains having a multi-modal data distribution. To overcome this limitation, we introduce a learnable clustering module, and a novel domain adaptation framework called cross-domain grouping and alignment. To cluster the samples across domains with an aim to maximize the domain alignment without forgetting precise segmentation ability on the source domain, we present two loss functions, in particular, for encouraging semantic consistency and orthogonality among the clusters. We also present a loss so as to solve a class imbalance problem, which is the other limitation of the previous methods. Our experiments show that our method consistently boosts the adaptation performance in semantic segmentation, outperforming the state-of-the-arts on various domain adaptation settings.
△ Less
Submitted 17 December, 2020; v1 submitted 15 December, 2020;
originally announced December 2020.
-
Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation
Authors:
Sunghun Joung,
Seungryong Kim,
Hanjae Kim,
Minsu Kim,
Ig-Jae Kim,
Junghyun Cho,
Kwanghoon Sohn
Abstract:
Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that…
▽ More
Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.
△ Less
Submitted 25 March, 2020;
originally announced March 2020.
-
Deep neural network Grad-Shafranov solver constrained with measured magnetic signals
Authors:
Semin Joung,
Jaewook Kim,
Sehyun Kwak,
J. G. Bak,
S. G. Lee,
H. S. Han,
H. S. Kim,
Geunho Lee,
Daeho Kwon,
Y. -c. Ghim
Abstract:
A neural network solving Grad-Shafranov equation constrained with measured magnetic signals to reconstruct magnetic equilibria in real time is developed. Database created to optimize the neural network's free parameters contain off-line EFIT results as the output of the network from $1,118$ KSTAR experimental discharges of two different campaigns. Input data to the network constitute magnetic sign…
▽ More
A neural network solving Grad-Shafranov equation constrained with measured magnetic signals to reconstruct magnetic equilibria in real time is developed. Database created to optimize the neural network's free parameters contain off-line EFIT results as the output of the network from $1,118$ KSTAR experimental discharges of two different campaigns. Input data to the network constitute magnetic signals measured by a Rogowski coil (plasma current), magnetic pick-up coils (normal and tangential components of magnetic fields) and flux loops (poloidal magnetic fluxes). The developed neural networks fully reconstruct not only the poloidal flux function $ψ\left( R, Z\right)$ but also the toroidal current density function $j_φ\left( R, Z\right)$ with the off-line EFIT quality. To preserve robustness of the networks against a few missing input data, an imputation scheme is utilized to eliminate the required additional training sets with large number of possible combinations of the missing inputs.
△ Less
Submitted 7 November, 2019;
originally announced November 2019.