AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection

Lingjie Kong1 Kai Wu2∗ Xiaobin Hu2 Wenhui Han2 Jinlong Peng2
Chengming Xu2 Donghao Luo2 Jiangning Zhang2 Chengjie Wang2 Yanwei Fu1
1Fudan University, Shanghai, China    2Tencent Youtu Lab, Shanghai, China   
https://lingjiekong-fdu.github.io
Equal contribution.Corresponding Author.
Abstract

Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon in https://github.com/LingjieKong-fdu/AnyMaker.

Refer to caption
Figure 1: Diverse applications of AnyMaker. Given text prompts and one single reference image, our AnyMaker can achieve various customization for general objects with high ID fidelity and flexible text editability, without further fine-tuning.

1 Introduction

Recent advancements in diffusion-based text-to-image generative models [1, 2, 3] have entitled common users to create high-fidelity photo-realistic images with minimal expertise. However, generating images from specific user elements, such as a personal dog of users, remains a considerable challenge [4, 5, 6, 7] due to the difficulty of preserving the ID of specific elements. To address this limitation, object customization, which leverages reference images and textual descriptions to generate images that shares the same identity as objects of interest, has become a crucial topic.

Current object customization methods fall into two main categories: object-specific methods and object-agnostic methods. Object-specific methods, such as DreamBooth [8] and DisenBooth [9], inject customized knowledge into pretrained diffusion models via learning additional reference-specific identifiers. While these methods can be adapted to various objects, their effectiveness relies on time-consuming fine-tuning on a few reference images, which leads to the low efficiency of such methods. Object-agnostic methods, or zero-shot methods like PhotoMaker [6] and InstantID [5], on the other hand, employ large-scale training datasets to generate customized images without further fine-tuning. However, these methods predominantly concentrate on specialized tasks that are only applicable to specific domains, such as human customization or virtual try-on. In this paper, we target at zero-shot general object customization which has seldom been studied in previous works.

To this end, we propose AnyMaker, an innovative zero-shot general object customization framework in text-to-image generation tasks, which can achieve high-fidelity customization for general objects given a text prompt and one single reference image without any test-time fine-tuning. The efficacy of AnyMaker comes from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling.

Specifically, in contrast to prior approaches [6, 7, 5] that merely utilize a single model to extract object representations, we employ an ensemble of self-supervised models to extract more comprehensive information from general objects, ensuring the extracted ID-related information is sufficient enough for the high-ID-fidelity customization. Then, to integrate the extracted ID information into the diffusion UNet while preserving the text editability of the model, we design a global-local dual-level ID injection strategy. At the global level, we fuse the class tokens of the extracted feature with the class word (such as dog or cat) of the text description, injecting semantic-level ID information into the diffusion UNet without damaging the other elements in the text. At the local level, we inject the patch tokens of the extracted feature into the diffusion UNet through newly added cross-attention modules without compressing the feature dimension, ensuring that the model receives a substantial amount of ID-related details, which is crucial for achieving high-ID-fidelity generation. Furthermore, we find that the ID information is usually coupled with redundant non-ID elements such as the motions and orientations of objects. Such coupling phenomena can disrupt the text editing ability and ID retention ability of the model, and further hinder the diversity of generation. To address this issue, we design a novel ID-aware decoupling module at the training stage, compelling the model to discern the intrinsic properties of the ID information and exclude non-ID information away from the generation process.

Benefiting from the innovative ID processing modules mentioned above, AnyMaker realizes zero-shot customization for general objects with high ID fidelity and flexible text editability. We demonstrate various applications of AnyMaker in Fig.1, such as realistic human customization, cartoon character customization, virtual try-on, story generation, ID mixing, and so on.

To validate our method and facilitate the development of general object customization, we build the first large-scale general ID dataset, that is Multi-Category ID-consistent (MC-IDC) dataset, which contains 315k text-image samples with over 10k categories. Experiments demonstrate that AnyMaker achieves state-of-the-art generation results on general object customization and even outperforms task-specific methods in specialized domains like human customization and virtual try-on.

In summary, the contributions of our work are as follows:

  • We propose a novel zero-shot general object customization framework in text-to-image generation tasks, which is capable of customizing general objects with high ID fidelity and flexible text editability through innovative general ID extraction, dual-level ID injection and ID-aware decoupling.

  • We construct a large-scale Multi-Category ID-Consistent (MC-IDC) dataset to significantly advance the development of general object customization.

  • We validate the effectiveness of our method through extensive experiments, which demonstrate that our AnyMaker delivers desirable performance in general object customization and surpass specialized methods tailored for specific tasks.

2 Related Work

Diffusion models. Diffusion models have demonstrated their effectiveness for text-to-image generations. The DDPM model pioneered in this field, using a diffusion and denoising process to create a mapping between Gaussian and image distributions. The Latent Diffusion Model (LDM) [10] took this a step further by applying the diffusion model to a latent space rather than pixel space, leading to the creation of text-to-image diffusion models like Stable Diffusion (SD), Midjourney, and DALLE-3 [11]. One main research problem for the diffusion model is its structure. Other than the original UNet, DiT  [12] and Pixart-α𝛼\alphaitalic_α [3] adopt transformer as the backbone structure. Other researchers have also focused on enhancing image fidelity regarding complicated concepts. By manipulating the cross-attention maps according to text prompts,  [13] ensures that required objects are sufficiently generated in the images.  [14] proposed to process the text prompts to decompose all target attributes, which can be further used to guide the denoising procedure with the help of different objective functions. To enhance content consistency,  [15] adopted an inner localization method by calculating the cross attention. Our method is generally built on diffusion models. While different from these basic models for plain image generations, we focus on the more challenging topic of zero-shot any object customization.

Image customization. Image customization aims to generate images with the same objects of interest as the reference image [8, 16, 17, 18, 5]. Current methods can be generally grouped into two categories. One is object-specific model, which tries to finetune the pretrained diffusion models with reference images. Textual inversion [19] learns an extra word embedding for the target concept. Dreambooth [8] and DisenBooth [9] enhance such a pipeline with auxiliary objectives such as prior loss, which ensures the retention of the model’s original generative capabilities throughout the fine-tuning process. The other genre is object-agnostic models, which can directly process reference images without further fine-tuning. PhotoMaker [6], as a typical zero-shot method, is driven by training a person ID-extaction network, while  [20] uses a face model with CLIP to extract ID features. With the help of ControlNet [16], InstantID [5] can control face expression by leveraging a pre-trained face model to extract face embedding. MagicCloth [21] utilizes a ControlNet style garment extractor to encode garment details and fuse with attention for clothes replacement or preview. On the contrary to these works, we try to combine the versatility of object-specific models and the efficiency of object-agnostic models to present a method that can customize any object in a zero-shot manner, which has never been studied before.

Refer to caption
Figure 2: Overview of our AnyMaker. AnyMaker is a zero-shot text-to-image customization method for general objects, consisting of general ID extractor, global-local dual-level ID injection, and ID-aware decoupling module.

3 Method

We first outline the overall structure of AnyMaker in Sec. 3.1. Then, we detail the general ID extraction, dual-level ID injection and ID-aware decoupling in Sec. 3.2, Sec. 3.3 and Sec. 3.4 respectively.

3.1 Overview

Give a reference image and a text prompt, our goal is to generate images that shares the same ID as the object of interest from the reference image, and modifies non-ID elements such as motions and backgrounds in accordance with the text prompt. To this end, we propose AnyMaker, an innovative zero-shot text-to-image customization framework for general objects, as summarized in Fig. 2. Firstly, we utilize the general ID extractor to acquire sufficient ID information from the reference image with a segmentation mask of the interested object. Next, we leverage the dual-level ID injection to endow the diffusion model with ID information at the global and local levels respectively, without damaging the text editing ability of the model. Additionally, we design a novel ID-aware decoupling module to distinguish the ID information from the coupled non-ID elements in the object representation, for higher ID fidelity and better text editing ability.

3.2 Genral ID Extraction

We utilize an ensemble of self-supervised models, consisting of DINOv2 [22] and MAE [23], to extract the representations of general objects in the reference image.

Previous works [6, 5] targeting at customization tasks in specific domains like human face, usually use a single model to extract object representations. However, our experiments demonstrate that one single pre-trained model, such as CLIP [24], DINOv2 [22] or MAE [23], cannot extract sufficient features for high-ID-fidelity generation of general objects, as shown in Fig. 4 and Tab. 2. Specifically, previous works [4, 5] demonstrate that features extracted by CLIP contain insufficient details, thus being less practical as an ID feature extractor, while DINOv2 is capable of providing desirable details due to its self-supervised contrastive learning strategy. However, we find that the features extracted by DINOv2 are color-insensitive because it is trained with augmentations such as ColorJitter. Consequently, solely using DINOv2 as an ID extractor would lead to mistakenly generated color. MAE, on the other hand, is aware of the color thanks to its reconstruction-related objective.

Therefore, we leverage DINOv2 and MAE, which compensate for each other, to extract sufficient and comprehensive information to handle the challenging customization task for general objects with various categories and diverse characteristics. To align the extracted features and facilitate the subsequent injection, we adopt a two-layer MLP. The process of extracting general object representations is as follows,

fdinoC,fdinoP=MLP(Fdino(IrefMref)),superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝐶superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝑃𝑀𝐿𝑃subscript𝐹𝑑𝑖𝑛𝑜direct-productsubscript𝐼𝑟𝑒𝑓subscript𝑀𝑟𝑒𝑓f_{dino}^{C},f_{dino}^{P}=MLP(F_{dino}(I_{ref}\odot M_{ref})),italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = italic_M italic_L italic_P ( italic_F start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ) , (1)
fmaeC,fmaeP=MLP(Fmae(IrefMref)),superscriptsubscript𝑓𝑚𝑎𝑒𝐶superscriptsubscript𝑓𝑚𝑎𝑒𝑃𝑀𝐿𝑃subscript𝐹𝑚𝑎𝑒direct-productsubscript𝐼𝑟𝑒𝑓subscript𝑀𝑟𝑒𝑓f_{mae}^{C},f_{mae}^{P}=MLP(F_{mae}(I_{ref}\odot M_{ref})),italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT = italic_M italic_L italic_P ( italic_F start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ⊙ italic_M start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT ) ) , (2)

where direct-product\odot represents element-wise multiplication, and fdinoC,fdinoP,fmaeC,fmaePsuperscriptsubscript𝑓𝑑𝑖𝑛𝑜𝐶superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝑃superscriptsubscript𝑓𝑚𝑎𝑒𝐶superscriptsubscript𝑓𝑚𝑎𝑒𝑃f_{dino}^{C},f_{dino}^{P},f_{mae}^{C},f_{mae}^{P}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT represent the class tokens and patch tokens extracted by DINOv2 and MAE respectively. Fdinosubscript𝐹𝑑𝑖𝑛𝑜F_{dino}italic_F start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT and Fmaesubscript𝐹𝑚𝑎𝑒F_{mae}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT represent the DINOv2 and MAE backbone. Irefsubscript𝐼𝑟𝑒𝑓I_{ref}italic_I start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT represents the reference image, and Mrefsubscript𝑀𝑟𝑒𝑓M_{ref}italic_M start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT is the segmentation mask of the interested object in the reference image. Consequently, tokens derived from our general ID extractor not only offer intricate details from DINOv2, but also have sufficient colour and structural information from MAE.

3.3 Dual-level ID Injection

We propose a dual-level ID injection mechanism to inject the global semantic ID information and local fine-grained ID information into the diffusion UNet respectively, ensuring both flexible text editability and high ID fidelity in the customization process for general objects.

Global ID injection. In text-to-image generation, text prompts are leveraged to guide diffusion models to generate diverse images. Nevertheless, diffusion models are incapable of accurately generating images with a specific object ID solely relying on the text description. To inject the ID information of the interested objects into the text, while simultaneously maintaining the editing capability of text prompts for other elements such as the background and motions, we design the global ID injection mechanism. Specifically, we merge the semantic-level class tokens (fdinoC,fmaeC)superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝐶superscriptsubscript𝑓𝑚𝑎𝑒𝐶(f_{dino}^{C},f_{mae}^{C})( italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ) with the class word in the text embedding, such as "dog" or "cat",

ffuseC=MLP(Concat(ftextC,fdinoC,fmaeC)),superscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶𝑀𝐿𝑃𝐶𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝑓𝑡𝑒𝑥𝑡𝐶superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝐶superscriptsubscript𝑓𝑚𝑎𝑒𝐶f_{fuse}^{C}=MLP(Concat(f_{text}^{C},f_{dino}^{C},f_{mae}^{C})),italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT = italic_M italic_L italic_P ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_f start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ) ) , (3)

where ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT is the fused class token, and ftextCsuperscriptsubscript𝑓𝑡𝑒𝑥𝑡𝐶f_{text}^{C}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT represents the class word embedding of the interested object. Then we insert the fused class token ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to the original position of the class word in the text embeddings, obtaining the global-level condition cgsubscript𝑐𝑔c_{g}italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT with ID-related information of the object of interest, as shown in Fig. 2. Following this, the global-level condition cgsubscript𝑐𝑔c_{g}italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT engages with the cross-attention module in the UNet, like to standard text-to-image models [10], as follows,

Zg=Attention(Qg,Kg,Vg),subscript𝑍𝑔𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscript𝑄𝑔subscript𝐾𝑔subscript𝑉𝑔Z_{g}=Attention(Q_{g},K_{g},V_{g}),italic_Z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , (4)

where Qg=ZgWqsubscript𝑄𝑔subscript𝑍𝑔subscript𝑊𝑞Q_{g}=Z_{g}W_{q}italic_Q start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Kg=cgWksubscript𝐾𝑔subscript𝑐𝑔subscript𝑊𝑘K_{g}=c_{g}W_{k}italic_K start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Vg=cgWvsubscript𝑉𝑔subscript𝑐𝑔subscript𝑊𝑣V_{g}=c_{g}W_{v}italic_V start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the query, key, and value of cross-attention module, in which Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the weight matrices of the trainable linear projection layers, and Zgsubscript𝑍𝑔Z_{g}italic_Z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT is the query feature containing the latent input information.

Local ID injection. Semantic-level information in the global-level condition cgsubscript𝑐𝑔c_{g}italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT contains relatively fewer details of the objects, and thus the model cannot handle complex customization tasks for general objects effectively merely relying on global ID injection. To supply the model with sufficient ID-related details, we propose the local ID injection mechanism. Specifically, we first fuse the extracted patch tokens fdinoPsuperscriptsubscript𝑓𝑑𝑖𝑛𝑜𝑃f_{dino}^{P}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT and fmaePsuperscriptsubscript𝑓𝑚𝑎𝑒𝑃f_{mae}^{P}italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT with a two-layer MLP to densify the contained information to a unified representation clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, as follows,

cl=MLP(fdinoP)+MLP(fmaeP),subscript𝑐𝑙𝑀𝐿𝑃superscriptsubscript𝑓𝑑𝑖𝑛𝑜𝑃𝑀𝐿𝑃superscriptsubscript𝑓𝑚𝑎𝑒𝑃c_{l}=MLP(f_{dino}^{P})+MLP(f_{mae}^{P}),italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_M italic_L italic_P ( italic_f start_POSTSUBSCRIPT italic_d italic_i italic_n italic_o end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) + italic_M italic_L italic_P ( italic_f start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_P end_POSTSUPERSCRIPT ) , (5)

where clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT serves as the local-level condition for the diffusion UNet. Then, we add one cross-attention module in each upblock of the diffusion UNet in order to inject the ID-related details into the model without compressing clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT dimensions, feeding the model with as much ID information as possible.

Zl=Attention(Ql,Kl,Vl),subscript𝑍𝑙𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscript𝑄𝑙subscript𝐾𝑙subscript𝑉𝑙Z_{l}=Attention(Q_{l},K_{l},V_{l}),italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , (6)

where Ql=ZlWqsubscript𝑄𝑙subscript𝑍𝑙subscript𝑊𝑞Q_{l}=Z_{l}W_{q}italic_Q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, Kl=clWksubscript𝐾𝑙subscript𝑐𝑙subscript𝑊𝑘K_{l}=c_{l}W_{k}italic_K start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Vl=clWvsubscript𝑉𝑙subscript𝑐𝑙subscript𝑊𝑣V_{l}=c_{l}W_{v}italic_V start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the query, key, and value of cross-attention module, in which Wlsubscript𝑊𝑙W_{l}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, Wlsubscript𝑊𝑙W_{l}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, Wlsubscript𝑊𝑙W_{l}italic_W start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT are the weight matrices of the trainable linear projection layers, and Zlsubscript𝑍𝑙Z_{l}italic_Z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the query feature containing the latent input information.

Through the dual-level ID injection, we provide the model with sufficient ID-related information, both at the semantic level and the detail level, on the premise of not impairing the text editing ability of the model.

3.4 Training with ID-Aware Decoupling

The ID-related information injected in the diffusion model is usually entangled with redundant non-ID information such as motion, direction and size. Without proper guidance, the model would learn both kinds of information together, which may disrupt the text editing ability of the model and hinder the diversity of generation. For instance, in a reference image depicting "a standing person", the posture "standing" may also be extracted and injected into the diffusion model, which could impede a text prompt wishing to generate "a sitting person". Consequently, the generated images would closely resemble the provided reference images, thereby compromising the text editing ability and generation diversity.

To eliminate the negative impact brought by the coupling phenomenon, we take inspiration from previous works [9, 23] to propose an ID-aware decoupling module, which consists of a decoupling branch and a group of losses based on the normal generation branch at the training stage, guiding the model to discern the intrinsic properties of ID information.

Decoupling branch and normal branch. In the decoupling branch, we first extract the target image features ftarsubscript𝑓𝑡𝑎𝑟f_{tar}italic_f start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT through CLIP [24] to make an image embedding prior [11]. Next, we mask out the ID information contained in the image embedding by a trainable feature mask midsubscript𝑚𝑖𝑑m_{id}italic_m start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT, and thus the masked image feature fmsk=ftarmidsubscript𝑓𝑚𝑠𝑘direct-productsubscript𝑓𝑡𝑎𝑟subscript𝑚𝑖𝑑f_{msk}=f_{tar}\odot m_{id}italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT ⊙ italic_m start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT contains only non-ID information of the target image, where direct-product\odot represents element-wise multiplication. Then, we add the masked image feature fmsksubscript𝑓𝑚𝑠𝑘f_{msk}italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT to the fused class token ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT and then feed it to the diffusion UNet for generation following the injection way mentioned in the Sec. 3.3. Yet in the normal branch, we inject ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT to the model without fmsksubscript𝑓𝑚𝑠𝑘f_{msk}italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT. The generation process of two branches can be illustrated as follows,

fmskffuseCItar,ffuseCItar,fmskffuseC,formulae-sequencedirect-sumsubscript𝑓𝑚𝑠𝑘superscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶subscript𝐼𝑡𝑎𝑟formulae-sequencesuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶subscript𝐼𝑡𝑎𝑟perpendicular-tosubscript𝑓𝑚𝑠𝑘superscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{msk}\oplus f_{fuse}^{C}\Rightarrow I_{tar},\hskip 14.22636ptf_{fuse}^{C}% \Rightarrow I_{tar},\hskip 14.22636ptf_{msk}\perp f_{fuse}^{C},italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT ⊕ italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ⇒ italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ⇒ italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT ⟂ italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , (7)

where direct-sum\oplus represents element-wise addition and Itarsubscript𝐼𝑡𝑎𝑟I_{tar}italic_I start_POSTSUBSCRIPT italic_t italic_a italic_r end_POSTSUBSCRIPT is the target image, and the symbol perpendicular-to\perp signifies that the features it connects are independent of each other and have no obvious correlation. Through training in the two branches, we push the non-ID information into the masked image feature fmsksubscript𝑓𝑚𝑠𝑘f_{msk}italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT and only retain the ID-related information in the fused class token ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT.

Training strategy. We use three losses in our training stage to ensure that the ID-aware decoupling module takes effect. Specifically, We utilize the following denoising losses to train each branch,

decouple=ϵϵθ(xt,cg,fmsk,cl,t)2,subscript𝑑𝑒𝑐𝑜𝑢𝑝𝑙𝑒superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑐𝑔subscript𝑓𝑚𝑠𝑘subscript𝑐𝑙𝑡2\mathcal{L}_{decouple}=\|\epsilon-\epsilon_{\theta}\big{(}x_{t},c_{g},f_{msk},% c_{l},t\big{)}\|^{2},caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_c italic_o italic_u italic_p italic_l italic_e end_POSTSUBSCRIPT = ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (8)
normal=ϵϵθ(xt,cg,cl,t)2,subscript𝑛𝑜𝑟𝑚𝑎𝑙superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡subscript𝑐𝑔subscript𝑐𝑙𝑡2\mathcal{L}_{normal}=\|\epsilon-\epsilon_{\theta}\big{(}x_{t},c_{g},c_{l},t% \big{)}\|^{2},caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT = ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (9)

where t𝑡titalic_t is the randomly sampled time step and xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the noisy latent of the target image, ϵitalic-ϵ\epsilonitalic_ϵ is the ground truth noise and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the predicted noise. cgsubscript𝑐𝑔c_{g}italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represent the global-level and local-level conditions acquired in Sec. 3.3. Additionally, we design a contrastive loss, ensuring that the masked image feature fmsksubscript𝑓𝑚𝑠𝑘f_{msk}italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT captures non-ID information and the fused class token ffuseCsuperscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶f_{fuse}^{C}italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT remains free of it, as below,

contrast=Sim(ffuseC,fmsk),subscript𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡𝑆𝑖𝑚superscriptsubscript𝑓𝑓𝑢𝑠𝑒𝐶subscript𝑓𝑚𝑠𝑘\mathcal{L}_{contrast}=Sim(f_{fuse}^{C},f_{msk}),caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT = italic_S italic_i italic_m ( italic_f start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_m italic_s italic_k end_POSTSUBSCRIPT ) , (10)

where Sim𝑆𝑖𝑚Simitalic_S italic_i italic_m is instantiated as cosine similarity. In summary, the training loss is the sum of the three loss functions mentioned above,

=α1normal+α2decouple+α3contrast,subscript𝛼1subscript𝑛𝑜𝑟𝑚𝑎𝑙subscript𝛼2subscript𝑑𝑒𝑐𝑜𝑢𝑝𝑙𝑒subscript𝛼3subscript𝑐𝑜𝑛𝑡𝑟𝑎𝑠𝑡\mathcal{L}=\alpha_{1}\mathcal{L}_{normal}+\alpha_{2}\mathcal{L}_{decouple}+% \alpha_{3}\mathcal{L}_{contrast},caligraphic_L = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n italic_o italic_r italic_m italic_a italic_l end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_d italic_e italic_c italic_o italic_u italic_p italic_l italic_e end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_a italic_s italic_t end_POSTSUBSCRIPT , (11)

where α1,α2,α3subscript𝛼1subscript𝛼2subscript𝛼3\alpha_{1},\alpha_{2},\alpha_{3}italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT represent hyperparameters to balance the three losses.

4 MC-IDC Dataset

We create the first large-scale general ID dataset, named as Multi-Category ID-Consistent (MC-IDC) dataset, designed for the research of general object customization. We introduce the characteristics and construction process of the dataset in Sec. 4.1 and Sec. 4.2 respectively. We introduce more details of our dataset in the Appendix.

4.1 Characteristics

Large-scale dataset. Our dataset consists of approximately 315,000 samples in total with more than 10,000 categories, covering various types such as human faces, animals, clothes, human-made tools, etc. Adequate training samples and diverse categories endow the model with general ID understanding. We introduce the main categories of the dataset and their quantities in the Appendix.

High quality. The average resolution of the images in our dataset is 1039×\times×950, which is helpful for the high-quality generation of our model in both training and inference stage.

Reference-target image pair. Each sample consists of a reference image, a segmentation mask of the object of interest in the reference image, a target image, and a text caption of the target image. During training, the reference image with its segmentation mask provides the model with ID information, the text caption of the target image offers semantic-level guidance for generation, and the target image serves as the ground truth. In each sample, the reference image and the target image depict the same object in different states, which means that each reference-target image pair shares the same ID of the object of interest and has different non-ID elements such as motions and directions. Training with reference-target image pairs endows the model with the ID processing ability while not damaging the diversity of generation.

4.2 Construction Pipeline

The construction pipeline of MC-IDC consists of four steps: data collection, instance detection and segmentation, image pair generation, and text prompt generation.

Data collection. MC-IDC dataset contains diverse sources including web-crawled images, movies and publicly available datasets. To facilitate the subsequent generation of image pairs, most of the data we collect are video datas or multi-view datas. To ensure the high quality of our dataset, we delete the images with a resolution less than 300×\times×300.

Instance detection and segmentation. For some publicly available datasets or web-crawled images which do not provide segmentation annotations, we use the advanced instance segmentation model [25] for instance detection and segmentation. For movie dataset, we extract the frames with 10fps and do object tracking with  [26], aiming to establish the ID connection across frames.

Image pair generation. For most data samples which are from a clip of video or a group of multi-view data, we randomly select two frames containing the same object and perform a random crop around the object to form the reference-target image pair. For the other data samples from the single image dataset, we perform random data augmentations on the single image two times, and take them as the reference image and the target image respectively.

Text prompt generation. In each image pair, the target image is densely captioned with the state-of-the-art large vision language model [27], serving as the text prompt to guide the generation.

For more details about the MC-IDC, please refer to the Appendix.

Table 1: Quantitative comparison among zero-shot object customization methods in general domain and two specific popular domains namely human customization and virtual try-on.
Domains Methods FID\downarrow CLIP-i\uparrow CLIP-t\uparrow DINO-i\uparrow FaceSim\uparrow DiverSim-i\downarrow
General objects IP-Adapter 70.32 77.18 28.03 44.94 - 84.43±0.66
Ours 47.09 82.16 29.27 65.13 - 74.38±3.99
Human customization IP-Adapter 102.69 72.17 29.32 41.44 65.51 -
PhotoMaker 106.35 71.80 32.13 44.62 64.10 -
InstanceID 113.18 75.87 32.89 49.26 63.26 -
Ours 86.40 79.60 30.88 57.44 78.54 -
Virtual try-on MagicClothing 126.09 76.53 21.40 29.10 - 89.36±0.40
IP-Adapter 104.47 81.99 25.03 59.39 - 71.28±5.06
Ours 50.65 83.82 22.42 66.24 - 71.27±3.63

5 Experiment

5.1 Setup

Implementation details. To be compatible with open-community as Civitai.com, we employ SD1.5 [28], the most used foundational model in the public, as the backbone model in our experiments. We use the multi-scale training mode proposed in [1] considering the resolutions of images from different sources are diverse. We employ the same CLIP image encoder as in [28] in the ID-aware decoupling module. We use MAE [23] and DINOv2 [22] for extracting ID features. During training, we set the learning rate as 1e-5 with a batch size of 32. We conduct the training process about 6 epochs with 32 V100 GPUs, consuming about 30 hours. We adopt dropout with a ratio of 0.1 for the global-level condition cgsubscript𝑐𝑔c_{g}italic_c start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and the local-level condition clsubscript𝑐𝑙c_{l}italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT respectively. During inference, we utilize 50 steps denoising process for generation and set the classifier-free guidance scale as 7.

Competitors. To demonstrate the generality of our AnyMaker, we not only make a comparison with previous works on general object customization, but also compare it with task-specialized methods in two popular specific customization domains, that is, human customization and virtual try-on. Specifically, we compare AnyMaker with IP-Adapter [7] on zero-shot general object customization. On human customization, we further choose human specialized method PhotoMaker [6] and InstantID [5] for comparison. On virtual try-on, we conduct a comparison with the recent state-of-the-art MagicClothing  [21]. It is worth noting that all the methods mentioned above are zero-shot methods and are capable of customizing images based solely on one single reference image. Therefore, in this paper, we do not compare object-specific methods such as DreamBooth [8] and DisenBooth [9] which typically require 3 to 5 reference images and additional fine-tuning.

Evaluation dataset. Our evaluation dataset contains 1000 text-image samples, including general objects, human data, and virtual try-on data with a ratio of 4:3:3. All samples in the evaluation dataset are randomly selected from the MC-IDC and do not appear during the training stage.

Evaluation metrics. Following PhotoMaker [6], we utilize DINO-i [22] and CLIP-i [29] to measure the ID fidelity and use CLIP-t to measure the prompt fidelity. We leverage FID [30] to assess the generation quality. For human customization, we additionally calculate the face similarity (FaceSim) with FaceNet [31] as commonly done in [6, 5]. Furthermore, we design a novel metric DiverSim-i, which calculates the average DINO similarity among the images generated according to diverse text prompts containing different scenarios. The lower the value of DiverSim-i, the stronger the ability of the model to generate diverse images in line with text prompts.

We introduce more details about the experimental setup in the Appendix.

Refer to caption
Figure 3: Qualitative comparison on zero-shot general object customization and two specific popular tasks namely human customization and virtual try-on. The AnyMaker exhibits great ID-preserving ability with better text controls and more diverse generations on both general objects and specialized domains.

5.2 Quantative Analysis

Our AnyMaker outperforms previous works on general object customization for all metrics, and achieve comparable or better performance in contrast with specific methods tailored for specialized tasks such as human customization and virtual try-on, as shown in Tab. 1. Specifically, we achieve higher ID fidelity for CLIP-i and DINO-i, better prompt fidelity for CLIP-t, and higher generation quality evaluated by FID. Furthermore, our method exhibits better diversity on various scenarios as presented by DiverSim-i.

Table 2: Ablation study on the ID extraction methods.
Methods FID \downarrow CLIP-i \uparrow DINO-i \uparrow
CLIP 49.11 79.58 59.45
DINO 48.89 80.82 63.71
MAE 49.00 79.09 59.79
Ours 47.50 81.86 65.12
Table 3: Ablation study on the ID injection methods.
Methods FID \downarrow CLIP-i \uparrow DINO-i \uparrow
Global 49.78 78.76 60.67
Local 48.66 81.24 62.89
Ours 47.50 81.86 65.12
Table 4: Ablation study on the ID-aware decoupling module.
Methods FID \downarrow CLIP-i \uparrow DINO-i \uparrow
w.o. decoupling 47.50 81.86 65.12
with decoupling 47.09 82.16 65.13

5.3 Qualitative Analysis

The AnyMaker exhibits outstanding capabilities of high-quality customization for general objects, and even beat task-specialized methods in the specific domains, such as human customization and virtual try-on, as shown in Fig. 3. Our method is capable of generating high-ID-fidelity images given one single reference image, and simultaneously editing non-ID elements like posture or background in accordance with text prompts, which endows the model with the ability to generate diverse images, such as various scenarios and different postures. Furthermore, benefiting from the ID-aware decoupling module, our model is capable of maintaining the ID fidelity of objects while diversifying the non-ID elements such as motions and directions compared to the reference images, even without corresponding guidance of text prompts.

Refer to caption
Figure 4: Ablation study on different ID extraction methods in the quantitative perspective. Our ID extractor has both the rich details from DINOv2 and the color information from MAE.
Refer to caption
Figure 5: Ablation study on whether to add the ID-aware decoupling module in the quantitative perspective. The model with the ID-aware decoupling module demonstrates stronger text control ability, as shown by the blue text.

5.4 Ablation Study

Effectiveness of general ID Extraction. We compare our proposed general ID extractor with DINOv2 [22], MAE [23] and CLIP [29]. As in Tab. 2, our method performs the best, whether in terms of FID, which measures the quality of generation, or in terms of CLIP-i and DINO-i, which measure ID fidelity. As in Fig. 4, merely using DINOv2 as the extractor fails to maintain color consistency, while solely using MAE as the extractor lacks sufficient details. In contrast, our method combines the advantages of both DINOv2 and MAE, which enables our model to customize images with rich details and the consistent color of the object in reference images.

Benefits of dual-level ID injection. We explore the effectiveness of our dual-level ID injection module compared to only global-level injection and only local-level injection separately. As shown in Tab. 3, by combining global and local injection, AnyMaker achieves the best generation results both in quality as measured by FID and ID-fidelity as measured by CLIP-i and DINO-i.

Benefits of ID-aware decoupling. We conduct a comparative experiment on whether to add the ID decoupling module to the model during training, aiming to verify its effectiveness. As in Tab.4, the model trained with the ID-aware decoupling module achieves higher ID-fidelity scores in terms of CLIP-i and DINO-i, which means that the decoupling module can help the model to discern ID information embedded in object representations, thereby generating results with better ID consistency. Further, we visualize the results by models trained with and without decoupling in Fig.5. The model trained with decoupling exhibit enhanced capabilities in distinguishing ID information from non-ID elements such as motions (open or shut the mouth) and directions (front-facing or side-facing) of objects of interest. The enhanced discrimination ability allows the model to mitigate the influence of non-ID information during generation, thereby better preserving the text editing capabilities.

6 Conclusion

We introduce AnyMaker, a zero-shot text-to-image framework for general object customization, which has seldom been studied in previous works. Our approach integrates novel general ID extraction, dual-level ID injection, and ID-aware decoupling module, endowing the model with high ID fidelity without damaging the text editing ability. Experiments demonstrate that our method excels in general object customization solely with one single reference image, and even outperforms task-specialized methods in some popular specific domains. Furthermore, we construct the first large-scale general ID dataset, MC-IDC, to set a new benchmark and promote the future research for object customization.

References

  • [1] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [2] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  • [3] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
  • [4] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization, 2023.
  • [5] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds, 2024.
  • [6] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding, 2023.
  • [7] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  • [8] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • [9] Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
  • [10] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [11] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • [12] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  • [13] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021.
  • [14] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin Huang. Expressive text-to-image generation with rich text. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7545–7556, 2023.
  • [15] Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928, 2022.
  • [16] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  • [17] Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, and Rongrong Ji. Portraitbooth: A versatile portrait model for fast identity-preserved personalization. arXiv preprint arXiv:2312.06354, 2023.
  • [18] Gan Pei, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, and Dacheng Tao. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881, 2024.
  • [19] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • [20] Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, et al. Consistentid: Portrait generation with multimodal fine-grained identity preserving. arXiv preprint arXiv:2404.16771, 2024.
  • [21] Weifeng Chen, Tao Gu, Yuhao Xu, and Chengcai Chen. Magic clothing: Controllable garment-driven image synthesis. arXiv preprint arXiv:2404.09512, 2024.
  • [22] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2024.
  • [23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021.
  • [24] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • [25] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
  • [26] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 145–161. Springer, 2020.
  • [27] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • [28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [31] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
  • [32] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  • [33] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
  • [34] Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033–21043, 2022.
  • [35] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023.
  • [36] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018.
  • [37] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.

Appendix A Appendix

A.1 Additional Details about MC-IDC

Data sources. The data sources of MC-IDC can be divided into three categories: public datasets, web-crawled images, and movies. We detail the statistical information about various data sources in Tab. A1.

Main categories. We record several main categories that appear most frequently in MC-IDC, as shown in Tab. A2.

Refer to caption
Figure A1: Illustration of our general ID dataset, MC-IDC. In each sample, the reference image with the object mask provides ID information, the text prompt offers semantic-level guidance for generation, and the target image serves as the ground truth.

Dataset illustration. We present the overall illustration of the MC-IDC as shown in Fig. A1.

Table A1: Details about data sources of MC-IDC.
Source Dataset Typ Image pair numbers
Public datasets HumanFace [32] Video 55,830
VOS [33] Video 55,823
VIPSEG [34] Video 27,983
MVImgNet [35] Multi-view image 53,909
VITON [36] Multi-view image 20,000
LVIS [37] Single image 8,003
Web-crawled images - Single image 55,829
Movies - Video 38,405
Table A2: Sample numbers of main categories in MC-IDC.
Kategorien Images
man 46,720
woman 26,670
clothes 20,040
girl 3,498
panda 3,007
train 2,286
car 1,974
boy 1,855
dog 1,820

A.2 Additional Details about Experiment Setup

Categories in evaluation dataset. The evaluation dataset can be divided into general objects, human data, and virtual try-on data. The human data and the virtual try-on data each contain 300 different samples. General objects in the evaluation dataset consist of 50 categories, each of which contains 8 diverse samples. We summarize the 50 categories in Tab. A3.

Table A3: Categories of general objects in the evaluation dataset.
Winter melon Cabbage Vessel Pillow Screw driver
Pants Computer mouse Lipstick Rice cooker Toy figure
Clothing Pineapple Can Plush toy Grape
Toilet paper Paper box Skirt Pawpaw Ginger
Bowl Train Bottle Cantaloupe Sanitary napkin
Soccer Bag Umbrella Hammer Book
Flower Shoe Towel Ashcan Telephone
Faucet Flowerpot Motorcycle Mug Kiwi
Pot Grapefruit Jug Car Basket
Balloons Tomato Flashlight Bagged snacks Toy duck

Text prompts for calculating DiverSim-i. We use diverse text prompts describing different scenarios to guide the generation, and calculate DiverSim-i among the generated images. We record the text prompts in Tab. A4.

Table A4: Text prompts for calculating DiverSim-i.
Scenarios Text prompts
Snow Original text prompt + "The scene of the picture is in the snow."
Original text prompt + "The background of the picture is in the snow."
Grass Original text prompt + "The scene of the picture is on the grass."
Original text prompt + "The background of the picture is on the grass."
Beach Original text prompt + "The scene of the picture is on the beach."
Original text prompt + "The background of the picture is on the beach."
Jungle Original text prompt + "The scene of the picture is in the jungle."
Original text prompt + "The background of the picture is in the jungle."
Eiffel Tower Original text prompt + "The scene of the picture is beside the Eiffel Tower."
Original text prompt + "The background of the picture is beside the Eiffel Tower."

A.3 More Visual Results

Our AnyMaker exhibits outstanding performance on various applications, such as general object customization in Fig. A2, virtual try-on in Fig. A3, Text-Image ID Mixing in Fig. A4, ID-consistent generation in Fig. A5, and story generation in Fig. A6

Refer to caption
Figure A2: Additional visual results of general object customization. Our AnyMaker is capable of generating high-ID-fidelity images given one single reference image, and simultaneously editing non-ID elements like posture or background in accordance with text prompts.
Refer to caption
Figure A3: Additional visual results of virtual try-on. Given a piece of clothing, the AnyMaker can generate images of the clothing worn on a person.
Refer to caption
Figure A4: Additional visual results of text-image ID mixing. If the category of the interested object in the text prompt and that in the reference image is not the same, our AnyMaker can merge the two and form a new ID.
Refer to caption
Figure A5: Additional visual results of ID-consistent generation. The AnyMaker can generate multiple ID-consistent images with diverse non-ID elements such as motions and orientations.
Refer to caption
Figure A6: Additional visual results of story generation. Our AnyMaker can generate diverse images under the guidance of text prompts, while maintaining the same identity as the object of interest in the reference image, thereby enabling the creation of a cohesive narrative.