AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection

Lingjie Kong¹ Kai Wu^2∗ Xiaobin Hu² Wenhui Han² Jinlong Peng²
Chengming Xu² Donghao Luo² Jiangning Zhang² Chengjie Wang² Yanwei Fu¹
¹Fudan University, Shanghai, China ²Tencent Youtu Lab, Shanghai, China
https://lingjiekong-fdu.github.io Equal contribution.Corresponding Author.

Abstract

Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon in https://github.com/LingjieKong-fdu/AnyMaker.

Refer to caption — Figure 1: Diverse applications of AnyMaker. Given text prompts and one single reference image, our AnyMaker can achieve various customization for general objects with high ID fidelity and flexible text editability, without further fine-tuning.

1 Introduction

Recent advancements in diffusion-based text-to-image generative models [1, 2, 3] have entitled common users to create high-fidelity photo-realistic images with minimal expertise. However, generating images from specific user elements, such as a personal dog of users, remains a considerable challenge [4, 5, 6, 7] due to the difficulty of preserving the ID of specific elements. To address this limitation, object customization, which leverages reference images and textual descriptions to generate images that shares the same identity as objects of interest, has become a crucial topic.

Current object customization methods fall into two main categories: object-specific methods and object-agnostic methods. Object-specific methods, such as DreamBooth [8] and DisenBooth [9], inject customized knowledge into pretrained diffusion models via learning additional reference-specific identifiers. While these methods can be adapted to various objects, their effectiveness relies on time-consuming fine-tuning on a few reference images, which leads to the low efficiency of such methods. Object-agnostic methods, or zero-shot methods like PhotoMaker [6] and InstantID [5], on the other hand, employ large-scale training datasets to generate customized images without further fine-tuning. However, these methods predominantly concentrate on specialized tasks that are only applicable to specific domains, such as human customization or virtual try-on. In this paper, we target at zero-shot general object customization which has seldom been studied in previous works.

To this end, we propose AnyMaker, an innovative zero-shot general object customization framework in text-to-image generation tasks, which can achieve high-fidelity customization for general objects given a text prompt and one single reference image without any test-time fine-tuning. The efficacy of AnyMaker comes from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling.

Specifically, in contrast to prior approaches [6, 7, 5] that merely utilize a single model to extract object representations, we employ an ensemble of self-supervised models to extract more comprehensive information from general objects, ensuring the extracted ID-related information is sufficient enough for the high-ID-fidelity customization. Then, to integrate the extracted ID information into the diffusion UNet while preserving the text editability of the model, we design a global-local dual-level ID injection strategy. At the global level, we fuse the class tokens of the extracted feature with the class word (such as dog or cat) of the text description, injecting semantic-level ID information into the diffusion UNet without damaging the other elements in the text. At the local level, we inject the patch tokens of the extracted feature into the diffusion UNet through newly added cross-attention modules without compressing the feature dimension, ensuring that the model receives a substantial amount of ID-related details, which is crucial for achieving high-ID-fidelity generation. Furthermore, we find that the ID information is usually coupled with redundant non-ID elements such as the motions and orientations of objects. Such coupling phenomena can disrupt the text editing ability and ID retention ability of the model, and further hinder the diversity of generation. To address this issue, we design a novel ID-aware decoupling module at the training stage, compelling the model to discern the intrinsic properties of the ID information and exclude non-ID information away from the generation process.

Benefiting from the innovative ID processing modules mentioned above, AnyMaker realizes zero-shot customization for general objects with high ID fidelity and flexible text editability. We demonstrate various applications of AnyMaker in Fig.1, such as realistic human customization, cartoon character customization, virtual try-on, story generation, ID mixing, and so on.

To validate our method and facilitate the development of general object customization, we build the first large-scale general ID dataset, that is Multi-Category ID-consistent (MC-IDC) dataset, which contains 315k text-image samples with over 10k categories. Experiments demonstrate that AnyMaker achieves state-of-the-art generation results on general object customization and even outperforms task-specific methods in specialized domains like human customization and virtual try-on.

In summary, the contributions of our work are as follows:

•

We propose a novel zero-shot general object customization framework in text-to-image generation tasks, which is capable of customizing general objects with high ID fidelity and flexible text editability through innovative general ID extraction, dual-level ID injection and ID-aware decoupling.
•

We construct a large-scale Multi-Category ID-Consistent (MC-IDC) dataset to significantly advance the development of general object customization.
•

We validate the effectiveness of our method through extensive experiments, which demonstrate that our AnyMaker delivers desirable performance in general object customization and surpass specialized methods tailored for specific tasks.

2 Related Work

Diffusion models. Diffusion models have demonstrated their effectiveness for text-to-image generations. The DDPM model pioneered in this field, using a diffusion and denoising process to create a mapping between Gaussian and image distributions. The Latent Diffusion Model (LDM) [10] took this a step further by applying the diffusion model to a latent space rather than pixel space, leading to the creation of text-to-image diffusion models like Stable Diffusion (SD), Midjourney, and DALLE-3 [11]. One main research problem for the diffusion model is its structure. Other than the original UNet, DiT [12] and Pixart- $\alpha$ [3] adopt transformer as the backbone structure. Other researchers have also focused on enhancing image fidelity regarding complicated concepts. By manipulating the cross-attention maps according to text prompts, [13] ensures that required objects are sufficiently generated in the images. [14] proposed to process the text prompts to decompose all target attributes, which can be further used to guide the denoising procedure with the help of different objective functions. To enhance content consistency, [15] adopted an inner localization method by calculating the cross attention. Our method is generally built on diffusion models. While different from these basic models for plain image generations, we focus on the more challenging topic of zero-shot any object customization.

Image customization. Image customization aims to generate images with the same objects of interest as the reference image [8, 16, 17, 18, 5]. Current methods can be generally grouped into two categories. One is object-specific model, which tries to finetune the pretrained diffusion models with reference images. Textual inversion [19] learns an extra word embedding for the target concept. Dreambooth [8] and DisenBooth [9] enhance such a pipeline with auxiliary objectives such as prior loss, which ensures the retention of the model’s original generative capabilities throughout the fine-tuning process. The other genre is object-agnostic models, which can directly process reference images without further fine-tuning. PhotoMaker [6], as a typical zero-shot method, is driven by training a person ID-extaction network, while [20] uses a face model with CLIP to extract ID features. With the help of ControlNet [16], InstantID [5] can control face expression by leveraging a pre-trained face model to extract face embedding. MagicCloth [21] utilizes a ControlNet style garment extractor to encode garment details and fuse with attention for clothes replacement or preview. On the contrary to these works, we try to combine the versatility of object-specific models and the efficiency of object-agnostic models to present a method that can customize any object in a zero-shot manner, which has never been studied before.

3 Method

We first outline the overall structure of AnyMaker in Sec. 3.1. Then, we detail the general ID extraction, dual-level ID injection and ID-aware decoupling in Sec. 3.2, Sec. 3.3 and Sec. 3.4 respectively.

3.1 Overview

Give a reference image and a text prompt, our goal is to generate images that shares the same ID as the object of interest from the reference image, and modifies non-ID elements such as motions and backgrounds in accordance with the text prompt. To this end, we propose AnyMaker, an innovative zero-shot text-to-image customization framework for general objects, as summarized in Fig. 2. Firstly, we utilize the general ID extractor to acquire sufficient ID information from the reference image with a segmentation mask of the interested object. Next, we leverage the dual-level ID injection to endow the diffusion model with ID information at the global and local levels respectively, without damaging the text editing ability of the model. Additionally, we design a novel ID-aware decoupling module to distinguish the ID information from the coupled non-ID elements in the object representation, for higher ID fidelity and better text editing ability.

3.2 Genral ID Extraction

We utilize an ensemble of self-supervised models, consisting of DINOv2 [22] and MAE [23], to extract the representations of general objects in the reference image.

Previous works [6, 5] targeting at customization tasks in specific domains like human face, usually use a single model to extract object representations. However, our experiments demonstrate that one single pre-trained model, such as CLIP [24], DINOv2 [22] or MAE [23], cannot extract sufficient features for high-ID-fidelity generation of general objects, as shown in Fig. 4 and Tab. 2. Specifically, previous works [4, 5] demonstrate that features extracted by CLIP contain insufficient details, thus being less practical as an ID feature extractor, while DINOv2 is capable of providing desirable details due to its self-supervised contrastive learning strategy. However, we find that the features extracted by DINOv2 are color-insensitive because it is trained with augmentations such as ColorJitter. Consequently, solely using DINOv2 as an ID extractor would lead to mistakenly generated color. MAE, on the other hand, is aware of the color thanks to its reconstruction-related objective.

Therefore, we leverage DINOv2 and MAE, which compensate for each other, to extract sufficient and comprehensive information to handle the challenging customization task for general objects with various categories and diverse characteristics. To align the extracted features and facilitate the subsequent injection, we adopt a two-layer MLP. The process of extracting general object representations is as follows,

f_{dino}^{C},f_{dino}^{P}=MLP(F_{dino}(I_{ref}\odot M_{ref})),

(1)

f_{mae}^{C},f_{mae}^{P}=MLP(F_{mae}(I_{ref}\odot M_{ref})),

(2)

where $\odot$ represents element-wise multiplication, and $f_{dino}^{C},f_{dino}^{P},f_{mae}^{C},f_{mae}^{P}$ represent the class tokens and patch tokens extracted by DINOv2 and MAE respectively. $F_{dino}$ and $F_{mae}$ represent the DINOv2 and MAE backbone. $I_{ref}$ represents the reference image, and $M_{ref}$ is the segmentation mask of the interested object in the reference image. Consequently, tokens derived from our general ID extractor not only offer intricate details from DINOv2, but also have sufficient colour and structural information from MAE.

3.3 Dual-level ID Injection

We propose a dual-level ID injection mechanism to inject the global semantic ID information and local fine-grained ID information into the diffusion UNet respectively, ensuring both flexible text editability and high ID fidelity in the customization process for general objects.

Global ID injection. In text-to-image generation, text prompts are leveraged to guide diffusion models to generate diverse images. Nevertheless, diffusion models are incapable of accurately generating images with a specific object ID solely relying on the text description. To inject the ID information of the interested objects into the text, while simultaneously maintaining the editing capability of text prompts for other elements such as the background and motions, we design the global ID injection mechanism. Specifically, we merge the semantic-level class tokens $(f_{dino}^{C},f_{mae}^{C})$ with the class word in the text embedding, such as "dog" or "cat",

f_{fuse}^{C}=MLP(Concat(f_{text}^{C},f_{dino}^{C},f_{mae}^{C})),

(3)

where $f_{fuse}^{C}$ is the fused class token, and $f_{text}^{C}$ represents the class word embedding of the interested object. Then we insert the fused class token $f_{fuse}^{C}$ to the original position of the class word in the text embeddings, obtaining the global-level condition $c_{g}$ with ID-related information of the object of interest, as shown in Fig. 2. Following this, the global-level condition $c_{g}$ engages with the cross-attention module in the UNet, like to standard text-to-image models [10], as follows,

Z_{g}=Attention(Q_{g},K_{g},V_{g}),

(4)

where $Q_{g}=Z_{g}W_{q}$ , $K_{g}=c_{g}W_{k}$ , $V_{g}=c_{g}W_{v}$ are the query, key, and value of cross-attention module, in which $W_{q}$ , $W_{k}$ , $W_{v}$ are the weight matrices of the trainable linear projection layers, and $Z_{g}$ is the query feature containing the latent input information.

Local ID injection. Semantic-level information in the global-level condition $c_{g}$ contains relatively fewer details of the objects, and thus the model cannot handle complex customization tasks for general objects effectively merely relying on global ID injection. To supply the model with sufficient ID-related details, we propose the local ID injection mechanism. Specifically, we first fuse the extracted patch tokens $f_{dino}^{P}$ and $f_{mae}^{P}$ with a two-layer MLP to densify the contained information to a unified representation $c_{l}$ , as follows,

c_{l}=MLP(f_{dino}^{P})+MLP(f_{mae}^{P}),

(5)

where $c_{l}$ serves as the local-level condition for the diffusion UNet. Then, we add one cross-attention module in each upblock of the diffusion UNet in order to inject the ID-related details into the model without compressing $c_{l}$ dimensions, feeding the model with as much ID information as possible.

Z_{l}=Attention(Q_{l},K_{l},V_{l}),

(6)

where $Q_{l}=Z_{l}W_{q}$ , $K_{l}=c_{l}W_{k}$ , $V_{l}=c_{l}W_{v}$ are the query, key, and value of cross-attention module, in which $W_{l}$ , $W_{l}$ , $W_{l}$ are the weight matrices of the trainable linear projection layers, and $Z_{l}$ is the query feature containing the latent input information.

Through the dual-level ID injection, we provide the model with sufficient ID-related information, both at the semantic level and the detail level, on the premise of not impairing the text editing ability of the model.

3.4 Training with ID-Aware Decoupling

The ID-related information injected in the diffusion model is usually entangled with redundant non-ID information such as motion, direction and size. Without proper guidance, the model would learn both kinds of information together, which may disrupt the text editing ability of the model and hinder the diversity of generation. For instance, in a reference image depicting "a standing person", the posture "standing" may also be extracted and injected into the diffusion model, which could impede a text prompt wishing to generate "a sitting person". Consequently, the generated images would closely resemble the provided reference images, thereby compromising the text editing ability and generation diversity.

To eliminate the negative impact brought by the coupling phenomenon, we take inspiration from previous works [9, 23] to propose an ID-aware decoupling module, which consists of a decoupling branch and a group of losses based on the normal generation branch at the training stage, guiding the model to discern the intrinsic properties of ID information.

Decoupling branch and normal branch. In the decoupling branch, we first extract the target image features $f_{tar}$ through CLIP [24] to make an image embedding prior [11]. Next, we mask out the ID information contained in the image embedding by a trainable feature mask $m_{id}$ , and thus the masked image feature $f_{msk}=f_{tar}\odot m_{id}$ contains only non-ID information of the target image, where $\odot$ represents element-wise multiplication. Then, we add the masked image feature $f_{msk}$ to the fused class token $f_{fuse}^{C}$ and then feed it to the diffusion UNet for generation following the injection way mentioned in the Sec. 3.3. Yet in the normal branch, we inject $f_{fuse}^{C}$ to the model without $f_{msk}$ . The generation process of two branches can be illustrated as follows,

f_{msk}\oplus f_{fuse}^{C}\Rightarrow I_{tar},\hskip 14.22636ptf_{fuse}^{C}% \Rightarrow I_{tar},\hskip 14.22636ptf_{msk}\perp f_{fuse}^{C},

(7)

where $\oplus$ represents element-wise addition and $I_{tar}$ is the target image, and the symbol $\perp$ signifies that the features it connects are independent of each other and have no obvious correlation. Through training in the two branches, we push the non-ID information into the masked image feature $f_{msk}$ and only retain the ID-related information in the fused class token $f_{fuse}^{C}$ .

Training strategy. We use three losses in our training stage to ensure that the ID-aware decoupling module takes effect. Specifically, We utilize the following denoising losses to train each branch,

\mathcal{L}_{decouple}=\|\epsilon-\epsilon_{\theta}\big{(}x_{t},c_{g},f_{msk},% c_{l},t\big{)}\|^{2},

(8)

\mathcal{L}_{normal}=\|\epsilon-\epsilon_{\theta}\big{(}x_{t},c_{g},c_{l},t% \big{)}\|^{2},

(9)

where $t$ is the randomly sampled time step and $x_{t}$ represents the noisy latent of the target image, $\epsilon$ is the ground truth noise and $\epsilon_{\theta}$ is the predicted noise. $c_{g}$ and $c_{l}$ represent the global-level and local-level conditions acquired in Sec. 3.3. Additionally, we design a contrastive loss, ensuring that the masked image feature $f_{msk}$ captures non-ID information and the fused class token $f_{fuse}^{C}$ remains free of it, as below,

\mathcal{L}_{contrast}=Sim(f_{fuse}^{C},f_{msk}),

(10)

where $Sim$ is instantiated as cosine similarity. In summary, the training loss is the sum of the three loss functions mentioned above,

\mathcal{L}=\alpha_{1}\mathcal{L}_{normal}+\alpha_{2}\mathcal{L}_{decouple}+% \alpha_{3}\mathcal{L}_{contrast},

(11)

where $\alpha_{1},\alpha_{2},\alpha_{3}$ represent hyperparameters to balance the three losses.

4 MC-IDC Dataset

We create the first large-scale general ID dataset, named as Multi-Category ID-Consistent (MC-IDC) dataset, designed for the research of general object customization. We introduce the characteristics and construction process of the dataset in Sec. 4.1 and Sec. 4.2 respectively. We introduce more details of our dataset in the Appendix.

4.1 Characteristics

Large-scale dataset. Our dataset consists of approximately 315,000 samples in total with more than 10,000 categories, covering various types such as human faces, animals, clothes, human-made tools, etc. Adequate training samples and diverse categories endow the model with general ID understanding. We introduce the main categories of the dataset and their quantities in the Appendix.

High quality. The average resolution of the images in our dataset is 1039 $\times$ 950, which is helpful for the high-quality generation of our model in both training and inference stage.

Reference-target image pair. Each sample consists of a reference image, a segmentation mask of the object of interest in the reference image, a target image, and a text caption of the target image. During training, the reference image with its segmentation mask provides the model with ID information, the text caption of the target image offers semantic-level guidance for generation, and the target image serves as the ground truth. In each sample, the reference image and the target image depict the same object in different states, which means that each reference-target image pair shares the same ID of the object of interest and has different non-ID elements such as motions and directions. Training with reference-target image pairs endows the model with the ID processing ability while not damaging the diversity of generation.

4.2 Construction Pipeline

The construction pipeline of MC-IDC consists of four steps: data collection, instance detection and segmentation, image pair generation, and text prompt generation.

Data collection. MC-IDC dataset contains diverse sources including web-crawled images, movies and publicly available datasets. To facilitate the subsequent generation of image pairs, most of the data we collect are video datas or multi-view datas. To ensure the high quality of our dataset, we delete the images with a resolution less than 300 $\times$ 300.

Instance detection and segmentation. For some publicly available datasets or web-crawled images which do not provide segmentation annotations, we use the advanced instance segmentation model [25] for instance detection and segmentation. For movie dataset, we extract the frames with 10fps and do object tracking with [26], aiming to establish the ID connection across frames.

Image pair generation. For most data samples which are from a clip of video or a group of multi-view data, we randomly select two frames containing the same object and perform a random crop around the object to form the reference-target image pair. For the other data samples from the single image dataset, we perform random data augmentations on the single image two times, and take them as the reference image and the target image respectively.

Text prompt generation. In each image pair, the target image is densely captioned with the state-of-the-art large vision language model [27], serving as the text prompt to guide the generation.

For more details about the MC-IDC, please refer to the Appendix.

Table 1: Quantitative comparison among zero-shot object customization methods in general domain and two specific popular domains namely human customization and virtual try-on.

Domains	Methods	FID $\downarrow$	CLIP-i $\uparrow$	CLIP-t $\uparrow$	DINO-i $\uparrow$	FaceSim $\uparrow$	DiverSim-i $\downarrow$
General objects	IP-Adapter	70.32	77.18	28.03	44.94	-	84.43±0.66
General objects	Ours	47.09	82.16	29.27	65.13	-	74.38±3.99
Human customization	IP-Adapter	102.69	72.17	29.32	41.44	65.51	-
	PhotoMaker	106.35	71.80	32.13	44.62	64.10	-
	InstanceID	113.18	75.87	32.89	49.26	63.26	-
	Ours	86.40	79.60	30.88	57.44	78.54	-
Virtual try-on	MagicClothing	126.09	76.53	21.40	29.10	-	89.36±0.40
	IP-Adapter	104.47	81.99	25.03	59.39	-	71.28±5.06
	Ours	50.65	83.82	22.42	66.24	-	71.27±3.63

5 Experiment

5.1 Setup

Implementation details. To be compatible with open-community as Civitai.com, we employ SD1.5 [28], the most used foundational model in the public, as the backbone model in our experiments. We use the multi-scale training mode proposed in [1] considering the resolutions of images from different sources are diverse. We employ the same CLIP image encoder as in [28] in the ID-aware decoupling module. We use MAE [23] and DINOv2 [22] for extracting ID features. During training, we set the learning rate as 1e-5 with a batch size of 32. We conduct the training process about 6 epochs with 32 V100 GPUs, consuming about 30 hours. We adopt dropout with a ratio of 0.1 for the global-level condition $c_{g}$ and the local-level condition $c_{l}$ respectively. During inference, we utilize 50 steps denoising process for generation and set the classifier-free guidance scale as 7.

Competitors. To demonstrate the generality of our AnyMaker, we not only make a comparison with previous works on general object customization, but also compare it with task-specialized methods in two popular specific customization domains, that is, human customization and virtual try-on. Specifically, we compare AnyMaker with IP-Adapter [7] on zero-shot general object customization. On human customization, we further choose human specialized method PhotoMaker [6] and InstantID [5] for comparison. On virtual try-on, we conduct a comparison with the recent state-of-the-art MagicClothing [21]. It is worth noting that all the methods mentioned above are zero-shot methods and are capable of customizing images based solely on one single reference image. Therefore, in this paper, we do not compare object-specific methods such as DreamBooth [8] and DisenBooth [9] which typically require 3 to 5 reference images and additional fine-tuning.

Evaluation dataset. Our evaluation dataset contains 1000 text-image samples, including general objects, human data, and virtual try-on data with a ratio of 4:3:3. All samples in the evaluation dataset are randomly selected from the MC-IDC and do not appear during the training stage.

Evaluation metrics. Following PhotoMaker [6], we utilize DINO-i [22] and CLIP-i [29] to measure the ID fidelity and use CLIP-t to measure the prompt fidelity. We leverage FID [30] to assess the generation quality. For human customization, we additionally calculate the face similarity (FaceSim) with FaceNet [31] as commonly done in [6, 5]. Furthermore, we design a novel metric DiverSim-i, which calculates the average DINO similarity among the images generated according to diverse text prompts containing different scenarios. The lower the value of DiverSim-i, the stronger the ability of the model to generate diverse images in line with text prompts.

We introduce more details about the experimental setup in the Appendix.

5.2 Quantative Analysis

Our AnyMaker outperforms previous works on general object customization for all metrics, and achieve comparable or better performance in contrast with specific methods tailored for specialized tasks such as human customization and virtual try-on, as shown in Tab. 1. Specifically, we achieve higher ID fidelity for CLIP-i and DINO-i, better prompt fidelity for CLIP-t, and higher generation quality evaluated by FID. Furthermore, our method exhibits better diversity on various scenarios as presented by DiverSim-i.

Table 2: Ablation study on the ID extraction methods.

Methods	FID $\downarrow$	CLIP-i $\uparrow$	DINO-i $\uparrow$
CLIP	49.11	79.58	59.45
DINO	48.89	80.82	63.71
MAE	49.00	79.09	59.79
Ours	47.50	81.86	65.12

Table 3: Ablation study on the ID injection methods.

Methods	FID $\downarrow$	CLIP-i $\uparrow$	DINO-i $\uparrow$
Global	49.78	78.76	60.67
Local	48.66	81.24	62.89
Ours	47.50	81.86	65.12

Table 4: Ablation study on the ID-aware decoupling module.

Methods	FID $\downarrow$	CLIP-i $\uparrow$	DINO-i $\uparrow$
w.o. decoupling	47.50	81.86	65.12
with decoupling	47.09	82.16	65.13

5.3 Qualitative Analysis

The AnyMaker exhibits outstanding capabilities of high-quality customization for general objects, and even beat task-specialized methods in the specific domains, such as human customization and virtual try-on, as shown in Fig. 3. Our method is capable of generating high-ID-fidelity images given one single reference image, and simultaneously editing non-ID elements like posture or background in accordance with text prompts, which endows the model with the ability to generate diverse images, such as various scenarios and different postures. Furthermore, benefiting from the ID-aware decoupling module, our model is capable of maintaining the ID fidelity of objects while diversifying the non-ID elements such as motions and directions compared to the reference images, even without corresponding guidance of text prompts.

5.4 Ablation Study

Effectiveness of general ID Extraction. We compare our proposed general ID extractor with DINOv2 [22], MAE [23] and CLIP [29]. As in Tab. 2, our method performs the best, whether in terms of FID, which measures the quality of generation, or in terms of CLIP-i and DINO-i, which measure ID fidelity. As in Fig. 4, merely using DINOv2 as the extractor fails to maintain color consistency, while solely using MAE as the extractor lacks sufficient details. In contrast, our method combines the advantages of both DINOv2 and MAE, which enables our model to customize images with rich details and the consistent color of the object in reference images.

Benefits of dual-level ID injection. We explore the effectiveness of our dual-level ID injection module compared to only global-level injection and only local-level injection separately. As shown in Tab. 3, by combining global and local injection, AnyMaker achieves the best generation results both in quality as measured by FID and ID-fidelity as measured by CLIP-i and DINO-i.

Benefits of ID-aware decoupling. We conduct a comparative experiment on whether to add the ID decoupling module to the model during training, aiming to verify its effectiveness. As in Tab.4, the model trained with the ID-aware decoupling module achieves higher ID-fidelity scores in terms of CLIP-i and DINO-i, which means that the decoupling module can help the model to discern ID information embedded in object representations, thereby generating results with better ID consistency. Further, we visualize the results by models trained with and without decoupling in Fig.5. The model trained with decoupling exhibit enhanced capabilities in distinguishing ID information from non-ID elements such as motions (open or shut the mouth) and directions (front-facing or side-facing) of objects of interest. The enhanced discrimination ability allows the model to mitigate the influence of non-ID information during generation, thereby better preserving the text editing capabilities.

6 Conclusion

We introduce AnyMaker, a zero-shot text-to-image framework for general object customization, which has seldom been studied in previous works. Our approach integrates novel general ID extraction, dual-level ID injection, and ID-aware decoupling module, endowing the model with high ID fidelity without damaging the text editing ability. Experiments demonstrate that our method excels in general object customization solely with one single reference image, and even outperforms task-specialized methods in some popular specific domains. Furthermore, we construct the first large-scale general ID dataset, MC-IDC, to set a new benchmark and promote the future research for object customization.

References

[1] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
[2] James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
[3] Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis, 2023.
[4] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization, 2023.
[5] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, Anthony Chen, Huaxia Li, Xu Tang, and Yao Hu. Instantid: Zero-shot identity-preserving generation in seconds, 2024.
[6] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding, 2023.
[7] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
[8] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
[9] Hong Chen, Yipeng Zhang, Simin Wu, Xin Wang, Xuguang Duan, Yuwei Zhou, and Wenwu Zhu. Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation. In The Twelfth International Conference on Learning Representations, 2023.
[10] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[11] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
[12] William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
[13] Hila Chefer, Shir Gur, and Lior Wolf. Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 782–791, 2021.
[14] Songwei Ge, Taesung Park, Jun-Yan Zhu, and Jia-Bin Huang. Expressive text-to-image generation with rich text. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7545–7556, 2023.
[15] Yoad Tewel, Yoav Shalev, Idan Schwartz, and Lior Wolf. Zerocap: Zero-shot image-to-text generation for visual-semantic arithmetic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17918–17928, 2022.
[16] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
[17] Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, and Rongrong Ji. Portraitbooth: A versatile portrait model for fast identity-preserved personalization. arXiv preprint arXiv:2312.06354, 2023.
[18] Gan Pei, Jiangning Zhang, Menghan Hu, Guangtao Zhai, Chengjie Wang, Zhenyu Zhang, Jian Yang, Chunhua Shen, and Dacheng Tao. Deepfake generation and detection: A benchmark and survey. arXiv preprint arXiv:2403.17881, 2024.
[19] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
[20] Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, et al. Consistentid: Portrait generation with multimodal fine-grained identity preserving. arXiv preprint arXiv:2404.16771, 2024.
[21] Weifeng Chen, Tao Gu, Yuhao Xu, and Chengcai Chen. Magic clothing: Controllable garment-driven image synthesis. arXiv preprint arXiv:2404.09512, 2024.
[22] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2024.
[23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners, 2021.
[24] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[25] Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, et al. Internimage: Exploring large-scale vision foundation models with deformable convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14408–14419, 2023.
[26] Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, and Yanwei Fu. Chained-tracker: Chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 145–161. Springer, 2020.
[27] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[30] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[31] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
[32] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
[33] Ning Xu, Linjie Yang, Yuchen Fan, Dingcheng Yue, Yuchen Liang, Jianchao Yang, and Thomas Huang. Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327, 2018.
[34] Jiaxu Miao, Xiaohan Wang, Yu Wu, Wei Li, Xu Zhang, Yunchao Wei, and Yi Yang. Large-scale video panoptic segmentation in the wild: A benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21033–21043, 2022.
[35] Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023.
[36] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7543–7552, 2018.
[37] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.

Appendix A Appendix

A.1 Additional Details about MC-IDC

Data sources. The data sources of MC-IDC can be divided into three categories: public datasets, web-crawled images, and movies. We detail the statistical information about various data sources in Tab. A1.

Main categories. We record several main categories that appear most frequently in MC-IDC, as shown in Tab. A2.

Dataset illustration. We present the overall illustration of the MC-IDC as shown in Fig. A1.

Table A1: Details about data sources of MC-IDC.

Source	Dataset	Typ	Image pair numbers
Public datasets	HumanFace [32]	Video	55,830
	VOS [33]	Video	55,823
	VIPSEG [34]	Video	27,983
	MVImgNet [35]	Multi-view image	53,909
	VITON [36]	Multi-view image	20,000
	LVIS [37]	Single image	8,003
Web-crawled images	-	Single image	55,829
Movies	-	Video	38,405

Table A2: Sample numbers of main categories in MC-IDC.

Kategorien	Images
man	46,720
woman	26,670
clothes	20,040
girl	3,498
panda	3,007
train	2,286
car	1,974
boy	1,855
dog	1,820

A.2 Additional Details about Experiment Setup

Categories in evaluation dataset. The evaluation dataset can be divided into general objects, human data, and virtual try-on data. The human data and the virtual try-on data each contain 300 different samples. General objects in the evaluation dataset consist of 50 categories, each of which contains 8 diverse samples. We summarize the 50 categories in Tab. A3.

Table A3: Categories of general objects in the evaluation dataset.

Winter melon	Cabbage	Vessel	Pillow	Screw driver
Pants	Computer mouse	Lipstick	Rice cooker	Toy figure
Clothing	Pineapple	Can	Plush toy	Grape
Toilet paper	Paper box	Skirt	Pawpaw	Ginger
Bowl	Train	Bottle	Cantaloupe	Sanitary napkin
Soccer	Bag	Umbrella	Hammer	Book
Flower	Shoe	Towel	Ashcan	Telephone
Faucet	Flowerpot	Motorcycle	Mug	Kiwi
Pot	Grapefruit	Jug	Car	Basket
Balloons	Tomato	Flashlight	Bagged snacks	Toy duck

Text prompts for calculating DiverSim-i. We use diverse text prompts describing different scenarios to guide the generation, and calculate DiverSim-i among the generated images. We record the text prompts in Tab. A4.

Table A4: Text prompts for calculating DiverSim-i.

Scenarios	Text prompts
Snow	Original text prompt + "The scene of the picture is in the snow."
Snow	Original text prompt + "The background of the picture is in the snow."
Grass	Original text prompt + "The scene of the picture is on the grass."
Grass	Original text prompt + "The background of the picture is on the grass."
Beach	Original text prompt + "The scene of the picture is on the beach."
Beach	Original text prompt + "The background of the picture is on the beach."
Jungle	Original text prompt + "The scene of the picture is in the jungle."
Jungle	Original text prompt + "The background of the picture is in the jungle."
Eiffel Tower	Original text prompt + "The scene of the picture is beside the Eiffel Tower."
Eiffel Tower	Original text prompt + "The background of the picture is beside the Eiffel Tower."

A.3 More Visual Results

Our AnyMaker exhibits outstanding performance on various applications, such as general object customization in Fig. A2, virtual try-on in Fig. A3, Text-Image ID Mixing in Fig. A4, ID-consistent generation in Fig. A5, and story generation in Fig. A6