On the Limitation of Diffusion Models for Synthesizing Training Datasets

Yamaguchi, Shin'ya; Fukuda, Takuma

Computer Science > Artificial Intelligence

arXiv:2311.13090 (cs)

[Submitted on 22 Nov 2023]

Title:On the Limitation of Diffusion Models for Synthesizing Training Datasets

Authors:Shin'ya Yamaguchi, Takuma Fukuda

View PDF

Abstract:Synthetic samples from diffusion models are promising for leveraging in training discriminative models as replications of real training datasets. However, we found that the synthetic datasets degrade classification performance over real datasets even when using state-of-the-art diffusion models. This means that modern diffusion models do not perfectly represent the data distribution for the purpose of replicating datasets for training discriminative tasks. This paper investigates the gap between synthetic and real samples by analyzing the synthetic samples reconstructed from real samples through the diffusion and reverse process. By varying the time steps starting the reverse process in the reconstruction, we can control the trade-off between the information in the original real data and the information added by diffusion models. Through assessing the reconstructed samples and trained models, we found that the synthetic data are concentrated in modes of the training data distribution as the reverse step increases, and thus, they are difficult to cover the outer edges of the distribution. Our findings imply that modern diffusion models are insufficient to replicate training data distribution perfectly, and there is room for the improvement of generative modeling in the replication of training datasets.

Comments:	NeurIPS 2023 SyntheticData4ML Workshop
Subjects:	Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.13090 [cs.AI]
	(or arXiv:2311.13090v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2311.13090

Submission history

From: Shin'ya Yamaguchi [view email]
[v1] Wed, 22 Nov 2023 01:42:23 UTC (3,806 KB)

Computer Science > Artificial Intelligence

Title:On the Limitation of Diffusion Models for Synthesizing Training Datasets

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:On the Limitation of Diffusion Models for Synthesizing Training Datasets

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators