GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Fahrbach, Matthew; Ramalingam, Srikumar; Zadimoghaddam, Morteza; Ahmadian, Sara; Citovsky, Gui; DeSalvo, Giulia

Computer Science > Data Structures and Algorithms

arXiv:2405.18754 (cs)

[Submitted on 29 May 2024]

Title:GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Authors:Matthew Fahrbach, Srikumar Ramalingam, Morteza Zadimoghaddam, Sara Ahmadian, Gui Citovsky, Giulia DeSalvo

View PDF HTML (experimental)

Abstract:We propose a novel subset selection task called min-distance diverse data summarization ($\textsf{MDDS}$), which has a wide variety of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal is to maximize an objective that combines the total utility of the points and a diversity term that captures the minimum distance between any pair of selected points, subject to the constraint $|S| \le k$. For example, the points may correspond to training examples in a data sampling problem, e.g., learned embeddings of images extracted from a deep neural network. This work presents the $\texttt{GIST}$ algorithm, which achieves a $\frac{2}{3}$-approximation guarantee for $\textsf{MDDS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove a complementary $(\frac{2}{3}+\varepsilon)$-hardness of approximation, for any $\varepsilon > 0$. Finally, we provide an empirical study that demonstrates $\texttt{GIST}$ outperforms existing methods for $\textsf{MDDS}$ on synthetic data, and also for a real-world image classification experiment the studies single-shot subset selection for ImageNet.

Comments:	15 pages, 1 figure
Subjects:	Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Cite as:	arXiv:2405.18754 [cs.DS]
	(or arXiv:2405.18754v1 [cs.DS] for this version)
	https://doi.org/10.48550/arXiv.2405.18754

Submission history

From: Matthew Fahrbach [view email]
[v1] Wed, 29 May 2024 04:39:24 UTC (730 KB)

Computer Science > Data Structures and Algorithms

Title:GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Data Structures and Algorithms

Title:GIST: Greedy Independent Set Thresholding for Diverse Data Summarization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators