Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Siddiqui, Shoaib Ahmed; Rajkumar, Nitarshan; Maharaj, Tegan; Krueger, David; Hooker, Sara

Computer Science > Machine Learning

arXiv:2209.10015 (cs)

[Submitted on 20 Sep 2022]

Title:Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Authors:Shoaib Ahmed Siddiqui, Nitarshan Rajkumar, Tegan Maharaj, David Krueger, Sara Hooker

View PDF

Abstract:Modern machine learning research relies on relatively few carefully curated datasets. Even in these datasets, and typically in `untidy' or raw data, practitioners are faced with significant issues of data quality and diversity which can be prohibitively labor intensive to address. Existing methods for dealing with these challenges tend to make strong assumptions about the particular issues at play, and often require a priori knowledge or metadata such as domain labels. Our work is orthogonal to these methods: we instead focus on providing a unified and efficient framework for Metadata Archaeology -- uncovering and inferring metadata of examples in a dataset. We curate different subsets of data that might exist in a dataset (e.g. mislabeled, atypical, or out-of-distribution examples) using simple transformations, and leverage differences in learning dynamics between these probe suites to infer metadata of interest. Our method is on par with far more sophisticated mitigation methods across different tasks: identifying and correcting mislabeled examples, classifying minority-group samples, prioritizing points relevant for training and enabling scalable human auditing of relevant examples.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2209.10015 [cs.LG]
	(or arXiv:2209.10015v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2209.10015

Submission history

From: Shoaib Ahmed Siddiqui [view email]
[v1] Tue, 20 Sep 2022 21:52:39 UTC (18,720 KB)

Computer Science > Machine Learning

Title:Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Metadata Archaeology: Unearthing Data Subsets by Leveraging Training Dynamics

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators