Curating global datasets of structural linguistic features for independence

Anna Graff; Natalia Chousou-Polydouri; David Inman; Hedvig Skirgård; Marc Lischka; Taras Zakharko; Chiara Barbieri; Balthasar Bickel

doi:10.1038/s41597-024-04319-4

Curating global datasets of structural linguistic features for independence

Sci Data. 2025 Jan 18;12(1):106. doi: 10.1038/s41597-024-04319-4.

Authors

Anna Graff^{1

2}, Natalia Chousou-Polydouri^{3

4}, David Inman³, Hedvig Skirgård⁵, Marc Lischka⁶, Taras Zakharko³, Chiara Barbieri^{3

7

8}, Balthasar Bickel³

Affiliations

¹ Institute for the Interdisciplinary Study of Language Evolution (ISLE), University of Zurich, Zürich, Switzerland. [email protected].
² Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland. [email protected].
³ Institute for the Interdisciplinary Study of Language Evolution (ISLE), University of Zurich, Zürich, Switzerland.
⁴ Institute for Mediterranean Studies, Foundation for Research and Technology - Hellas, Rethymno, Greece.
⁵ Department of Linguistic and Cultural Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany.
⁶ Institute of Mathematics, University of Zurich, Zürich, Switzerland.
⁷ Department of Evolutionary Biology and Environmental Studies, University of Zurich, Zürich, Switzerland.
⁸ Dipartimento di Scienze della vita e dell'ambiente, Università degli Studi di Cagliari, Cagliari, Italy.

PMID: 39827249
DOI: 10.1038/s41597-024-04319-4

Abstract

The increasing availability of cross-linguistic databases dedicated to documenting morphosyntactic, lexical and phonological features has proliferated the use of such data for studies on language evolution and human history. However, most of these databases were not designed to ensure independence of features, such that it is not valid to jointly use all their features in large-scale statistical analyses assuming independence of inputs. Here, we curate published data from five large linguistic databases to generate two global-scale cross-linguistic datasets: GBI (from the Grambank dataset), and TLI (using inputs from the World Atlas of Language Structures, AUTOTYP, PHOIBLE and Lexibank). The datasets minimize logical dependencies of features and forms of strong statistical dependencies that go beyond phylogenetic and geographical signal. They are also made available in densified form, reducing the proportion of missing data. We document our curation principles and workflows to ensure reusability of this framework with other inputs or thresholds of independence. Our curation steps on both datasets reveal robust and comparable global patterns of structural linguistic diversity.

Publication types

Dataset

MeSH terms

Data Curation
Databases, Factual
Humans
Language
Linguistics*

Grants and funding

51NF40_180888/Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Swiss National Science Foundation)