• Letter
  • Open Access

Correlation dimension of natural language in a statistical manifold

Xin Du and Kumiko Tanaka-Ishii
Phys. Rev. Research 6, L022028 – Published 2 May 2024

Abstract

The correlation dimension of natural language is measured by applying the Grassberger-Procaccia algorithm to high-dimensional sequences produced by a large-scale language model. This method, previously studied only in a Euclidean space, is reformulated in a statistical manifold via the Fisher-Rao distance. Language exhibits a multifractal, with global self-similarity and a universal dimension around 6.5, which is smaller than those of simple discrete random sequences and larger than that of a Barabási-Albert process. Long memory is the key to producing self-similarity. Our method is applicable to any probabilistic model of real-world discrete sequences, and we show an application to music data.

  • Figure
  • Figure
  • Figure
  • Figure
  • Received 25 July 2023
  • Revised 24 January 2024
  • Accepted 9 April 2024

DOI:https://doi.org/10.1103/PhysRevResearch.6.L022028

Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article's title, journal citation, and DOI.

Published by the American Physical Society

Physics Subject Headings (PhySH)

Statistical Physics & Thermodynamics

Authors & Affiliations

Xin Du1,* and Kumiko Tanaka-Ishii2,†

  • 1Waseda Research Institute for Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan
  • 2Department of Computer Science and Engineering, School of Fundamental Science and Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan

Article Text

Click to Expand

Supplemental Material

Click to Expand

References

Click to Expand
Issue

Vol. 6, Iss. 2 — May - July 2024

Subject Areas
Reuse & Permissions
Author publication services for translation and copyediting assistance advertisement

Authorization Required


×
×

Images

×

Sign up to receive regular email alerts from Physical Review Research

Reuse & Permissions

It is not necessary to obtain permission to reuse this article or its components as it is available under the terms of the Creative Commons Attribution 4.0 International license. This license permits unrestricted use, distribution, and reproduction in any medium, provided attribution to the author(s) and the published article's title, journal citation, and DOI are maintained. Please note that some figures may have been included with permission from other third parties. It is your responsibility to obtain the proper permission from the rights holder directly for these figures.

×

Log In

Abbrechen
×

Suche


Article Lookup

Paste a citation or DOI

Enter a citation
×