Establishing vocabulary tests as a benchmark for evaluating large language models

Gonzalo Martínez; Javier Conde; Elena Merino-Gómez; Beatriz Bermúdez-Margaretto; José Alberto Hernández; Pedro Reviriego; Marc Brysbaert

doi:10.1371/journal.pone.0308259

Establishing vocabulary tests as a benchmark for evaluating large language models

PLoS One. 2024 Dec 12;19(12):e0308259. doi: 10.1371/journal.pone.0308259. eCollection 2024.

Authors

Gonzalo Martínez¹, Javier Conde², Elena Merino-Gómez³, Beatriz Bermúdez-Margaretto⁴, José Alberto Hernández¹, Pedro Reviriego², Marc Brysbaert⁵

Affiliations

¹ Departamento de Ingeniería Telemática, Universidad Carlos III de Madrid, Leganés, Spain.
² ETSI de Telecomunicación, Universidad Politécnica de Madrid, Madrid, Spain.
³ Escuela de Ingenierías Industriales, Universidad de Valladolid, Valladolid, Spain.
⁴ Departamento de Psicología Básica, Psicobiología y Metodología de las CC. del Compto, Universidad de Salamanca, Salamanca, Spain.
⁵ Department of Experimental Psychology, Ghent University, Ghent, Belgium.

Abstract

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama 2, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific tasks or domain-specific knowledge, they often neglect the fundamental linguistic aspects of language understanding. In this paper, we advocate for the revival of vocabulary tests as a valuable tool for assessing LLM performance. We evaluate seven LLMs using two vocabulary test formats across two languages and uncover surprising gaps in their lexical knowledge. These findings shed light on the intricacies of LLM word representations, their learning mechanisms, and performance variations across models and languages. Moreover, the ability to automatically generate and perform vocabulary tests offers new opportunities to expand the approach and provide a more complete picture of LLMs' language skills.

Copyright: © 2024 Martínez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Benchmarking*
Humans
Language Tests* / standards
Language*
Models, Theoretical
Vocabulary*

Grants and funding

This work was partially supported by the project CyberTutor: Asistente educativo personalizado basado en Grandes Modelos de Lenguaje (LLM), funded by “Primeros Proyectos” call from ETSIT, UPM; by the FUN4DATE (PID2022-136684OB-C22) and ENTRUDIT (TED2021-130118B-I00 projects funded by the Spanish Agencia Estatal de Investigación (AEI); by the Chips Act Joint Undertaking project SMARTY (Grant no. 101140087) and by the OpenAI API Research Access Program. The funders had not played in study design, data collection and analysis, decision to publish, or preparation of the manuscript.