Dealing with zero word frequencies: a review of the existing rules of thumb and a suggestion for an evidence-based choice

Behav Res Methods. 2013 Jun;45(2):422-30. doi: 10.3758/s13428-012-0270-5.

Abstract

In a critical review of the heuristics used to deal with zero word frequencies, we show that four are suboptimal, one is good, and one may be acceptable. The four suboptimal strategies are discarding words with zero frequencies, giving words with zero frequencies a very low frequency, adding 1 to the frequency per million, and making use of the Good-Turing algorithm. The good algorithm is the Laplace transformation, which consists of adding 1 to each frequency count and increasing the total corpus size by the number of word types observed. A strategy that may be acceptable is to guess the frequency of absent words on the basis of other corpora and then increasing the total corpus size by the estimated summed frequency of the missing words. A comparison with the lexical decision times of the English Lexicon Project and the British Lexicon Project suggests that the Laplace transformation gives the most useful estimates (in addition to being easy to calculate). Therefore, we recommend it to researchers.

MeSH terms

  • Algorithms*
  • Behavioral Research / methods*
  • Choice Behavior
  • Evidence-Based Medicine
  • Humans
  • Language
  • Recognition, Psychology*
  • Vocabulary