The textual characteristics of traditional and Open Access scientific journals are similar

BMC Bioinformatics. 2009 Jun 15:10:183. doi: 10.1186/1471-2105-10-183.

Abstract

Background: Recent years have seen an increased amount of natural language processing (NLP) work on full text biomedical journal publications. Much of this work is done with Open Access journal articles. Such work assumes that Open Access articles are representative of biomedical publications in general and that methods developed for analysis of Open Access full text publications will generalize to the biomedical literature as a whole. If this assumption is wrong, the cost to the community will be large, including not just wasted resources, but also flawed science. This paper examines that assumption.

Results: We collected two sets of documents, one consisting only of Open Access publications and the other consisting only of traditional journal publications. We examined them for differences in surface linguistic structures that have obvious consequences for the ease or difficulty of natural language processing and for differences in semantic content as reflected in lexical items. Regarding surface linguistic structures, we examined the incidence of conjunctions, negation, passives, and pronominal anaphora, and found that the two collections did not differ. We also examined the distribution of sentence lengths and found that both collections were characterized by the same mode. Regarding lexical items, we found that the Kullback-Leibler divergence between the two collections was low, and was lower than the divergence between either collection and a reference corpus. Where small differences did exist, log likelihood analysis showed that they were primarily in the area of formatting and in specific named entities.

Conclusion: We did not find structural or semantic differences between the Open Access and traditional journal collections.

Publication types

  • Research Support, N.I.H., Extramural

MeSH terms

  • Access to Information*
  • Biomedical Research
  • Databases, Bibliographic
  • Linguistics*
  • Natural Language Processing
  • Periodicals as Topic*