CoagMDB: a database analysis of missense mutations within four conserved domains in five vitamin K-dependent coagulation serine proteases using a text-mining tool

Hum Mutat. 2008 Mar;29(3):333-44. doi: 10.1002/humu.20629.

Abstract

Central repositories of mutations that combine structural, sequence, and phenotypic information in related proteins will facilitate the diagnosis and molecular understanding of diseases associated with them. Coagulation involves the sequential activation of serine proteases and regulators in order to yield stable blood clots while maintaining hemostasis. Five coagulation serine proteases-factor VII (F7), factor IX (F9), factor X (F10), protein C (PROC), and thrombin (F2)-exhibit high sequence similarities and all require vitamin K. All five of these were incorporated into an interactive database of mutations named CoagMDB (http://www.coagMDB.org; last accessed: 9 August 2007). The large number of mutations involved (especially for factor IX) and the increasing problem of out-of-date databases required the development of new database management tools. A text mining tool automatically scans full-length references to identify and extract mutations. High recall rates between 96 and 99% and precision rates of 87 to 93% were achieved. Text mining significantly reduces the time and expertise required to maintain the databases and offers a solution to the problem of locus-specific database management and upkeep. A total of 875 mutations were extracted from 1,279 literature sources. Of these, 116 correspond to Gla domains, 86 to the N-terminal EGF domain, 73 to the C-terminal EGF domain, and 477 to the serine protease domain. The combination of text mining and consensus domain structures enables mutations to be correlated with experimentally-measurable phenotypes based on either low protein levels (Type I) or reduced functional activities (Type II), respectively. A tendency for the conservation of phenotype with structural location was identified.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms
  • Amino Acid Sequence
  • Amino Acid Substitution
  • Blood Coagulation Factors / chemistry
  • Blood Coagulation Factors / genetics*
  • Consensus Sequence
  • Conserved Sequence / genetics
  • Databases, Genetic*
  • Factor IX / chemistry
  • Factor IX / genetics
  • Factor VII / chemistry
  • Factor VII / genetics
  • Factor X / chemistry
  • Factor X / genetics
  • Humans
  • Models, Genetic
  • Models, Molecular
  • Molecular Sequence Data
  • Mutation, Missense*
  • Natural Language Processing
  • Protein C / chemistry
  • Protein C / genetics
  • Protein Structure, Tertiary
  • Sequence Homology, Amino Acid
  • Serine Endopeptidases / chemistry
  • Serine Endopeptidases / genetics*
  • Thrombin / chemistry
  • Thrombin / genetics
  • Vitamin K / metabolism

Substances

  • Blood Coagulation Factors
  • Protein C
  • Vitamin K
  • Factor VII
  • Factor IX
  • Factor X
  • Serine Endopeptidases
  • Thrombin