Collagen is an important structural protein and the most abundant protein in mammals. In several research fields, structural analysis of collagens is performed. Fibrillar collagens almost entirely consist of continuous repeats of GXY, where G is glycine, X is often proline or alanine and Y is often hydroxyproline or alanine. In the present study, the collagen structure was investigated in detail at the nucleotide, codon group, amino acid and target peptide level using sequence analyses. One of the most important findings was that a selection of codon groups is predominantly involved in amino acid changes between closely related collagens and that other change routes come up when collagens are less related. The findings of the sequence analyses were used to evaluate reported sequences of non-avian dinosaur species and database entries of duck and chicken collagen. The duck assessment was supported by an experimental data set, obtained by collagen extraction from duck skin and subsequent digestion and LC-MS analysis. It was found that database entries of chicken and duck collagen 3α1 contained unreliable features, such as missing parts, no continuous GXY pattern and too many interspecies differences. As an example, the erroneous nature of one of these unreliable features was confirmed experimentally using LC-MS. Finally, dino and bird collagen 1α1 were compared. The presented results will show that performing a domain-specific proteogenomic analysis provides very useful information to assess de novo sequencing results and database information of collagens. Furthermore, it offers deeper insight in the functional restrictions and routes of evolutionary divergence.
Keywords: Collagen; De novo sequencing; Domain-specific; GXY domain; LC-MS; Proteogenomics.