In multi-domain proteins, the domains typically run end-to-end, that is, one domain follows the C-terminus of another domain. However, approximately 10% of multi-domain proteins are formed by insertion of one domain sequence into that of another domain. Detecting such insertions within protein sequences is a fundamental challenge in structural biology. The haloacid dehalogenase superfamily (HADSF) serves as a challenging model system wherein a variable cap domain (∼5-200 residues in length) accessorizes the ubiquitous Rossmann-fold core domain, with variations in insertion site and topology corresponding to different classes of cap types. Herein, we describe a comprehensive computational strategy, CapPredictor, for determining large, variable domain insertions in protein sequences. Using a novel sequence-alignment algorithm in conjunction with a structure-guided sequence profile from 154 core-domain-only structures, more than 40,000 HADSF member sequences were assigned cap types. The resulting data set afforded insight into HADSF evolution. Notably, a similar distribution of cap-type classes across different phyla was observed, indicating that all cap types existed in the last universal common ancestor. In addition, comparative analyses of the predicted cap-type and functional assignments showed that different cap types carry out similar chemistries. Thus, while cap domains play a role in substrate recognition and chemical reactivity, cap-type does not strictly define functional class. Through this example, we have shown that CapPredictor is an effective new tool for the study of form and function in protein families where domain insertion occurs.
Keywords: domain-boundary prediction; protein evolution; sequence analysis; structure-function relationship.
© 2014 Wiley Periodicals, Inc.