This study addresses the relation between structural and functional similarity in proteins. We introduce a novel method named tree based on root mean square deviation (T-RMSD), which uses distance RMSD (dRMSD) variations to build fine-grained structure-based classifications of proteins. The main improvement of the T-RMSD over similar methods, such as Dali, is its capacity to produce the equivalent of a bootstrap value for each cluster node. We validated our approach on two domain families studied extensively for their role in many biological and pathological pathways: the small GTPase RAS superfamily and the cysteine-rich domains (CRDs) associated with the tumor necrosis factor receptors (TNFRs) family. Our analysis showed that T-RMSD is able to automatically recover and refine existing classifications. In the case of the small GTPase ARF subfamily, T-RMSD can distinguish GTP- from GDP-bound states, while in the case of CRDs it can identify two new subgroups associated with well defined functional features (ligand binding and formation of ligand pre-assembly complex). We show how hidden Markov models (HMMs) can be built on these new groups and propose a methodology to use these models simultaneously in order to do fine-grained functional genomic annotation without known 3D structures. T-RMSD, an open source freeware incorporated in the T-Coffee package, is available online.
2010 Elsevier Ltd. All rights reserved.