High-throughput genomic technologies enable researchers to identify genes that are co-regulated with respect to specific experimental conditions. Numerous statistical approaches have been developed to identify differentially expressed genes. Because each approach can produce distinct gene sets, it is difficult for biologists to determine which statistical approach yields biologically relevant gene sets and is appropriate for their study. To address this issue, we implemented Latent Semantic Indexing (LSI) to determine the functional coherence of gene sets. An LSI model was built using over 1 million Medline abstracts for over 20,000 mouse and human genes annotated in Entrez Gene. The gene-to-gene LSI-derived similarities were used to calculate a literature cohesion p-value (LPv) for a given gene set using a Fisher's exact test. We tested this method against genes in more than 6,000 functional pathways annotated in Gene Ontology (GO) and found that approximately 75% of gene sets in GO biological process category and 90% of the gene sets in GO molecular function and cellular component categories were functionally cohesive (LPv<0.05). These results indicate that the LPv methodology is both robust and accurate. Application of this method to previously published microarray datasets demonstrated that LPv can be helpful in selecting the appropriate feature extraction methods. To enable real-time calculation of LPv for mouse or human gene sets, we developed a web tool called Gene-set Cohesion Analysis Tool (GCAT). GCAT can complement other gene set enrichment approaches by determining the overall functional cohesion of data sets, taking into account both explicit and implicit gene interactions reported in the biomedical literature.
Availability: GCAT is freely available at http://binf1.memphis.edu/gcat.