Established human breast cancer cell lines are widely used as experimental models in breast cancer research. While these cell lines and their variants share many phenotypic characteristics with human breast tumors, the extent to which they reflect the underlying molecular biology of breast cancer remains controversial. We explored this issue using a probabilistic rather than heuristic approach. Data from gene expression microarrays were used to compare the global structures of the transcriptomes of three estrogen receptor alpha positive (ER+) human breast cancer cell lines (MCF-7, T47D, ZR-75-1) and 13 human breast tumors (11 ER+; 2 ER-). Linear representations of the respective data structures were obtained by deriving those top principal components (PCs) required to capture > or =80% of the cumulative variance for each data set (M PCs). We then identified those genes most highly correlated with the M PCs (Pearson's correlation coefficient r > or =0.800) and identified a group of 36 genes commonly correlated with both the cell line (M = 5 PCs) and tumor (M = 6 PCs) data structures. All 36 common genes were correlated with PC1 from the breast tumor data: 21/36 genes were correlated with PC1, 14/36 genes correlated with PC2, and 1/36 genes correlated with PC3 from the cell line data. Genes important in defining the data structures include NFkappaB p65, IGFBP-6, ornithine decarboxylase-1, and paxillin. When data from MDA-MB-435 xenografts (ER-) were included in the analysis, we were unable to find any common genes between these xenografts and the breast tumors. These data clearly imply that MCF-7, T47D, and ZR-75-1 cells and ER+ breast tumors share substantial global similarities in the structures of their respective transcriptomes, and that these cell lines are good models in which to identify molecular events that are likely to be important in some ER+ human breast cancers.