Background: Comprehensive understanding of intratumor heterogeneity requires identification of molecular markers, which are capable of differentiating different subpopulations and which also have clinical significance. One important tool that has been addressing this issue is single cell RNA-Sequencing (scRNASeq) that allows the quantification of expression profiles of transcripts in individual cells in a population of cancer cells. Using the expression profiles from scRNASeq, current studies conduct analysis to group cells into different subpopulations using clustering algorithms. In this study, we explore scRNASeq cancer data from a different perspective. We focus on scRNASeq data originating from cancer cells pertaining to a particular cancer type, where the cell type or the subpopulation to which each cell belongs is known. We investigate if the "cell type" of a cancer cell can be predicted based on the expression profiles of a small set of transcripts.
Results: We outline a predictive analytics pipeline to accurately predict 6 breast cancer cell types using single cell gene expression profiles. Instead of building predictive models using the complete human transcripts, the pipeline first eliminates predictors with low expression and low variance. A multinomial penalized logistic regression further reduces the size of the predictors to only 308, out of which 34 are long non-coding RNAs. Tuning of predictive models shows support vector machines and neural networks as the most accurate models achieving close to 98% prediction accuracies. We also find that mixture of protein coding genes and long non-coding RNAs are better predictors compared to when the two sets of transcripts are treated separately. A signature risk score originating from 65 protein coding genes and 5 lncRNA predictors is associated with prognostic survival of TCGA breast cancer patients. This association was maintained when the risk scores were generated using 65 PCGs and 5 lncRNA separately. We further show that predictors restricted to a particular cell type serve as better prognostic markers for the respective patient subtype.
Conclusion: Our results show that in general, the breast cancer cell type predictors are also associated with patient survivability and hence have clinical significance.
Keywords: Breast cancer; HER2 positive; Logistic regression; Single cell sequencing.