Data from the Cancer Genome Atlas (TCGA) are now easily accessible through web-based platforms with tools to assess the prognostic value of molecular alterations. Pancreatic tumors have heterogeneous biology and aggressiveness ranging from the deadly adenocarcinoma (PDAC) to the better prognosis, neuroendocrine tumors. We assessed the availability of the pancreatic cancer TCGA data (TCGA_PAAD) from several repositories and investigated the nature of each sample and how non-PDAC samples impact prognostic biomarker studies. While the clinical and genomic data (n = 185) were fairly consistent across all repositories, RNAseq profiles varied from 176 to 185. As a result, 35 RNAseq profiles (18.9%) corresponded to a normal, inflamed pancreas or non-PDAC neoplasms. This information was difficult to obtain. By considering gene expression data as continuous values, the expression of the 5312 and 4221 genes were significantly associated with the progression-free and overall survival respectively. Considering the cohort was not curated, only 4 and 14, respectively, had prognostic value in the PDAC-only cohort. Similarly, mutations in key genes or well-described miRNA lost their prognostic significance in the PDAC-only cohort. Therefore, we propose a web-based application to assess biomarkers in the curated TCGA_PAAD dataset. In conclusion, TCGA_PAAD curation is critical to avoid important biological and clinical biases from non-PDAC samples.
Keywords: TCGA; curation; pancreatic cancer.