Developing a robust two-step machine learning multiclassification pipeline to predict primary site in head and neck carcinoma from lymph nodes

Jiaying Liu; Anna Corti; Giuseppina Calareso; Gaia Spadarella; Lisa Licitra; Valentina D A Corino; Luca Mainardi

doi:10.1016/j.heliyon.2024.e24377

Developing a robust two-step machine learning multiclassification pipeline to predict primary site in head and neck carcinoma from lymph nodes

Heliyon. 2024 Jan 12;10(2):e24377. doi: 10.1016/j.heliyon.2024.e24377. eCollection 2024 Jan 30.

Authors

Jiaying Liu¹, Anna Corti¹, Giuseppina Calareso², Gaia Spadarella^{3

4}, Lisa Licitra^{5

6}, Valentina D A Corino^{1

7}, Luca Mainardi¹

Affiliations

¹ Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.
² Radiology Department, Fondazione IRCCS Istituto Nazionale dei Tumori di Milano, Milan, Italy.
³ Postgraduation School in Radiodiagnostics, University of Milan, Italy.
⁴ Department of Clinical Medicine and Surgery, Federico II University, Naples, Italy.
⁵ Head and Neck Cancer Medical Oncology Department, Fondazione IRCCS Instituto Nazionale dei Tumori di Milano, Milan, Italy.
⁶ Department of Oncology and Hemato-Oncology, University of Milan, Italy.
⁷ Cardiotech Lab, Centro Cardiologico Monzino IRCCS, Milan, Italy.

Abstract

This study aimed to develop a robust multiclassification pipeline to determine the primary tumor location in patients with head and neck carcinoma of unknown primary using radiomics and machine learning techniques. The dataset included 400 head and neck cancer patients with primary tumor in oropharynx, OPC (n = 162), nasopharynx, NPC (n = 137), oral cavity, OC (n = 63), larynx and hypopharynx, HL (n = 38). Two radiomic-based multiclassification pipelines (P1 and P2) were developed. P1 consisted in a direct identification of the primary sites, whereas P2 was based on a two-step approach: in the first step, the number of classes was reduced by merging the two minority classes which were reclassified in the second step. Diverse correlation thresholds (0.75, 0.80, 0.85), feature selection methods (sequential forwards/backwards selection, sequential floating forward selection, neighborhood component analysis and minimum redundancy maximum relevance), and classification models (neural network, decision tree, naïve Bayes, bagged trees and support vector machine) were assessed. P2 outperformed P1, with the best results obtained with the support vector machine classifier including radiomic and clinical features (accuracies of 75.3 % (HL), 75.4 % (OC), 71.3 % (OPC), 92.9 % (NPC)). These results indicate that the two-step multiclassification pipeline integrating radiomics and clinical information is a promising approach to predict the tumor site of unknown primary.

Keywords: Head and neck carcinoma of unknown primary; Machine learning; Multiclassification; Radiomics.