A Machine Learning Model for Predicting the HER2 Positive Expression of Breast Cancer Based on Clinicopathological and Imaging Features

Acad Radiol. 2025 Jan 20:S1076-6332(25)00001-7. doi: 10.1016/j.acra.2025.01.001. Online ahead of print.

Abstract

Rationale and objectives: To develop a machine learning (ML) model based on clinicopathological and imaging features to predict the Human Epidermal Growth Factor Receptor 2 (HER2) positive expression (HER2-p) of breast cancer (BC), and to compare its performance with that of a logistic regression (LR) model.

Materials and methods: A total of 2541 consecutive female patients with pathologically confirmed primary breast lesions were enrolled in this study. Based on chronological order, 2034 patients treated between January 2018 and December 2022 were designated as the retrospective development cohort, while 507 patients treated between January 2023 and May 2024 were designated as the prospective validation cohort. The patients were randomly divided into a train cohort (n=1628) and a test cohort (n=406) in an 8:2 ratio within the development cohort. Pretreatment mammography (MG) and breast MRI data, along with clinicopathological features, were recorded. Extreme Gradient Boosting (XGBoost) in combination with Artificial Neural Network (ANN) and multivariate LR analyses were employed to extract features associated with HER2 positivity in BC and to develop an ANN model (using XGBoost features) and an LR model, respectively. The predictive value was assessed using a receiver operating characteristic (ROC) curve.

Results: Following the application of Recursive Feature Elimination with Cross-Validation (RFE-CV) for feature dimensionality reduction, the XGBoost algorithm identified tumor size, suspicious calcifications, Ki-67 index, spiculation, and minimum apparent diffusion coefficient (minimum ADC) as key feature subsets indicative of HER2-p in BC. The constructed ANN model consistently outperformed the LR model, achieving the area under the curve (AUC) of 0.853 (95% CI: 0.837-0.872) in the train cohort, 0.821 (95% CI: 0.798-0.853) in the test cohort, and 0.809 (95% CI: 0.776-0.841) in the validation cohort.

Conclusion: The ANN model, built using the significant feature subsets identified by the XGBoost algorithm with RFE-CV, demonstrates potential in predicting HER2-p in BC.

Keywords: Breast neoplasms; Human epidermal growth factor receptor 2; Machine learning; Magnetic resonance imaging; Mammography.