Objective: The purpose of our study was to create a breast cancer risk estimation model based on the descriptors of the National Mammography Database using logistic regression that can aid in decision making for the early detection of breast cancer.
Materials and methods: We created two logistic regression models based on the mammography features and demographic data for 62,219 consecutive mammography records from 48,744 studies in 18,269 [corrected] patients reported using the Breast Imaging Reporting and Data System (BI-RADS) lexicon and the National Mammography Database format between April 5, 1999 and February 9, 2004. State cancer registry outcomes matched with our data served as the reference standard. The probability of cancer was the outcome in both models. Model 2 was built using all variables in Model 1 plus radiologists' BI-RADS assessment categories. We used 10-fold cross-validation to train and test the model and to calculate the area under the receiver operating characteristic curves (A(z)) to measure the performance. Both models were compared with the radiologists' BI-RADS assessments.
Results: Radiologists achieved an A(z) value of 0.939 +/- 0.011. The A(z) was 0.927 +/- 0.015 for Model 1 and 0.963 +/- 0.009 for Model 2. At 90% specificity, the sensitivity of Model 2 (90%) was significantly better (p < 0.001) than that of radiologists (82%) and Model 1 (83%). At 85% sensitivity, the specificity of Model 2 (96%) was significantly better (p < 0.001) than that of radiologists (88%) and Model 1 (87%).
Conclusion: Our logistic regression model can effectively discriminate between benign and malignant breast disease and can identify the most important features associated with breast cancer.