¹¹institutetext: Department of Electrical and Computer Engineering, ²²institutetext: Department of Computer Science, ³³institutetext: Institute for Biomedical Informatics
University of Kentucky, Lexington, USA
³³email: {ses235,aes255,qch239}@uky.edu

Comparative Analysis of Transfer Learning Models for Breast Cancer Classification

Sania Eskandari 11 Ali Eslamian 22 Qiang Cheng 2233

Abstract

The classification of histopathological images is crucial for the early and precise detection of breast cancer. This study investigates the efficiency of deep learning models in distinguishing between Invasive Ductal Carcinoma (IDC) and non-IDC in histopathology slides. We conducted a thorough comparison examination of eight sophisticated models: ResNet-50, DenseNet-121, ResNeXt-50, Vision Transformer (ViT), GoogLeNet (Inception v3), EfficientNet, MobileNet, and SqueezeNet. This analysis was carried out using a large dataset of 277,524 image patches. Our research makes a substantial contribution to the field by offering a comprehensive assessment of the performance of each model. We particularly highlight the exceptional efficacy of attention-based mechanisms in the ViT model, which achieved a remarkable validation accuracy of 93%, surpassing conventional convolutional networks. This study highlights the promise of advanced machine learning approaches in clinical settings, offering improved precision as well as efficiency in breast cancer diagnosis.

1 Introduction

1.1 Breast Cancer

Breast cancer (BC) is a malignant disease that arises in the breast tissue, usually starting with the unregulated growth of cells. The presence of malignant cells frequently results in the development of a discernible tumor, which can be detected by the utilization of imaging methods like mammography[1].

The timely identification of breast cancer is essential, although it poses difficulties due to the frequently asymptomatic presentation of the illness throughout its early phases. Diagnostic procedures, such as mammography, ultrasound, and biopsy, are crucial in differentiating between benign and malignant tumors[2].

Nevertheless, traditional manual diagnosis is time-consuming and necessitates the proficiency of exceptionally trained pathologists, who may still be susceptible to diagnostic inaccuracies as a result of human constraints and variations in expertise. In order to tackle these difficulties, Computer-Aided Diagnostic (CAD) systems have been created and proven to greatly assist in the diagnostic procedure[3].

The recent achievements of Convolutional Neural Networks (CNNs) in several fields of image classification have motivated their use in medical imaging, namely for the categorization of histopathology images[4]. CNN possess the ability to acquire and extract hierarchical characteristics from images, rendering them well-suited for the identification of intricate patterns linked to malignant tumors.The classification task entails the allocation of each image or image patch to its corresponding category, such as benign or malignant[5].

This automated methodology seeks to enhance the accuracy and effectiveness of breast cancer identification, potentially resulting in earlier detection and improved patient prognosis[6]. Araujo et al.[3] showed that CNNs are highly effective at categorizing breast cancer histology images, resulting in substantial enhancements compared to conventional approaches.

In addition, recent progress in deep learning has brought forth more advanced models such as ResNet, DenseNet, and Vision Transformers. These models have demonstrated promising outcomes in the field of medical picture processing. These models utilize complex structures, residual connections, and attention mechanisms to improve accuracy performance [6].

Furthermore, alternative machine learning methodologies have been investigated alongside CNNs. Support Vector Machines (SVMs) have been employed in gene selection and cancer categorization, showcasing the adaptability and potential of machine learning in the field of oncology[2]. In addition, sophisticated models such as EfficientNet and MobileNet have been modified for medical purposes, providing a harmonious combination of precision and computing efficiency [7].

This research aims to identify the most effective approaches for classifying histopathology images of breast cancer by conducting a comparative analysis of eight advanced models: ResNet-50, DenseNet-121, ResNeXt-50, Vision Transformer (ViT), GoogLeNet (Inception v3), EfficientNet, MobileNet, and SqueezeNet. This undertaking aims to make a valuable contribution to the profession by improving the accuracy of diagnoses, ultimately leading to better outcomes for patients and potentially saving lives.

1.2 Motivation

Breast cancer continues to be a serious worldwide health concern, causing substantial rates of death and illness. According to statistics from 2012, breast cancer caused 522,000 deaths globally, and there were 1.68 million newly diagnosed cases [8]. In 2018, the World Cancer Research Fund reported over 2 million newly diagnosed instances of breast cancer, indicating a persistent increase in its occurrence. This significant rise highlights the urgent requirement for prompt identification and management to enhance rates of survival.

Timely identification of breast cancer is crucial, as it significantly improves the likelihood of effective therapy and prolonged survival. Research has indicated that over 95 percent of women who are diagnosed with early-stage breast cancer have a survival rate of five years or more [9, 10]. This statistic emphasizes the life-saving capacity of early diagnosis and the significance of creating dependable and effective diagnostic techniques.

The conventional approaches for diagnosing breast cancer, which mainly depend on manual examination by pathologists, are not only time-consuming but also susceptible to human fallibility. Pathologists of different proficiency levels may generate uneven outcomes, potentially resulting in misdiagnoses. Therefore, there is an urgent want for automated, precise, and effective diagnostic methods to aid in the timely identification of breast cancer. The progress in machine learning and deep learning has created new opportunities for enhancing breast cancer diagnostics. CNNs have shown exceptional performance in a wide range of image classification applications, including medical imaging [4]. These models have the ability to understand complex patterns in histopathological pictures that may be difficult for humans to see, making them a reliable tool for detecting cancerous cells.

Recent research has investigated the application of deep learning models in the classification of breast cancer histopathology images, yielding encouraging outcomes. Golatkar et al. [10] provided evidence of the efficacy of deep learning models in accurately categorizing breast cancer histology images, while Beam and Kohane [9] explored the transformative potential of big data and machine learning in the field of healthcare.

Furthermore, cutting-edge architectures like ResNet, DenseNet, and Vision Transformers have demonstrated exceptional efficacy in the field of medical image processing. Praveen et al. [11] emphasized the utilization of ResNet-32 and Fastai in the detection of ductal cancer from 2D tissue slides. Their approach resulted in notable enhancements in diagnostic precision. These developments demonstrate the potential of incorporating deep learning models into clinical practice to improve diagnostic accuracy and efficiency.

This project seeks to utilize advanced deep-learning models to create a CAD system, in response to the rising occurrence of breast cancer and the crucial significance of early detection. This research aims to determine the most effective methods for categorizing histopathology images of breast cancer by conducting a comparative examination of eight sophisticated models. The primary objective is to enhance the early identification and intervention process, consequently diminishing death rates and enhancing patient outcomes.

1.3 Related work

In recent years, there has been considerable interest in classifying breast histopathology photos. Researchers have been investigating several deep learning models to enhance the accuracy of diagnosis. This section provides an overview of the progress made in deep learning methods for classifying breast cancer. It specifically focuses on research that compares various models and emphasizes their effectiveness[11]. Praveen et al. [11] developed a specialized ResNet-32 architecture that improves the precision of categorizing breast histopathology pictures. Their study conducted a comparison analysis using different CNN-based models, such as AlexNet, VGG, and Inception. Their demonstration showcased the superior performance of ResNet-32 compared to these models, emphasizing the advantages of deeper topologies incorporating residual connections[12].

Brennon et al. [12] performed a comparative analysis of two commonly employed convolutional neural network structures: the VGG network and the Residual network. ResNet integrates shortcut connections and batch normalization, which helps expedite the training process and enhance the network’s capacity to generalize. The exceptional efficacy of their ensemble model, which merges VGG-based and ResNet-based classifiers, suggests that these architectures might mutually reinforce each other to improve diagnostic precision[13]. Nawaz et al. [13] enhanced the DenseNet model, showcasing the ability of deep learning techniques employed in natural image processing to reach outstanding results in medical image processing. Their model attained an accuracy of around 96 percent in multi-class breast cancer classification, surpassing the proficiency of human experts in the diagnostic domain. In addition, recent research has investigated the use of Vision Transformers (ViTs) for classifying medical images[14].

Dosovitskiy et al. [14] presented the Vision Transformer, a model that utilizes self-attention mechanisms to analyze visuals without relying heavily on preconceived notions. This model has demonstrated favorable outcomes in diverse picture classification assignments, such as medical imaging, owing to its proficient capacity to successfully capture extensive connections and contextual information.

Furthermore, the Inception network, which was proposed by Szegedy et al.[15]has been extensively utilized in medical image analysis because of its ability to interpret several scales. The architecture of Inception enables the effective extraction of characteristics at various sizes, making it well-suited for intricate image classification problems such as breast cancer detection. The EfficientNet model, as presented by Tan and Le[7], has also been examined for its possible applications in the field of medical imaging. EfficientNet employs a compound scaling technique to achieve a harmonious balance between network depth, width, and resolution, resulting in enhanced performance and efficiency. The use of this technology in categorizing breast cancer has shown substantial enhancements in both accuracy and computing efficiency.

To summarize, the existing research on breast cancer classification using deep learning models focuses on the ongoing progress and comparative evaluations of different architectures. The studies mentioned highlight the promise of deep learning in improving the accuracy and efficiency of breast cancer diagnosis, ranging from standard CNNs such as AlexNet and VGG to newer models like Vision Transformers and EfficientNet[16].

2 PROBLEM DEFINITION

The objective of this project is to develop an automated system for categorizing histopathological images of breast cancer, with a specific focus on differentiating between Invasive Ductal Carcinoma (IDC) and non-IDC tissue samples. The goal is to create a predictive model that can reliably classify each image in a dataset of n histopathological imaging patches. Each image is annotated with a binary classification: 0 denotes the lack of IDC and 1 denotes the presence of IDC.

The model is defined by weights acquired through a learning process using the provided data, aiming to optimize the accuracy of predictions on photos that have not been seen before. The main objective is to improve performance indicators, such as accuracy, precision, recall, and F1-score, by improving the training approach. An essential component of this work is to ensure that the model can effectively apply its knowledge to new, unfamiliar photos while overcoming problems such as an unequal distribution of classes, differences in image quality, and other intrinsic complexities connected with the interpretation of medical images.

In order to address this issue, we utilize sophisticated deep-learning systems. This study specifically focuses on doing a comparative analysis to determine the effectiveness of several CNNs, in comparison to the Vision Transformer (ViT) model. The ViT model stands out for its utilization of self-attention processes, which allow it to efficiently capture long-range relationships and contextual information, beyond the capabilities of typical CNNs.

The main goal of this research is to identify or create a model that achieves the highest level of diagnostic accuracy, thereby enabling the prompt and dependable detection of breast cancer. Our objective is to assess the performance of these cutting-edge models in order to identify the most appropriate method for automating the classification of histopathology images. This will ultimately enhance diagnostic procedures in clinical environments.

3 DATASET

The Breast Histopathology Images dataset comprises histopathology photographs of breast biopsies, specifically encompassing instances of IDC, which is the prevailing subtype of breast cancer. Pathologists frequently concentrate on areas that contain IDC when assessing the level of aggressiveness in a whole mount sample. Consequently, an essential initial step in automating the assessment of aggressiveness is to precisely identify the specific areas of IDC within an entire slide.

The initial dataset consists of 162 whole-mount slide pictures of breast cancer specimens, which were scanned at a magnification of 40x. A total of 277,524 patches, with dimensions of 50 × 50 pixels each, were recovered from these slides. The patches are categorized based on the presence or absence of IDC. Specifically, there are 198,738 patches labeled as IDC negative and 78,786 patches labeled as IDC positive.

Refer to caption — Figure 1: Examples of histopathology images. Top row: Non-IDC, Bottom row: IDC. Images sourced from Kaggle [16]

In order to train and evaluate the model, the dataset was divided into three subsets: training data, validation data, and test data. This partitioning guarantees that the model may be trained with high efficiency, validated during the training phase to optimize hyperparameters, and assessed on a distinct dataset to evaluate its capacity to generalize. Table 1 presents a summary of how the samples are distributed among different subsets.

Table 1: Dataset Partitioning

Subset	IDC Negative	IDC Positive	Total
Training	138,288	55,978	194,266
Validation	39,748	15,760	55,508
Test	20,702	7,048	27,750
Total	198,738	78,786	277,524

Through careful and systematic preparation and division of the dataset, our goal is to guarantee that our deep learning models are trained on a thorough and inclusive collection of data. This will enable them to perform strongly in accurately diagnosing IDC in breast histopathology photos.

4 PROPOSED APPROACH

Within the dynamic realm of deep learning in medical imaging, the selection of model architecture plays a crucial role in accurately classifying breast cancer from histopathological images. CNNs have shown exceptional effectiveness in tasks involving the categorization of images, which makes them especially well-suited for applications in medical imaging. These models have the ability to independently and flexibly acquire knowledge about the arrangement of characteristics in space from extensive datasets. As a result, they can effectively categorize breast cancer tumors with high accuracy using mammograms or histological pictures.

The objective of this study is to conduct a comprehensive comparative analysis of various modern CNN models, including the latest attention-based model called Vision Transformer (ViT). The chosen models consist of ResNet-50, DenseNet-121, ResNeXt-50, Inception v3, EfficientNet, MobileNet, SqueezeNet, and ViT. Each of these models possesses distinct benefits and has been selected based on its probable efficacy in medical image analysis.

The main goal of this comparison investigation is to determine the model that provides the most accuracy in diagnosing IDC in breast histopathology images. Our goal is to assess the performance of these advanced models in a systematic manner to identify the most appropriate method for incorporating deep learning into clinical practice. This will improve the accuracy and efficiency of breast cancer diagnosis.

4.1 ResNet

ResNet is a deep learning architecture that introduced skip connections to improve gradient propagation during training. Instead of learning a direct input-output mapping, ResNet learns residuals, which are combined with the input to produce the output. This approach allows the network to be very deep, with over 100 layers, and still be trainable. The added depth enhances performance in tasks like image classification and object detection by capturing intricate details, while residual learning helps address the vanishing gradient problem[16].

4.2 ResNeXt

ResNeXt enhances the ResNet architecture by introducing "cardinality," which increases the number of independent pathways within each residual block. Unlike traditional ResNets, ResNeXt processes input data through multiple paths or groups concurrently, allowing the model to capture a broader range of features and improve its representational power. This approach enhances the model’s ability to learn complex patterns, resulting in better performance across various tasks. By incorporating grouped convolutions, ResNeXt achieves greater accuracy and efficiency compared to standard ResNet models[17].

4.3 DenseNet

DenseNet introduces a connectivity architecture where each layer is directly connected to all preceding layers, ensuring efficient feature reuse and improved gradient flow. This dense connectivity allows the model to receive input from all earlier layers, effectively addressing the vanishing gradient problem and optimizing parameter usage. By concatenating feature maps instead of summing them, DenseNet reduces the overall parameter count while maintaining high performance, making it highly effective for tasks like image classification and segmentation[18].

4.4 Inception v3

Inception v3 is a deep learning framework that extends the original Inception (GoogleLeNet) model with several enhancements to improve performance and efficiency. It addresses the challenge of varying object sizes in images by employing a multi-scale approach within each module, enabling the model to capture features at different resolutions. Inception v3 introduces factorized convolutions, which break down larger convolutions into smaller ones, reducing computational load and improving efficiency. Additionally, the model incorporates batch normalization to stabilize training and label smoothing to reduce overconfidence and enhance generalization. The architecture, composed of multiple Inception modules with various convolutional filters and pooling operations, allows for simultaneous analysis of spatial features at different scales. This multi-scale processing capability, combined with architectural improvements, makes Inception v3 particularly effective for tasks like analyzing medical images, such as breast histopathology classification, by capturing intricate details and improving detection accuracy[15].

4.5 EfficientNet

EfficientNet is a series of CNN designed for high accuracy and processing efficiency, introduced by Tan and Le. It utilizes a unique compound scaling technique to proportionally scale the network’s depth, width, and resolution, optimizing performance without unnecessary computational costs. Starting with a baseline model, EfficientNet-B0, which balances accuracy and efficiency, the family includes models B1 to B7, each scaled up systematically in depth, width, and resolution for improved accuracy. EfficientNet incorporates Mobile Inverted Bottleneck Convolution (MBConv) layers for mobile efficiency, uses the Swish activation function for smoother gradients, and applies dropout and stochastic depth to enhance generalization and prevent overfitting. These innovations allow EfficientNet to achieve top accuracy on benchmark datasets with fewer computational resources, making it particularly effective for tasks like breast cancer histopathology image classification[7].

4.6 MobileNet

MobileNet is a category of efficient CNNs designed for mobile and embedded vision applications. The key innovation in MobileNet is the use of depthwise separable convolutions, which reduce the number of parameters and computational complexity compared to traditional convolutional layers. In this approach, spatial filtering and channel-wise operations are separated into two layers: depthwise convolutions, which apply a single filter per input channel to reduce computational load, and pointwise convolutions, which use a 1x1 filter to combine the depthwise results. This separation results in faster inference times and lower power consumption. MobileNet also incorporates width multipliers to adjust the number of channels per layer and resolution multipliers to control input image resolution, allowing for flexible scaling based on available computational resources. These features make MobileNet highly suitable for mobile and embedded devices, as well as for medical image analysis tasks like breast histopathology classification, where it provides accurate results within limited computational constraints[15].

4.7 SqueezeNet

SqueezeNet is a highly efficient convolutional neural network to match the accuracy of AlexNet while significantly reducing the number of parameters. This compact architecture is ideal for resource-constrained environments, such as mobile and embedded systems. The key innovation in SqueezeNet is the "fire module," which consists of a squeeze layer and an expansion layer. The squeeze layer uses 1x1 convolutions to reduce the number of input channels, lowering computational load and parameter count, while the expansion layer combines 1x1 and 3x3 convolutions to expand the compressed data, allowing the network to learn intricate features. SqueezeNet also employs techniques like 1x1 convolutions and delayed downsampling to further reduce model size without compromising accuracy. Additionally, it is highly compressible through methods like pruning, quantization, and Huffman coding, making it well-suited for devices with limited storage and processing power. In medical image analysis, SqueezeNet’s efficiency and reduced parameter count make it an excellent choice for tasks like breast histopathology classification, providing accurate and efficient real-time diagnostics in resource-limited settings[20].

4.8 Vision Transformer

Transformer-based models have proven effective across various tasks in Natural Language Processing and Computer Vision, with the Vision Transformer (ViT) standing out for its performance in medical imaging [21, 14]. Unlike traditional CNN that excel in learning spatial hierarchies through convolutional layers, ViT applies the Transformer architecture, originally designed for sequential data like text, to image data. ViT divides an image into fixed-size patches, converts these patches into token embeddings, and incorporates positional embeddings to maintain spatial information. These token embeddings are then processed by a transformer encoder, utilizing self-attention mechanisms to capture both global and local information within the image, resulting in robust feature representations.

One of the key advantages of ViT is its ability to handle images of varying dimensions without requiring resizing or cropping, thanks to its patch-based processing method. This flexibility makes ViT a versatile option for diverse image recognition tasks. In our studies, we employed the standard ViT model along with additional models, initially pre-training them on the ImageNet dataset. We then fine-tuned the models on our specific histopathology dataset to accurately distinguish between samples with invasive ductal carcinoma (IDC) and those without IDC, showcasing ViT’s effectiveness in medical image analysis.

Table 2 presents a comprehensive overview of the performance outcomes of different deep learning models assessed in this research. It includes precise measurements such as precision, recall, F1-score, and total accuracy metrics for both non-IDC and IDC classifications. This thorough comparison evaluates the performance of the models, offering insights into their efficacy for classifying breast cancer histopathology images, and includes a ranking of their relative performance.

Table 2: Performance comparison of deep learning models for IDC classification in breast histopathology images

Methods	Accuracy	Precision		Recall		F1-Score		Rank (std)
Methods	Accuracy	Class 0	Class 1	Class 0	Class 1	Class 0	Class 1	Rank (std)
Resnet-50	0.91	0.93	0.85	0.94	0.81	0.94	0.83	4.57 (0.051)
DenseNet-121	0.91	0.93	0.86	0.95	0.81	0.94	0.83	4.64 (0.056)
ResNeXt-50	0.91	0.93	0.85	0.94	0.82	0.94	0.84	5.14 (0.055)
Inception v3	0.91	0.93	0.84	0.94	0.84	0.94	0.84	4.57 (0.049)
EfficientNet	0.92	0.95	0.83	0.93	0.89	0.94	0.86	3.29 (0.044)
MobileNetV2	0.90	0.94	0.82	0.93	0.85	0.93	0.83	5.71 (0.051)
SqueezeNet	0.88	0.95	0.74	0.88	0.87	0.91	0.80	6.21 (0.070)
ViT	0.93	0.94	0.89	0.96	0.84	0.95	0.87	1.86 (0.045)

5 RESULTS AND OBSERVATIONS

5.1 Experimental Setup

The experimental system architecture was deployed on a system equipped with an AMD Ryzen Threadripper PRO 5965WX 24-core CPU, 62 GB of RAM, and an NVIDIA RTX A4500 GPU with 20 GB of memory. The models were created using Python and the PyTorch framework, leveraging its flexibility and comprehensive support for deep learning research. The training parameters were carefully selected to balance computational efficiency and model performance. All models trained for 10 epochs. To optimize training time and GPU memory usage, a batch size of 128 was chosen for training, while a batch size of 64 was used for validation and testing to balance memory consumption and processing efficiency.

The Adam optimizer was used with a learning rate of 0.0001, known for its ability to adjust the learning rate for each parameter, thus facilitating faster convergence. The models’ weights were initialized using pre-trained weights from the ImageNet dataset to enable transfer learning. This approach utilizes features learned from a large dataset, enhancing the model’s ability to quickly and efficiently converge on the specific task of classifying histopathological images. This configuration ensures a thorough and comprehensive assessment of the models, enabling the determination of the most effective architecture for classifying IDC in breast histopathology images.

5.2 Results and Observations

The ResNet-50 model was trained for a total of 10 epochs, consistently showing a gain in accuracy and a reduction in loss during both the training and validation stages. The model attained a comprehensive accuracy rate of 91%. The non-cancerous forecasts had a precision of 0.93 and a recall of 0.94, while the malignant predictions had a precision of 0.85 and a recall of 0.81. The F1-scores for non-cancerous and malignant classifications were 0.94 and 0.83, respectively. These values demonstrate a balanced precision-recall tradeoff, which is crucial for accurate medical diagnosis.

In the same manner, the DenseNet-121 model underwent training for 10 epochs and attained an overall accuracy of 91%. The accuracy for identifying non-cancerous cases was 0.93, with a sensitivity of 0.95. On the other hand, the accuracy for identifying cancerous cases was 0.86, with a sensitivity of 0.81. The F1-scores for non-cancerous and malignant classifications were 0.94 and 0.83, respectively, indicating the model’s high accuracy in categorizing histopathological pictures.

The ResNeXt-50 model, trained for a duration of 20 epochs, also attained an overall accuracy rate of 91%. The non-cancerous predictions made by the model had a precision of 0.93 and a recall of 0.94, whereas the cancerous predictions had a precision of 0.85 and a recall of 0.82. The F1-scores achieved by this model were 0.94 for non-cancerous classifications and 0.84 for cancerous classifications, indicating its strong performance in classifying medical images.

The GoogLeNet (Inception v3) model, after being trained for 10 epochs, achieved an overall accuracy of 91%. The non-cancerous forecasts had a precision of 0.93 and a recall of 0.94, while the malignant predictions had a precision of 0.84 and a recall of 0.83. The F1-scores for non-cancerous and malignant classifications were 0.94 and 0.84, respectively, demonstrating a robust capacity to maintain a balance between precision and recall.

The EfficientNet model, after undergoing 10 epochs of training, attained the best level of accuracy among the CNN models, reaching 92%. The non-cancerous forecasts achieved a precision of 0.95 and a recall of 0.93, but the malignant predictions had a precision of 0.83 and a recall of 0.89. The F1-scores achieved were 0.94 for non-cancerous classifications and 0.86 for cancerous classifications, indicating that the model exhibited exceptional performance in the field of medical picture classification.

The MobileNetV2 model, trained for 10 epochs, attained a 90% accuracy rate. The benign forecasts exhibited a precision of 0.94 and a recall of 0.93, while the malignant predictions demonstrated a precision of 0.82 and a recall of 0.85. The F1-scores achieved were 0.93 for non-cancerous classifications and 0.83 for cancerous classifications, demonstrating the effectiveness of the system even with limited computational resources.

The SqueezeNet model, after undergoing 10 epochs of training, attained an overall accuracy rate of 88%. The non-cancerous forecasts had a precision of 0.95 and a recall of 0.88, but the cancerous predictions had a precision of 0.74 and a recall of 0.87. The F1-scores for non-cancerous and malignant classifications were 0.91 and 0.80, respectively, indicating its performance in a resource-limited setting.

The Vision Transformer (ViT) model demonstrated superior performance compared to other models. It was trained for 10 epochs and achieved an overall accuracy of 93%. The non-cancerous forecasts achieved a precision of 0.94 and a recall of 0.96, but the malignant predictions had a precision of 0.89 and a recall of 0.84. The F1-scores for the ViT model were 0.95 for non-cancerous classifications and 0.87 for malignant classifications, demonstrating its excellent capacity to successfully balance precision and recall.

5.3 Comparison and Analysis

The Vision Transformer (ViT) model showed improved performance in comparison to conventional CNN designs. ViT achieved a superior overall accuracy of 93%, surpassing other models that had accuracy rates ranging from 88% to 92%. The precision and recall values of ViT for non-cancerous (0.94 and 0.96, respectively) and malignant (0.89 and 0.84, respectively) predictions demonstrate a high degree of dependability in accurately distinguishing between the two classes. The F1-scores of 0.95 for non-cancerous and 0.87 for cancerous classifications highlight the strong ability of the system to properly handle both positive and negative samples.

EfficientNet, one of the CNN models, achieved the maximum accuracy of 92%, demonstrating excellent precision and recall metrics. Nevertheless, it did not quite match the level of performance demonstrated by the ViT model. ResNet, DenseNet, ResNeXt, and GoogLeNet all attained a 91% accuracy, showcasing their efficacy albeit with somewhat inferior precision and recall when compared to EfficientNet and ViT. MobileNetV2 and SqueezeNet, albeit computationally efficient, exhibited lower accuracy (90% and 88% respectively) and somewhat diminished performance metrics, rendering them less effective than the other models for this particular job.

The Vision Transformer (ViT) model is the most successful architecture for classifying invasive ductal carcinoma (IDC) in breast histopathology images. It achieves maximum accuracy and maintains a well-balanced precision, recall, and F1-score. The utilization of self-attention processes in capturing both global and local information inside images provides it with a distinct advantage over conventional CNN architectures.

6 CONCLUSION AND FUTURE WORK

This study conducted a thorough comparative investigation of various deep learning models for classifying histopathology images of breast cancer, specifically differentiating between Invasive Ductal Carcinoma (IDC) and non-IDC tissue samples. The models assessed included ResNet-50, DenseNet-121, ResNeXt-50, GoogLeNet (Inception v3), EfficientNet, MobileNetV2, SqueezeNet, and Vision Transformer (ViT).

The experimental findings revealed that the ViT model outperformed conventional CNN architectures in terms of overall accuracy, precision, recall, and F1-score. The ViT model achieved a remarkable accuracy rate of 93%, with F1-scores of 0.95 for non-IDC and 0.87 for IDC, demonstrating its ability to maintain a balance between precision and recall. EfficientNet showed the highest accuracy among the CNN models at 92%, followed closely by ResNet, DenseNet, ResNeXt, and GoogLeNet, each achieving 91% accuracy. MobileNetV2 and SqueezeNet, although computationally efficient, showed lower accuracy levels of 90% and 88%, respectively. The exceptional performance of the ViT model highlights the capacity of attention-based mechanisms to capture both comprehensive and specific characteristics essential for precise medical image interpretation. This work underscores the importance of using advanced deep learning architectures to enhance diagnostic accuracy in clinical settings, leading to better patient outcomes.

While this study provides valuable insights, future research should explore advanced data augmentation techniques and preprocessing methods to improve model performance, particularly in addressing class imbalance and enhancing generalization. Additionally, a model based on Mamba will be explored for breast cancer diagnosis with mammography images to improve accuracy and efficiency [22]. Investigating ensemble approaches and developing real-time implementations of these models in clinical settings are also important areas for further research to assess practical utility and integration into diagnostic workflows.

Improving the interpretability of deep learning models is crucial for their acceptance and use in clinical practice. Future research should focus on developing methods to provide a deeper understanding of model decisions, enhancing transparency for medical practitioners. Exploring transfer learning and domain adaptation strategies could expand the applicability of these models to other types of cancer or medical imaging tasks. Using larger and more diverse datasets will help evaluate the models’ robustness and generalizability across different populations and imaging conditions. Integrating histopathological image analysis with additional clinical data, such as patient demographics and genetic information, has the potential to improve diagnostic accuracy and provide a more comprehensive understanding of the disease.

In summary, this study demonstrates that advanced deep learning models, particularly the Vision Transformer, can significantly improve the classification of breast cancer from histopathology images. Further exploration of this field, with a focus on the proposed future directions, holds promise for advancing medical image analysis and improving clinical outcomes.

References

[1] S. Ara, A. Das, and A. Dey, “Malignant and benign breast cancer classification using machine learning algorithms,” in 2021 International Conference on Artificial Intelligence (ICAI). IEEE, 2021, pp. 97–101.
[2] H. A. Le Thi, V. V. Nguyen, and S. Ouchani, “Gene selection for cancer classification using dca,” in Advanced Data Mining and Applications: 4th International Conference, ADMA 2008, Chengdu, China, October 8-10, 2008. Proceedings 4. Springer, 2008, pp. 62–72.
[3] T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia, and A. Campilho, “Classification of breast cancer histology images using convolutional neural networks,” PloS one, vol. 12, no. 6, p. e0177544, 2017.
[4] F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heutte, “Breast cancer histopathological image classification using convolutional neural networks,” in 2016 international joint conference on neural networks (IJCNN). IEEE, 2016, pp. 2560–2567.
[5] M. Amrane, S. Oukid, I. Gagaoua, and T. Ensari, “Breast cancer classification using machine learning,” in 2018 electric electronics, computer science, biomedical engineerings’ meeting (EBBT). IEEE, 2018, pp. 1–4.
[6] L. Shen, L. R. Margolies, J. H. Rothstein, E. Fluder, R. McBride, and W. Sieh, “Deep learning to improve breast cancer detection on screening mammography,” Scientific reports, vol. 9, no. 1, p. 12495, 2019.
[7] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
[8] A. L. Beam and I. S. Kohane, “Big data and machine learning in health care,” Jama, vol. 319, no. 13, pp. 1317–1318, 2018.
[9] ——, “Big data and machine learning in health care,” Jama, vol. 319, no. 13, pp. 1317–1318, 2018.
[10] A. Golatkar, D. Anand, and A. Sethi, “Classification of breast cancer histology using deep learning,” in Image Analysis and Recognition: 15th International Conference, ICIAR 2018, Póvoa de Varzim, Portugal, June 27–29, 2018, Proceedings 15. Springer, 2018, pp. 837–844.
[11] S. P. Praveen, P. N. Srinivasu, J. Shafi, M. Wozniak, and M. F. Ijaz, “Resnet-32 and fastai for diagnoses of ductal carcinoma from 2d tissue slides,” Scientific Reports, vol. 12, no. 1, p. 20804, 2022.
[12] B. Maistry and A. E. Ezugwu, “Breast cancer detection and diagnosis: A comparative study of state-of-the-arts deep learning architectures,” arXiv preprint arXiv:2305.19937, 2023.
[13] M. Nawaz, A. A. Sewissy, and T. H. A. Soliman, “Multi-class breast cancer classification using deep learning convolutional neural network,” Int. J. Adv. Comput. Sci. Appl, vol. 9, no. 6, pp. 316–332, 2018.
[14] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[17] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
[18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
[19] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[20] ——, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[21] M. Fereidouni, A. Mosharrof, U. Farooq, and A. Siddique, “Proactive prioritization of app issues via contrastive learning,” in 2022 IEEE International Conference on Big Data (Big Data). IEEE, 2022, pp. 535–544.
[22] A. Nasiri-Sarvi, M. S. Hosseini, and H. Rivaz, “Vision mamba for classification of breast ultrasound images,” arXiv preprint arXiv:2407.03552, 2024.