A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models

Dibaloke Chanda \orcidlink0000-0001-5993-659X, ,
Milan Aryal \orcidlink0009-0005-1326-9804,
Nasim Yahya Soltani \orcidlink0000-0002-4502-8715, 
Masoud Ganji
Abstract

Recent advances in deep learning have completely transformed the domain of computational pathology (CPath), which in turn altered the diagnostic workflow of pathologists by integrating foundation models (FMs) and vision-language models (VLMs) in their assessment and decision-making process. FMs overcome the limitations of existing deep learning approaches in CPath by learning a representation space that can be adapted to a wide variety of downstream tasks without explicit supervision. VLMs allow pathology reports written in natural language to be used as a rich semantic information source to improve existing models as well as generate predictions in natural language form. In this survey, a holistic and systematic overview of recent innovations in FMs and VLMs in CPath is presented. Furthermore, the tools, datasets and training schemes for these models are summarized in addition to categorizing them into distinct groups. This extensive survey highlights the current trends in CPath and the way it is going to be transformed through FMs and VLMs in the future.

Index Terms:
Computational Pathology, Foundation Models, Multi-Modal, Vision-Language Models

I Introduction

In recent years there has been a surge of artificial intelligence (AI) based approaches [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] in CPath owing to wide adoption of digital slide scanners. As a result, large-scale curation and annotation [19, 20] of whole slide images (WSIs) has been made possible which ensured adequate data to train these AI-based models. The goal of these AI-based models is to automate and expedite the diagnosis and prognosis process of CPath. The traditional diagnosis process in CPath is time-consuming and requires experts with extensive domain knowledge. In addition, the wide variation of pathology and heterogeneity between tasks makes it difficult to come up with a unified general approach.

Several research studies have addressed this lack of a unified approach and among the proposed methods, the FMs have gained a lot of attention in recent years [21, 22, 23, 24, 25, 26]. FMs leverage self-supervised learning (SSL) [27] schemes to learn a rich representation in a task-agnostic manner. Owing to self-supervised pre-training (SSPT), FMs do not require large-scale annotated data which is hard to come by in CPath. Furthermore, these models can be trained with a diverse selection of datasets containing tissue samples from different organs and associated with different cancer types, scanner types, etc. As a result, the resultant pre-trained model can easily be utilized in a wide range of downstream tasks while maintaining robustness to extreme variation in tissue samples.

Refer to caption
Figure 1: Number of publications in FMs and VLMs in pathology (from Google Scholar). The search was done with the keywords “vision-language” + “pathology” for VLMs statistics and “foundation models”+ “pathology” for FMs statistics.
Refer to caption
Figure 2: Outline of the major challenges in CPath (challenges in data collection, challenges in data annotation, lack of diverse data, large number of tasks and challenges in deep learning architectures). Several causes and consequences for each challenge are outlined in addition to how FMs and VLMs address these challenges.

The impact of FMs in CPath can be amplified by integrating the power of VLMs [28] which surged in popularity after the introduction of contrastive language-image pre-training (CLIP) [29] model by OpenAI. Pathology reports, books, education videos, etc are a rich source of semantic information that can be utilized by VLMs to significantly boost performance which is not possible with vision-only models. When used in conjunction with FMs they can perform like AI pathologists capable of performing a vast array of tasks as evidenced by recent research works [25]. Even though CPath is a specialized field, there is a large increase in the number of publications focusing on FMs and VLMs as shown in Fig. 1 which indicates the future direction of this field.

To appreciate the full impact of FMs and VLMs in CPath, the major challenges in CPath are outlined below. In Fig. 2 key points of these challenges are summarized. All these challenges are addressed in some form by FMs and VLMs as mentioned in Fig. 2.

I-A Scope of the Review

In this review, the main emphasis is put on the application of FMs and VLMs in CPath, especially the details of their architectures and training schemes. Note that these two categories are not mutually exclusive, meaning that some research articles belong to both categories which are vision-language foundation models (VLFMs). In addition, details of multi-modal datasets are summarized with a focus on vision and language as the modalities.

TABLE I: Number of Surveyed articles published in top journals and conferences within 2023-2024
Journal/Conference Venue Counts
Nature/Nature Medicine 6  
CVPR 8  
NeurIPS 2  
ECCV 2  
MICCAI/MICCAI Workshop 4  
AAAI 1  
Elsevier/Springer 3  

Several self-imposed rules and restrictions were used as guidelines throughout the review to ensure the scope of the paper is maintained.

  1. 1.

    First, articles that focus solely on pathology are included and articles that focus on other areas of the biomedical domain are excluded. As an example, articles like BiomedCLIP [30] with pathology as a subsection of the research are not included in the review.

  2. 2.

    Secondly, only vision-language models are included and articles that use other modalities along with vision are excluded. As an example, along with vision, transcriptomics[31] can be used to solve pathological tasks. However, such papers are excluded to maintain the scope.

  3. 3.

    Both peer-reviewed articles and pre-print articles are included in the survey. Among the peer-reviewed articles, a significant number of articles are published in top-tier journals and conferences as shown in Table I.

Refer to caption
Figure 3: Visualization of the timeline of recently published work in CPath utilizing FMs and VLMs as well as multi-modal datasets. To maintain transparency we clearly annotate research articles that have been peer-reviewed and articles that are available as pre-print. Furthermore, high-impact pioneering research works published in prominent journals are highlighted. For pre-prints if there are multiple version, the latest version and the corresponding date is used.

I-B Contribution and Organization

In the past few years, quite a few review articles [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] have been published focusing on computational pathology. Most of these reviews include articles in the digital and computational pathology domain and the application of deep learning as a whole rather than focusing on a specific subtopic. There are a handful of papers that focus on specific architecture like MIL [32], graph-based models [33, 17], transformer-based models [13, 10], LLMs [18], etc. The contribution of this review article is mentioned below:

  1. 1.

    To the best of our knowledge, this is the first review article to summarize recent advances (most articles from 2023-2024) in foundation and vision-language models (Section III and IV). The closest peer-reviewed survey article [8] mostly provides a high-level overview without going into the details of foundation models.

  2. 2.

    An exhaustive list of multi-modal datasets (Section II) in computational pathology that are being used or can be used in vision-language research is outlined.

  3. 3.

    Given the diverse datasets and architectures, descriptions for individual research are provided in tabular format (Table II, Table IV, Table V, Table VIII) so it is easier for readers to follow.

  4. 4.

    A categorized and annotated timeline of the survey articles (Fig. 3) which provides a clear idea of the evolution of the FMs and VLMs in CPath.

The rest of the article is organized as follows. In section II existing multi-modal datasets are listed along with details about their source, pre-processing techniques, etc. In section III existing FMs in CPath are outlined and descriptions for the vision and vision-language pre-training schemes for these FMs are provided. In section IV an extensive list of VLMs in CPath is provided along with details about their architecture, utilized datasets and contribution. Finally, in section V the paper is concluded.

II Multi-Modal Datasets in Pathology

In this section, a comprehensive summary of the existing multi-modal datasets available in CPath is provided. The modalities taken into account are vision and language. The one article that has additional modalities is CR-PathNarratives which is included because it was referenced in many other VLMs articles. All key information regarding each dataset is summarized in Table II.

Refer to caption
Figure 4: Different components of multi-modal datasets in computational pathology: Type of datasets, source of data, annotation and pre-processing

To have a comprehensive understanding of the existing multi-modal datasets for pathology, three different components need to be considered as shown in Fig. 4.

TABLE II: Summary of Multi-Modal Datasets in Pathology
Dataset Type of Data Size Image and Text Source Method of Generation/Procurement Other Utilized Dataset/Models for Generation/Procurement Dataset Used By Variation/ Subset/ Extension Availability (Linked)
PathText [34] WSI and question-text pairs 1,041 WSI 9,009 WSI-text pairs WSI and pathology report from TCGA-BRCA Prompting LLMs, OCR, manual annotation, classifier MI-Gen [34]
WSI-VQA [35] WSI and Q/A pairs 977 WSIs 8,672 question-answer pairs (slide-level) with average 8.9 Q/A pairs per WSI WSI and pathology report from TCGA-BRCA Prompting LLMs and template matching heuristics Model/Framework: GPT-4 [36] Wsi2Text Transformer (W2T) [35] 4535 close-ended VQA subset and 4137 open-ended VQA subset
PathGen-1.6M [37] Image-caption pairs 1.6 million WSI and pathology report from TCGA Multi-agent collaboration and caption generation with LMMs Dataset: PathCap [38], Quilt-1M [39], OpenPath [40] Model/Framework: LLaVA-v1.5 [41], Vicuna [42] OpenCLIP [43], GPT-4 [36] PathGen-CLIP[37], PathGen-LLVA [37] 200K instruction-tuning data (Extension)
PathMMU [44] Image and Q/A pairs 24,067 pathology images 33,428 Q/A PubMed, pathology atlas, educational videos, pathologist-shared images with explanations on Twitter Prompting GPT-4, Heuristics, Annotation by seven pathologists Dataset: Quilt-1M [39], OpenPath [40] Model/Framework: GPT-4 [36], YOLOv6 [45] PubMed, SocialPath EduContent, Atlas, PathCLS (Subsets)
Quilt-Instruct [46] Instruction-tuning question-answer pairs 107,131 histopathology-specific Q/A pairs Over 1,000 hours of 4,149 educational histopathology videos from YouTube Prompting GPT-4, hand-crafted algorithms for extraction of video frames and spatial annotation Dataset: Quilt-1M [39] Model/Framework: GPT-4 [36] QUILT-LLAVA[46] Quilt-VQA with 985 images 1,283 Q/A pairs where 940 are open-set and 343 closed-set
QUILT [39] Image-text pairs 437,878 images aligned with 802,144 text pairs 1,087 hours of 4,504 narrative educational histopathology videos from YouTube Prompting LLMs, hand-crafted algorithms, human knowledge databases (UMLS), automatic speech recognition Dataset: OpenPath [40], PubMed, LAION-5B [47] (For Quit-1M) Model/Framework: Whisper [48], GPT-3.5, inaSpeechSegmenter [49], langdetect [50] QUILT-Net[39] Quilt-1M with 1 million image-text pair with additional data from LAION, Twitter, and PubMed (Extension)
PathQABench [25] ROI-annotated WSI and Q/A 48 H&E WSIs (25 WSIs from PathQABench-Private 23 WSIs from PathQABench-Public) + 48 close-ended Q/A 115 open-ended Q/A PathQABench-Private from private in-house cases PathQABench-Public from TCGA cases Expert-pathologists curated and annotated PathChat[25] (Subsets) PathQABench-Public and PathQABench-Private PathChatInstruct 456,916 instruction-tuning dataset (Extension)
PathCap [38] Image-caption pairs 207K PubMed, internal pathology guidelines books, annotation by expert cytologists Parsing from PubMed, image processing with YOLOv7, ConvNeXt, PLIP, caption refinement and text processing with ChatGPT Dataset: PubMed Model/Framework: YOLOv7 [51], ConvNeXt [52] ChatGPT [53], PLIP [40] PathAsst [38] PathInstruct 180K pathology multimodal instruction-following samples (Extension)
OpenPath [40] Image-text pairs 208,414 image–text pairs 116,504 image–text pairs from Twitter posts, 59,869 image–text pairs from replies, 32,041 image–text pairs from LAION-5B Pathology image classifier to exclude non-pathology images, CLIP image embeddings with cosine similarity to create PathLAION, other hand-crafted heuristics Dataset: LAION-5B [47] Model/Framework: CLIP [29], langdetect [50] PLIP [40] PathLAION 32,041 pathology images from the LAION-5B dataset (Subset)
CR-PathNarratives [54] WSIs with multi-modal annotations 174 annotated colorectal WSIs Proprietary data from a hospital Annotation by eight pathologists which contains ROI, voice information and behavioral trajectory information of the annotators Model/Framework: PathNarrative annotation tool [54]
ARCH [55] Image-caption pairs 11,816 image-caption pairs PubMed and pathology textbooks Hand-crafted algorithms with tools like Pubmed Parser, Pdffigures 2.0 Model/Framework: Pubmed Parser [56] Pdffigures 2.0 [57]
PathVQA [58] Image and Q/A pairs 4,998 images 32,799 Q/A pairs Pathology textbooks, Pathology Education Informational Resource (PEIR) digital library Hand-crafted algorithms with tools like PyPDF2, PDFMiner, Stanford CoreNLP toolkit Model/Framework: PyPDF2, PDFMiner Stanford CoreNLP toolkit [59] 16,465 open-ended Q/A subset, 16,334 close-ended Q/A subset
Prov-Path [26] WSI and reports pairs 17,383 WSI-reports pairs Proprietary dataset from Providence Health System (PHS) K-means to generate four representative reports which are used to prompt GPT 3.5 to clean rest of the reports Model/Framework: GPT-3.5 Prov-GigaPath [26]
Dataset from CONCH  [24] Image-caption pairs 1,786,362 image-caption pairs PubMed, publicly available research articles, internal data from Mass General Brigham institution Hand-crafted workflow with YOLOv5, CLIP, BioGPT Model/Framework: YOLOv5 [60], CLIP BioGPT [61] CONCH [24] PMC-Path (data from PubMed) EDU (data extracted from educational notes)
Dataset from MI-Zero  [62] Image-caption pairs 33,480 image-caption pairs Publicly available educational resources combined with ARCH Hand-crafted algorithms to filter out non-pathology images Dataset: ARCH MI-Zero [62]
Availability: GitHub repositories (except for ARCH [55] and PathVQA [58]; direct dataset source is linked for these two papers) with the associated paper are linked if they are open-sourced and accessible. Some repositories only provide data generation and processing code (if the data is proprietary or requires an API) and some provide direct data sources through platforms such as Hugging Face Hub, Zenodo. Other Utilized Dataset/Models: Dataset/Models/Framework used only in the data generation/ processing part is mentioned. Downstream task dataset is not mentioned for this reason.

II-A Type of Datasets:

The first component is the type of data which can be broadly classified into 5555 categories. CR-PathNarratives [54] and Prov-Path [26] fall into the last category which is dissimilar to the rest of the datasets. In CR-PathNarratives, WSIs are associated with special annotated additional modalities. The additional modalities are voice and behavioral trajectory information of the pathologists who annotated the dataset. These additional modalities were collected by using the PathNarrative multimodal interactive annotation tool proposed in the same research work. The dataset Prov-Path utilizes WSI and the corresponding reports along with histopathology findings, cancer staging, genomic mutation profiles, etc collected by Providence Health System (PHS).

Refer to caption
Figure 5: Comparison between the size of different multi-modal datasets. The size of the bubbles indicates the size of the data set and the number of elements in a data set is mentioned on the right of the bubbles. Color is used to indicate the different types.

The rest of the datasets can be categorized into the remaining four categories. The image-caption/ image-text pair category (PathGen-1.5M [37], Quilt-1M [39], OpenPath [40], PathCap [38], ARCH [55]) involves a low-to-medium quality image and an associated piece of text for that image. This text can be a short caption with a description of the image or a more elaborate description. ARCH is the earliest dataset in this category that utilized Pubmed and pathology textbooks to extract the texts. PathGen-1.5 M is the latest and largest dataset in this category, but unlike other datasets in this category, the images are patches extracted from WSIs. This is in contrast to other datasets in this category like OpenPath which is constructed with Twitter posts and replies and Quilt-1M which is constructed with frames extracted from educational pathology videos. However, the curation and annotation process of this category is much easier as it can automated with hand-crafted algorithms and heuristics. But it comes with the trade-off of noisy data as due to the automated process a lot of artifacts can be present in the data.

The WSI VQA/text category (PathText [34],WSI-VQA [35], PathQABench [25]) contains question-and-answer pairs or texts associated with WSIs. Comparatively, the generation process of this category is more difficult and often involves prompting LLMs to extract information or format information according to a certain template. The most common data source is the cancer genome atlas (TCGA) [19] which contains a large repository of WSI and patient report pairs. The VQA part can be of two types, close-ended question-answer pair and open-ended question-answer pair. Close-ended question-answer pairs are of a multiple-choice type or short-answer type. On the other hand, open-ended question-answer pairs contain answers that are in natural language form. Among the datasets in this category, PathQABench is unique as it contains ROI annotation of WSIs performed by expert pathologists. It has two subsets PathQABench-Public and PathQABench-Private. The former is publicly available as it was constructed with TCGA WSI and reports, and the latter was constructed with in-house data.

The next category, which is VQA (PathMMU [44], PathVQA [58], Quilt-VQA [46]) is similar to the previous category as it also contains close-ended and open-ended question-answer pairs, but the associated images are not WSIs but rather low-to-medium-quality images. Among these datasets, PathVQA is the first research to curate a pathology-specific VQA dataset. PathMMU is the latest and largest dataset in this category and it also provides explainability annotations with each answer. PathMMU utilized two previous datasets from the image-caption/image-text category OpenPath and Quilt-1M to generate a subset of the question-answer pair with images. Another category, which is the instruction-tuning dataset (Quilt-Instruct [46], PathInstruct [37], PathChatInstruct [25], extension of PathGen-1.6M [37] ) is unique kind of dataset, as this type of dataset is used to provide conversational ability to an existing multimodal model. The common workflow is that the instruction-tuning data set is applied in the last phase to fine-tune an already trained VLM. All of these datasets were created following the strategy mentioned in LLaVA [41] or LLaVa-1.5 [63]. In Fig. 5 a visual comparison of the size of each dataset in each category is provided.

II-B Source of Data:

The second component to consider is the source of data which largely dictates the third component, the annotation and pre-processing. PubMed is a common data source containing pathology images and captions/text. However, the quality of the data is not as high as that of the TCGA repository that contains WSIs and corresponding pathological reports. Other high-quality data sources that contain WSIs and pathology reports are in-house proprietary datasets. A unique data source, OpenPath, contains pathology images and texts from Twitter posts and replies associated with a large pathology community.

Refer to caption
Figure 6: Subsets of the Quilt-1M dataset and the corresponding number of image-text pair for each subset.
Refer to caption
Figure 7: Subsets of the PathMMU dataset and proportion for each subset. The EduContent and SocialPath subset is sourced from Quilt-1M and OpenPath dataset.

This data set was supplemented by pathology-specific data from the large-scale artificial intelligence open network (LAION) data repository. Pathology textbooks and atlas are also large knowledge sources that can be used to extract image caption/text pairs. In a couple of recent research [39, 46], educational histopathology videos on YouTube are being used as the source of pathology image and text pair. However, curation of this kind of dataset requires a series of hand-crafted algorithms and many external tools.

Based on the above discussion, it is apparent that there is a trade-off between the quality and volume of the data. Sources like PathQABench are superior in terms of quality, as they were curated by an expert pathologist. However as it requires explicit manual annotation and supervision, it is much harder to scale and hence small in terms of size.

Another key point is that most of these datasets contain other datasets as one of the subsets. An example of that is Quilt-1M which contains OpenPath, PubMed and LAION as subsets in addition to the proposed Quilt dataset which is the original proposed dataset. A visualization of Quilt-1M and its subsets is provided in Fig. 6. Another such example is the PathMMU dataset (shown in Fig. 7) which contains 5555 different subsets, PubMed, SocialPath, EduContent, Atlas and PathCLS each containing data from different sources.

II-C Annotation and Pre-Processing:

The third component is the annotation and pre-processing steps of data curation and generation. Among the surveyed articles, all employ a series of steps depending on the type of dataset and if the entire annotation and pre-processing pipeline is compared every article is unique. However, some specific steps are similar if the data source is the same. For example, all articles that utilize PubMed as a data source use some kind of parsing process [56] to parse and extract figures and texts. In addition, they employ light-weight classifiers and object detection architectures (YOLOv5 [60], YOLOv6 [45], YOLOv7 [51]) to distinguish between pathology and non-pathology images, detect and separate subfigures, etc. Another common approach is to prompt LLMs to format and refine captions/text or structure extracted information according to a pre-defined template. These LLMs include generalized LLMs like GPT-4, GPT-3.5, ChatGPT and also specialized LLMs like BioGPT. Another widely used strategy is using a trained CLIP-based model and using cosine similarity as a metric to classify pathology and non-pathology images.

Apart from the approaches mentioned above, there are a lot of other hand-crafted algorithms, heuristics and tools which are summarized in Table II.

III Foundation Model

Refer to caption
Figure 8: A high-level visualization of different factors of variability in terms of organs, stain types, scanner types, magnification levels, preservation methods, downstream tasks, task levels, pre-training tissue sample sources, etc in foundation models. The research that puts the most emphasis on ensuring variability are RudolfV [64] and PLUTO [65]. Among the tissue sample sources for pre-training TCGA and PAIP are publicly accessible and the rest are proprietary.

In this section, an overview of existing FMs in CPath is provided. First, the characteristics of FMs are provided to remove any ambiguity for the later sections. Next, pre-training workflow and typical pre-training schemes are mentioned.

III-A Characteristics of FMs

A model can be classified as FM if it holds the following characteristics :

  1. 1.

    The first characteristic that is common to all FMs is the SSPT. The data used in the pre-training phase do not have any explicit label or annotation. Specifics of the SSL strategies are mentioned in section III-B.

  2. 2.

    The training goal of FMs is not to solve any specific task but rather to learn a general and rich representation space. For VFMs, it is a vision representation space and for VLFMs it is a vision-language representation space. The training of FMs is termed as “pre-training” as in later stages further training is required to optimize for a specific task.

  3. 3.

    In CPath, FMs are trained using large and diverse datasets that encompass tissue samples from different organs and anatomic sites. In addition, some research [64, 65] put effort into capturing diversity in terms of scanner types, magnification levels, stain types, preservation methods, etc. The idea is to capture the representation of a wide range of tissue and disease types and also making sure in downstream tasks are robust. In Fig. 8 a high-level visualization is provided highlighting different aspects of FMs.

  4. 4.

    Another characteristic is the size of models that typically have parameters on the scale of millions. A huge amount of computing resources is put into training these models involving multiple GPUs.

As shown in Fig. 3, the FMs are sectioned into three separate categories. One category encompasses VFMs [22] in CPath, the second category encompasses VLFMs in CPath and the last category utilizes these FMs by providing a benchmark [66, 67, 68, 69, 70], framework [71] or adapting FMs [72, 73, 74, 75, 76, 77]. Before going into the details of each category in the following section, an overview of the pre-training strategies of FMs is provided.

III-B Pre-training Workflow and Strategies

The typical pre-training workflow of FMs is shown in Fig. 9 which provides a high-level visualization of different phases in the workflow.

In this single diagram, both pre-training strategy for VFM and VLFM is shown. For VFMs, a vision module with some initial weight is leveraged in pre-training. The term “vision module” is used to generally represent a wide range of vision architectures and also modified versions of these architectures with additional layers. The most common architecture is variants of vision image transformers (ViTs) [78] which includes ViT-S, ViT-B, ViT-L and ViT-H. Though there are specialized architectures proposed in BEPH [79] and Prov-GigaPath [26] which uses BEiTv2 and GigaPath architecture respectively. Note that, some models initialize the architectures with ImageNet [80] weights to get an initial vision representation space which can be transformed through pre-training. In Table III, a summary of pre-training strategies in different phases is outlined.

Refer to caption
Figure 9: Typical pre-training workflow in FMs. In general it starts with a vision module and a language module with randomly initialized weights or pre-set weights. VFMs only go through unimodal vision pre-training to learn a vision representation space. On the other hand, VLFMs can optionally go through unimodal pre-training for their vision and language modules.
TABLE III: Different pre-training phases and Corresponding SSPT Strategies
Pre-training phase Strategy
Unimodal vision pre-training Self-Distillation DINO [81], DINOv2 [82]
Contrastive Learning MoCov3 [83]
Masked Image Modeling (MIM) MAE [84], iBOT (MIM+Self distillation) [85]
Unimodal language pre-training No Specific Strategy
Vision-language pre-training CoCa [86], CLIP [29]
TABLE IV: Summary of Vision Pre-Training Strategy and Vision Pre-Training Dataset of Foundation Models in Pathology
Model Vision Pre-Training Strategy Vision Pre-Training Dataset Type of Foundation Model Availability (Linked)
Kategorie Approach Architecture Source Size Stain Additional Information
GPFM [22] MIM + Self-Distillation+ (Expert Knowledge Distillation) Custom Custom architecture that uses other FMs including UNI, Phikon, CONCH 33333333 public datasets including TCGA, GTExPortal PAIP, CPTAC, etc Slides: 86,104 Tiles: 190,000,000 47 data sources 34 major tissue types Combines Knowledge of VFMs and VLFMs
Virchow 2 [21] Virchow 2G Self-Distillation DINOv2 ViT-H ViT-G Memorial Sloan Kettering Cancer Center (MSKCC) + Institutions worldwide Slides: 3,134,922 Tiles: H&E IHC 225,401 patients 493,332 cases 871,025 specimens Vision
Virchow [23] Self-Distillation DINOv2 ViT-H Memorial Sloan Kettering Cancer Center (MSKCC) Slides: 1,488,550 Tiles: 2 billion H&E 17 organs 119,629 patients 208,815 cases 392,268 specimens Vision
RudolfV [64] Self-Distillation DINOv2 ViT-L 108,433 slides from 15 labs across the EU and US + 26,565 slides from TCGA Slides: 133,998 Tiles: 1.2 billion H&E (68%) + IHC (15%) + other (17%) 14 organs 15 lab 58 tissue types 129 stains 6 scanner types FFPE and FF tissue Vision
PLUTO [65] Self-Distillation+MIM DINOv2 variation + MAE ViT-S variant FlexiViT-S [87] TCGA + Proprietary data from PathAI Slides: 195 million Tiles: 158,000 H&E + IHC + Other stains More than 30 diseases More than 12 scanners More than 100 stain types More than 4M pathologist pixel-level annotation Vision
Hibou [88] Self-Distillation DINOv2 ViT-B (for Hibou-B) ViT-L (for Hibou-L) Proprietary Data Slides: 936,441 H&E +202,464 non-H&E + 2,676 cytology slides Tiles: 1.2 billion patches for Hibou-L, 512 million patches for Hibou-B H&E + non H&E 306,400 cases Vision
BEPH [79] MIM BEiTv2 [89] with VQ-KD autoencoder and ViT-B encoder TCGA Slides: 11,760 Tiles: 11,774,353 H&E 32 cancer types Vision
UNI [90] Self-Distillation DINOv2 (also MoCov3 for comparison) ViT-L Proprietary Data from Massachusetts General Hospital (MGH), Brigham and Women’s Hospital (BWH), Genotype–Tissue Expression (GTEx) consortium Slides: 100,426 Tiles: over 100 million H&E 20 major tissue types Vision
3B-CPath [91] MIM and Self-Distillation MAE, DINO ViT-S, ViT-L Mount Sinai Health System (MSHS) Slides: 423,600 Tiles: 3.25 billion H&E 3 anatomic site 2 institutions Vision
PathoDuet [92] Contrastive Learning MoCov3 ViT-B TCGA Slides: 11,000 Tiles: 13,166,437 H&E Vision
Phikon [93] MIM+Self-Distillation iBOT ViT-S, ViT-B, ViT-L TCGA (Three variants TCGA-COAD, PanCancer4M, PanCancer40M) Slides: 6,093 Tiles: 43,374,634 (For PanCancer40M) H&E 16 cancer types 13 anatomic sites 5,558 patients (For PanCancer40M) Vision
CTransPath [94] Contrastive Learning MoCov3 variation (SRCL: semantically relevant contrastive learning ) Swin Transformer TCGA + PAIP Slides: 32,220 Tiles: 15.6 million H&E 32 cancer types 25 anatomic sites Vision
Prov-GigaPath [26] MIM and Self-Distillation MAE, DINOv2 GigaPath [26] (constructed with ViT and LongNet [95]) Providence Health System (PHS) Slides: 171,189 Tiles: 1.3 billion H&E IHC 30,000 patients 28 cancer centers 31 major tissue types Vision Language
CONCH [24] MIM + Self-Distillation iBOT ViT-B backbone in image encoder In-house dataset Slides: 21,442 Tiles: 1.3 billion Self-Distillaton 350 cancer subtypes Vision Language

The term “unimodal” is used to signify the pre-training phase utilizing a single modality out of vision and language. This is inherently different from vision-language pre-training strategies which involve both vision and language modalities to learn a joint vision-language representation space. In the unimodal vision pre-training phase there are three strategies commonly used in CPath which are self-distillation, contrastive learning and masked image modeling (MIM) approach. Each approach has its own advantages and disadvantages; however, the most popular approach is self-distillation with no labels (specifically DINOv2) which has been used in all the recent FMs [23, 64, 90, 88] as the SSL scheme. The work done in Lunit [96] goes through and benchmarks DINO, MoCov2, SwaV [97], Barlow Twins [98], but concludes there is no clear best SSL scheme. Lai [99] et al. is another work that performs a thorough evaluation of SSL schemes in CPath. .

Refer to caption
Figure 10: Visualization of the self-distillation approach of unimodal vision pre-training scheme.

III-B1 Unimodal Vision Pre-training

TABLE V: Summary of Vision-Language Pre-Training and Instruction Tuning Phase of Foundation Models in Pathology
Model Vision Language Modules Vision-Language Pre-Training Phase Instruction Tuning Phase
Vision Module Language Module Additional Layer/Module Architecture / Framework Pre-Training Process Instruction Tuning Process
CONCH [24] An image encoder with ViT-B backbone with 12 transformer layers, 12 attention heads followed by two attentional pooler modules A GPT-style text encoder with 12 transformer layers A GPT-style multimodal decoder with 12 transformer layers CoCa Visual-language pre-training with image–text contrastive loss and the captioning loss according to CoCa framework
PLIP [40] An image encoder with ViT-B A text transformer as the text encoder with modifications mentioned in [100] CLIP Fine-tuning of CLIP model through contrastive learning using OpenPath dataset
PathChat [25] UNI [90] as vision encoder backbone Pre-trained Llama 2 [101] LLM which is a decoder-only transformer-based auto-regressive model A multimodal projector module to connect the outputs of the vision module to language module by projecting the visual tokens to the same dimension as the LLM’s embedding space for text tokens CoCa Vision-language pre-training according to CoCa framework with CONCH dataset LLaVa 1.5 [63] training approach First phase: Only the parameters of multi-modal projector is updated Second phase: Fine-tuned with instruction-following data
PathCLIP and PathAsst [38] Image encoder with ViT-B Text transformer as the text encoder with modifications mentioned in [100] Fully-connected layer after vision encoder to map the image embedding space to the corresponding language embedding CLIP Fine-tuning of CLIP model through contrastive learning using PathCap dataset First phase: Detailed description-based part is used to train the fully-connected layer connected to the vision encoder. Second phase: fine-tuned with instruction-following data via next word prediction
PRISM [102] Virchow as tile encoder, Perceiver [103] as slide-encoder BioGPT first 12 layers BioGPT last 12 layers as vision-language decoder CoCa Trained using contrastive loss and generative/ captioning loss
\forestset

my tier/.style=tier/.wrap pgfmath arg=level##1level(), , The self-distillation with no label approach uses a student-teacher network to learn a rich vision representation space as shown in Fig. 10. From a single image patch, two different views are generated by applying an augmentation sampled from a set of possible augmentations (color jittering, Gaussian blur, polarization, etc). The generated output of both the networks is utilized to compute a cross-entropy loss which is then used to update the parameters of the student network. The parameter of the teacher network is then updated through an exponential moving average (EMA) of the student network parameters. Among the surveyed articles Virchow2, Virchow, RudolfV, Hibou and UNI use DINOv2 as the self-distillation approach. PLUTO takes a unique approach by integrating MAE and a Fourier loss term to get a custom variation of DINOv2.

Refer to caption
Figure 11: Visualization of the masked image modeling approach of unimodal vision pre-training scheme.
Refer to caption
Figure 12: Visualization of the MoCo approach of unimodal vision pre-training scheme.

The second popular SSPT approach is the MIM (visualized in Fig. 11) which has variants like MAE and iBOT used in FMs. In the MAE approach, randomly selected high portions of the image are masked out and the patches that are not masked out are passed through an encoder which generates latent representations of those patches.

Then those representations are passed through a decoder along tokens of masked out regions to reconstruct the image and reconstruction loss is used to train the model. The iBOT approach also uses MIM but adds self-distillation technique by leveraging a student-teacher network. The teacher network works as an online tokenizer and the student network learns to predict masked patches with the help of distilled knowledge of the teacher network.

Among the surveyed articles 3B-CPath and Prov-GigaPath use MAE but in conjunction with DINO and DINOv2 respectively. Research works utilizing iBOT approach include Phikon and CONCH.

Another SSPT approach that is comparatively less popular in FMs for CPath is the contrastive learning framework proposed in MoCo [104] (visualization in Fig. 12). Over the years variations of the proposed approach in MoCo in terms of architectural change and training blueprint have been done and MoCov2 [105] and MoCov3 [83] are the results of that.

Like the self-distillation approach, MoCo also utilizes two models; one is an encoder (with query patch as input) and the second one is a special momentum encoder (with key patches as input). The embeddings generated through the encoder and the momentum encoder are used to compute a similarity score which in turn is used for contrastive loss computation. The computed loss is used in backpropagation to update the parameters of the encoder. The parameter of the momentum encoder is updated through a momentum-based update rule that utilizes the parameters of the encoder. Another innovation of MoCo was introducing a feature queue which ensures a large set of negative samples without holding the entire dataset in memory. This also ensures the negative samples do not go stale over the training period and the model sees a diverse set of negative samples to learn a better vision representation. Among the surveyed articles PathoDuet uses MoCov3, UNI uses MoCov3 for comparison and CTransPath performs a unique variation of MoCov3 called semantically relevant contrastive learning (SRCL).

GPFM [22] is distinct from other approaches as it includes a novel expert knowledge distillation in addition to MIM and self-distillation by utilizing existing FMs UNI, Phikon and CONCH.

III-B2 Vision-Language Pre-training

After unimodal pre-training for the image module and text module, VLFMs go through vision language SSPT to align or learn a joint vision-language representation space.

The most popular vision-language SSPT approach in CPath is the CLIP approach shown in Fig. 13.

The inputs to CLIP are a batch of image features encoded by an image encoder and a batch of text features encoded by a text encoder. The model is trained to pull together the corresponding image feature and text feature in the representation space and push apart the rest of the features i.e. utilizing a contrastive objective. In other words, maximizing the similarity (specifically cosine similarity) between the corresponding image and text pair while minimizing the cosine similarity with other non-matching embeddings in the representation space.

Refer to caption
Figure 13: Visualization of the CLIP approach of the vision-language pre-training scheme.

The second widely used approach is the CoCa [86] framework shown in Fig. 14.

Refer to caption
Figure 14: Visualization of the CoCa approach of the vision-language pre-training scheme.

It contains three different modules; an image encoder, an unimodal text encoder and a multi-modal text decoder. It is trained by leveraging both captioning loss which follows a generative objective and contrastive loss which follows a contrastive objective. The input image patches are passed through an image encoder to generate image embeddings which is used to provide cross-attention to the multi-modal text decoder. The combination of generative and contrastive objectives forces the model to learn rich vision-language representation space.

Details of vision modules, language modules, additional modules and specifics of the pre-training process of FMs are outlined in Table V.

One recent work Zhou et al.111https://github.com/MAGIC-AI4Med/KEP [106] proposes an entirely new approach in visual-language SSPT named knowledge enhanced pre-training (KEP) by utilizing PathKT, a pathology knowledge tree curated by the same research. This is different from the CLIP and CoCa pre-training techniques as it uses a novel knowledge encoder and knowledge distillation technique.

III-C Instruction-Tuning Phase

For CPath, the instruction-tuning phase provides the model with conversational ability i.e. a user can prompt the model and the model will respond according to the prompt. Visual-instruction tuning adds the ability to provide an image in addition to natural language user prompts. Among the surveyed papers in FMs, PathChat and PathAsst perform instruction-tuning with self-curated datasets PathChatInstruct and PathInstruct, respectively. Both works adopt the strategy employed by LLaVA and LLaVa-1.5 which are the pioneering work in visual instruction tuning. A summary of the instruction tuning process is provided in Table V. For both PathChat and PathAsst, the instruction tuning phase is subdivided into two phases. For both models in the first phase, the parameters of the vision encoder are frozen and the layer/module connected to the vision encoder (for PathChat it is a multimodal projector, for PathAsst it is fully connected layers) is trained. In the second phase, the model is fine-tuned with instruction-following data.

Even though instruction tuning in CPath is relatively new, some recent works like PathInsight [107] specifically focus on this aspect of FMs. Another recent work CLOVER222https://github.com/JLINEkai/CLOVER [108] provides a framework for cost-effective instruction learning in pathology.

III-D Downstream Tasks

FMs have the ability to adapt to a vast array of tasks by utilizing the representation space learned during SSPT.

Refer to caption
Figure 15: Adaptation and performance evaluation strategy for downstream tasks.

Note that, in the pre-training phase FMs were never trained for any of these specific tasks. At the end of the pre-training of FMs, one of the following strategies is adopted to perform a specific task.

  1. 1.

    Linear Probing: This is a commonly used technique where a linear classifier/regressor is trained on top of the pre-trained model. During the training of the linear layers, the parameters of the pre-trained model are kept frozen. Depending on the specifics of the tasks, the corresponding loss function and update rule are determined. This is a computationally cheap way to adapt to a downstream task as the parameters of the pre-trained model do not need to be updated.

  2. 2.

    KNN Probing: This is yet another approach to adopt a foundation model for a specific downstream task by utilizing K-nearest neighbors algorithm.

  3. 3.

    Fine-Tuning: This is similar to linear probing as a classifier/regressor is added on top of the pre-trained model, but the major difference is the parameter of the pre-trained models is also updated during fine-tuning. Hence, it is computationally much more costly compared to linear probing. This is sometimes also referred to as the supervised training phase. Note that, most of the time annotation is only available on slide-level. Hence, if the slide-level label is utilized it is called weakly supervised training.

TABLE VI: Different types of downstream tasks and corresponding research works
Downstream tasks Performed By
Cancer Detection [40],[102],[92],[91],[79],[23]
Tumor Detection [40], [38],[90]
Disease/Cancer/Tissue/Tumor/Molecular SubTyping [40], [24],[102],[26],[92],[90],[79],[64]
Cancer Grading [24],[90]
Image/Tissue/Tumor/Nuclear Segmentation [24],[90],[64]
Survival Prediction [79]
Text-to-Image Retrieval [40],[24]
Image-to-Text Retrieval [24]
Image-to-Image Retrieval [40],[90],[64],[22]
Image Captioning [24]
Pattern/Tissue/Image Classification [24],[38],[90]
Biomarker Prediction/Detection/Screening/Scoring [102],[90],[91],[64],[23]
Metastasis Detection [90]
Organ Transplant Assessment [90]
Mutation Detection/Prediction [26],[90],[91]
VQA [38],[25],[22]
Report Generation [102],[22]
Survival Analysis [22]
Conversational Agent [38],[25]

In Table VI, a summary of downstream tasks is provided along with research works performing these tasks.

Refer to caption
Figure 16: Aggregate patch-level features through an aggregator network to get a slide-level representation and use it for weakly supervised training.

Another aspect to consider is how the performance evaluation is conducted. Among the surveyed articles, there are three different strategies that are employed.

  1. 1.

    Zero Shot Evaluation: In zero shot evaluation, the pre-trained model is directly used in a downstream task without probing or fine-tuning the pre-trained model with any samples of the downstream dataset. This provides a direct assessment of learned representation in the pre-training phase i.e. evaluates the quality of the generated embeddings from the pre-trained model. This is the most common approach among the surveyed articles.

  2. 2.

    Few Shot Evaluation: In few shot evaluation, the pre-trained model sees only a few examples from the downstream task dataset.

  3. 3.

    Simple Shot Evaluation: This is a unique variation of the few shot evaluation methods which is only been explored in UNI.

III-E Framework, Benchmarking and Adaptation of FMs

There are several research works in CPath that do not directly propose a FM but introduce the framework of FMs, provide benchmarks and comparisons between FMs and adapt the FMs for efficient training.

TABLE VII: Evaluation datasets and the associated models utilizing the dataset along with the data source
Dataset Used By Source
PanNuke [109] [40]
DigestPath [110] [40], [24]
WSSS4LUAD [111] [40],[38],[24]
SkinCancer [112] [39]
PatchCamelyon (PCam) [113] [88],[64],[23],[39]
MHIST [114] [88],[64],[23],[39]
BACH [115] [79],[90],[39]
SICAP, SICAPv2 [116] [24],[39]
Databiox [117] [39]
RenalCell [118] [39]
Osteo [119] [39]
CoNSeP [120] [23]
KIMIA Path24C [121] [40]
DHMC [122] [24],[90]
EBRAINS [123] [24],[90]
AGGC [124] [24],[90]
PANDA [125] [24],[90]
MSK-IMPACT [126] [102],[23]
SegPath [127] [90]
BRACS [128] [90]
UniToPatho [129] [90]
HunCRC [130] [90]
BreakHis [131] [79]
MSI-CRC & MSI-STAD [132, 133] [88],[64]
TIL-DET [134] [88],[64]
MIDOG [135] [23]
CAMELYON16 [136] [79],[92],[90]
CAMELYON17-WILDS [137] [90],[23]
{forest} for tree= grow’=0, child anchor=west, parent anchor=south, anchor=west, calign=first, s sep+=-5pt, inner sep=3.5pt, edge path= [draw, \forestoptionedge] (!u.south west) +(5pt,0) —- (.child anchor)\forestoptionedge label; , before typesetting nodes= if n=1 insert before=[, phantom, my tier], , , my tier, fit=band, before computing xy=l=30pt, , [Kather colon  [138] [CRC100K] [NCT-CRC-HE-100K] [NCT-CRC-HE-100K-NONORM] ] [40],[79],[92]
[38],[24],[90],[88],[64]
[79],[23], [39]
[23]
{forest} for tree= grow’=0, child anchor=west, parent anchor=south, anchor=west, calign=first, s sep+=-5pt, inner sep=3.5pt, edge path= [draw, \forestoptionedge] (!u.south west) +(5pt,0) —- (.child anchor)\forestoptionedge label; , before typesetting nodes= if n=1 insert before=[, phantom, my tier], , , my tier, fit=band, before computing xy=l=30pt, , [LC25000 [139] [LC25000Colon] [LC25000Lung] ] [38],[79]
[38],[39]
[38],[39]
TCGA Uniform [140] [90]
TCGA CRC-MSI [90],[23]
TCGA-TILs [141] [90],[23]
{forest} for tree= grow’=0, child anchor=west, parent anchor=south, anchor=west, calign=first, s sep+=-5pt, inner sep=3.5pt, edge path= [draw, \forestoptionedge] (!u.south west) +(5pt,0) —- (.child anchor)\forestoptionedge label; , before typesetting nodes= if n=1 insert before=[, phantom, my tier], , , my tier, fit=band, before computing xy=l=30pt, , [TCGA [19] [TCGA BRCA] [TCGA RCC] [TCGA NSCLC] ]
[24],[102],[79],[88]
[24],[79],[88],[92]
[24],[102],[79],[88],[92]
Mentioned data sources are shared through platforms like Zenodo, GitHub, official challenge websites. TCGA dataset needs to be downloaded from the official portal.

III-E1 Frameworks:

eva333https://github.com/kaiko-ai/eva [71, 142] is a framework for VFMs in CPath which abstracts a lot of complexity of VFMs. In addition, it facilitates the reproducibility of VFMs for fair comparison and provides an interface to evaluate publicly available downstream datasets. Note that, this is an ongoing work and more downstream datasets and models are being added.

III-E2 Benchmarking:

Another category is benchmark analysis of FMs [67], [66],[143] in CPath. The work done in [67] analyzes different FMs and evaluates their performance on 8888 datasets. The other work [66] is a more recent work that analyzes the performance of FMs on a large and diverse data set collected from two medical centers. The FMs were benchmarked on 3333 broad downstream tasks which include disease detection, biomarker prediction and treatment outcome prediction.

III-E3 Adaptation of FMs:

Another category is the adaptation of existing FMs to carry out tasks such as low-resource fine-tuning[72] and multi-modal prompt-tuning[73]. The work carried out in [72] fine-tunes a FM with a single GPU and shows that it can outperform SOTA feature extractors. PathoTune [73] adapts a visual or pathological FM to downstream tasks using multi-modal prompts.

Other than the surveyed FMs, there are more recent works of FM in CPath like mSTAR [144]. However, as it includes RNA-Seq data in addition to pathology reports it falls outside the scope of this paper. Another foundation model we do not include is H-optimus-0444https://huggingface.co/bioptimus/H-optimus-0 [145] developed by Bioptimus as it is not associated with a publication.

TABLE VIII: Summary of Vision-Language Models in Computational Pathology
Model Architecture and Utilized Models/Frameworks Dataset Contribution Availability
TraP-VQA [146] Question feature extraction with BioELMo [147] and BiLSTM Image feature extraction with ResNet-50 and CNN layers Transformer encoder to fuse question and image feature, transformer decoder to upsample fused features PathVQA Performs VQA Provides interpretability in text domain with SHAP [148] Provides interpretability in image domain with GradCAM [149]
FSWC [150] CLIP for image and text feature extraction GPT-4 to generate instance-level and slide-level prompt groups which introduce pathological prior knowledge into the model Camelyon 16 TCGA-Lung In-house cervical cancer dataset Performs few-shot weakly supervised WSI classification Proposes a two-level (instance-level and slide-level) prompt learning MIL framework named TOP Introduces a prompt guided instance pooling to generate slide-level feature
ViLA-MIL [151] GPT-3.5 as the frozen LLM which generates visual descriptive text for WSI at 5x and 10x resolution based on class level question prompt ResNet-50 as the image encoder and corresponding CLIP transformer is used as the text encoder Prototype-guided patch decoder which generates slide features and context-guided text decoder which generates text-features TIHD-RCC (in-house dataset) TCGA-RCC TCGA-Lung Performs WSI classification with a MIL framework that utilizes information from 5x and 10x. In addition, incorporates information from LLM generated visual descriptive text for both 5x and 10x scale WSIs Introduces a novel prototype-guided patch decoder that progressively aggregates the patch features Introduces a context-guided text decoder to refine the text prompt features by leveraging multi-granular image contexts
PathM3 [152] ViT-G as the image encoder Query-based transformer to fuse image embeddings of WSIs and corresponding captions Frozen LLM FlanT5 [153] for caption generation PatchGastric [154] Performs WSI classification and captioning through a multi-modal, multi-task, multi-instance learning framework Leverages limited WSI captions during training Develops multi-task joint learning inspired from [155]
HLSS [156] ResNet-50 as visual encoder with CLIP pre-training Language encoder from CLIP A positive pairing module (PPM) which consists of three parallel reshape layer followed by MLP A cross-modal alignment (CAM) module which computes cosine similarity OpenSRH [157] TCGA Performs hierarchical (patient-slide-patch hierarchy) text-to-vision alignment for SSL Proposes a positive pairing module (PPM) and a cross-modal alignment module (CAM)
FiVE [158] ResNet as image encoder following [159] Pre-trained BioClinicalBERT [160] as text encoder and utilizing LoRA [161] for fine-tuning GPT-4 to generate fine-grained pathological descriptions based on raw pathology reports and expert inquiries An instance aggregator module consisting of a self-attention module and a cross-attention module that fuses image instance embeddings and prompt embeddings to create bag-level feature Camelyon 16 TCGA-Lung Incorporates a patch-samnpling strategy for optimize training efficiency of the model Proposes a task-specific fine-grained semantics (TFS) module Performs zero shot histological subtype classification, few shot classification and supervised classification with pre-training
CPLIP [162] MI-Zero [62] to identity-related prompts based on an image by utilizing similarity metric GPT-3 to categorize and transform prompts into five variations PILP to match the transformed prompts with corresponding images from OpenPath CRC100K WSSS4LUAD PanNuke DigestPath SICAP Camelyon 16 TCGA-BRCA TCGA-RCC TCGA-NSCLC Creates a pathology-specific dictionary using a range of publicly available online glossaries Introduces a many-to-many contrastive learning instead of traditional one-to-one contrastive learning
HistGen [163] GPT-4 to clean and summarize reports curated from TCGA A ViT-L model is pre-trained with 200 million patches by utilizing DINOv2 and used as feature extractor UBC-OCEAN [164, 165] TUPAC16 [166] Camelyon 16 and 17 TCGA-BRCA TCGA-STAD TCGA-KIRC TCGA-KIRP TCGA-LUAD TCGA-COADREAD Performs WSI report generation, cancer subtyping and survival analysis Develops a local-global hierarchical encoder module to capture features at different levels (region-to-slide) Develops a cross-modal context-aware learning module to align and ensure interaction between different modalities
CLIPath [167] ResNet-50 as the vision encoder in CLIP 4444 fully-connected layer after the vision encoder Transformer as the text encoder in CLIP PatchCamelyon (PCam) MHIST Develops residual feature connection (RFC) to fine-tune CLIP with a small amount of trainable parameters and this also fuses the existing knowledge from pre-trained CLIP and the newly learned knowledge related to pathological tasks
MI-Zero [62] HistPathGPT: Unimodal pre-training of GPT-style transformer (same architecture as GPT 2-medium [100]) which is used as text encoder. Additionally, BioClinicalBERT [168] and PubMedBERT [169] are considered as text encoders. Existing encoders like CTransPath [94] (which is based on Swin Transformer [170]) or ViT-S (ImageNet initialization or pre-trained with MoCov3) 3 WSI datasets from Brigham and Women’s Hospital (in-house dataset) named Independent BRCA, Independent NSCLC, Independent RCC TCGA-BRCA TCGA-NSCLC TCGA-RCC Performs cancer subtyping with a VLM Develops a custom pathology domain-specific pre-trained text encoder called HistPathGPT Introduces 33,480 image-caption pair dataset
HistoGPT [171] The vision module is based on CTransPath BioGPT as the language module Image features sampled (with Perceiver Resampler [103]) from the vision modules are integrated into the language module via interleaved gated cross-attention (XATTN) blocks [172] In-house dataset with 2 cohorts. Munich cohort with l 6,000 histology samples and Münster cohort with 1,300 histology samples with all samples stained with H&E Performs histopathology report generation that provides a description of WSIs with high fidelity Provides interpretability map which shows which word in the generated report corresponds to which region in a WSI
Model Architecture and Utilized Models/Frameworks Dataset Contribution Availability
PathAlign [173] PathSSL patch encoder which is based on ViT-S architecture and pre-trained following the approach in [99] with masked siamese networks [174] SSL scheme Q-Former from BLIP-2 [175] as WSI-encoder Among two variants PathAlign-R and PathAlign-G, PathAlign-G uses PaLM-2 S [176] as the frozen LLM In-house de-identified dataset (DS1) of WSIs paired with reports collected from a teaching hospital. The stain type for the WSIs are H&E and IHC TCGA Performs WSI classification, image-to-text retrieval and text generation
MI-Gen [34] ResNet and ViT (ImageNet initialization or hierarchical SSPT with HIPT [177]) as visual encoder CNN layer in the hierarchical position-aware module (PAM) Transformer as encoder-decoder (both vanilla transformer and Mem-Transformer [178]). Additionally, CNN-RNN [179] and att-LSTM [180] is used for comparison PathText dataset which was created using and WSI and pathology report from TCGA-BRCA Performs pathology report generation from WSI Additionally performs cancer subtyping and biomarker prediction Develops a hierarchical position-aware module (PAM) Introduces PathText dataset which contains 9,009 WSI-text pairs
W2T [35] ResNet-50, ViT-S pre-trained with DINO or scheme followed in HIPT as visual encoders to extract patch features Text Encoders: PubMedBERT and BioClinicalBERT Co-attention mapping in the decoder to align visual and text features WSI-VQA Curates a WSI VQA dataset called WSI-VQA with 977 WSIs and 8,672 Q/A Performs VQA with the proposed WSI-VQA dataset Co-attention mapping between word embeddings and WSIs are used for interpretability
PathGen-CLIP [37] PathGen-LLaVA OpenCLIP framework is utilized to train a pathology-specific CLIP model The vision encoder of LLaVA v1.5 is replaced with the trained pathology-specific CLIP model PathGen-CLIP is combined with Vicuna LLM PathGen-1.6M (for pre-training) PatchCamelyon (Pcam) CRC-100K SICAPv2 BACH Osteo SkinCancer LC25000 Camelyon 16 and 17 Curates a large-scale dataset PathGen-1.6M with 1.6 million image-caption pairs Develops a pathology-specific large multi-modal modal with the capability to adapt to various downstream task
Quilt-Net [39] CLIP model (based on OpenCLIP framework) ViT-B as the image encoder GPT-2 and PubMedBERT as text encoder Quilt-1M (for pre-training) PatchCamelyon (PCam) NCT-CRC-HE-100K SICAPv2 Databiox BACH Osteo Renal Cell SkinCancer MHIST LC25000 Curates a large-scale dataset Quilt-1M Performs zero-shot, linear-probing task and also cross-modal retrieval task Provides cross-modal attention mask with Grad-CAM
Guevara et al. [181] CLIP model HIPT as image encoder A transformer encoder to project encoded image embedding to match the dimension of text embedding PubMedBERT as text encoder For the caption model pre-trained Bio GPT 2 is used GPT-3.5-turbo [182] to clean and refine captions. Additionally, the captions were machine-translated from two languages to English using the approach in [183] Self-curated dataset of WSIs and captions. WSIs are of colon polyps and biopsies. The captions contain possible 5 diagnostic labels normal, hyperplasia, low-grade dysplasia, high-grade dysplasia, or adenocarcinoma Performs WSI classification and caption generation through the use of weakly supervised transformer-based models
Hu et al. [184] ResNet-50 pre-trained with BYOL [185] SSL scheme as patch encoder Anchor-based module as WSI encoder based on [186] which uses kernel attention transformer Prompt-based text encoder which utilizes self-attention structure in BERT [187] GastricADC [188] In-house gastric dataset containing 3598 WSIs and reports dataset Performs cross-modal retrieval tasks which includes image-to-image retrieval, image-to-text retrieval, text-to-image retrieval and text-to-text retrieval Proposes a histopathology language-image representation learning framework Develops a prompt-based text representation learning scheme
PEMP [189] CLIP as the backbone of the proposed model A self-attention layer A attention pooling layer In-house datasets TCGA-CESC Performs survival analysis, metastasis detection and cancer subtype classification Proposes a mechanism to introduce vision and text prior knowledge in the designed prompts (both static and learnable) at both patch and slide levels Develops a self-attention layer called a lightweight messenger and an attention pooling layer called summary layer
HistoCap [190] and HistoCapBERT ResNet-18 to encode thumbnail image LSTM decoder according to [180] as caption generator Pre-trained HIPT encoder for WSI encoding Trainable attention layer after HIPT encoder In HistocapBERT the LSTM decoder is replaced by a BioclinicalBERT Dataset from Genotype-Tissue Expression (GTEx) portal Performs caption generation given a thumbnail image and a WSI image
PathCap [191] ResNet-18 to encode thumbnail image ResNet-18 to encode WSI patches Trainable attention layer after the ResNet-18 used to encode WSI patches LSTM as caption generation module Dataset from Genotype-Tissue Ex- pression (GTEx) portal Performs caption generation of histopathology images using multi-scale view (thumbnail of WSIs and WSIs)
Elbedwehy et al. [192] Vision encoders include VGG [193], ResNet, PVT [194], Swin-Large [170], ConvNexT-Large [52] Language decoders and pre-trained word embedding models include LSTM, RNN, Bi-directional RNN, BioLinkBERT-Large [195] PatchGastric Performs caption generation of WSIs
PromptBio [196] PLIP as coarse-grained pathology instance classifier which filters out patches given a patch and a prompt (keeps patches associated with cancer-associated stroma) IBMIL [197] to encode patches followed by a fully-connected layer GPT-4 to generate pathology descriptions Series of transformer and MLP layer which performs biomarker prediction TCGA-CRC The Clinical Proteomic Tumor Analysis Consortium (CPTAC) CRC dataset Performs biomarker prediction
Model Architecture and Utilized Models/Frameworks Dataset Contribution Availability
SGMT [198] ResNet-18 (pre-trained on ImageNet) as patch encoder Transformer encoder to encode the patch features Transformer decoder as caption generation module PatchGastric Performs caption generation of WSIs Proposes a novel mechanism to use subtype prediction as a guiding mechanism for the caption generation task Develops a random sampling and voting strategy to select patches
Tsuneki et al. [154] EfficientNetB3 [199] and DenseNet121 [200] (pre-trained on ImageNet) as image encoder Average and global pooling layer after the image encoders RNN decoder layer for caption generation PatchGastric Performs caption generation of WSIs Curates a captioned dataset of 262,777 patches from 991 WSIs

IV Vision-Language Models

In this section, the recent works in CPath with VLMs are outlined. The details about architecture, used datasets and contribution of individual research work are listed in Table VIII.

First, different categorizations of VLMs are provided to give insight into how VLMs are utilized to solve pathology-specific tasks. Second, the common architectural components in VLMs and adopted strategies are summarized. Lastly, a brief overview of models that do not solve a direct pathology task but perform other types of vision-language tasks is summarized.

IV-A Categorization of VLMs

A categorization of VLMs in CPath can be done based on the reason for using the language modality. Some research works solve tasks like caption generation ([191, 154, 198, 190, 192]) or VQA ([146, 35]) which necessitates utilization of both vision and language modality because of the nature of tasks. For VQA or caption generation, the generated output from the model needs to be in language form i.e. the model needs to perform a language task. On the other hand, other research works use language modality as a source of semantic information to be injected into information gained from vision modality. This additional semantic information can significantly boost the performance of the model which would not have been possible with a vision-only model. As an example, MI-Zero [62] performs cancer subtyping which is not a language task but utilizes pathology reports curated from the in-house dataset and TCGA as a source of semantic information.

Refer to caption
Figure 17: Visualization of categorization of VLMs and VLFMs. There are no clear distinctive criteria for a model to be classified as a VLFM or a VLM.

Another categorization can be done by comparing the training approach with FMs. VLFMs are different from VLMs as traditional VLMs focus on solving one or two vision-language tasks like caption generation. However, as FMs become more prevalent recent works are starting to shift towards VLFMs. Recent works (PathGen-CLIP [37], Quilt-Net [39]) take FM-like approach to train their models and adapt them to different downstream tasks. On the other end of the spectrum, there are VLMs that solve only a single task. As shown in Fig. 17 VLMs and VLFMs should be viewed as a spectrum where if the number of tasks and pre-training data size increases it moves away from VLMs and towards VLFMs.

TABLE IX: Summary of pre-training strategies in VLMs
Strategy Reference
Performs pre-training from scratch (vision or language module) [39],[37]
Utilize domain-specific vision module (HIPT, CTransPath, etc) or language module (PubMedBERT, BioClinicalBERT, etc) [196],[190],[184], [181],[35], [34], [171]
Initialize with out-of-domain encoders (ImageNet pre-training, pre-trained CLIP, etc) [154], [198], [192] [156]

Owing to this, some VLMs follow the FM approach of pre-training the vision or text module. However, as pre-training from scratch requires a large amount of data and computing some VLMs use domain-specific vision modules from HIPT, CTransPath, IBMIL, etc or language modules from PubMedBERT, BioClinicalBERT, etc. This allows the vision or language module to learn an initial vision or language representation space which might not be as rich as pre-training scratch but boosts the performance.

Refer to caption
Figure 18: Amount of semantic information available through different text sources.

Another categorization can be done by analyzing how language prior is incorporated into the model. Most works use text labels or captions curated from different sources, which provide the least amount of information. Recent works like FSWC [150], ViLA-MIL [151] and PEMP [189] utilize prompting LLMs to increase the amount of semantic information that can be gained from text labels or captions. They prompt the LLMs to generate descriptions about a particular class label or of morphological or textural patterns about patches. On the other end of the spectrum is research work utilizing all information from pathology reports which contain a lot of details. However, as shown in Fig. 18 the difficulty of curation and doing it at a scale to match the pathologist level also increases.

IV-B Architectural components for VLMs in CPath

Even though there is a huge variety of vision and language modules used in VLMs, it is possible to identify common structures. In Fig. 19 the common architectural components used in different stages are outlined. In the pre-processing stage, LLMs like GPT-4, GPT-3.5, GPT-3.5-turbo and GPT-3 are used to clean and refine captions or prompting them to generate descriptions. Examples of research work utilizing LLMs in the pre-processing phase include HistGen, FSWC, ViLA-MIL, FiVE, CPLIP and PromptBio.

Another key component of the architecture is the vision module or encoder which converts the WSI patches to image embeddings. As shown in Fig. 19 it can be separated into three separate groups. The first group is vanilla CNNs that have ImageNet initialization or some kind of pre-training on pathology data. Note that most VLMs utilizing these architectures are not recent works (with the exception of Elbedwehy et al. [192]). Recent VLMs utilize ViTs which is the second group containing several ViT variants. The third group is special encoders that are pathology domain-specific and optimized for encoding WSI patches. These are ideal for learning a rich representation without the cost of pre-training from scratch.

For language encoders, most works utilize BERT (PubMedBERT, BioClincialBERT) or GPT (BioGPT, GPT 2) variants and most are pre-trained on datasets from the biomedical domain.

Another type of language module is the caption/text generation module which rather than encoding text into embedding, generates text sequences. Earlier works included RNN, LSTM and Transformer decoders and recent works include LLMs.

The most common vision-language alignment or fusion module is the CLIP model and its variations. However, some works come up with custom approaches to align vision and language representations.

Refer to caption
Figure 19: Common modules used in different stages of VLMs. The last stage is the vision-language alignment or fusion stage where the vision and language embeddings generated from the earlier stages are combined.

IV-C Related works for VLMs in CPath

There are several research works that do not directly utilize VLMs to solve a task in the pathology domain, but investigate or use VLMs for other reasons. One example is Thota et al 555https://github.com/jaiprakash1824/VLM_Adv_Attack [201] which investigates the effect of Projected Gradient Descent (PGD) adversarial perturbation attack on PLIP [40] architecture. Another work by Lucassen et al.666https://github.com/RTLucassen/report_preprocessing [202] provides a pathology report processing workflow that can be used for VLMs in CPath.

Another category of VLMs in CPath that does not directly solve pathology-specific tasks is text-guided diffusion models for image generation. However, these models can perform tasks like virtual stain transfer which can later be used to solve a pathology-specific task. Another use case is extending dataset size with generated synthetic images. One such work is PathLDM777https://github.com/cvlab-stonybrook/PathLDM [203], that performs text-to-image generation on the TCGA-BRCA dataset.

V Conclusion

In recent years the number of research works in CPath with FMs and VLMs has increased significantly which provides an indication about how CPath will evolve in the next couple of years. Many of these works put significant computing resources into training FMs on massive pre-training datasets or coming up with novel strategies to push more language prior knowledge into VLMs. In the near future, the VLFMs which bring together the benefits of both FMs and VLMs, are going to be the dominant model. This review article provides a comprehensive overview of all these models which will aid future researchers.

References

  • [1] C. L. Srinidhi, O. Ciga, and A. L. Martel, “Deep neural network models for computational histopathology: A survey,” Medical image analysis, vol. 67, p. 101813, 2021.
  • [2] S. Morales, K. Engan, and V. Naranjo, “Artificial intelligence in computational pathology–challenges and future directions,” Digital Signal Processing, vol. 119, p. 103196, 2021.
  • [3] A. Echle, N. T. Rindtorff, T. J. Brinker, T. Luedde, A. T. Pearson, and J. N. Kather, “Deep learning in cancer pathology: a new generation of clinical biomarkers,” British journal of cancer, vol. 124, no. 4, pp. 686–696, 2021.
  • [4] J. Van der Laak, G. Litjens, and F. Ciompi, “Deep learning in histopathology: the path to the clinic,” Nature medicine, vol. 27, no. 5, pp. 775–784, 2021.
  • [5] M. M. Abdelsamea, U. Zidan, Z. Senousy, M. M. Gaber, E. Rakha, and M. Ilyas, “A survey on artificial intelligence in histopathology image analysis,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 12, no. 6, p. e1474, 2022.
  • [6] I. Kim, K. Kang, Y. Song, and T.-J. Kim, “Application of artificial intelligence in pathology: trends and challenges,” Diagnostics, vol. 12, no. 11, p. 2794, 2022.
  • [7] A. Shmatko, N. Ghaffari Laleh, M. Gerstung, and J. N. Kather, “Artificial intelligence in histopathology: enhancing cancer research and clinical oncology,” Nature cancer, vol. 3, no. 9, pp. 1026–1038, 2022.
  • [8] A. Waqas, M. M. Bui, E. F. Glassy, I. El Naqa, P. Borkowski, A. A. Borkowski, and G. Rasool, “Revolutionizing digital pathology with the power of generative artificial intelligence and foundation models,” Laboratory Investigation, p. 100255, 2023.
  • [9] A. Asif, K. Rajpoot, S. Graham, D. Snead, F. Minhas, and N. Rajpoot, “Unleashing the potential of AI for pathology: challenges and recommendations,” The Journal of Pathology, vol. 260, no. 5, pp. 564–577, 2023.
  • [10] C. C. Atabansi, J. Nie, H. Liu, Q. Song, L. Yan, and X. Zhou, “A survey of transformer applications for histopathological image analysis: New developments and future directions,” BioMedical Engineering OnLine, vol. 22, no. 1, p. 96, 2023.
  • [11] M. Cooper, Z. Ji, and R. G. Krishnan, “Machine learning in computational histopathology: Challenges and opportunities,” Genes, Chromosomes and Cancer, vol. 62, no. 9, pp. 540–556, 2023.
  • [12] A. H. Song, G. Jaume, D. F. Williamson, M. Y. Lu, A. Vaidya, T. R. Miller, and F. Mahmood, “Artificial intelligence for digital and computational pathology,” Nature Reviews Bioengineering, vol. 1, no. 12, pp. 930–949, 2023.
  • [13] H. Xu, Q. Xu, F. Cong, J. Kang, C. Han, Z. Liu, A. Madabhushi, and C. Lu, “Vision transformers for computational histopathology,” IEEE Reviews in Biomedical Engineering, 2023.
  • [14] C. D. Bahadir, M. Omar, J. Rosenthal, L. Marchionni, B. Liechty, D. J. Pisapia, and M. R. Sabuncu, “Artificial intelligence applications in histopathology,” Nature Reviews Electrical Engineering, vol. 1, no. 2, pp. 93–108, 2024.
  • [15] C. McGenity, E. L. Clarke, C. Jennings, G. Matthews, C. Cartlidge, H. Freduah-Agyemang, D. D. Stocken, and D. Treanor, “Artificial intelligence in digital pathology: a systematic review and meta-analysis of diagnostic test accuracy,” npj Digital Medicine, vol. 7, no. 1, p. 114, 2024.
  • [16] M. S. Hosseini, B. E. Bejnordi, V. Q.-H. Trinh, L. Chan, D. Hasan, X. Li, S. Yang, T. Kim, H. Zhang, T. Wu et al., “Computational pathology: a survey review and the way forward,” Journal of Pathology Informatics, p. 100357, 2024.
  • [17] S. Brussee, G. Buzzanca, A. M. Schrader, and J. Kers, “Graph neural networks in histopathology: Emerging trends and future directions,” arXiv preprint arXiv:2406.12808, 2024.
  • [18] J. Cheng, “Applications of large language models in pathology,” Bioengineering, vol. 11, no. 4, p. 342, 2024.
  • [19] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, and J. M. Stuart, “The cancer genome atlas pan-cancer analysis project,” Nature genetics, vol. 45, no. 10, pp. 1113–1120, 2013.
  • [20] Y. Kang, Y. J. Kim, S. Park, G. Ro, C. Hong, H. Jang, S. Cho, W. J. Hong, D. U. Kang, J. Chun et al., “Development and operation of a digital platform for sharing pathology image data,” BMC Medical Informatics and Decision Making, vol. 21, pp. 1–8, 2021.
  • [21] E. Zimmermann, E. Vorontsov, J. Viret, A. Casson, M. Zelechowski, G. Shaikovski, N. Tenenholtz, J. Hall, T. Fuchs, N. Fusi et al., “Virchow 2: Scaling self-supervised mixed magnification models in pathology,” arXiv preprint arXiv:2408.00738, 2024.
  • [22] J. Ma, Z. Guo, F. Zhou, Y. Wang, Y. Xu, Y. Cai, Z. Zhu, C. Jin, Y. L. X. Jiang, A. Han et al., “Towards a generalizable pathology foundation model via unified knowledge distillation,” arXiv preprint arXiv:2407.18449, 2024.
  • [23] E. Vorontsov, A. Bozkurt, A. Casson, G. Shaikovski, M. Zelechowski, K. Severson, E. Zimmermann, J. Hall, N. Tenenholtz, N. Fusi et al., “A foundation model for clinical-grade computational pathology and rare cancers detection,” Nature Medicine, pp. 1–12, 2024.
  • [24] M. Y. Lu, B. Chen, D. F. Williamson, R. J. Chen, I. Liang, T. Ding, G. Jaume, I. Odintsov, L. P. Le, G. Gerber et al., “A visual-language foundation model for computational pathology,” Nature Medicine, vol. 30, no. 3, pp. 863–874, 2024.
  • [25] M. Y. Lu, B. Chen, D. F. Williamson, R. J. Chen, M. Zhao, A. K. Chow, K. Ikemura, A. Kim, D. Pouli, A. Patel et al., “A multimodal generative AI copilot for human pathology,” Nature, pp. 1–3, 2024.
  • [26] H. Xu, N. Usuyama, J. Bagga, S. Zhang, R. Rao, T. Naumann, C. Wong, Z. Gero, J. González, Y. Gu et al., “A whole-slide foundation model for digital pathology from real-world data,” Nature, pp. 1–8, 2024.
  • [27] J. Gui, T. Chen, J. Zhang, Q. Cao, Z. Sun, H. Luo, and D. Tao, “A survey on self-supervised learning: Algorithms, applications, and future trends,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [28] J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  • [29] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [30] S. Zhang, Y. Xu, N. Usuyama, H. Xu, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri et al., “BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs,” arXiv preprint arXiv:2303.00915, 2023.
  • [31] G. Jaume, A. Vaidya, R. J. Chen, D. F. Williamson, P. P. Liang, and F. Mahmood, “Modeling dense multimodal interactions between biological pathways and histology for survival prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 579–11 590.
  • [32] M. Gadermayr and M. Tschuchnig, “Multiple instance learning for digital pathology: A review of the state-of-the-art, limitations & future potential,” Computerized Medical Imaging and Graphics, p. 102337, 2024.
  • [33] D. Ahmedt-Aristizabal, M. A. Armin, S. Denman, C. Fookes, and L. Petersson, “A survey on graph-based deep learning for computational histopathology,” Computerized Medical Imaging and Graphics, vol. 95, p. 102027, 2022.
  • [34] P. Chen, H. Li, C. Zhu, S. Zheng, and L. Yang, “MI-Gen: Multiple instance generation of pathology reports for gigapixel whole-slide images,” arXiv preprint arXiv:2311.16480, 2023.
  • [35] P. Chen, C. Zhu, S. Zheng, H. Li, and L. Yang, “WSI-VQA: Interpreting whole slide images by generative visual question answering,” arXiv preprint arXiv:2407.05603, 2024.
  • [36] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  • [37] Y. Sun, Y. Zhang, Y. Si, C. Zhu, Z. Shui, K. Zhang, J. Li, X. Lyu, T. Lin, and L. Yang, “PathGen-1.6 M: 1.6 million pathology image-text pairs generation through multi-agent collaboration,” arXiv preprint arXiv:2407.00203, 2024.
  • [38] Y. Sun, C. Zhu, S. Zheng, K. Zhang, L. Sun, Z. Shui, Y. Zhang, H. Li, and L. Yang, “PathAsst: A generative foundation AI assistant towards artificial general intelligence of pathology,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 5034–5042.
  • [39] W. Ikezogwo, S. Seyfioglu, F. Ghezloo, D. Geva, F. Sheikh Mohammed, P. K. Anand, R. Krishna, and L. Shapiro, “Quilt-1M: One million image-text pairs for histopathology,” Advances in neural information processing systems, vol. 36, 2024.
  • [40] Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual–language foundation model for pathology image analysis using medical twitter,” Nature medicine, vol. 29, no. 9, pp. 2307–2316, 2023.
  • [41] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
  • [42] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
  • [43] G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V. Shankar, H. Namkoong, J. Miller, H. Hajishirzi, A. Farhadi, and L. Schmidt, “OpenCLIP,” Jul. 2021. [Online]. Available: https://doi.org/10.5281/zenodo.5143773
  • [44] Y. Sun, H. Wu, C. Zhu, S. Zheng, Q. Chen, K. Zhang, Y. Zhang, X. Lan, M. Zheng, J. Li et al., “PathMMU: A massive multimodal expert-level benchmark for understanding and reasoning in pathology,” arXiv preprint arXiv:2401.16355, 2024.
  • [45] C. Li, L. Li, H. Jiang, K. Weng, Y. Geng, L. Li, Z. Ke, Q. Li, M. Cheng, W. Nie et al., “YOLOv6: A single-stage object detection framework for industrial applications,” arXiv preprint arXiv:2209.02976, 2022.
  • [46] M. S. Seyfioglu, W. O. Ikezogwo, F. Ghezloo, R. Krishna, and L. Shapiro, “Quilt-LLaVA: Visual instruction tuning by extracting localized narratives from open-source histopathology videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 183–13 192.
  • [47] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5B: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  • [48] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning.   PMLR, 2023, pp. 28 492–28 518.
  • [49] D. Doukhan, J. Carrive, F. Vallet, A. Larcher, and S. Meignier, “An open-source speaker gender detection framework for monitoring gender equality,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2018, pp. 5214–5218.
  • [50] S. Nakatani, “langdetect,” https://pypi.org/project/langdetect/, 2015.
  • [51] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 7464–7475.
  • [52] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 11 976–11 986.
  • [53] OpenAI, “Introducing ChatGPT,” 2023, available at https://openai.com/blog/chatgpt.
  • [54] H. Zhang, Y. He, X. Wu, P. Huang, W. Qin, F. Wang, J. Ye, X. Huang, Y. Liao, H. Chen et al., “PathNarratives: Data annotation for pathological human-AI collaborative diagnosis,” Frontiers in Medicine, vol. 9, p. 1070072, 2023.
  • [55] J. Gamper and N. Rajpoot, “Multiple instance captioning: Learning representations from histopathology textbooks and articles,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 549–16 559.
  • [56] T. Achakulvisut, D. Acuna, and K. Kording, “Pubmed parser: A python parser for pubmed open-access xml subset and medline xml dataset xml dataset,” Journal of Open Source Software, vol. 5, no. 46, p. 1979, 2020. [Online]. Available: https://doi.org/10.21105/joss.01979
  • [57] C. Clark and S. Divvala, “Pdffigures 2.0: Mining figures from research papers,” in Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, 2016, pp. 143–152.
  • [58] X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie, “PathVQA: 30000+ questions for medical visual question answering,” arXiv preprint arXiv:2003.10286, 2020.
  • [59] D. Klein and C. D. Manning, “Accurate unlexicalized parsing,” in Proceedings of the 41st annual meeting of the association for computational linguistics, 2003, pp. 423–430.
  • [60] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [61] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, and T.-Y. Liu, “BioGPT: generative pre-trained transformer for biomedical text generation and mining,” Briefings in bioinformatics, vol. 23, no. 6, p. bbac409, 2022.
  • [62] M. Y. Lu, B. Chen, A. Zhang, D. F. Williamson, R. J. Chen, T. Ding, L. P. Le, Y.-S. Chuang, and F. Mahmood, “Visual language pretrained multiple instance zero-shot transfer for histopathology images,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 19 764–19 775.
  • [63] H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 26 296–26 306.
  • [64] J. Dippel, B. Feulner, T. Winterhoff, S. Schallenberg, G. Dernbach, A. Kunft, S. Tietz, P. Jurmeister, D. Horst, L. Ruff et al., “RudolfV: a foundation model by pathologists for pathologists,” arXiv preprint arXiv:2401.04079, 2024.
  • [65] D. Juyal, H. Padigela, C. Shah, D. Shenker, N. Harguindeguy, Y. Liu, B. Martin, Y. Zhang, M. Nercessian, M. Markey et al., “PLUTO: Pathology-universal transformer,” arXiv preprint arXiv:2405.07905, 2024.
  • [66] G. Campanella, S. Chen, R. Verma, J. Zeng, A. Stock, M. Croken, B. Veremis, A. Elmas, K.-l. Huang, R. Kwan et al., “A clinical benchmark of public self-supervised pathology foundation models,” arXiv preprint arXiv:2407.06508, 2024.
  • [67] S. Alfasly, P. Nejat, S. Hemati, J. Khan, I. Lahr, A. Alsaafin, A. Shafique, N. Comfere, D. Murphree, C. Meroueh et al., “Foundation models for histopathology—fanfare or flair,” Mayo Clinic Proceedings: Digital Health, vol. 2, no. 1, pp. 165–174, 2024.
  • [68] S. Zheng, X. Cui, Y. Sun, J. Li, H. Li, Y. Zhang, P. Chen, X. Jing, Z. Ye, and L. Yang, “Benchmarking pathclip for pathology image analysis,” Journal of Imaging Informatics in Medicine, pp. 1–17, 2024.
  • [69] W. Aswolinskiy, M. Paulikat, and C. Aichmueller, “Impact of layer selection in histopathology foundation models on downstream task performance,” in Medical Imaging with Deep Learning, 2024.
  • [70] M. Mallya, A. K. Mirabadi, H. Farahani, and A. Bashashati, “Benchmarking histopathology foundation models for ovarian cancer bevacizumab treatment response prediction from whole slide images,” arXiv preprint arXiv:2407.20596, 2024.
  • [71] N. Aben, E. D. de Jong, I. Gatopoulos, N. Känzig, M. Karasikov, A. Lagré, R. Moser, J. van Doorn, F. Tang et al., “Towards large-scale training of pathology foundation models,” arXiv preprint arXiv:2404.15217, 2024.
  • [72] B. Roth, V. Koch, S. J. Wagner, J. A. Schnabel, C. Marr, and T. Peng, “Low-resource finetuning of foundation models beats state-of-the-art in histopathology,” arXiv preprint arXiv:2401.04720, 2024.
  • [73] J. Lu, F. Yan, X. Zhang, Y. Gao, and S. Zhang, “PathoTune: Adapting visual foundation model to pathological specialists,” arXiv preprint arXiv:2403.16497, 2024.
  • [74] D. Ferber, G. Wölflein, I. C. Wiest, M. Ligero, S. Sainath, N. G. Laleh, O. S. El Nahhas, G. Müller-Franzes, D. Jäger, D. Truhn et al., “In-context learning enables multimodal large language models to classify cancer pathology images,” arXiv preprint arXiv:2403.07407, 2024.
  • [75] C. Yin, S. Liu, K. Zhou, V. W.-S. Wong, and P. C. Yuen, “Prompting vision foundation models for pathology image analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 292–11 301.
  • [76] Y. Zhang, J. Gao, M. Zhou, X. Wang, Y. Qiao, S. Zhang, and D. Wang, “Text-guided foundation model adaptation for pathological image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 272–282.
  • [77] W. Tang, F. Zhou, S. Huang, X. Zhu, Y. Zhang, and B. Liu, “Feature re-embedding: Towards foundation model-level performance in computational pathology,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 343–11 352.
  • [78] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [79] Z. Yang, T. Wei, Y. Liang, X. Yuan, R. Gao, Y. Xia, J. Zhou, Y. Zhang, and Z. Yu, “A foundation model for generalizable cancer diagnosis and survival prediction from histopathological images,” bioRxiv, pp. 2024–05, 2024.
  • [80] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
  • [81] M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.
  • [82] M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
  • [83] X. Chen, S. Xie, and K. He, “An empirical study of training self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9640–9649.
  • [84] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
  • [85] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “iBOT: Image BERT pre-training with online tokenizer,” arXiv preprint arXiv:2111.07832, 2021.
  • [86] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu, “CoCa: Contrastive captioners are image-text foundation models,” arXiv preprint arXiv:2205.01917, 2022.
  • [87] L. Beyer, P. Izmailov, A. Kolesnikov, M. Caron, S. Kornblith, X. Zhai, M. Minderer, M. Tschannen, I. Alabdulmohsin, and F. Pavetic, “Flexivit: One model for all patch sizes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 496–14 506.
  • [88] D. Nechaev, A. Pchelnikov, and E. Ivanova, “Hibou: A family of foundational vision transformers for pathology,” arXiv preprint arXiv:2406.05074, 2024.
  • [89] Z. Peng, L. Dong, H. Bao, Q. Ye, and F. Wei, “BEiT v2: Masked image modeling with vector-quantized visual tokenizers,” arXiv preprint arXiv:2208.06366, 2022.
  • [90] R. J. Chen, T. Ding, M. Y. Lu, D. F. Williamson, G. Jaume, A. H. Song, B. Chen, A. Zhang, D. Shao, M. Shaban et al., “Towards a general-purpose foundation model for computational pathology,” Nature Medicine, vol. 30, no. 3, pp. 850–862, 2024.
  • [91] G. Campanella, C. Vanderbilt, and T. Fuchs, “Computational pathology at health system scale–self-supervised foundation models from billions of images,” in AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024.
  • [92] S. Hua, F. Yan, T. Shen, and X. Zhang, “PathoDuet: Foundation models for pathological slide analysis of H&E and IHC stains,” arXiv preprint arXiv:2312.09894, 2023.
  • [93] A. Filiot, R. Ghermi, A. Olivier, P. Jacob, L. Fidon, A. Mac Kain, C. Saillard, and J.-B. Schiratti, “Scaling self-supervised learning for histopathology with masked image modeling,” medRxiv, pp. 2023–07, 2023.
  • [94] X. Wang, S. Yang, J. Zhang, M. Wang, J. Zhang, W. Yang, J. Huang, and X. Han, “Transformer-based unsupervised contrastive learning for histopathological image classification,” Medical image analysis, vol. 81, p. 102559, 2022.
  • [95] J. Ding, S. Ma, L. Dong, X. Zhang, S. Huang, W. Wang, N. Zheng, and F. Wei, “LongNet: Scaling transformers to 1,000,000,000 tokens,” arXiv preprint arXiv:2307.02486, 2023.
  • [96] M. Kang, H. Song, S. Park, D. Yoo, and S. Pereira, “Benchmarking self-supervised learning on diverse pathology datasets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3344–3354.
  • [97] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.
  • [98] J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International conference on machine learning.   PMLR, 2021, pp. 12 310–12 320.
  • [99] J. Lai, F. Ahmed, S. Vijay, T. Jaroensri, J. Loo, S. Vyawahare, S. Agarwal, F. Jamil, Y. Matias, G. S. Corrado et al., “Domain-specific optimization and diverse evaluation of self-supervised models for histopathology,” arXiv preprint arXiv:2310.13259, 2023.
  • [100] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  • [101] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  • [102] G. Shaikovski, A. Casson, K. Severson, E. Zimmermann, Y. K. Wang, J. D. Kunz, J. A. Retamero, G. Oakley, D. Klimstra, C. Kanan et al., “Prism: A multi-modal generative foundation model for slide-level histopathology,” arXiv preprint arXiv:2405.10254, 2024.
  • [103] A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” in International conference on machine learning.   PMLR, 2021, pp. 4651–4664.
  • [104] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.
  • [105] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.
  • [106] X. Zhou, X. Zhang, C. Wu, Y. Zhang, W. Xie, and Y. Wang, “Knowledge-enhanced visual-language pretraining for computational pathology,” arXiv preprint arXiv:2404.09942, 2024.
  • [107] X. Wu, R. Xu, P. Wei, W. Qin, P. Huang, Z. Li, and L. Luo, “Pathinsight: Instruction tuning of multimodal datasets and models for intelligence assisted diagnosis in histopathology,” arXiv preprint arXiv:2408.07037, 2024.
  • [108] K. Chen, M. Liu, F. Yan, L. Ma, X. Shi, L. Wang, X. Wang, L. Zhu, Z. Wang, M. Zhou et al., “Cost-effective instruction learning for pathology vision and language analysis,” arXiv preprint arXiv:2407.17734, 2024.
  • [109] J. Gamper, N. Alemi Koohbanani, K. Benet, A. Khuram, and N. Rajpoot, “PanNuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification,” in Digital Pathology: 15th European Congress, ECDP 2019, Warwick, UK, April 10–13, 2019, Proceedings 15.   Springer, 2019, pp. 11–19.
  • [110] Q. Da, X. Huang, Z. Li, Y. Zuo, C. Zhang, J. Liu, W. Chen, J. Li, D. Xu, Z. Hu et al., “DigestPath: A benchmark dataset with challenge review for the pathological detection and segmentation of digestive-system,” Medical Image Analysis, vol. 80, p. 102485, 2022.
  • [111] C. Han, X. Pan, L. Yan, H. Lin, B. Li, S. Yao, S. Lv, Z. Shi, J. Mai, J. Lin et al., “WSSS4LUAD: Grand challenge on weakly-supervised tissue semantic segmentation for lung adenocarcinoma,” arXiv preprint arXiv:2204.06455, 2022.
  • [112] K. Kriegsmann, F. Lobers, C. Zgorzelski, J. Kriegsmann, C. Janßen, R. R. Meliß, T. Muley, U. Sack, G. Steinbuss, and M. Kriegsmann, “Deep learning for the detection of anatomical tissue structures and neoplasms of the skin on scanned histopathological tissue sections,” Frontiers in Oncology, vol. 12, p. 1022967, 2022.
  • [113] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling, “Rotation equivariant CNNs for digital pathology,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11.   Springer, 2018, pp. 210–218.
  • [114] J. Wei, A. Suriawinata, B. Ren, X. Liu, M. Lisovsky, L. Vaickus, C. Brown, M. Baker, N. Tomita, L. Torresani et al., “A petri dish for histopathology image analysis,” in International Conference on Artificial Intelligence in Medicine.   Springer, 2021, pp. 11–24.
  • [115] G. Aresta, T. Araújo, S. Kwok, S. S. Chennamsetty, M. Safwan, V. Alex, B. Marami, M. Prastawa, M. Chan, M. Donovan et al., “BACH: Grand challenge on breast cancer histology images,” Medical image analysis, vol. 56, pp. 122–139, 2019.
  • [116] J. Silva-Rodríguez, A. Colomer, M. A. Sales, R. Molina, and V. Naranjo, “Going deeper through the gleason scoring scale: An automatic end-to-end system for histology prostate grading and cribriform pattern detection,” Computer methods and programs in biomedicine, vol. 195, p. 105637, 2020.
  • [117] H. Bolhasani, E. Amjadi, M. Tabatabaeian, and S. J. Jassbi, “A histopathological image dataset for grading breast invasive ductal carcinomas,” Informatics in Medicine Unlocked, vol. 19, p. 100341, 2020.
  • [118] O. Brummer, P. Pölönen, S. Mustjoki, and O. Brück, “Integrative analysis of histological textures and lymphocyte infiltration in renal cell carcinoma using deep learning,” bioRxiv, pp. 2022–08, 2022.
  • [119] H. B. Arunachalam, R. Mishra, O. Daescu, K. Cederberg, D. Rakheja, A. Sengupta, D. Leonard, R. Hallac, and P. Leavey, “Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models,” PloS one, vol. 14, no. 4, p. e0210706, 2019.
  • [120] S. Graham, Q. D. Vu, S. E. A. Raza, A. Azam, Y. W. Tsang, J. T. Kwak, and N. Rajpoot, “Hover-net: Simultaneous segmentation and classification of nuclei in multi-tissue histology images,” Medical image analysis, vol. 58, p. 101563, 2019.
  • [121] S. Shafiei, M. Babaie, S. Kalra, and H. R. Tizhoosh, “Colored Kimia Path24 dataset: configurations and benchmarks with deep embeddings,” arXiv preprint arXiv:2102.07611, 2021.
  • [122] J. W. Wei, L. J. Tafe, Y. A. Linnik, L. J. Vaickus, N. Tomita, and S. Hassanpour, “Pathologist-level classification of histologic patterns on resected lung adenocarcinoma slides with deep neural networks,” Scientific reports, vol. 9, no. 1, p. 3358, 2019.
  • [123] T. Roetzer-Pejrimovsky, A.-C. Moser, B. Atli, C. C. Vogel, P. A. Mercea, R. Prihoda, E. Gelpi, C. Haberler, R. Höftberger, J. A. Hainfellner et al., “The digital brain tumour atlas, an open histopathology resource,” Scientific Data, vol. 9, no. 1, p. 55, 2022.
  • [124] X. Huo, O. KokHaur, K. W. Lau, L. Gole, D. Young, C. L. Tan, C. Zhang, Y. Zhang, X. Zhu, L. Li et al., “Comprehensive AI model development for gleason grading: From scanning, cloud-based annotation to pathologist-AI interaction,” 2022.
  • [125] W. Bulten, K. Kartasalo, P.-H. C. Chen, P. Ström, H. Pinckaers, K. Nagpal, Y. Cai, D. F. Steiner, H. Van Boven, R. Vink et al., “Artificial intelligence for diagnosis and gleason grading of prostate cancer: the PANDA challenge,” Nature medicine, vol. 28, no. 1, pp. 154–163, 2022.
  • [126] D. T. Cheng, T. N. Mitchell, A. Zehir, R. H. Shah, R. Benayed, A. Syed, R. Chandramohan, Z. Y. Liu, H. H. Won, S. N. Scott et al., “Memorial sloan kettering-integrated mutation profiling of actionable cancer targets (MSK-IMPACT): a hybridization capture-based next-generation sequencing clinical assay for solid tumor molecular oncology,” The Journal of molecular diagnostics, vol. 17, no. 3, pp. 251–264, 2015.
  • [127] D. Komura, T. Onoyama, K. Shinbo, H. Odaka, M. Hayakawa, M. Ochi, R. R. Herdiantoputri, H. Endo, H. Katoh, T. Ikeda et al., “Restaining-based annotation for cancer histology segmentation to overcome annotation-related limitations among pathologists,” Patterns, vol. 4, no. 2, 2023.
  • [128] N. Brancati, A. M. Anniciello, P. Pati, D. Riccio, G. Scognamiglio, G. Jaume, G. De Pietro, M. Di Bonito, A. Foncubierta, G. Botti et al., “BRACS: A dataset for breast carcinoma subtyping in H&E histology images,” Database, vol. 2022, p. baac093, 2022.
  • [129] C. A. Barbano, D. Perlo, E. Tartaglione, A. Fiandrotti, L. Bertero, P. Cassoni, and M. Grangetto, “UniToPatho, a labeled histopathological dataset for colorectal polyps classification and adenoma dysplasia grading,” in 2021 IEEE International Conference on Image Processing (ICIP).   IEEE, 2021, pp. 76–80.
  • [130] B. Á. Pataki, A. Olar, D. Ribli, A. Pesti, E. Kontsek, B. Gyöngyösi, Á. Bilecz, T. Kovács, K. A. Kovács, Z. Kramer et al., “HunCRC: annotated pathological slides to enhance deep learning applications in colorectal cancer screening,” Scientific Data, vol. 9, no. 1, p. 370, 2022.
  • [131] F. A. Spanhol, L. S. Oliveira, C. Petitjean, and L. Heutte, “A dataset for breast cancer histopathological image classification,” IEEE transactions on biomedical engineering, vol. 63, no. 7, pp. 1455–1462, 2015.
  • [132] J. R. Kaczmarzyk, R. Gupta, T. M. Kurc, S. Abousamra, J. H. Saltz, and P. K. Koo, “Champkit: A framework for rapid evaluation of deep neural networks for patch-based histopathology classification,” Computer methods and programs in biomedicine, vol. 239, p. 107631, 2023.
  • [133] J. N. Kather, A. T. Pearson, N. Halama, D. Jäger, J. Krause, S. H. Loosen, A. Marx, P. Boor, F. Tacke, U. P. Neumann et al., “Deep learning can predict microsatellite instability directly from histology in gastrointestinal cancer,” Nature medicine, vol. 25, no. 7, pp. 1054–1056, 2019.
  • [134] S. Abousamra, R. Gupta, L. Hou, R. Batiste, T. Zhao, A. Shankar, A. Rao, C. Chen, D. Samaras, T. Kurc et al., “Deep learning-based mapping of tumor infiltrating lymphocytes in whole slide images of 23 types of cancer,” Frontiers in oncology, vol. 11, p. 806603, 2022.
  • [135] M. Aubreville, F. Wilm, N. Stathonikos, K. Breininger, T. A. Donovan, S. Jabari, M. Veta, J. Ganz, J. Ammeling, P. J. van Diest et al., “A comprehensive multi-domain dataset for mitotic figure detection,” Scientific data, vol. 10, no. 1, p. 484, 2023.
  • [136] B. E. Bejnordi, M. Veta, P. J. Van Diest, B. Van Ginneken, N. Karssemeijer, G. Litjens, J. A. Van Der Laak, M. Hermsen, Q. F. Manson, M. Balkenhol et al., “Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer,” Jama, vol. 318, no. 22, pp. 2199–2210, 2017.
  • [137] P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao et al., “WILDS: A benchmark of in-the-wild distribution shifts,” in International conference on machine learning.   PMLR, 2021, pp. 5637–5664.
  • [138] J. N. Kather, J. Krisam, P. Charoentong, T. Luedde, E. Herpel, C.-A. Weis, T. Gaiser, A. Marx, N. A. Valous, D. Ferber et al., “Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study,” PLoS medicine, vol. 16, no. 1, p. e1002730, 2019.
  • [139] A. A. Borkowski, M. M. Bui, L. B. Thomas, C. P. Wilson, L. A. DeLand, and S. M. Mastorides, “Lung and colon cancer histopathological image dataset (LC25000),” arXiv preprint arXiv:1912.12142, 2019.
  • [140] D. Komura, A. Kawabe, K. Fukuta, K. Sano, T. Umezaki, H. Koda, R. Suzuki, K. Tominaga, M. Ochi, H. Konishi et al., “Universal encoding of pan-cancer histology by deep texture representations,” Cell Reports, vol. 38, no. 9, 2022.
  • [141] J. Saltz, R. Gupta, L. Hou, T. Kurc, P. Singh, V. Nguyen, D. Samaras, K. R. Shroyer, T. Zhao, R. Batiste et al., “Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images,” Cell reports, vol. 23, no. 1, pp. 181–193, 2018.
  • [142] I. Gatopoulos, N. Känzig, R. Moser, S. Otálora et al., “eva: Evaluation framework for pathology foundation models,” in Medical Imaging with Deep Learning, 2024.
  • [143] G. Wölflein, D. Ferber, A. R. Meneghetti, O. S. M. E. Nahhas, D. Truhn, Z. I. Carrero, D. J. Harrison, O. Arandjelović, and J. N. Kather, “Benchmarking pathology feature extractors for whole slide image classification,” 2024, arXiv:2311.11772 [cs.CV]. [Online]. Available: https://arxiv.org/abs/2311.11772
  • [144] Y. Xu, Y. Wang, F. Zhou, J. Ma, S. Yang, H. Lin, X. Wang, J. Wang, L. Liang, A. Han et al., “A multimodal knowledge-enhanced whole-slide pathology foundation model,” arXiv preprint arXiv:2407.15362, 2024.
  • [145] C. Saillard, R. Jenatton, F. Llinares-López, Z. Mariet, D. Cahané, E. Durand, and J.-P. Vert, “H-optimus-0,” 2024. [Online]. Available: https://github.com/bioptimus/releases/tree/main/models/h-optimus/v0
  • [146] U. Naseem, M. Khushi, and J. Kim, “Vision-language transformer for interpretable pathology visual question answering,” IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 4, pp. 1681–1690, 2022.
  • [147] Q. Jin, B. Dhingra, W. Cohen, and X. Lu, “Probing biomedical embeddings from language models,” in Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, 2019, pp. 82–89.
  • [148] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
  • [149] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
  • [150] L. Qu, K. Fu, M. Wang, Z. Song et al., “The rise of AI language pathologists: Exploring two-level prompt learning for few-shot weakly-supervised whole slide image classification,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [151] J. Shi, C. Li, T. Gong, Y. Zheng, and H. Fu, “ViLa-MIL: Dual-scale vision-language multiple instance learning for whole slide image classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 248–11 258.
  • [152] Q. Zhou, W. Zhong, Y. Guo, M. Xiao, H. Ma, and J. Huang, “PathM3: A multimodal multi-task multiple instance learning framework for whole slide image classification and captioning,” arXiv preprint arXiv:2403.08967, 2024.
  • [153] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” Journal of Machine Learning Research, vol. 25, no. 70, pp. 1–53, 2024.
  • [154] M. Tsuneki and F. Kanavati, “Inference of captions from histopathological patches,” in International Conference on Medical Imaging with Deep Learning.   PMLR, 2022, pp. 1235–1250.
  • [155] X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers, “TieNet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 9049–9058.
  • [156] H. Watawana, K. Ranasinghe, T. Mahmood, M. Naseer, S. Khan, and F. S. Khan, “Hierarchical text-to-vision self supervised alignment for improved histopathology representation learning,” arXiv preprint arXiv:2403.14616, 2024.
  • [157] C. Jiang, A. Chowdury, X. Hou, A. Kondepudi, C. Freudiger, K. Conway, S. Camelo-Piragua, D. Orringer, H. Lee, and T. Hollon, “OpenSRH: optimizing brain tumor surgery using intraoperative stimulated raman histology,” Advances in neural information processing systems, vol. 35, pp. 28 502–28 516, 2022.
  • [158] H. Li, Y. Chen, Y. Chen, R. Yu, W. Yang, L. Wang, B. Ding, and Y. Han, “Generalizable whole slide image classification with fine-grained visual-semantic interaction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 398–11 407.
  • [159] B. Li, Y. Li, and K. W. Eliceiri, “Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 14 318–14 328.
  • [160] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s transformers: State-of-the-art natural language processing,” arXiv preprint arXiv:1910.03771, 2019.
  • [161] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  • [162] S. Javed, A. Mahmood, I. I. Ganapathi, F. A. Dharejo, N. Werghi, and M. Bennamoun, “CPLIP: Zero-shot learning for histopathology with comprehensive vision-language alignment,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 11 450–11 459.
  • [163] Z. Guo, J. Ma, Y. Xu, Y. Wang, L. Wang, and H. Chen, “HistGen: Histopathology report generation via local-global feature encoding and cross-modal context interaction,” arXiv preprint arXiv:2403.05396, 2024.
  • [164] M. Asadi-Aghbolaghi, H. Farahani, A. Zhang, A. Akbari, S. Kim, A. Chow, S. Dane, O. C. Consortium, O. Consortium, D. G Huntsman et al., “Machine learning-driven histotype diagnosis of ovarian carcinoma: Insights from the ocean ai challenge,” medRxiv, pp. 2024–04, 2024.
  • [165] H. Farahani, J. Boschman, D. Farnell, A. Darbandsari, A. Zhang, P. Ahmadvand, S. J. Jones, D. Huntsman, M. Köbel, C. B. Gilks et al., “Deep learning-based histotype diagnosis of ovarian carcinoma whole-slide pathology images,” Modern Pathology, vol. 35, no. 12, pp. 1983–1990, 2022.
  • [166] M. Veta, Y. J. Heng, N. Stathonikos, B. E. Bejnordi, F. Beca, T. Wollmann, K. Rohr, M. A. Shah, D. Wang, M. Rousson et al., “Predicting breast tumor proliferation from whole-slide images: the TUPAC16 challenge,” Medical image analysis, vol. 54, pp. 111–121, 2019.
  • [167] Z. Lai, Z. Li, L. C. Oliveira, J. Chauhan, B. N. Dugger, and C.-N. Chuah, “CLIPath: Fine-tune clip with visual feature fusion for pathology image analysis towards minimizing data collection efforts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2374–2380.
  • [168] E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical BERT embeddings,” arXiv preprint arXiv:1904.03323, 2019.
  • [169] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021.
  • [170] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  • [171] M. Tran, P. Schmidle, S. J. Wagner, V. Koch, B. Novotny, V. Lupperger, A. Feuchtinger, A. Böhner, R. Kaczmarczyk, T. Biedermann et al., “Generating clinical-grade pathology reports from gigapixel whole slide images with HistoGPT,” medRxiv, pp. 2024–03, 2024.
  • [172] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in neural information processing systems, vol. 35, pp. 23 716–23 736, 2022.
  • [173] F. Ahmed, A. Sellergren, L. Yang, S. Xu, B. Babenko, A. Ward, N. Olson, A. Mohtashamian, Y. Matias, G. S. Corrado et al., “PathAlign: A vision-language model for whole slide images in histopathology,” arXiv preprint arXiv:2406.19578, 2024.
  • [174] M. Assran, M. Caron, I. Misra, P. Bojanowski, F. Bordes, P. Vincent, A. Joulin, M. Rabbat, and N. Ballas, “Masked siamese networks for label-efficient learning,” in European Conference on Computer Vision.   Springer, 2022, pp. 456–473.
  • [175] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning.   PMLR, 2023, pp. 19 730–19 742.
  • [176] R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “PaLM 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  • [177] R. J. Chen, C. Chen, Y. Li, T. Y. Chen, A. D. Trister, R. G. Krishnan, and F. Mahmood, “Scaling vision transformers to gigapixel images via hierarchical self-supervised learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 144–16 155.
  • [178] Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” arXiv preprint arXiv:2010.16056, 2020.
  • [179] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3156–3164.
  • [180] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning.   PMLR, 2015, pp. 2048–2057.
  • [181] B. C. Guevara, N. Marini, S. Marchesin, W. Aswolinskiy, R.-J. Schlimbach, D. Podareanu, and F. Ciompi, “Caption generation from histopathology whole-slide images using pre-trained transformers,” in Medical Imaging with Deep Learning, short paper track, 2023.
  • [182] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022.
  • [183] J. Tiedemann and S. Thottingal, “OPUS-MT–building open translation services for the world,” in Proceedings of the 22nd annual conference of the European Association for Machine Translation, 2020, pp. 479–480.
  • [184] D. Hu, Z. Jiang, J. Shi, F. Xie, K. Wu, K. Tang, M. Cao, J. Huai, and Y. Zheng, “Histopathology language-image representation learning for fine-grained digital pathology cross-modal retrieval,” Medical Image Analysis, vol. 95, p. 103163, 2024.
  • [185] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020.
  • [186] Y. Zheng, J. Li, J. Shi, F. Xie, J. Huai, M. Cao, and Z. Jiang, “Kernel attention transformer for histopathology whole slide image analysis and assistant cancer diagnosis,” IEEE Transactions on Medical Imaging, vol. 42, no. 9, pp. 2726–2739, 2023.
  • [187] J. Devlin, “BERT: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  • [188] H. Kervadec, J. Bouchtiba, C. Desrosiers, E. Granger, J. Dolz, and I. Ayed, “International conference on medical imaging with deep learning,” 2019.
  • [189] L. Qu, D. Yang, D. Huang, Q. Guo, R. Luo, S. Zhang, and X. Wang, “Pathology-knowledge enhanced multi-instance prompt learning for few-shot whole slide image classification,” arXiv preprint arXiv:2407.10814, 2024.
  • [190] S. Sengupta and D. E. Brown, “Automatic report generation for histopathology images using pre-trained vision transformers,” arXiv preprint arXiv:2311.06176, 2023.
  • [191] R. Zhang, C. Weber, R. Grossman, and A. A. Khan, “Evaluating and interpreting caption prediction for histopathology images,” in Machine Learning for Healthcare Conference.   PMLR, 2020, pp. 418–435.
  • [192] S. Elbedwehy, T. Medhat, T. Hamza, and M. F. Alrahmawy, “Enhanced descriptive captioning model for histopathological patches,” Multimedia Tools and Applications, vol. 83, no. 12, pp. 36 645–36 664, 2024.
  • [193] K. Simonyan, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [194] W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pvt v2: Improved baselines with pyramid vision transformer,” Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
  • [195] M. Yasunaga, J. Leskovec, and P. Liang, “LinkBERT: Pretraining language models with document links,” arXiv preprint arXiv:2203.15827, 2022.
  • [196] L. Zhang, B. Yun, X. Xie, Q. Li, X. Li, and Y. Wang, “Prompting whole slide image based genetic biomarker prediction,” arXiv preprint arXiv:2407.09540, 2024.
  • [197] T. Lin, Z. Yu, H. Hu, Y. Xu, and C.-W. Chen, “Interventional bag multi-instance learning on whole-slide pathological images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 830–19 839.
  • [198] W. Qin, R. Xu, P. Huang, X. Wu, H. Zhang, and L. Luo, “What a whole slide image can tell? subtype-guided masked transformer for pathological image captioning,” arXiv preprint arXiv:2310.20607, 2023.
  • [199] M. Tan and Q. Le, “EfficientNet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning.   PMLR, 2019, pp. 6105–6114.
  • [200] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
  • [201] P. Thota, J. Veerla, P. Guttikonda, M. S. Nasr, S. Nilizadeh, and J. M. Luber, “Demonstration of an adversarial attack against a multimodal vision language model for pathology imaging,” in 21st IEEE International Symposium on Biomedical Imaging (ISBI 2024).   IEEE, 2024.
  • [202] R. Lucassen, T. van de Luijtgaarden, S. Moonemans, W. Blokx, and M. Veta, “Preprocessing pathology reports for vision-language model development,” in MICCAI Workshop on Computational Pathology with Multimodal Data (COMPAYL), 2024.
  • [203] S. Yellapragada, A. Graikos, P. Prasanna, T. Kurc, J. Saltz, and D. Samaras, “PathLDM: Text conditioned latent diffusion model for histopathology,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5182–5191.