Transformer-based Single-Cell Language Model: A Survey

Wei Lan [email protected] 1234-5678-9012 , Guohang He [email protected] The Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of Computer, Electronic and Information, Guangxi UniversityNanningChina , Mingyang Liu The School of Computer and Electronic Information, Guangxi UniversityNanningChina [email protected] , Qingfeng Chen The School of Computer, Electronic and Information, Guangxi UniversityNanningChina [email protected] , Junyue Cao The College of Life Science and Technology, Guangxi UniversityNanningChina [email protected] and Wei Peng the Faculty of Information Engineering and Automation, Kunming University of Science and TechnologyKunmingChina [email protected]

(2018)

Abstract.

The transformers have achieved significant accomplishments in the natural language processing as its outstanding parallel processing capabilities and highly flexible attention mechanism. In addition, increasing studies based on transformers have been proposed to model single-cell data. In this review, we attempt to systematically summarize the single-cell language models and applications based on transformers. First, we provide a detailed introduction about the structure and principles of transformers. Then, we review the single-cell language models and large language models for single-cell data analysis. Moreover, we explore the datasets and applications of single-cell language models in downstream tasks such as batch correction, cell clustering, cell type annotation, gene regulatory network inference and perturbation response. Further, we discuss the challenges of single-cell language models and provide promising research directions. We hope this review will serve as an up-to-date reference for researchers interested in the direction of single-cell language models.

Language Model; transformer; deep learning; single-cell data

^†^†copyright: acmlicensed^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Single-cell research has shown tremendous potential across a variety of fields including genetics, immunology and oncology. By utilizing single-cell RNA sequencing data for cluster analysis and the identification of cell subtypes, it is possible to accurately categorize cell populations and reveal crucial information about cell interactions and the structure of tissues (lan2023multiview, ). Exploring the gene expression, gene function and gene-gene interaction at the single-cell level helps to unveil the deep mechanisms of cellular heterogeneity within tissues (unveiling, ; landeep, ). Single-cell research is critically important for understanding fundamental biological processes and provides significant insights for the diagnosis of diseases (12, ). Single-cell data usually consist of large amounts of high-dimensional data which contains complex information. There is heterogeneity among single-cell data originating from the same tissue.

In the early stages, traditional machine learning methods, such as n-gram (ngram, ) and Hidden Markov Models (HMM) (hmm, ), were widely used for cell annotation and protein prediction. With the development of machine learning technology, more sophisticated algorithms were applied to single-cell research (machine, ). Subsequently, deep learning models, including Recurrent Neural Networks (RNN) (3, ) and Convolutional Neural Networks (CNN) (4, ), were used for the analysis of single-cell data. Currently, the transformers developed by Google has become the most popular language model (10, ). The transformers can process an entire sentence at once during training and effectively captures long-distance dependencies within sequences through the self-attention mechanism (41, ). This capability enables transformers to effectively explore various types of single-cell data. It leads to an increasing number of researchers applying Transformer technology in the field of single-cell research (osnet, ).

This review will introduce the main modules of the transformers in the second section. Then, we provide an overview and analysis of existing single-cell language models in the third section and showcase some downstream tasks accomplished by single-cell language models in the fourth section. Final, we discuss the challenges and opportunities of transformers-based single-cell language models in the fifth section. We hope to offer assistance to individuals interested in understanding single-cell language models.

Refer to caption — Figure 1. The structure of transformers

2. Transformer

The transformers requires extensive training on numerous texts. It usually employs a self-supervised approach during training, enabling language models to perform classification and generation (25, ). For instance,the transformers-based language models can automatically extract key information of text, generate new text and answer user queries in question-answering. These achievement is credited to the ability of transformers for learning long-term dependencies of language and allowing parallel training across multiple language units. This enhances the parallelism in processing sentences and capability to extract overall sequence correlations of transformers. The structure of transformers is depicted in Fig. 1.

The transformers has demonstrated excellent performance in both training tasks from scratch and pre-training tasks. Transformer-XL (20, ) introduces the recursive mechanism and positional encoding. It captures longer-term dependencies by learning beyond fixed-length dependencies while maintaining temporal continuity to address context fragmentation. Reformer (23, ) reduces attention calculation complexity and uses reversible residual layers instead of standard residual layers to achieve higher memory efficiency and alleviate pressure on computing resources. In addition, pre-training tasks can reduce dependence on annotated data, thus lowering the training cost of the transformers (14, ). The GPT (15, ) employs multiple layers of the transformers encoders and performs unsupervised language modeling tasks during pre-training to learn semantic and syntactic knowledge from the text. The BERT (16, ) is a model that pre-trained on large datasets. It use bi-directional transformers and mask mechanism to consider the context information from both the left and right sides of the input sequence simultaneously. Due to the success of these models, many models based on them have started to emerge. XLNet (19, ) is a pre-training model base on Transformer-XL that achieves bidirectional learning of context. It uses the self-regressive strategy to helps the model avoid the inconsistency issue in pre-training fine-tuning. RoBERTa (17, ) is a model based on Bert and achieves enhanced training performance by utilizing dynamic masking.

2.1. Encoder and decoder

The transformers is primarily composed of encoders and decoders, which uses residual connections and layer normalization. The layer normalization is defined as follows:

(1)

\mathrm{LayerNorm}(X+\mathrm{MultiHead}(X))

(2)

\mathrm{LayerNorm}(X+\mathrm{FFN}(X))

where the X in formula(1) denotes the input embedding. It is processed through multi-head self-attention mechanism(MultiHead). After processing X, the result is added to the original X in formula(1) to obtain the X in formula(2). Then the X in formula(2) is processed through Position-wise feed-forward networks(FFN). The Layer normalization computes the mean and variance of each input sequence to provide more accurate training results (30, ). The encoder gradually extracts semantic information from the input sequence and encodes it into a series of hidden vectors by stacking multiple identical layers. The decoder is responsible for transforming the hidden representations generated by the encoder into an output sequence. It adopts an autoregressive training approach. The decoder acquires information about the entire sequence of tokens during training, which would lead to a decrease in prediction accuracy. To address this issue, the decoder uses masked self-attention mechanism in the first layer. After obtaining vector information based on the masked self-attention mechanism, it needs to be combined with the hidden vectors provided by the encoder before entering the next layer. Then, the decoder gradually generates vectors of the sequence and transforms them into the final output sequence based on linear transformation and Softmax function.

2.2. Multi-head self-attention mechanism

The multi-head self-attention mechanism is comprised of multiple self-attention mechanism. It can help the model to determine the important of parameters during the training process. In addition, it adjusts the weights at different positions by calculating the correlations between each input position and other positions. The self-attention mechanism is defined as follows:

(3)

\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{\mathrm{QK^{T}}}{\sqrt{d_{% \mathrm{k}}}})V

where $d_{\mathrm{k}}$ represents the dimensionality of the key vector. $Q$ , $K$ and $V$ are three matrices. $K^{\mathrm{T}}$ represents the transpose of the $K$ matrix. The dot product of $Q$ and $K^{\mathrm{T}}$ denotes the similarity between the current word vector and other word vectors. After dividing this value by $\sqrt{d}_{\mathrm{k}}$ and applying the softmax function, the coefficient of weight is obtained. The weight coefficient is then multiplied by V to ultimately obtain the attention value. The multi-head attention mechanism is defined as follows:

(4)

\mathrm{M}(Q,K,V)=\mathrm{C}(\mathrm{head_{1}},\ldots,\mathrm{head_{h}})W^{% \mathrm{O}}

(5)

\mathrm{head_{i}}=\mathrm{Attention}(\mathrm{QW_{i}^{Q}},\mathrm{KW_{i}^{K}},% \mathrm{VW_{i}^{V}})

where $W^{\mathrm{O}}$ is a matrix containing the weights for each attention value. C denotes the concat function. $\mathrm{head_{i}}$ represents the self-attention mechanism module of $\mathrm{i}$ -th head. $W^{\mathrm{O}}$ contains the weights of each $\mathrm{head_{i}}$ . The $W^{\mathrm{Q}}$ , $W^{\mathrm{K}}$ and $W^{\mathrm{V}}$ denote the weight matrices. Each input embedding vector is multiplied with them to obtain the corresponding $Q$ , $K$ and $V$ matrices. They are updated with each backward propagation during training. Each self-attention module has different $W^{\mathrm{Q}}$ , $W^{\mathrm{K}}$ and $W^{\mathrm{V}}$ . The multi-head attention value is calculated by weighting each Attention value with $W^{\mathrm{O}}$ .

2.3. Position encoding

The position encoding is obtained by adding positional information to the embedding vectors of input words in transformers. It is defined as follows:

(6)

\mathrm{PE}_{(\mathrm{pos},\mathrm{2i})}=\mathrm{sin(pos/{10000}}^{2i/d_{% \mathrm{{model}}})}

(7)

\mathrm{PE}_{(\mathrm{pos},\mathrm{2i+1})}=\mathrm{cos(pos/{10000}}^{2i/d_{% \mathrm{{model}}})}

where pos is the position index, i is the dimension index and $d_{\mathrm{model}}$ is the size of the hidden layer. The sine and cosine values for each pos and i are calculated separately using the PE function. Then they are merged into a position encoding vector. This ensures that the embedding vectors for each token not only contain semantic information, but also position information of input sequence. In addition, the relative position encoding is proposed (35, ). It makes transformers to better understand the positional information of input sequence, thereby enhancing the performance and generalization capability of model.

2.4. Position-wise feed-forward networks

The position-wise feed-forward networks(FFN) acts as a multi-layer perceptron, which is equivalent to use a linear layer in each encoder and decoder (104, ). It is defined as follows:

(8)

\mathrm{FFN}(x)=\max{(0,xW_{1}+b_{1})}W_{2}+b_{2}

where $W_{1}$ , $b_{1}$ , $W_{2}$ and $b_{2}$ are parameters that can be learned during training. The FFN initially performs a linear operation on the input to increase its dimension and applies the ReLU activation function to learn more complex feature information. In final, the FFN reduces the dimension to the original dimension based on a linear operation. This contributes to enhance the generalization capability of features.

3. The Application of Transformer in Single-cell

We categorize the application of single-cell data analysis based on transformers into single-cell language models and single-cell large language models depended on whether it uses pre-training or not. These models effectively analyze single-cell omics data by utilizing the unique feature representation of transformers.

3.1. single-cell language model

This section introduces the current structural design and optimization of single-cell language models. These models are developed based on the transformers framework. They have been utilized for analyzing various types of single-cell datasets, including single-cell transcriptomics, spatial transcriptomics, epigenomics.

3.1.1. Single-cell language model based on single-cell transcriptomics

The transCluster (65, ) is a model based on transformers for analyzing scRNA-Seq data. It demonstrates that transformers can be used for scRNA-seq analysis. It utilizes Linear Discriminant Analysis (LDA) (125, ) to obtain input embeddings for the transformers. Then, CNN is employed to train the output of transformers for predicting cell types. In addition, scTransSort (scTransSort, ) is also a model that combination of transformers and CNN. It uses CNN to transfer the gene embeddings of each cell into multiple two-dimensional matrix blocks. Each matrix block represents a token and these tokens are trained through 12 layers of transformers. Finally, a linear classifier utilizes the output features of transformers to predict cell type. CIForm (66, ) is a model inspired by the application of transformers in computer vision (CV). It divides equally sized sub-vectors within the gene embedding module. These sub-vectors are combined with positional embeddings to fed into the transformers for training. The average pooling layer uses the average of the output sub-vectors to Fetch the final result. STGRNS (101, ) is an interpretable model base on transformers. It proposes a Gene Expression Motif (GEM) data processing technique to process scRNA-seq. The combination of GEM and transformers in STGRNS provides stonger interpretability. In contrast to STGRNS, T-GEM (116, ) enhances model interpretability by replacing the weights in the transformers with gene-related weights. It obtains attention values for different genes. Then, it utilizes these attention values for the classification task.

3.1.2. Single-cell language model based on single-cell spatial omics and epigenomics

The PROTRAIT (67, ) is a model based on transformers for analyzing scATAC-Seq data. It utilizes one-hot encoding to map input sequences into a latent space. When the sequence length is less than a predefined threshold, the one-hot encoding is transformed into motif embedding through convolutional layers. For sequences longer than the predefined threshold, an alternating combination of convolutional and pooling layers is used to obtain motif embedding. Then, the embeddings with absolute positional information are subsequently passed into the transformers for further processing. The output features from the transformers are used to conduct cell classification. TransformerST (transformerST, ) constructs a Variational-Transformers framework for data representation and employs CNN as both the decoder and encoder. It introduces a graph transformers between the decoder and encoder to analyze spatial transcriptomics data. By constructing an undirected graph, the graph transformers is able to learn nonlinear mappings and aggregate neighbor relationships. It makes high-resolution reconstruction of gene expression possible.

3.1.3. Single-cell language model based on single-cell multi-omics

The SCMVP (scMVP, ) is a deep generative model based on transformers specifically designed for the simultaneous analysis of scRNA-seq and scATAC-seq data. The model establishes two independent channels at the encoder and decoder layers for processing scRNA data and scATAC data. In the scRNA channel, the masked attention mechanism is adopted, while in the scATAC channel, the self-attention mechanism is employed. Subsequently, the outputs of the two channels are combined, and the mean and variance of the common latent variables are obtained through a shared linear layer. scMoFormer (scMoFormer, ) is a multimodal model based on transformers that uses a heterogeneous graph to model single-cell data. It constructs a multimodal heterogeneous graph containing three types of nodes: cells, genes, and proteins. In the training framework, three transformers are used, each dedicated to extracting the data representation of the corresponding modality. Finally, a multi-layer fully connected network is utilized to predict the target protein expression level of each cell. DeepMAPS (113, ) is a model that introduces the heterogeneous graph transformer (HGT) framework. It constructs a heterogeneous graph using a cell-gene matrix. Then, the entire heterogeneous graph is divided into multiple subgraphs and HGT is applied on these subgraphs. Subgraph sampling is performed through a sparse-based feature selection method. During training process, the information of nodes is updated through multiple iterations of training and the training on different subgraphs shares the same set of parameters. After training on all subgraphs is completed, HGT is applied to the entire heterogeneous graph to obtain data features. MarsGT (marsgt, ) is an extended model based on DeepMAPS. The heterogeneous graph of MarsGT is constructed base on cell-gene matrix, gene-peak matrix and cell-peak matrix. Compared to DeepMAPS, it better obtain features of single-cell data from the perspective of regulatory networks by increasing the peak. During the subgraph sampling stage, a probability-based subgraph sampling method is employed to select genes and regulatory regions associated with rare cells. Then, the model is trained on the subgraph using transformers. After obtaining the trained weights, the pre-trained model is applied to the entire graph for training.

3.2. Single-cell Large Language model

Currently, large language models are also being applied to single-cell domains. The GPT and BERT have emerged as leading representatives. This section will provide an introduction of the current single-cell large language models.

3.2.1. Single-cell large language model based on single-Cell transcriptomics

The scBERT (69, ) is the first single-cell pre-training model constructed based on the BERT architecture. The structure of scBERT is shown in Fig. 2. During the training process, scBERT has been optimized to eliminate of artificial biases and overfitting for enhancing the generalization capability of model. To capture the similarity between genes, the scBERT employs the gene2vec (71, ) to obtain gene embedding for each gene. The input embedding information is obtained to capture relationships between genes by combining expression embedding and gene embedding. The embedding design allows scBERT to more effectively transform gene expression information into the input for the transformers for generating cell-specific embedding. Considering that most scRNA-seq data dimensions exceed the 512-limitation of transformers, scBERT utilizes the Performer to reduce computational complexity through approximate self-attention calculations, which employs a Linear Attention mechanism based on low-rank random feature mapping. It enables scBERT to input over 16,000 genes when processing long sequence data. In addition, the scBERT also provides the interpretability by using Enrichr to visual attention weight to reflect the contribution of genes.

The scFoundation (90, ) is a large pre-trained model based on transformers with 100 million parameter scale. The embedding module of scFoundation is employed to get final embeddings with positional information. In addition, scFoundation adopts an asymmetric encoder-decoder architecture. During the encode phase, it exclusively conducts the training on non-zero and non-masked expressed genes to reduce computational costs. In the decode phase, it restores zero and masked expressed genes to learn relationships among all genes. The Read-depth-aware task is utilized as training strategy to train a pre-trained model, which is illustrated in Fig. 3. It successfully harmonize read depth differences across different cells to prove more coordinated and precise when dealing with cells with varying sequencing depths.

3.2.2. Single-cell large language model based on single-cell multi-omics

The scGPT (89, ) is the first single-cell foundation model based on transformers that undergone generative pre-training on over 33 million cells. The model draws inspiration from GPT. The structure of scGPT is depicted in Fig. 4. scGPT treats genes as tokens and uses a condition token to represent the positional information of genes. In addition, it employs value binning to address differences between different sequencing batches. scGPT uses stacked transformers layers and Flash-Attention (97, ) to handle single-cell multi-omics data. Flash-Attention can effectively address the sequence length limitation and reduce computational cost. In terms of interpretability, scGPT focusing on key genes through pre-training on a good deal of single-cell data. Therefore, it has more comprehensive interpretability. While scGPT demonstrates impressive performance, It still has some shortcomings. It proves competitive in low-data settings, but it requires careful consideration of experimental conditions in zero-shot settings. Moreover, the current pre-training methods may lack universal applicability.

The CellPLM (80, ) is the first single-cell pre-trained model based on transformers that considers the relationship between cells. The structure of CellPLM is depicted in Fig. 5. It establishes a gene expression embedder for processing input data. The embedder initializes an embedding vector for each type of gene and filters out unmeasured genes and randomly masked genes. The gene expression embedder aggregates gene embedding based on their expression levels in each cell and then transforms them into a suitable input of the transformers. These expression embedding are then input into a structure of an encoder-decoder by utilizing a latent space between the encoder and decoder. The encoder part comprises N transformers blocks. However, the computational complexity of transformers exhibits quadratic growth which results in significant computational costs (lan2023benchmarking, ). CellPLM replaces the transformers with a variant called Flowformer (82, ) to resolve the input constraints and computational complexity problems associated with the transformers. To more effectively capture cell-cell relationships and spatial positional information of individual cells, CellPLM incorporates spatial resolution transcriptome (SRT) data into the encoder for training. SRT data contains position embedding information. The position embedding are combined with expression embedding to obtain the final input embedding. In the latent space, a Gaussian mixture model is employed. The decoder employs several feedforward layers (FFLayers) to train latent space vectors and acquires the batch label of each cell by learning from the learnable lookup table.

3.2.3. Single-cell large language model based on gene expression ranking

The tGPT (50, ) is an autoregressive unsupervised training model based on transformers. It utilizes the ranking of gene expression to predict the index of the next gene. Gene expression ranking provides the relative position of genes and is more suitable for large-scale gene screening and comparative analysis. However, this strategy may only consider genes with higher expression levels and neglect the specific information contained in low-expression genes. The structure of tGPT is depicted in Fig. 6. The tGPT predefines a length limit of input sequence and any part of the input sequence exceeding this limit is truncated, while the sections not reaching the limit are padded as 0. In the training process, it combines gene token embedding with positional encoding embedding. Final embedding undergoes 8 transformers modules to extract features from single-cell sequences.

The Cell2Sentence (C2S) (121, ) is a pretrained model fine-tuned on GPT-2, focusing on handling text sequences containing gene names. Through fine-tuning, C2S is capable of generating new cell sentences and reversely converting them back into gene expression vectors, retaining most of the information. The order of gene names is determined by the expression ranking of each gene and C2S uses these gene name sequences as its input. By converting cell text sequences back into gene expressions, C2S minimizes information loss and retains key information from the original data in most cases. This method enables transformers to acquire information about single-cell data, but the sequence conversion operation often results in higher computational costs.

4. Downstream task analysis

The single-cell language models based on transformers have conducted on various downstream tasks include batch correction, cell clustering, cell type annotation, Gene network inference and Perturbation responses. The datasets used for these downstream tasks are primarily obtained through databases such as TCGA (TCGA, ) and GEO (geo, ). The details of them are shown in Table 1.

Table 1. The details of downstream multi-task(Single-cell large language models are marked with an asterisk.)

Downstream tasks	Model	Metrics	Dataset
Batch Correction	tGPT*	kBET	HCA(137, )
	scGPT*	ASWbatch, GraphConn	COVID-19(130, ), PBMC 10(131, ), Perirhinal Cortex(132, )
Cell Clustering	scMVP	ARI	Paired-seq cell line data(pair, ), SNARE-seq cell line data(SNARE-seq, )
	tGPT*	ARI, NMI, FMI	HCA(137, ), HCL(134, ), TCGA(138, ), Macaque Retina(202, ), the GTEx(135, ), Tabula Muris(136, )
	CellPLM*	ARI, NMI	public dataset (publicdata, )
	DeepMAPS	ASW, ARI	PBMC(131, ), lung tumor leukocytes CITE-seq dataset(lung, )
Cell Type Annotation	TransCluster	F1-score, Precision, Recall, MCC	Shao(shao, ), Baron(baron, )
	PROTRAIT	ARI, AMI	sci-ATAC human atlas(acat, )
	scBERT*	Accuracy, ARI, F1-score	Baron(baron, ), Muraro(muraro, ), Segerstolpe(segerstolpe, ), Xin(xin, ),
	scGPT*	Accuracy, Precision, Recall, F1-score	hPancreas(139, ), multiple sclerosis(140, ), tumor-infiltrating myeloid(mye, )
	CellPLM*	Precision, F1-score	hPancreas(139, ), multiple sclerosis(140, )
Gene network inference	DeepMAPS	Closeness centrality, Eigenvector centrality, Functional enrichment analysis	Reactome(reactome, ), DoRothEA(roroth, ), TRRUST v2(trrust, )
	scGPT*	Pathway enrichment analysis	Immune Human(144, ), ChIP-994Atlas database(chip, ), Adamson(146, )
Perturbation Prediction	scFoundation*	MSE	Dixit(147, ), Adamson(146, ), Norman(148, )
	scGPT*	PCC	Adamson(146, ), Norman(148, )
	CellPLM*	RMSE	Adamson(146, ), Norman(148, )

4.1. Batch correction

With the increasing quantity of single-cell data, the variability between different batches has become an increasingly significant interference in data analysis. It becomes an urgent challenge to improve the effectiveness of batch correction. Three key metrics are used to evaluation of batch correction effects including k-nearest neighbor batch effect test (kBET) (kbet, ), Average Silhouette Width for batch correction (ASWbatch) (ASW, ) and Graph Connectivity measurement (GraphConn) (graphconnectivity, ). The kBET assesses the effectiveness of correction by comparing the distribution of cells within and between batches. Its acceptance rate reflects the uniformity of cell distribution after correction. A higher acceptance rate indicates the preservation of biological heterogeneity and a reduction in technical batch effects. The ASWbatch originates from the concept of silhouette width in cluster analysis. It is used to measure the clustering effect after removing batch effects. The GraphConn is a method for evaluating the connectivity between cells in the dataset after batch correction. It aims to quantify the enhancement of cell-to-cell connectivity post-correction for reflecting the reduction of batch effects.

The tGPT (50, ) adopts the ranking of gene expression to void the interference of actual expressions of Highly Variable Genes (HVGs) and batch information during training. It is trained on the HCA dataset (137, ), utilizing the kBET acceptance rate to reflect the magnitude of differences between different batches. In addition, tGPT conducted an immune checkpoint blockade (ICB) clinical trial. By quantifying the expression features of different attention heads, it is demonstrated that these attention heads have prognostic significance in this clinical trial. scGPT (89, ) conducts batch effect experiments by fine-tuning on pre-trained models. To quantify batch correction performance, scGPT calculates the average silhouette width ( $ASW_{batch}$ ) and the graph connectivity measure (GraphConn) (song2023benchmarking, ). It computes the AvgBATCH (i.e average of ASWbatch and GraphConn,) to comprehensively represent batch performance. scGPT evaluates batch correction performance on three datasets including COVID-19 (130, ), PBMC 10 (131, ) and Perirhinal Cortex (132, ). The evaluation is conducted against three methods including Seurat (seruat, ), Harmony (ham, ) and scVI (scvi, ). scGPT achieves a best performance AvgBATCH value on the three datasets. However, scGPT does not achieve excellent batch effect correction in zero-shot settings (110, ).

4.2. Cell clustering

The goal of cell clustering analysis is to group cells based on their gene expression patterns. When evaluating the accuracy of clustering results, commonly used metrics include Adjusted Rand Index (ARI) (ARI, ), Average Silhouette Width (ASW) (ASW, ) and Normalized Mutual Information (NMI) (NMI, ). ARI adjusts the Rand Index by comparing the observed pair-wise concordance to the expected random concordance and yields a measure of clustering consistency. ASW measures the difference in similarity between samples and different clusters by calculating the silhouette width for each sample. It offers a intuitive evaluation of clustering results. NMI utilizes normalized mutual information to eliminate the influence of the number of clusters and the total number of samples It makes it useful for comparing clustering results under different parameter settings.

The scMVP (scMVP, ) employs a joint deep learning model to learn features from both scATAC data and scRNA data. It is trained on Paired-seq cell line data (pair, ) and SNARE-seq cell line data (SNARE-seq, ). Then it utilizes UMAP visualization to perform cell clustering analysis on cell clusters. It successfully identifies different numbers of cell subpopulations and effectively separates the integration data of scRNA-seq and scATAC-seq. It confirms its effectiveness in cell clustering analysis. tGPT (50, ) is applicable to large-scale tissue samples through pre-training. It partitions samples into distinct clusters that correspond to different organs. It is trained on six datasets including HCA (137, ), HCL (134, ), TCGA (138, ), Macaque Retina (202, ), GTEx (135, ) and Tabula Muris (136, ). The experimental results demonstrate that it achieves excellent performance in cell clustering tasks. CellPLM (80, ) conducts unsupervised clustering analysis by extracting cell embedding vectors from the dataset without fine-tuning. CellPLM achieves zero-shot clustering experiments on a public dataset (publicdata, ). It compares with PCA, Geneformer and scGPT. In the experiments, it achieves the highest ARI and NMI. DeepMAPS (113, ) validates cell clustering on ten single-cell multi-omics datasets. It trains with 36 parameter combinations and compares with Seurat, MOFA+ (mofa+, ), TotalVI (totalVI, ) and Harmony. In all experiments, DeepMAPS achieved the best ARI and ASW. Furthermore, DeepMAPS performs single-cell multi-omics integration analysis on the PBMC dataset (131, ) and the CITE-seq dataset of lung tumor leukocytes (lung, ). It successfully identifies 13 cell types and validates its effectiveness.

4.3. Cell type annotation

Cell type annotation refers to assigning known cell type labels to each cell or cell cluster, which aids in gaining a deeper understanding of the biological significance of the cells (lan2020ldicdl, ). When evaluating the performance of cell annotation, commonly used metrics include precision, recall, accuracy and the F1 score (lan2024jlonmfsc, ). Precision represents the proportion of true samples of a certain category among those predicted as that category by the model. Accuracy is the ratio of correctly classified samples to the total number of samples. Recall is the proportion of true samples of a certain category that the model correctly predicts as that category. The F1 score is the harmonic mean of precision and recall. It is used to comprehensively assess the performance of model.

The TransCluster (65, ) is the first model to apply transformers to cell type annotation. It is trained on the Shao dataset (shao, ) and the Baron dataset (baron, ) and demonstrates efficient performance in cell type prediction tasks. PROTRAIT (67, ) is trained on the sci-ATAC human atlas (acat, ) and generates cell embeddings that reflect the distribution of scATAC-seq data.Then, it uses the k-nearest neighbors (KNN) for cell type annotation. scBERT (69, ) is pre-trained on 9 scRNA-seq datasets. Then fine-tuning is performed on the trained model. Final, it uses the K-means algorithm to annotate cell types. scBERT performs cell annotation tasks on the Baron dataset (baron, ), the Muraro dataset (muraro, ), the Segerstolpe dataset (segerstolpe, ) and the Xin dataset (xin, ). Both scGPT (89, ) and CellPLM (80, ) are trained on the hPancreas (139, ) dataset and multiple sclerosis (MS) (140, ) dataset to perform cell annotation task. scGPT performs normalization, log transformation and binning operations on gene expression values. Then cell type annotation is achieved through fine-tuning. In addition, scGPT is trained on the tumor-infiltrating myeloid dataset (Mye.) (mye, ) and evaluated on query partitions of three previously unseen cancer types. The results indicate that scGPT has high accuracy in distinguishing immune cell subtypes. CellPLM adds a feedforward layer during the fine-tuning process and utilizes standard cross-entropy loss function for the fine-tuning process. Fine-tuned CellPLM exhibits a significant improvement in F1-score and precision metrics on two datasets compared to the from-scratch CellPLM.

4.4. Gene network inference

Gene network inference analysis reveals regulatory associations between genes by comparing gene expression patterns under different conditions. Currently, single-cell language models based on transformers have introduced innovative perspectives to the study of gene regulatory networks (lan2022detecting, ). Centrality score metrics, including closeness centrality (CC) and eigenvector centrality (EC) (centrality, ) is used to the experiment of single-cell language models. The CC assesses the average distance of a gene node relative to other gene nodes in the network. EC considers not only the number of connections of a gene node but also the importance of the other gene nodes that it is connected to. In addition, functional enrichment analysis (functional, ) and pathway enrichment analysis (path, ) are employed in experimental analysis. Functional enrichment analysis aims to identify biological functions or processes that are significantly enriched in a set of genes. Pathway enrichment analysis is similar to functional enrichment analysis but focuses more on known biochemical pathways (lan2024lgcda, ). It aims to deeply understand how genes function through synergistic interactions within specific biological pathways.

The DeepMAPS (113, ) uses the Steiner Forest Problem (SFP) to identify genes contributing significantly to cell cluster features and constructs a gene correlation network. It defines sets of genes regulated by the same transcription factor (TF) as regulons and compares regulon activities between cell clusters. Then, it selects regulons with significantly higher activity scores as cell-type-specific regulons and constructs gene regulatory networks (GRNs) based on cell cluster regulons. After constructing GRNs, DeepMAPS conducted functional enrichment analysis. Specifically, it employs hypergeometric tests to compare the intersection of GRN results with regulons in the database and evaluates whether the predicted regulons in the GRN are enriched for the same functions or pathways as known regulons. DeepMAPS is trained on single-cell multi-omics datasets from the 10x database. The Experiment of DeepMAPS demonstrates that the GRNs exhibits a greater number of unique transcription factor (TF) and cell-type-specific regulons and they are enriched in specific functions or pathways. In addition, scGPT (89, ) demonstrates high interpretability in gene regulatory network experiments. Pre-training enables scGPT to emphasize genes with intricate relationships. It improves the interpretability of scGPT. In the Human Leukocyte Antigen (HLA) dataset, scGPT forms a human leukocyte antigen (HLA) gene network through zero-shot learning. On the Immune Human dataset , fine-tuned scGPT generates CD gene networks through zero-shot learning and visualization of the gene information. scGPT performs pathway enrichment analysis on the Reactome database (reactome, ). It successfully validates the extracted gene program and identifies 22 additional pathways. These experiments demonstrate the ability of scGPT to capture complex gene relationships. Through pre-training and fine-tuning, scGPT achieves stronger generalization capabilities.

4.5. Perturbation responses

Single-cell perturbation prediction experiments aim to predict and analyze the biological responses of cells to external stimuli or changes introduced into single cells (lan2022kgancda, ). Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) have become two important metrics for evaluating the performance of model in predicting how cells respond to specific perturbations (landeep, ). MSE is used to measure the accuracy of the model in predicting the response of single cells to specific perturbations. A lower MSE value indicates that the predictions of model are more consistent with the actual observed values. RMSE is the square root of MSE and provides an error measure in the same units as the original data. It directly reflects the magnitude of the prediction error.

In perturbation responses prediction experiments, scFoundation (90, ) is combined with the GEARS (129, ) to construct personalized gene co-expression graphs for each cell. It significantly improves the accuracy of gene perturbation predictions. It is evaluated on three datasets including the Dixit dataset (147, ), the Adamson dataset (146, ) and the Norman dataset (148, ). It obtains lower Mean Squared Error (MSE) values. In addition, scGPT (89, ) uses the pre-trained the parameters of embedding and transformer layers to initialize fine-tuning. The fine-tuning process uses genes with zero and non-zero expression. scGPT is compared with GEARS and CPA (151, ) on the Adamson dataset and the Norman dataset. It accurately predicted the expression changes of the top 20 Differentially Expressed (DE) genes in the datasets. During the fine-tuning process, CellPLM (80, ) initializes other components except the decoder with pre-trained weights. CellPLM is compared with GEARS and scGen (152, ) on the Adamson dataset and the Norman dataset. It conducts two types of experiments (single-gene perturbation and double-gene perturbation) on the Norman dataset and only single-gene perturbation on the Adamson dataset. In each experiment, CellPLM exhibited lower Root Mean Squared Error (RMSE) than GEARS and scGen.

5. Challenges and prospects

In the field of single-cell research, transformers contribute to a deeper understanding of these vast and complex datasets. It enhances the simulation and comprehension of cellular processes. In this section, we discuss the challenges encountered by transformers-based single-cell models. We focus primarily on limitations in the transformer-based single-cell language model including handling long sequence data, overfitting risks in pre-training, computational requirement and interpretability. In addition, we also analyze some potential future research directions.

5.1. Sequence data processing

The transformers-based single-cell language models have strong representational capabilities on single-cell sequence data. However, single-cell sequence data often contains excessively long sequences (lan2022sciac, ). It leads to an exponential increase in the computational complexity of these models. In addition, single-cell data with long sequences may contain more complex gene relationships. Nevertheless, the self-attention mechanism of the transformers tends to capture dependencies between adjacent positions in the sequence. It may causes the model to ignore some key gene information. scBERT (69, ) adopts a variant of transformers called Performer to solve the problem. scBERT used the low-rank attention mechanism of Performer to avoid over-focusing on dependencies between adjacent positions. When dealing with sparse DNA sequences, the attention mechanism of Performer may exhibit better robustness. Although Performer achieves good results, there are certain challenges in terms of data precision and sensitivity to model parameters due to the low-rank attention mechanism. In addition, the effectiveness of Performer is not always superior to that of the traditional transformers for different datasets and tasks. However, it is undeniable that using some variants of the transformers has brought new insights to the research of single-cell language models.

5.2. Overfitting risks in pre-training

Although transformers-based single-cell language models are increasingly inclined to adopt pre-training techniques, the analysis of these pre-trained models in terms of overfitting issues is relatively limited. The characteristic of single-cell data lies in its diversity of types and different types of single-cell data may vary significantly. It may lead to an imbalanced distribution of pre-training samples and potentially causing overfitting on smaller datasets. To address this issue, data augmentation techniques can be introduced into the pre-training. Currently, Generative Adversarial Networks (GAN) have shown promising results in the field of single-cell data augmentation (Gan, ). By using GAN to generate synthetic data samples that are similar to the original data, the diversity of the dataset can be effectively increased. It can mitigate the overfitting problems caused by data imbalance. In addition, interpolating and extrapolating between original single-cell data samples can also be considered. By using methods such as linear interpolation, polynomial interpolation or deep learning models to generate new samples, the quantity and diversity of the data can be increased. It further enhances the generalization capability and robustness of models. We believe that incorporating these methods into the pre-training process of single-cell language models may help address the issue of overfitting in the models.

5.3. Computing Requirement

Currently, transformers-based single-cell language models for single-cell multi-omics research are still in their early stages. Future work may involve incorporating more omics data in the pre-training phase to study single-cell multimodal tasks. However, the incorporation of omics data has led to an even larger scale of data. It causes challenges related to computational costs. Recently, the combination of recurrent neural networks and transformers has reduced computational costs by speeding up the training of transformers (31, ). This method could be considered as a possibility for application in single-cell language models. In addition, the parallel computing capabilities of transformers still face challenges. In the self-attention mechanism, the attention weights for each position need to be calculated sequentially and cannot be directly parallelized. When processing batch data, the sequence lengths of different single-cell samples may vary, increasing the complexity of parallel computing. In the future, solving the parallel computing capabilities of single-cell language models may become increasingly critical.

Table 2. The link to the code of the models.

Model	Input data type	Data repositories address
TransCluster	scRNA-seq	https://github.com/ Danica123/TransCluster.git
scTransSort	scRNA-seq	https://github.com/ jiaojiao-123/scTransSort
CIForm	scRNA-seq	https://github.com/zhanglab-wbgcas/ CIForm
STGRNS	scRNA-seq	https://github.com/zhanglab-wbgcas/ STGRNS
T-GEM	scRNA-seq, Transcriptomics(the pan-cancer RNA-Seq)	https://github.com/TingheZhang/TGEM
PROTRAIT	scATAC-seq	https://github.com/ZhangLab312/PROTRAIT
scMVP	scRNA-seq, scATAC-seq	https://github. com/bm2-lab/scMVP
scMoFormer	scRNA-seq, Proteomics	https://github.com/OmicsML/scMoFormer
DeepMAPS	scRNA-seq, scATAC-seq, CITE-seq	https://github.com/OSU-BMBL/deepmaps
MarsGT	scRNA-seq, scATAC-seq	https://github.com/mtduan/marsgt
scBERT	scRNA-seq	https://github.com/TencentAILabHealthcare/scBERT
scFoundation	scRNA-seq	https://github.com/biomapresearch/scFoundation
scGPT	scRNA-seq	https://github.com/bowang-lab/scGPT
CellPLM	scRNA-seq, Spatial Transcriptomics, Perturb-seq	https://github.com/OmicsML/CellPLM
tGPT	scRNA-seq	https://github.com/deeplearningplus/tGPT
Cell2Sentenc	scRNA-seq	https://github.com/vandijklab/cell2sentence-ft

5.4. Interpretability

Transformers-based single-cell language models offer significant advantages in terms of interpretability. They are capable of assigning different gene weights during the processing of sequence data to identify key features in the representation process. In single-cell research, this capability is crucial for understanding complex biological processes such as gene expression, protein interactions and gene regulation (lan2024deepkegg, ). In addition, single-cell data is highly complex and diverse. Each cell potentially exhibit unique gene expression patterns (120, ). Through the self-attention mechanism, transformers have successfully provided interpretability for the predictions of the key features. This helps biologists understand how models assign weights to different genes or cells and gain insights into gene expression patterns. Although transformers-based single-cell language models have achieved good results, these models still employ a black-box training approach. It inevitably affects the application of models in clinical settings. Therefore, improving the interpretability of single-cell language models remains a challenging research problem.

5.5. Validation Analysis

The single-cell language models and single-cell large language models mentioned in this paper have demonstrated promising results in experiments. Currently, some of these models have been subjected to benchmark experiments (106, ; 118, ; 119, ), which have revealed that different models exhibit varying performance across different tasks. These models have been proven to have the capability to integrate representations from diverse single-cell omics data. In particular, pre-trained models like scGPT have shown remarkable performance in gene function prediction tasks and achieve good results even without fine-tuning. However, the application of single-cell language models and single-cell large language models is still in its early stages and their generalizability faces certain challenges. In addition, comparing with some of the latest methods such as Sccross (sccross, ), ctpredictor (ctpredictor, ), will also help to promote research progress. Therefore, we provide an accessible link to the experimental code of the single-cell language model, please refer to Table 2 for details. We hope these resources can provide some assistance to researchers interested in this field.

6. Conclusion

The Transformer-based Single-Cell Language Model has shown promising results in single-cell data analysis. In this review, we provide a detailed overview of single-cell language models and single-cell large language models. We summarized the methods of these models as well as their applications in downstream tasks. While these models may not achieve optimal performance in certain evaluation metrics, they hold potential contributions and applications in single-cell research. They open new possibilities for research and applications in the field and present significant avenues for further development. We think that the potential areas for improvement may include refining data preprocessing methods, reducing computational costs, enhancing model interpretability and optimizing the transfer learning process. In-depth investigations into these directions will facilitate more effective utilization of various types of single-cell data. This review aims to provide an overview of single-cell language models and hope promoting progress in the field of single-cell research.

Acknowledgements

This work was partially supported by the National Natural Science Foundation of China (No. 62072124), the Natural Science Foundation of Guangxi (No. 2023JJG170006), the Natural Science and Technology Innovation Development Foundation of Guangxi University (No. 2022BZRC009), the CAAI-Huawei MindSpore Open Fund (No. CAAIXSJLJJ-2022-022A), the Project of Guangxi Key Laboratory of Eye Health (No. GXYJK-202407), the Project of Guangxi Health Commission eye and related diseases artificial intelligence screen technology key laboratory (No. GXYAI-202402).

References

[1] Wei Lan, Tianchuan Yang, Qingfeng Chen, Shichao Zhang, Yi Dong, Huiyu Zhou, and Yi Pan. Multiview subspace clustering via low-rank symmetric affinity graph. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[2] Kiran Makhani, Xiuhui Yang, France Dierick, Nivetha Subramaniam, Natascha Gagnon, Talin Ebrahimian, Hao Wu, Jun Ding, and Koren K Mann. Unveiling the impact of arsenic toxicity on immune cells in atherosclerotic plaques: Insights from single-cell multi-omics profiling. bioRxiv, pages 2023–11, 2023.
[3] Wei Lan, Jianwei Chen, Mingyang Liu, Qingfeng Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. Deep imputation bi-stochastic graph regularized matrix factorization for clustering single-cell rna-sequencing data. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2024.
[4] Tim Stuart and Rahul Satija. Integrative single-cell analysis. Nature reviews genetics, 20(5):257–272, 2019.
[5] William B Cavnar, John M Trenkle, et al. N-gram-based text categorization. In Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, volume 161175, page 14. Las Vegas, NV, 1994.
[6] Sean R Eddy. Hidden markov models. Current opinion in structural biology, 6(3):361–365, 1996.
[7] Raphael Petegrosso, Zhuliu Li, and Rui Kuang. Machine learning and statistical methods for clustering single-cell rna-sequencing data. Briefings in bioinformatics, 21(4):1209–1223, 2020.
[8] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
[9] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[11] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen. Up-detr: Unsupervised pre-training for object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1601–1610, 2021.
[12] Zhisheng Wang, Zihan Deng, Fenglin Liu, Yixing Huang, Haijun Yu, and Junning Cui. Osnet & mneto: Two types of general reconstruction architectures for linear computed tomography in multi-scenarios. arXiv preprint arXiv:2309.11858, 2023.
[13] Andy T Liu, Shang-Wen Li, and Hung-yi Lee. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366, 2021.
[14] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
[15] Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
[16] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13(3):55–75, 2018.
[17] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018.
[18] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[19] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32, 2019.
[20] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[21] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. On layer normalization in the transformer architecture. In International Conference on Machine Learning, pages 10524–10533. PMLR, 2020.
[22] Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. Rethinking and improving relative position encoding for vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10033–10041, 2021.
[23] Oludare Isaac Abiodun, Aman Jantan, Abiodun Esther Omolara, Kemi Victoria Dada, Abubakar Malah Umar, Okafor Uchenwa Linus, Humaira Arshad, Abdullahi Aminu Kazaure, Usman Gana, and Muhammad Ubale Kiru. Comprehensive review of artificial neural network applications to pattern recognition. IEEE access, 7:158820–158846, 2019.
[24] Tao Song, Huanhuan Dai, Shuang Wang, Gan Wang, Xudong Zhang, Ying Zhang, and Linfang Jiao. Transcluster: A cell-type identification method for single-cell rna-seq data using deep learning based on transformer. Frontiers in Genetics, 13:1038919, 2022.
[25] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
[26] Linfang Jiao, Gan Wang, Huanhuan Dai, Xue Li, Shuang Wang, and Tao Song. sctranssort: Transformers for intelligent annotation of cell types by gene embeddings. Biomolecules, 13(4):611, 2023.
[27] Jing Xu, Aidi Zhang, Fang Liu, Liang Chen, and Xiujun Zhang. Ciform as a transformer-based model for cell-type annotation of large-scale single-cell rna-seq data. Briefings in Bioinformatics, page bbad195, 2023.
[28] Jing Xu, Aidi Zhang, Fang Liu, and Xiujun Zhang. Stgrns: an interpretable transformer-based method for inferring gene regulatory networks from single-cell transcriptomic data. Bioinformatics, 39(4):btad165, 2023.
[29] Ting-He Zhang, Md Musaddaqul Hasib, Yu-Chiao Chiu, Zhi-Feng Han, Yu-Fang Jin, Mario Flores, Yidong Chen, and Yufei Huang. Transformer for gene expression modeling (t-gem): An interpretable deep learning model for gene expression-based phenotype predictions. Cancers, 14(19):4763, 2022.
[30] Zixuan Wang, Yongqing Zhang, Yun Yu, Junming Zhang, Yuhang Liu, and Quan Zou. A unified deep learning framework for single-cell atac-seq analysis based on proddep transformer encoder. International Journal of Molecular Sciences, 24(5):4784, 2023.
[31] Chongyue Zhao, Zhongli Xu, XInjun Wang, Kong Chen, Heng Huang, and Wei Chen. Transformer enables reference free and unsupervised analysis of spatial transcriptomics. bioRxiv, pages 2022–08, 2022.
[32] Gaoyang Li, Shaliu Fu, Shuguang Wang, Chenyu Zhu, Bin Duan, Chen Tang, Xiaohan Chen, Guohui Chuai, Ping Wang, and Qi Liu. A deep generative model for multi-view profiling of single-cell rna-seq and atac-seq data. Genome biology, 23(1):20, 2022.
[33] Wenzhuo Tang, Hongzhi Wen, Renming Liu, Jiayuan Ding, Wei Jin, Yuying Xie, Hui Liu, and Jiliang Tang. Single-cell multimodal prediction via transformers. arXiv preprint arXiv:2303.00233, 2023.
[34] Anjun Ma, Xiaoying Wang, Cankun Wang, Jingxian Li, Tong Xiao, Juexing Wang, Yang Li, Yuntao Liu, Yuzhou Chang, Duolin Wang, et al. Deepmaps: Single-cell biological network inference using heterogeneous graph transformer. bioRxiv, pages 2021–10, 2021.
[35] Xiaoying Wang, Maoteng Duan, Jingxian Li, Anjun Ma, Gang Xin, Dong Xu, Zihai Li, Bingqiang Liu, and Qin Ma. Marsgt: Multi-omics analysis for rare population inference using single-cell graph transformer. Nature Communications, 15(1):338, 2024.
[36] Fan Yang, Wenchuan Wang, Fang Wang, Yuan Fang, Duyu Tang, Junzhou Huang, Hui Lu, and Jianhua Yao. scbert as a large-scale pretrained deep language model for cell type annotation of single-cell rna-seq data. Nature Machine Intelligence, 4(10):852–866, 2022.
[37] Quan Zou, Pengwei Xing, Leyi Wei, and Bin Liu. Gene2vec: gene subsequence embedding for prediction of mammalian n6-methyladenosine sites from mrna. Rna, 25(2):205–218, 2019.
[38] Minsheng Hao, Jing Gong, Xin Zeng, Chiming Liu, Yucheng Guo, Xingyi Cheng, Taifeng Wang, Jianzhu Ma, Le Song, and Xuegong Zhang. Large scale foundation model on single-cell transcriptomics. bioRxiv, pages 2023–05, 2023.
[39] Haotian Cui, Chloe Wang, Hassaan Maan, and Bo Wang. scgpt: Towards building a foundation model for single-cell multi-omics using generative ai. biorxiv, 2023.
[40] Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
[41] Hongzhi Wen, Wenzhuo Tang, Xinnan Dai, Jiayuan Ding, Wei Jin, Yuying Xie, and Jiliang Tang. Cellplm: Pre-training of cell language model beyond single cells. bioRxiv, pages 2023–10, 2023.
[42] Wei Lan, Yi Dong, Hongyu Zhang, Chunling Li, Qingfeng Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. Benchmarking of computational methods for predicting circrna-disease associations. Briefings in Bioinformatics, 24(1):bbac613, 2023.
[43] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In European Conference on Computer Vision, pages 668–685. Springer, 2022.
[44] Hongru Shen, Jilei Liu, Jiani Hu, Xilin Shen, Chao Zhang, Dan Wu, Mengyao Feng, Meng Yang, Yang Li, Yichen Yang, et al. Generative pretraining from large-scale transcriptomes for single-cell deciphering. Iscience, 26(5), 2023.
[45] Daniel Levine, Syed Asad Rizvi, Sacha Lévy, Nazreen Pallikkavaliyaveetil, Ruiming Wu, Insu Han, Zihe Zheng, Antonio Henrique de Oliveira Fonseca, Xingyu Chen, Sina Ghadermarzi, et al. Cell2sentence: Teaching large language models the language of biology. bioRxiv, pages 2023–09, 2023.
[46] Katarzyna Tomczak, Patrycja Czerwińska, and Maciej Wiznerowicz. Review the cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncology/Współczesna Onkologia, 2015(1):68–77, 2015.
[47] Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Tomashevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko, et al. Ncbi geo: archive for functional genomics data sets—update. Nucleic acids research, 41(D1):D991–D995, 2012.
[48] Aviv Regev, Sarah Teichmann, Orit Rozenblatt-Rosen, Michael Stubbington, Kristin Ardlie, Ido Amit, Paola Arlotta, Gary Bader, Christophe Benoist, Moshe Biton, et al. The human cell atlas white paper. arXiv preprint arXiv:1810.05192, 2018.
[49] Mohammad Lotfollahi, Mohsen Naghipourfar, Malte D Luecken, Matin Khajavi, Maren Büttner, Marco Wagenstetter, Žiga Avsec, Adam Gayoso, Nir Yosef, Marta Interlandi, et al. Mapping single-cell data to reference atlases by transfer learning. Nature biotechnology, 40(1):121–130, 2022.
[50] Adam Gayoso, Romain Lopez, Galen Xing, Pierre Boyeau, Valeh Valiollah Pour Amiri, Justin Hong, Katherine Wu, Michael Jayasuriya, Edouard Mehlman, Maxime Langevin, et al. A python library for probabilistic analysis of single-cell omics data. Nature biotechnology, 40(2):163–166, 2022.
[51] Kimberly Siletti, Rebecca Hodge, Alejandro Mossi Albiach, Ka Wai Lee, Song-Lin Ding, Lijuan Hu, Peter Lönnerberg, Trygve Bakken, Tamara Casper, Michael Clark, et al. Transcriptomic diversity of cell types across the adult human brain. Science, 382(6667):eadd7046, 2023.
[52] Chenxu Zhu, Miao Yu, Hui Huang, Ivan Juric, Armen Abnousi, Rong Hu, Jacinta Lucero, M Margarita Behrens, Ming Hu, and Bing Ren. An ultra high-throughput method for single-cell joint analysis of open chromatin and transcriptome. Nature structural & molecular biology, 26(11):1063–1070, 2019.
[53] Song Chen, Blue B Lake, and Kun Zhang. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature biotechnology, 37(12):1452–1457, 2019.
[54] Xiaoping Han, Ziming Zhou, Lijiang Fei, Huiyu Sun, Renying Wang, Yao Chen, Haide Chen, Jingjing Wang, Huanna Tang, Wenhao Ge, et al. Construction of a human cell landscape at single-cell level. Nature, 581(7808):303–309, 2020.
[55] Vésteinn Thorsson, David L Gibbs, Scott D Brown, Denise Wolf, Dante S Bortone, Tai-Hsien Ou Yang, Eduard Porta-Pardo, Galen F Gao, Christopher L Plaisier, James A Eddy, et al. The immune landscape of cancer. Immunity, 48(4):812–830, 2018.
[56] Yi-Rong Peng, Karthik Shekhar, Wenjun Yan, Dustin Herrmann, Anna Sappington, Gregory S Bryman, Tavé van Zyl, Michael Tri H Do, Aviv Regev, and Joshua R Sanes. Molecular classification and comparative taxonomics of foveal and peripheral cells in primate retina. Cell, 176(5):1222–1237, 2019.
[57] GTEx Consortium. Erratum: Genetic effects on gene expression across human tissues. Nature, 553(7689):530–530, 2018.
[58] Nicholas Schaum, Jim Karkanias, Norma F Neff, Andrew P May, Stephen R Quake, Tony Wyss-Coray, Spyros Darmanis, Joshua Batson, Olga Botvinnik, Michelle B Chen, et al. Single-cell transcriptomics of 20 mouse organs creates a tabula muris: The tabula muris consortium. Nature, 562(7727):367, 2018.
[59] Yanming Li, Pingping Ren, Ashley Dawson, Hernan G Vasquez, Waleed Ageedi, Chen Zhang, Wei Luo, Rui Chen, Yumei Li, Sangbae Kim, et al. Single-cell transcriptome analysis reveals dynamic cell populations and differential gene expression patterns in control and aneurysmal human aortic tissue. Circulation, 142(14):1374–1388, 2020.
[60] Terkild B Buus, Alberto Herrera, Ellie Ivanova, Eleni Mimitou, Anthony Cheng, Ramin S Herati, Thales Papagiannakopoulos, Peter Smibert, Niels Odum, and Sergei B Koralov. Improving oligo-conjugated antibody signal in multimodal single-cell analysis. Elife, 10:e61973, 2021.
[61] Xin Shao, Haihong Yang, Xiang Zhuang, Jie Liao, Penghui Yang, Junyun Cheng, Xiaoyan Lu, Huajun Chen, and Xiaohui Fan. scdeepsort: a pre-trained cell-type annotation method for single-cell transcriptomics using deep learning with a weighted graph neural network. Nucleic acids research, 49(21):e122–e122, 2021.
[62] Jian Hu, Xiangjie Li, Gang Hu, Yafei Lyu, Katalin Susztak, and Mingyao Li. Iterative transfer learning with neural network for clustering and cell type classification in single-cell rna-seq analysis. Nature machine intelligence, 2(10):607–618, 2020.
[63] Kai Zhang, James D Hocker, Michael Miller, Xiaomeng Hou, Joshua Chiou, Olivier B Poirion, Yunjiang Qiu, Yang E Li, Kyle J Gaulton, Allen Wang, et al. A single-cell atlas of chromatin accessibility in the human genome. Cell, 184(24):5985–6001, 2021.
[64] Mauro J Muraro, Gitanjali Dharmadhikari, Dominic Grün, Nathalie Groen, Tim Dielen, Erik Jansen, Leon Van Gurp, Marten A Engelse, Francoise Carlotti, Eelco Jp De Koning, et al. A single-cell transcriptome atlas of the human pancreas. Cell systems, 3(4):385–394, 2016.
[65] Åsa Segerstolpe, Athanasia Palasantza, Pernilla Eliasson, Eva-Marie Andersson, Anne-Christine Andréasson, Xiaoyan Sun, Simone Picelli, Alan Sabirsh, Maryam Clausen, Magnus K Bjursell, et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell metabolism, 24(4):593–607, 2016.
[66] Yurong Xin, Jinrang Kim, Haruka Okamoto, Min Ni, Yi Wei, Christina Adler, Andrew J Murphy, George D Yancopoulos, Calvin Lin, and Jesper Gromada. Rna sequencing of single human islet cells reveals type 2 diabetes genes. Cell metabolism, 24(4):608–615, 2016.
[67] Jiawei Chen, Hao Xu, Wanyu Tao, Zhaoxiong Chen, Yuxuan Zhao, and Jing-Dong J Han. Transformer for one stop interpretable cell type annotation. Nature Communications, 14(1):223, 2023.
[68] Lucas Schirmer, Dmitry Velmeshev, Staffan Holmqvist, Max Kaufmann, Sebastian Werneburg, Diane Jung, Stephanie Vistnes, John H Stockley, Adam Young, Maike Steindel, et al. Neuronal vulnerability and multilineage diversity in multiple sclerosis. Nature, 573(7772):75–82, 2019.
[69] Sijin Cheng, Ziyi Li, Ranran Gao, Baocai Xing, Yunong Gao, Yu Yang, Shishang Qin, Lei Zhang, Hanqiang Ouyang, Peng Du, et al. A pan-cancer single-cell transcriptional atlas of tumor infiltrating myeloid cells. Cell, 184(3):792–809, 2021.
[70] Imre Vastrik, Peter D’Eustachio, Esther Schmidt, Geeta Joshi-Tope, Gopal Gopinath, David Croft, Bernard de Bono, Marc Gillespie, Bijay Jassal, Suzanna Lewis, et al. Reactome: a knowledge base of biologic pathways and processes. Genome biology, 8:1–13, 2007.
[71] Luz Garcia-Alonso, Christian H Holland, Mahmoud M Ibrahim, Denes Turei, and Julio Saez-Rodriguez. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome research, 29(8):1363–1375, 2019.
[72] Heonjong Han, Jae-Won Cho, Sangyoung Lee, Ayoung Yun, Hyojin Kim, Dasom Bae, Sunmo Yang, Chan Yeong Kim, Muyoung Lee, Eunbeen Kim, et al. Trrust v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic acids research, 46(D1):D380–D386, 2018.
[73] Malte D Luecken, Maren Büttner, Kridsadakorn Chaichoompu, Anna Danese, Marta Interlandi, Michaela F Müller, Daniel C Strobl, Luke Zappia, Martin Dugas, Maria Colomé-Tatché, et al. Benchmarking atlas-level data integration in single-cell genomics. Nature methods, 19(1):41–50, 2022.
[74] Zhaonan Zou, Tazro Ohta, Fumihito Miura, and Shinya Oki. Chip-atlas 2021 update: a data-mining suite for exploring epigenomic landscapes by fully integrating chip-seq, atac-seq and bisulfite-seq data. Nucleic acids research, 50(W1):W175–W182, 2022.
[75] Britt Adamson, Thomas M Norman, Marco Jost, Min Y Cho, James K Nuñez, Yuwen Chen, Jacqueline E Villalta, Luke A Gilbert, Max A Horlbeck, Marco Y Hein, et al. A multiplexed single-cell crispr screening platform enables systematic dissection of the unfolded protein response. Cell, 167(7):1867–1882, 2016.
[76] Atray Dixit, Oren Parnas, Biyu Li, Jenny Chen, Charles P Fulco, Livnat Jerby-Arnon, Nemanja D Marjanovic, Danielle Dionne, Tyler Burks, Raktima Raychowdhury, et al. Perturb-seq: dissecting molecular circuits with scalable single-cell rna profiling of pooled genetic screens. cell, 167(7):1853–1866, 2016.
[77] Thomas M Norman, Max A Horlbeck, Joseph M Replogle, Alex Y Ge, Albert Xu, Marco Jost, Luke A Gilbert, and Jonathan S Weissman. Exploring genetic interaction manifolds constructed from rich phenotypes. BioRxiv, page 601096, 2019.
[78] Maren Büttner, Zhichao Miao, F Alexander Wolf, Sarah A Teichmann, and Fabian J Theis. A test metric for assessing single-cell rna-seq batch correction. Nature methods, 16(1):43–49, 2019.
[79] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
[80] Roberto Navigli and Mirella Lapata. An experimental study of graph connectivity for unsupervised word sense disambiguation. IEEE transactions on pattern analysis and machine intelligence, 32(4):678–692, 2009.
[81] Yuyao Song, Zhichao Miao, Alvis Brazma, and Irene Papatheodorou. Benchmarking strategies for cross-species integration of single-cell rna sequencing data. Nature Communications, 14(1):6495, 2023.
[82] Rahul Satija, Jeffrey A Farrell, David Gennert, Alexander F Schier, and Aviv Regev. Spatial reconstruction of single-cell gene expression data. Nature biotechnology, 33(5):495–502, 2015.
[83] Ilya Korsunsky, Nghia Millard, Jean Fan, Kamil Slowikowski, Fan Zhang, Kevin Wei, Yuriy Baglaenko, Michael Brenner, Po-ru Loh, and Soumya Raychaudhuri. Fast, sensitive and accurate integration of single-cell data with harmony. Nature methods, 16(12):1289–1296, 2019.
[84] Romain Lopez, Jeffrey Regier, Michael B Cole, Michael I Jordan, and Nir Yosef. Deep generative modeling for single-cell transcriptomics. Nature methods, 15(12):1053–1058, 2018.
[85] Kasia Zofia Kedzierska, Lorin Crawford, Ava Pardis Amini, and Alex X Lu. Assessing the limits of zero-shot foundation models in single-cell biology. bioRxiv, pages 2023–10, 2023.
[86] Lawrence Hubert and Phipps Arabie. Comparing partitions. Journal of classification, 2:193–218, 1985.
[87] Josien PW Pluim, JB Antoine Maintz, and Max A Viergever. Mutual-information-based registration of medical images: a survey. IEEE transactions on medical imaging, 22(8):986–1004, 2003.
[88] Ricard Argelaguet, Damien Arnol, Danila Bredikhin, Yonatan Deloro, Britta Velten, John C Marioni, and Oliver Stegle. Mofa+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome biology, 21:1–17, 2020.
[89] Adam Gayoso, Zoë Steier, Romain Lopez, Jeffrey Regier, Kristopher L Nazor, Aaron Streets, and Nir Yosef. Joint probabilistic modeling of single-cell multi-omic data with totalvi. Nature methods, 18(3):272–282, 2021.
[90] Wei Lan, Dehuan Lai, Qingfeng Chen, Ximin Wu, Baoshan Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. Ldicdl: Lncrna-disease association identification based on collaborative deep learning. IEEE/ACM transactions on computational biology and bioinformatics, 19(3):1715–1723, 2020.
[91] Wei Lan, Mingyang Liu, Jianwei Chen, Jin Ye, Ruiqing Zheng, Xiaoshu Zhu, and Wei Peng. Jlonmfsc: Clustering scrna-seq data based on joint learning of non-negative matrix factorization and subspace clustering. Methods, 222:1–9, 2024.
[92] Wei Lan, Jianwei Chen, Qingfeng Chen, Jin Liu, Jianxin Wang, and Yi-Ping Phoebe Chen. Detecting cell type from single cell rna sequencing based on deep bi-stochastic graph regularized matrix factorization. bioRxiv, pages 2022–05, 2022.
[93] Zelin Wan, Yash Mahajan, Beom Woo Kang, Terrence J Moore, and Jin-Hee Cho. A survey on centrality metrics and their network resilience analysis. IEEE Access, 9:104773–104819, 2021.
[94] Da Wei Huang, Brad T Sherman, and Richard A Lempicki. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic acids research, 37(1):1–13, 2009.
[95] Marta Paczkowska, Jonathan Barenboim, Nardnisa Sintupisut, Natalie S Fox, Helen Zhu, Diala Abd-Rabbo, Miles W Mee, Paul C Boutros, Jüri Reimand, et al. Integrative pathway enrichment analysis of multivariate omics data. Nature communications, 11(1):735, 2020.
[96] Wei Lan, Chunling Li, Qingfeng Chen, Ning Yu, Yi Pan, Yu Zheng, and Yi-Ping Phoebe Chen. Lgcda: Predicting circrna-disease association based on fusion of local and global features. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2024.
[97] Wei Lan, Yi Dong, Qingfeng Chen, Ruiqing Zheng, Jin Liu, Yi Pan, and Yi-Ping Phoebe Chen. Kgancda: predicting circrna-disease associations based on knowledge graph attention network. Briefings in Bioinformatics, 23(1):bbab494, 2022.
[98] Yusuf Roohani, Kexin Huang, and Jure Leskovec. Gears: Predicting transcriptional outcomes of novel multi-gene perturbations. BioRxiv, pages 2022–07, 2022.
[99] Mohammad Lotfollahi, Anna Klimovskaia Susmelj, Carlo De Donno, Leon Hetzel, Yuge Ji, Ignacio L Ibarra, Sanjay R Srivatsan, Mohsen Naghipourfar, Riza M Daza, Beth Martin, et al. Predicting cellular responses to complex perturbations in high-throughput screens. Molecular Systems Biology, page e11517, 2023.
[100] Mohammad Lotfollahi, F Alexander Wolf, and Fabian J Theis. scgen predicts single-cell perturbation responses. Nature methods, 16(8):715–721, 2019.
[101] Wei Lan, Xun Sun, Qingfeng Chen, Jin Ye, Xiaoshu Zhu, and Yi Pan. sciac: clustering scatac-seq data based on student’s t-distribution similarity imputation and denoising autoencoder. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 206–211. IEEE, 2022.
[102] Mohamed Marouf, Pierre Machart, Vikas Bansal, Christoph Kilian, Daniel S Magruder, Christian F Krebs, and Stefan Bonn. Realistic in silico generation and augmentation of single-cell rna-seq data using generative adversarial networks. Nature communications, 11(1):166, 2020.
[103] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
[104] Wei Lan, Haibo Liao, Qingfeng Chen, Lingzhi Zhu, Yi Pan, and Yi-Ping Phoebe Chen. Deepkegg: a multi-omics data integration framework with biological insights for cancer recurrence prediction and biomarker discovery. Briefings in Bioinformatics, 25(3):bbae185, 2024.
[105] Martin Treppner, Harald Binder, and Moritz Hess. Interpretable generative deep learning: an illustration with single cell gene expression data. Human Genetics, 141(9):1481–1498, 2022.
[106] Tianyu Liu, Kexing Li, Yuge Wang, Hongyu Li, and Hongyu Zhao. Evaluating the utilities of large language models in single-cell data analysis. bioRxiv, pages 2023–09, 2023.
[107] Abdel Rahman Alsabbagh, Albert Maillo Ruiz de Infante, David Gomez-Cabrero, Narsis Kiani, Sumeer Ahmad Khan, and Jesper N Tegner. Foundation models meet imbalanced single-cell data when learning cell type annotations. bioRxiv, pages 2023–10, 2023.
[108] Rebecca Boiarsky, Nalini M Singh, Alejandro Buendia, Gad Getz, and David Sontag. A deep dive into single-cell rna sequencing foundation models. bioRxiv, pages 2023–10, 2023.
[109] Xiuhui Yang, Koren K Mann, Hao Wu, and Jun Ding. sccross: A deep generative model for unifying single-cell multi-omics with seamless integration, cross-modal generation, and in-silico exploration. bioRxiv, pages 2023–11, 2023.
[110] Zhenqi Shi and Hao Wu. Ctpredictor: A comprehensive and robust framework for predicting cell types by integrating multi-scale features from single-cell hi-c data. Computers in Biology and Medicine, 173:108336, 2024.