ILMCNet: A Deep Neural Network Model That Uses PLM to Process Features and Employs CRF to Predict Protein Secondary Structure

Genes (Basel). 2024 Oct 21;15(10):1350. doi: 10.3390/genes15101350.

Abstract

Background: Protein secondary structure prediction (PSSP) is a critical task in computational biology, pivotal for understanding protein function and advancing medical diagnostics. Recently, approaches that integrate multiple amino acid sequence features have gained significant attention in PSSP research.

Objectives: We aim to automatically extract additional features represented by evolutionary information from a large number of sequences while simultaneously incorporating positional information for more comprehensive sequence features. Additionally, we consider the interdependence between secondary structures during the prediction stage.

Methods: To this end, we propose a deep neural network model, ILMCNet, which utilizes a language model and Conditional Random Field (CRF). Protein language models (PLMs) pre-trained on sequences from multiple large databases can provide sequence features that incorporate evolutionary information. ILMCNet uses positional encoding to ensure that the input features include positional information. To better utilize these features, we propose a hybrid network architecture that employs a Transformer Encoder to enhance features and integrates a feature extraction module combining a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory Network (BiLSTM). This design enables deep extraction of localized features while capturing global bidirectional information. In the prediction stage, ILMCNet employs CRF to capture the interdependencies between secondary structures.

Results: Experimental results on benchmark datasets such as CB513, TS115, NEW364, CASP11, and CASP12 demonstrate that the prediction performance of our method surpasses that of comparable approaches.

Conclusions: This study proposes a new approach to PSSP research and is expected to play an important role in other protein-related research fields, such as protein tertiary structure prediction.

Keywords: conditional random field; hybrid neural network architecture; protein secondary structure; unsupervised protein language model.

MeSH terms

  • Algorithms
  • Amino Acid Sequence / genetics
  • Computational Biology* / methods
  • Databases, Protein
  • Deep Learning
  • Neural Networks, Computer*
  • Protein Structure, Secondary*
  • Proteins* / chemistry
  • Proteins* / genetics
  • Sequence Analysis, Protein / methods

Substances

  • Proteins