Machine Learning Techniques to Infer Protein Structure and Function from Sequences: A Comprehensive Review

Methods Mol Biol. 2025:2867:79-104. doi: 10.1007/978-1-0716-4196-5_5.

Abstract

The elucidation of protein structure and function plays a pivotal role in understanding biological processes and facilitating drug discovery. With the exponential growth of protein sequence data, machine learning techniques have emerged as powerful tools for predicting protein characteristics from sequences alone. This review provides a comprehensive overview of the importance and application of machine learning in inferring protein structure and function. We discuss various machine learning approaches, primarily focusing on convolutional neural networks and natural language processing, and their utilization in predicting protein secondary and tertiary structures, residue-residue contacts, protein function, and subcellular localization. Furthermore, we highlight the challenges associated with using machine learning techniques in this context, such as the availability of high-quality training datasets and the interpretability of models. We also delve into the latest progress in the field concerning the advancements made in the development of intricate deep learning architectures. Overall, this review underscores the significance of machine learning in advancing our understanding of protein structure and function, and its potential to revolutionize drug discovery and personalized medicine.

Keywords: Convolutional neural networks; Machine learning techniques; Natural language processing; Protein function; Protein sequence data; Protein structure.

Publication types

  • Review

MeSH terms

  • Computational Biology / methods
  • Databases, Protein
  • Deep Learning
  • Humans
  • Machine Learning*
  • Natural Language Processing
  • Neural Networks, Computer*
  • Protein Conformation
  • Proteins* / chemistry
  • Structure-Activity Relationship

Substances

  • Proteins