PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Jackson C Halpin; Amy E Keating

doi:10.1002/pro.70004

PairK: Pairwise k-mer alignment for quantifying protein motif conservation in disordered regions

Protein Sci. 2025 Jan;34(1):e70004. doi: 10.1002/pro.70004.

Authors

Jackson C Halpin¹, Amy E Keating^{1

2

3}

Affiliations

¹ Department of Biology, MIT, Cambridge, Massachusetts, USA.
² Department of Biological Engineering, MIT, Cambridge, Massachusetts, USA.
³ Koch Institute for Integrative Cancer Research, Cambridge, Massachusetts, USA.

PMID: 39720898
PMCID: PMC11669117 (available on 2025-12-25)
DOI: 10.1002/pro.70004

Abstract

Protein-protein interactions are often mediated by a modular peptide recognition domain binding to a short linear motif (SLiM) in the disordered region of another protein. To understand the features of SLiMs that are important for binding and to identify motif instances that are important for biological function, it is useful to examine the evolutionary conservation of motifs across homologous proteins. However, the intrinsically disordered regions (IDRs) in which SLiMs reside evolve rapidly. Consequently, multiple sequence alignment (MSA) of IDRs often misaligns SLiMs and underestimates their conservation. We present PairK (pairwise k-mer alignment), an MSA-free method to align and quantify the relative local conservation of subsequences within an IDR. Lacking a ground truth for conservation, we tested PairK on the task of distinguishing biologically important motif instances from background motifs, under the assumption that biologically important motifs are more conserved. The method outperforms both standard MSA-based conservation scores and a modern LLM-based conservation score predictor. PairK can quantify conservation over wider phylogenetic distances than MSAs, indicating that some SLiMs are more conserved than MSA-based metrics imply. PairK is available as an open-source python package at https://github.com/jacksonh1/pairk. It is designed to be easily adapted for use with other SLiM tools and for diverse applications.

Keywords: conservation; intrinsically disordered proteins; multiple sequence alignment; short linear motif.

MeSH terms

Algorithms
Amino Acid Motifs*
Conserved Sequence
Intrinsically Disordered Proteins / chemistry
Proteins / chemistry
Sequence Alignment*
Sequence Analysis, Protein / methods
Software

Substances

Intrinsically Disordered Proteins
Proteins

Grants and funding

NH/NIH HHS/United States