HSMotifDiscover: identification of motifs in sequences composed of non-single-letter elements

Vinod Kumar Singh; Rohan Misra; Steven C Almo; Ulrich G Steidl; Hannes E Bülow; Deyou Zheng

doi:10.1093/bioinformatics/btac437

HSMotifDiscover: identification of motifs in sequences composed of non-single-letter elements

Bioinformatics. 2022 Aug 10;38(16):4036-4038. doi: 10.1093/bioinformatics/btac437.

Authors

Vinod Kumar Singh¹, Rohan Misra¹, Steven C Almo², Ulrich G Steidl³, Hannes E Bülow^{1

4}, Deyou Zheng^{1

4

5}

Affiliations

¹ Department of Genetics, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
² Department of Biochemistry, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
³ Department of Cell Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
⁴ Department of Neuroscience, Albert Einstein College of Medicine, Bronx, NY 10461, USA.
⁵ Department of Neurology, Albert Einstein College of Medicine, Bronx, NY 10461, USA.

Abstract

Summary: The functional sub-string(s) of a biopolymer sequence defines the specificity of its interaction with other biomolecules and is often referred to as motifs. Computational algorithms and software have been broadly developed for finding such motifs in sequences in which the individual elements are single characters, such as those in DNA and protein sequences. However, there are more complex scenarios where the motifs exist in non-single-letter contexts, e.g. preferred patterns of chemical modifications on proteins, DNAs, RNAs or polysaccharides. To search for those motifs, we describe a new method that converts the modified sequence elements to representative single-letter codes and then uses a modified Gibbs-sampling algorithm to define the position specific scoring matrix representing the motif(s). As a proof of principle, we describe the implementation and application of an R package for discovering heparan sulfate (HS) motifs in glycan sequences, which are important in regulating protein-protein interactions. This software can be valuable for analyzing high-throughput glycoprotein binding data using microarrays with HS oligosaccharides or other biological polymers.

Availability and implementation: HSMotifDiscover is freely available as an open source R package released under an MIT license at https://github.com/bioinfoDZ/HSMotifDiscover and also available in the form of an app at https://hsmotifdiscover.shinyapps.io/HSMotifDiscover_ShinyApp/.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

Research Support, N.I.H., Extramural

MeSH terms

Algorithms*
Amino Acid Sequence
DNA / chemistry
Proteins / chemistry
Software*

Substances

Proteins
DNA

Grants and funding

U01 CA241981/CA/NCI NIH HHS/United States