Kernel-based logistic regression model for protein sequence without vectorialization

Biostatistics. 2015 Jul;16(3):480-92. doi: 10.1093/biostatistics/kxu056. Epub 2014 Dec 22.

Abstract

Protein sequence data arise more and more often in vaccine and infectious disease research. These types of data are discrete, high-dimensional, and complex. We propose to study the impact of protein sequences on binary outcomes using a kernel-based logistic regression model, which models the effect of protein through a random effect whose variance-covariance matrix is mostly determined by a kernel function. We propose a novel, biologically motivated, profile hidden Markov model (HMM)-based mutual information (MI) kernel. Hypothesis testing can be carried out using the maximum of the score statistics and a parametric bootstrap procedure. To improve the power of testing, we propose intuitive modifications to the test statistic. We show through simulation studies that the profile HMM-based MI kernel can be substantially more powerful than competing kernels, and that the modified test statistics bring incremental gains in power. We use these proposed methods to investigate two problems from HIV-1 vaccine research: (1) identifying segments of HIV-1 envelope (Env) protein that confer resistance to neutralizing antibody and (2) identifying segments of Env that are associated with attenuation of protective vaccine effect by antibodies of isotype A in the RV144 vaccine trial.

Keywords: Davies problem; Kernel methods; Maximum of score statistics.

Publication types

  • Research Support, N.I.H., Intramural
  • Research Support, Non-U.S. Gov't
  • Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

  • AIDS Vaccines / genetics
  • AIDS Vaccines / immunology
  • Antibodies, Neutralizing / immunology
  • Biostatistics
  • Computer Simulation
  • HIV Antibodies / immunology
  • HIV-1 / genetics
  • HIV-1 / immunology
  • Humans
  • Immunoglobulin A / immunology
  • Immunoglobulin G / immunology
  • Logistic Models*
  • Markov Chains
  • Models, Statistical
  • Sequence Analysis, Protein / statistics & numerical data*
  • env Gene Products, Human Immunodeficiency Virus / genetics
  • env Gene Products, Human Immunodeficiency Virus / immunology

Substances

  • AIDS Vaccines
  • Antibodies, Neutralizing
  • HIV Antibodies
  • Immunoglobulin A
  • Immunoglobulin G
  • env Gene Products, Human Immunodeficiency Virus