Proportion-based normalizations outperform compositional data transformations in machine learning applications

Microbiome. 2024 Mar 5;12(1):45. doi: 10.1186/s40168-023-01747-z.

Abstract

Background: Normalization, as a pre-processing step, can significantly affect the resolution of machine learning analysis for microbiome studies. There are countless options for normalization scheme selection. In this study, we examined compositionally aware algorithms including the additive log ratio (alr), the centered log ratio (clr), and a recent evolution of the isometric log ratio (ilr) in the form of balance trees made with the PhILR R package. We also looked at compositionally naïve transformations such as raw counts tables and several transformations that are based on relative abundance, such as proportions, the Hellinger transformation, and a transformation based on the logarithm of proportions (which we call "lognorm").

Results: In our evaluation, we used 65 metadata variables culled from four publicly available datasets at the amplicon sequence variant (ASV) level with a random forest machine learning algorithm. We found that different common pre-processing steps in the creation of the balance trees made very little difference in overall performance. Overall, we found that the compositionally aware data transformations such as alr, clr, and ilr (PhILR) performed generally slightly worse or only as well as compositionally naïve transformations. However, relative abundance-based transformations outperformed most other transformations by a small but reliably statistically significant margin.

Conclusions: Our results suggest that minimizing the complexity of transformations while correcting for read depth may be a generally preferable strategy in preparing data for machine learning compared to more sophisticated, but more complex, transformations that attempt to better correct for compositionality. Video Abstract.

Keywords: Compositional data; High-throughput nucleotide sequencing; Machine learning; Metagenomics; Normalization; PhILR; Random forest; Statistical data interpretation; Transformation.

Publication types

  • Video-Audio Media

MeSH terms

  • Algorithms*
  • Machine Learning
  • Microbiota* / genetics