SEMbap: Bow-free covariance search and data de-correlation

Mario Grassi; Barbara Tarantino

doi:10.1371/journal.pcbi.1012448

SEMbap: Bow-free covariance search and data de-correlation

PLoS Comput Biol. 2024 Sep 11;20(9):e1012448. doi: 10.1371/journal.pcbi.1012448. eCollection 2024 Sep.

Authors

Mario Grassi¹, Barbara Tarantino¹

Affiliation

¹ Department of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.

Abstract

Large-scale studies of gene expression are commonly influenced by biological and technical sources of expression variation, including batch effects, sample characteristics, and environmental impacts. Learning the causal relationships between observable variables may be challenging in the presence of unobserved confounders. Furthermore, many high-dimensional regression techniques may perform worse. In fact, controlling for unobserved confounding variables is essential, and many deconfounding methods have been suggested for application in a variety of situations. The main contribution of this article is the development of a two-stage deconfounding procedure based on Bow-free Acyclic Paths (BAP) search developed into the framework of Structural Equation Models (SEM), called SEMbap(). In the first stage, an exhaustive search of missing edges with significant covariance is performed via Shipley d-separation tests; then, in the second stage, a Constrained Gaussian Graphical Model (CGGM) is fitted or a low dimensional representation of bow-free edges structure is obtained via Graph Laplacian Principal Component Analysis (gLPCA). We compare four popular deconfounding methods to BAP search approach with applications on simulated and observed expression data. In the former, different structures of the hidden covariance matrix have been replicated. Compared to existing methods, BAP search algorithm is able to correctly identify hidden confounding whilst controlling false positive rate and achieving good fitting and perturbation metrics.

Copyright: © 2024 Grassi, Tarantino. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Algorithms*
Computational Biology* / methods
Computer Simulation
Correlation of Data
Gene Expression Profiling / methods
Gene Expression Profiling / statistics & numerical data
Humans
Models, Statistical
Normal Distribution
Principal Component Analysis

Grants and funding

The author(s) received no specific funding for this work.