\externaldocument

Supplement

Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference

Federico Castelletti and Laura Ferrini
Department of Statistical Sciences Università Cattolica del Sacro Cuore Milan

Abstract

Quantifying causal effects of exposures on outcomes, such as a treatment and a disease respectively, is a crucial issue in medical science for the administration of effective therapies. Importantly, any related causal analysis should account for all those variables, e.g. clinical features, that can act as risk factors involved in the occurrence of a disease. In addition, the selection of targeted strategies for therapy administration requires to quantify such treatment effects at personalized level rather than at population level. We address these issues by proposing a methodology based on categorical Directed Acyclic Graphs (DAGs) which provide an effective tool to infer causal relationships and causal effects between variables. We account for population heterogeneity by considering a Dirichlet Process mixture of categorical DAGs, which clusters individuals into homogeneous groups characterized by common causal structures, dependence parameters and causal effects. We develop computational strategies for Bayesian posterior inference, from which a battery of causal effects at subject-specific level is recovered. Our methodology is evaluated through simulations and applied to a dataset of breast cancer patients to investigate cardiotoxic side effects that can be induced by the administrated anticancer therapies.

Keywords: Breast cancer; Clustering; Personalized medicine; Subject-specific graph.

1 Introduction

1.1 Motivation and framework

Estimating cause-and-effect relations between variables is a pervasive issue in many applied domains and primarily medical science. Typically in this setting, interest lies in measuring the (direct or indirect) effect of a therapy on the progression of a disease. Our methodology is motivated by a dataset of patients diagnosed with breast cancer and treated with different oncological therapies. In this context, the protein Human Epidermal growth factor Receptor 2 (HER2) has been identified as one of the main responsibles of tumor progression and growth. Recent studies have shown that therapies targeting HER2 have a strong antitumor effect, improving the overall and progression-free survival. These therapies are commonly based on both monoclonal antibodies, such as trastuzumab, as well as anticancer drugs, in particular antracyclines; see Slamon et al. (1987) and Katzorke et al. (2013). However, they can cause cardiotoxicity as a side effect, with consequent heart failure and loss of left ventricular contractile function (Dempke et al., 2023; Bowles et al., 2012). Establishing the (causal) effect of anti-HER2 therapies on cardiotoxicity is therefore of key importance for the administration of appropriate anticancer treatments, and to develop strategies for preventing and detecting cardiotoxiciy in high-risk patients. In addition, there exist several factors, such as advanced age, hypertension, valvulopathy and arrhythmia, that predispose to cardiotoxicity; see in particular Dempsey et al. (2021), Lotrionte et al. (2013) and references therein. Accordingly, in a related causal-effect analysis one should account for all those clinical features that may act as risk factors in the occurrence of cardiotoxicity.

When several variables are entertained, as in the framework above, one should account for possible interactions/dependencies between them in order to provide a coherent quantification of causal effects. Typically however, such dependence structure is unknown, or can be partially drawn only based on experts’ knowledge and one need to learn it from the data. Graphical models based on Directed Acyclic Graphs (DAGs) offer a powerful tool for this structure learning task. Importantly to our purposes, DAGs allow to properly define the causal effect on a target variable of interest induced by a hypothetical intervention on another variable in the system. Additionally, learning such causal effect can be achieved from observational data alone, under suitable causal assumptions on the data generating process (Pearl, 2000). An important issue in this general framework is however represented by heterogeneity, which implies that causal effects may vary across individuals, as the consequence of an existing, yet unknown, clustering structure in the population. Causal-inference methodologies accounting for heterogeneity can provide a more reliable quantification of treatment effects across patients, leading to personalized strategies for the administration of therapies; see in particular Ma et al. (2015) for an overview.

1.2 Related work

Available methods for clustering multivariate (categorical) data include distance- and model-based approaches. Among the first, k-modes (Huang, 1998) is the most popular methodology, which is based on a modified version of the k-means algorithm (MacQueen, 1967) and implements a dissimilarity measure between modes of categorical variables computed across clusters. On the other side, poLCA (polytomous variable Latent Class Analysis) (Linzer and Lewis, 2011) is a model-based method which applies to a collection of categorical random variables. In the model, each mixture component corresponds to a multivariate categorical distribution built under the assumption of independence between marginal distributions and for a known number of components in the mixture. Importantly however, none of these methods accounts for possible dependence relationships between variables in the underlying multivariate statistical model, which instead represents a peculiar feature of our methodology. Model-based clustering methods have been extensively developed in the Bayesian literature from both a finite and infinite mixture-model perspective; see in particular Frühwirth-Schnatter et al. (2021), Argiento and De Iorio (2022) and references therein for a review and connections between the two approaches. In a multivariate Gaussian framework, infinite-mixture models based on a Dirichlet Process (DP) prior are considered by Rodríguez et al. (2011) and Castelletti and Consonni (2023) for clustering and structure learning of undirected and directed graphs respectively. Recently, Argiento et al. (2022) proposed a finite-mixture model specifically designed for multivariate unordered categorical data. This is based on a newly-introduced class of Hamming distributions which is assumed for each categorical variable marginally, and from which a joint distribution over $q$ variables is built under the assumption of (local) independence. Finally, Malsiner-Walli et al. (2024) proposed a two-layer mixture model which also allows for associations among categorical variables within each mixture-component. Such dependencies arise from a second-layer mixture, assumed within each component of the main mixture model, rather than a multivariate model with allied dependence parameter, which is instead a distinctive feature of our method for causal discovery and inference.

The literature on heterogeneous causal inference has grown extensively in the last years, particularly under the potential outcome framework (Rubin, 2005). Assuming the existence of latent sub-groups of individuals in the population, this issue is addressed through the definition and estimation of group-specific causal effects, known as Conditional Average Treatment Effects (CATEs). Machine learning methods are employed to identify the underlying clustering structure by stratification of subjects based on the levels of available covariates; see Dominici et al. (2021) for a review, Athey and Imbens (2016), Hahn et al. (2020) and Bargagli Stoffi et al. (2022), the latter proposing a method which is specifically designed to handle imperfect compliance. Recent methods also aim at improving interpretability of causal results. An instance in this direction is the Causal Rule Ensamble (CRE) method (Lee et al., 2021), which adopts multiple trees to identify patterns of heterogeneity in the data and to ensure stability in sub-group identification. Other approaches to heterogeneous causal inference using counterfactuals are based on Bayesian nonparametric methods; see for instance Linero and Antonelli (2022) and references therein. Among these, Zorzetto et al. (2024) employ a Dependent Probit Stick-Breaking mixture model to simultaneously impute the missing outcomes (counterfactuals), and to identify mutually exclusive groups, thus allowing to estimate causal effects in the presence of population heterogeneity. Still in a potential outcome framework, Roy et al. (2016) adopt marginal structural models and implement a dependent Dirichlet Process (DP) prior for the evaluation of heterogeneous causal effects of treatments on survival outcomes; Oganisian et al. (2021) instead consider a DP mixture of zero inflated regression models for pathological data exhibiting excesses of zeros. DP priors for clustering and heterogeneous causal inference are also employed by Castelletti and Consonni (2023) in a multivariate framework based on Gaussian graphical models where causal effects are defined and estimated according to do-calculus theory (Pearl, 2000).

1.3 Contribution and structure of the paper

We propose a Bayesian methodology based on a infinite mixture of categorical DAGs for causal discovery and causal effect estimation in the presence of heterogeneous data. Our model allows for the presence of latent sub-groups of individual/patients in the sample, each characterized by a possibly different causal structure and battery of related causal-effect parameters. Specifically, we assume that the multivariate distribution of the observables belongs to a Dirichlet Process (DP) mixture of categorical DAG models. Each mixture component reflects a factorization of the sampling distribution satisfying a set of conditional independencies imposed by the DAG. Under the latter, causal effects between variables are then defined according to do-calculus. With regard to the DP prior, we define a baseline measure over the space of priors on $(\mathcal{D},\bm{\theta})$ where $\mathcal{D}$ is a DAG and $\bm{\theta}$ the parameter of a categorical DAG model. We employ a constructive procedure based on local and global parameter independence to assign priors to DAG parameters, providing closed-form expressions for both the prior and posterior predictive of DAGs, as well as for the posterior distribution of DAG-parameters. We then leverage these results to develop a computational scheme for posterior inference of our DP model. When applied to breast cancer data, our methodology ultimately allows to quantify treatment effects of assigned therapies w.r.t. the occurrence of cardiotoxicity at subject-specific levels, thus leading to a more reliable decision process for the development of personalized therapies.

The rest of the paper is organized as follows. In Section 2 we provide some background material on DAGs and causal effects within a categorical modelling framework. In Section 3 we introduce our mixture model based on a DP prior, for which we detail the construction of the baseline mixing measure over the space of DAGs and allied parameters. We then describe a Markov Chain Monte Carlo (MCMC) strategy for posterior inference in Section 4. In Section 5 we evaluate our methodology relative to the tasks of clustering and causal discovery through extensive simulation studies, which include comparisons with alternative state-of-the-art methods. Section 6 is devoted to the analysis of breast cancer data and includes our causal-effect analysis to evaluate heterogeneous side effects of anti-HER2 therapies with respect to the occurrence of cardiotoxicity. We finally provide a discussion to our methodology in Section 7, together with possible future developments. Some technical results, including the computation of prior and posterior predictive distributions required by our posterior sampler, are reported in the Supplementary Material.

2 Background

2.1 Categorical DAG models

Consider a Directed Acyclic Graph (DAG) $\mathcal{D}=(V,E)$ with set of nodes $V=\{1,\dots,q\}$ and set of directed edges $E\subseteq V\times V$ . For a given $\mathcal{D}$ , if $(u,v)\in E$ , we say that $u$ is a parent of $v$ and let $\mathrm{pa}_{\mathcal{D}}(v)$ be the set of all parents of $v$ in $\mathcal{D}$ . Moreover, we let $\mathrm{fa}_{\mathcal{D}}(v):=v\cup\mathrm{pa}_{\mathcal{D}}(v)$ be the family of node $v$ in the DAG. Consider now a collection of random variables $X=(X_{1},\dots,X_{q})$ , such as clinical features that can be measured on patients, and binary categorical variables indicating the administration of a therapy and the absence/presence of a disease. In the following, we assume that each $X_{j}$ , $j\in V$ , is categorical with set of levels $\mathcal{X}_{j}$ and let $x_{j}\in\mathcal{X}_{j}$ be one of its levels. Accordingly, $X\in{\mathcal{X}}:=\times_{j\in V}\mathcal{X}_{j}$ , whose generic element is $x\in{\mathcal{X}}$ . In addition, if for any $S\subset V$ we let $X_{S}=(X_{j},j\in S)$ , then $X_{S}\in{\mathcal{X}}:=\times_{j\in S}\mathcal{X}_{j}$ , with $x_{S}\in{\mathcal{X}}_{S}$ . Under $\mathcal{D}$ , the joint probability $p(x)=\text{Pr}(X_{1}=x_{1},\dots,X_{q}=x_{q})$ admits the factorization

p(x)=\prod_{j=1}^{q}\Pr(X_{j}=x_{j}\,|\,X_{\mathrm{pa}(j)}=x_{\mathrm{pa}(j)}).

(1)

For the remainder of this section we omit DAG $\mathcal{D}$ from our notation and reason conditionally on a fixed DAG. Let now $\theta_{s}^{S}=\Pr(X_{S}=s)$ , $s\in{\mathcal{X}}_{S}$ , be a marginal probability for variables in $S\subseteq V$ . Moreover, let $\theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}=\Pr(X_{j}=m\,|\,X_{\mathrm{pa}(j)}=s)$ be a conditional probability for $X_{j}$ given configuration (level) $s$ of $X_{\mathrm{pa}(j)}$ , with $m\in{\mathcal{X}}_{j},s\in{\mathcal{X}}_{\mathrm{pa}(j)}$ . Consider $n$ observations from $X$ , $\bm{x}^{(1)},\dots,\bm{x}^{(n)}$ , where each $\bm{x}^{(i)}=(x^{(i)}_{1},\dots,x^{(i)}_{q})^{\top}$ , and $\bm{x}^{(i)}\in{\mathcal{X}}$ , $i=1,\dots,n$ . Also, let $\bm{x}_{S}^{(i)}$ be the sub-vector of $\bm{x}^{(i)}$ with components indexed by $S\subset V$ . If we collect the $\bm{x}^{(i)}$ ’s into an $(n,q)$ data matrix $\bm{X}$ , then the likelihood function can be written as

	$\displaystyle p(\bm{X}\,\|\,\bm{\theta})$	$\displaystyle=\prod_{i=1}^{n}\left\{\prod_{x\in{\mathcal{X}}}\left\{p(\bm{x}^{% (i)}\,\|\,\bm{\theta})\right\}^{\mathbbm{1}\{\bm{x}^{(i)}=x\}}\right\}$		(2)
		$\displaystyle=\prod_{j=1}^{q}\left\{\prod_{s\in{\mathcal{X}}_{\mathrm{pa}(j)}}% \left\{\prod_{m\in\mathcal{X}_{j}}\left\{\theta^{j\,\|\,\mathrm{pa}(j)}_{m\,\|\,% s}\right\}^{n^{\mathrm{fa}(j)}_{(m,s)}}\right\}\right\},$		(2)

now emphasizing the dependence on the DAG-parameter $\bm{\theta}$ (corresponding to the collection of conditional probabilities in the equation) and where $n^{\mathrm{fa}(j)}_{(m,s)}=\sum_{i=1}^{n}\mathbbm{1}\left\{\bm{x}^{(i)}_{% \mathrm{fa}(j)}=(m,s)\right\}$ is the number of observations for which $X_{\mathrm{fa}(j)}=(m,s)$ . See also Castelletti et al. (2024) for further notation on categorical DAG models.

2.2 Causal effects for categorical DAGs

For a given collection of random variables whose multivariate distribution factorizes according to a DAG, we now focus on the causal effect of an intervention on $X_{h}$ , $h\in V$ , on a response variable of interest, say $X_{j}:=Y$ , $j\neq h$ . In practice, such an intervention corresponds to assigning a treatment to an individual, equivalently fixing $X_{h}=\tilde{x}$ , where $X_{h}$ is typically an exposure of $Y$ , and this action can be denoted using Pearl’s do-operator $\textnormal{do}(X_{h}=\tilde{x})$ (Pearl, 2003). This implies a change in the observational distribution (1), leading to the so-called post-intervention distribution

p(x\,|\,\textnormal{do}(X_{h}=\tilde{x}))=\begin{cases}\prod\limits_{j\neq h}p% \big{(}X_{j}=x_{j}\,|\,X_{\mathrm{pa}(j)}=x_{\mathrm{pa}(j)}\big{)}&\textrm{if% }\ X_{h}=\tilde{x}\\ \,\,0&\textrm{otherwise}.\end{cases}

(3)

Assuming for simplicity that both $X_{h}$ and $Y$ are binary variables with levels in $\{0,1\}$ , the causal effect of $\textnormal{do}\{X_{h}=\tilde{x}\}$ on $Y$ can be defined as

c_{y,h}\vcentcolon=\mathbb{E}\big{[}Y\,|\,\textnormal{do}(X_{h}=1)\big{]}-% \mathbb{E}\big{[}Y\,|\,\textnormal{do}(X_{h}=0)\big{]};

(4)

see Pearl (2003). More in general, if $X_{h}$ is polytomous with levels labeled as $\{0,1,\dots,L\}$ , one can define a battery of causal effects by considering $X_{h}=l$ , for each $l=1,\dots,L$ in the first expectation of Equation (4). According to the definition above, $c_{y,h}$ involves a (marginal) post-intervention distribution of $Y$ . However, because of (3), the latter can be expressed in terms of observational distributions, simply by conditioning and then marginalizing w.r.t. a valid adjustment set $Z\subset X$ ; see Pearl (2003). A common choice for such an adjustment set is $Z=X_{\mathrm{pa}(h)}$ , namely the parents of $X_{h}$ , leading to

	$\displaystyle c_{y,h}=\sum_{s\in{\mathcal{X}}_{\mathrm{pa}(h)}}\mathbb{E}\big{% (}Y\,\|$	$\displaystyle X_{h}=1,X_{\mathrm{pa}(h)}=s\big{)}\Pr\big{(}X_{\mathrm{pa}(h)}=% s\big{)}$		(5)
	$\displaystyle-$	$\displaystyle\sum_{s\in{\mathcal{X}}_{\mathrm{pa}(h)}}\mathbb{E}\big{(}Y\,\|\,X% _{h}=0,X_{\mathrm{pa}(h)}=s\big{)}\Pr\big{(}X_{\mathrm{pa}(h)}=s\big{)};$		(5)

see in particular Theorem 3.2.3 in Pearl (2000). Under model (2), the causal effect in (5) can be expressed as a function of the DAG parameter $\bm{\theta}$ as

\displaystyle\gamma_{y,h}(\bm{\theta})=\sum_{s\in\mathcal{X}_{\mathrm{pa}(h)}}% \left\{\left(\theta^{Y\,|\,\mathrm{fa}(h)}_{1\,|\,(1,s)}-\theta^{Y\,|\,\mathrm% {fa}(h)}_{1|(0,s)}\right)\theta^{\mathrm{pa}(h)}_{s}\right\}.

(6)

3 DP mixture of categorical DAG models

In this section we introduce our Dirichlet Process (DP) mixture of categorical DAG models. This can be written using the following hierarchical structure

$\displaystyle\bm{x}^{(i)}\,\|\,\bm{\theta}_{i},\mathcal{D}_{i}$	$\displaystyle\sim p(\bm{x}^{(i)}\,\|\,\bm{\theta}_{i},\mathcal{D}_{i})$	(7)
$\displaystyle(\bm{\theta}_{i},\mathcal{D}_{i})\,\|\,H$	$\displaystyle\sim H$
$\displaystyle H$	$\displaystyle\sim DP(M_{0},\alpha)$

where $DP(M_{0},\alpha)$ denotes a DP prior with baseline $M_{0}$ and concentration parameter $\alpha$ (Ferguson, 1973), and we now emphasize the dependence on DAG $\mathcal{D}_{i}$ in the sampling distribution $p(\bm{x}^{(i)}\,|\,\bm{\theta}_{i},\mathcal{D}_{i})$ .

A property of model (7) is that it induces a partition of the observations $\bm{x}^{(1)},\dots,\bm{x}^{(n)}$ into clusters, with individuals assigned to the same cluster sharing the same DAG $\mathcal{D}$ and DAG parameter $\bm{\theta}$ . Moreover, the expected number of clusters is controlled by $\alpha$ : each observation $\bm{x}^{(i)}$ is associated with its own $(\bm{\theta}_{i},\mathcal{D}_{i})$ -parameter as $\alpha\rightarrow\infty$ ; on the contrary, if $\alpha\rightarrow 0$ , then all observations are assigned to the same cluster, leading to a standard categorical DAG model (Castelletti et al., 2024); see also Müller and Rodriguez (2013) for related properties of the DP prior.

Let now $K\leq n$ be the number of unique values among $(\bm{\theta}_{1},\mathcal{D}_{1}),\dots,(\bm{\theta}_{n},\mathcal{D}_{n})$ , and $\{\xi_{i}\}_{i=1}^{n}$ a sequence of (cluster) indicator variables such that $\xi_{i}\in\{1,\dots,K\}$ and $(\bm{\theta}_{i},\mathcal{D}_{i})=(\bm{\theta}_{\xi_{i}},\mathcal{D}_{\xi_{i}})$ . Conditionally on $\{\xi_{i}\}_{i=1}^{n}$ , observations are i.i.d. within each cluster, so that the likelihood can be written as

	$\displaystyle p\left(\bm{X}\,\|\,\{\xi_{i}\}_{i=1}^{n},\{\bm{\theta}_{i}\}_{i=1% }^{n},\{\mathcal{D}_{i}\}_{i=1}^{n}\right)$	$\displaystyle=\prod_{k=1}^{K}\left\{\prod_{i:\xi_{i}=k}p\left(\bm{x}^{(i)}\,\|% \,\bm{\theta}_{\xi_{i}},\mathcal{D}_{\xi_{i}}\right)\right\}$		(8)
		$\displaystyle=\prod_{k=1}^{K}p\big{(}\bm{X}^{(k)}\,\|\,\bm{\theta}_{k},\mathcal% {D}_{k}\big{)},$		(8)

with $p\big{(}\bm{X}^{(k)}\,|\,\bm{\theta}_{k},\mathcal{D}_{k}\big{)}$ as in Equation (2), and where $\bm{X}^{(k)}$ is the $(n_{k},q)$ matrix collecting all observations $\bm{x}^{(i)}$ such that $\xi_{i}=k$ . Additionally, a generic count involved in the $k$ -th component above will be denoted as $\prescript{}{k}{n}^{\mathrm{fa}(j)}_{(m,s)}=\sum_{i:\xi_{i}=k}\mathbbm{1}\big{% \{}\bm{x}^{(i)}_{\mathrm{fa}(j)}=(m,s)\big{\}}$ , which corresponds to the number of observations in cluster $k$ for which the level taken by variables $X_{\mathrm{fa}(j)}$ is equal to $(m,s)$ . An alternative representation of the DP prior is based on the so-called stick-breaking process (Sethuraman, 1994). Accordingly, $H$ can be written in the form

H=\sum_{k=1}^{\infty}\omega_{k}\delta_{(\bm{\theta}_{k},\mathcal{D}_{k})}

(9)

where $\delta_{(\bm{\theta}_{k},\mathcal{D}_{k})}$ is a degenerate probability measure placing all of its mass on $\left\{\bm{\theta}_{k},\mathcal{D}_{k}\right\}$ and $\left\{\bm{\theta}_{k},\mathcal{D}_{k}\right\}_{k=1}^{\infty}\overset{% \textnormal{iid}}{\sim}M_{0}$ . Moreover, the weights $\{\omega_{k}\}_{k=1}^{\infty}$ satisfy $\omega_{1}=v_{1}$ , and $\omega_{k}=v_{k}\prod_{h<k}(1-v_{h})$ , where $\{v_{k}\}_{k=1}^{\infty}\overset{\textnormal{iid}}{\sim}\textnormal{Beta}(1,\alpha)$ , with $\alpha$ the concentration parameter of the DP prior. In the following we will assign $\alpha\sim\textnormal{Gamma}(c,d)$ following Escobar and West (1994).

In the next sections we detail the construction of the baseline $M_{0}$ . This is structured as $M_{0}=p(\bm{\theta}\,|\,\mathcal{D})p(\mathcal{D})$ , where the former term corresponds to a prior on the DAG parameter $\bm{\theta}$ conditionally on DAG $\mathcal{D}$ , while the latter is a marginal prior over DAGs.

3.1 Baseline on DAG parameter

Conditionally on DAG $\mathcal{D}$ , we first assign a prior $p(\bm{\theta}\,|\,\mathcal{D})$ . To this end, consider for each node $j\in\{1,\dots,q\}$ and $s\in{\mathcal{X}}_{\mathrm{pa}(j)}$ the parameter $\big{(}\theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s},m\in\mathcal{X}_{j}\big{)}:=\bm% {\theta}^{j\,|\,\mathrm{pa}(j)}_{s}$ corresponding to a $|\mathcal{X}_{j}|$ -dimensional vector collecting conditional probabilities for variable $X_{j}$ , given a configuration $s$ of its parents $X_{\mathrm{pa}(j)}$ . We assign to each $\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}$ a Dirichlet prior with hyper-parameter $\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}=\big{(}a^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s},m% \in\mathcal{X}_{j}\big{)}$ , written as $\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\sim\textnormal{Dirichlet}(\bm{a}^{j\,|% \,\mathrm{pa}(j)}_{s})$ , whose p.d.f. is

\displaystyle p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}=h\big{(}% \bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}\prod_{m\in\mathcal{X}_{j}}\left\{% \theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}\right\}^{a^{j\,|\,\mathrm{pa}(j)}_{m\,% |\,s}-1},

(10)

and where $h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}$ is the prior normalizing constant. Let now $\bm{\theta}^{j\,|\,\mathrm{pa}(j)}=\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{% s},s\in{\mathcal{X}}_{\mathrm{pa}(j)}\big{)}$ . By assuming global and local parameter independence (Geiger and Heckerman, 1997), respectively $\bot\bot_{j}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}$ and $\bot\bot_{s}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}$ , a joint prior on $\bm{\theta}=\big{\{}\bm{\theta}^{j\,|\,\mathrm{pa}(j)},j\in V\big{\}}$ can be written as

\displaystyle p(\bm{\theta})=\prod_{j=1}^{q}\left\{\prod_{s\in{\mathcal{X}}_{% \mathrm{pa}(j)}}p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}\right\}.

(11)

In what follows we implement the default choice $a^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}=a/|{\mathcal{X}}_{\mathrm{fa}(j)}|$ , $a>0$ , leading to the Bayesian Dirichlet Equivalent uniform (BDEu) score (Heckerman et al., 1995), which guarantees that Markov equivalent DAGs are assigned the same marginal likelihood; see also Castelletti et al. (2024).

Also notice that the resulting prior is conjugate with the likelihood (2) since, for generic dataset $\bm{X}$ , $\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\,|\,\bm{X}\sim\textnormal{Dirichlet}% \big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}_{s}^{\mathrm{fa}(j)}\big{)}$ with $\bm{n}_{s}^{\mathrm{fa}(j)}=\big{(}n_{(m,s)}^{\mathrm{fa}(j)},m\in\mathcal{X}_% {j}\big{)}$ . Accordingly, the posterior of $\bm{\theta}$ is

\displaystyle p(\bm{\theta}\,|\,\bm{X})=\prod_{j=1}^{q}\left\{\prod_{s\in{% \mathcal{X}}_{\mathrm{pa}(j)}}p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\,% |\,\bm{X}\big{)}\right\},

(12)

with each term corresponding to a Dirichlet p.d.f., so that direct sampling from $p(\bm{\theta}\,|\,\bm{X})$ is possible. Finally, under the same prior, a marginal (i.e. integrated w.r.t. to $\bm{\theta}$ ) likelihood $m(\bm{X}\,|\,\mathcal{D})=\int p(\bm{X}\,|\,\bm{\theta},\mathcal{D})p(\bm{% \theta}\,|\,\mathcal{D})\,d\bm{\theta}$ is available and admits the factorization

m(\bm{X}\,|\,\mathcal{D})=\prod_{j=1}^{q}m(\bm{X}_{j}\,|\,\bm{X}_{\mathrm{pa}(% j)}),

(13)

with

m\big{(}\bm{X}_{j}\,|\,\bm{X}_{\mathrm{pa}(j)}\big{)}=\prod_{s\in{\mathcal{X}}% _{\mathrm{pa}(j)}}\frac{h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}}{h% \big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}^{\mathrm{fa}(j)}_{s}\big{)}}

(14)

and where $h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}^{\mathrm{fa}(j)}_{s}\big{)}$ is the posterior normalizing constant. See also the Supplementary Material (Section 1) for full details.

3.2 Baseline on DAGs

Let $\mathcal{S}_{q}$ be the (discrete) space of all DAGs with $q$ nodes. Additionally, we can restrict $\mathcal{S}_{q}$ to a subset of DAGs satisfying some structural constraints, typically edge orientations that can be postulated in advance based on the specific real-data problem. As an instance, in our application to breast cancer data we regard age as an exogenous variable and forbid any incoming edge to it from other variables; conversely, we regard the occurrence of cardiotoxic side effect as a response variable and accordingly forbid any outgoing edge from it. Each DAG $\mathcal{D}=(V,E)$ in $\mathcal{S}_{q}$ can be represented through a 0-1 adjacency matrix $\bm{A}^{\mathcal{D}}$ , whose $(u,v)$ -element $\bm{A}^{\mathcal{D}}_{u,v}=1$ if $(u,v)\in E$ , $0$ otherwise. Additionally, let $\bm{S}^{\mathcal{D}}$ , be the adjacency matrix of the skeleton of $\mathcal{D}$ , namely the undirected graph obtained from $\mathcal{D}$ by disregarding edges orientations. We assign for each $u>v$ , $\bm{S}^{\mathcal{D}}_{u,v}\,|\,\pi\overset{\textnormal{{iid}}}{\sim}% \textnormal{Ber}(\pi)$ where $\pi$ is a prior probability of edge inclusion. We then assume hierarchically $\pi\sim\textnormal{Beta}(a,b)$ , leading to the integrated prior on DAG $\mathcal{D}$

p(\mathcal{D})\propto p\big{(}\bm{S}^{\mathcal{D}}\big{)}=\frac{\Gamma(a+b)}{% \Gamma(a)\Gamma(b)}\cdot\frac{\Gamma\big{(}|\bm{S}^{\mathcal{D}}|+a\big{)}% \Gamma\big{(}q(q-1)/2-|\bm{S}^{\mathcal{D}}|+b\big{)}}{\Gamma\big{(}q(q-1)/2+a% +b\big{)}},

where $|\bm{S}^{\mathcal{D}}|$ the number of non-null elements in $\bm{S}^{\mathcal{D}}$ , corresponding to the number of edges in $\mathcal{D}$ , and $q(q-1)/2$ is the maximum number of edges in a DAG having $q$ nodes. Sampling from the baseline over DAGs, as required by our MCMC sampler (Section 4), is possible through an acceptance-rejection algorithm over the space $\mathcal{S}_{q}$ ; see also the Supplementary Material (Section 2).

4 Posterior inference

In this section we detail our Markov Chain Monte Carlo (MCMC) strategy for posterior inference of a DP mixture of categorical DAGs. This is based on a collapsed sampler with DAG parameters integrated out, and which accordingly approximates a marginal posterior over DAGs and cluster indicators $\xi_{1},\dots,\xi_{n}$ . Such output allows for inference about the clustering structure and/or the graphical structures associated with the clusters. In a second step, DAG parameters can be sampled conditionally on $\xi_{1},\dots,\xi_{n}$ based on Equation (12).

4.1 MCMC scheme

The structure of our baseline measure (Section 3.1) is such that we can integrate out the DAG parameter $\bm{\theta}$ , which allows for the implementation of a collapsed sampler approximating the marginal posterior of $\{\xi_{i}\}_{i=1}^{n},\alpha,\{\mathcal{D}_{k}\}_{k=1}^{K},K$ . The resulting scheme has a Gibbs-sampling structure as Algorithm 2 in Neal (2000) and implements the following steps.

4.1.1 Update of cluster indicators

The full conditional of $\xi_{i}$ is

\displaystyle p(\xi_{i}=k\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D}_{% k})\propto\begin{cases}n^{-i}_{k}\ p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,% \xi_{l}=k\},\mathcal{D}_{k})&k=1,\dots,K\\ \alpha\ p(\bm{x}^{(i)}\,|\,\mathcal{D}_{k})&k=K+1,\end{cases}

(15)

which corresponds to the probability that subject $i$ is assigned to cluster $k$ , conditionally on all the observations currently assigned to that cluster, and on $\mathcal{D}_{k}$ . In particular, for a non empty cluster $k=1,\dots,K$ , the full conditional is proportional to the product between two terms: the number of observations belonging to cluster $k$ (possibly excluding observation $i$ ), $n^{-i}_{k}=\sum_{l\neq i}\mathbbm{1}\{\xi_{l}=k\}$ , and the posterior predictive distribution $p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D}_{k})$ evaluated at $\bm{x}^{(i)}$ . For the latter, we provide in the following proposition a simple closed-form expression.

Proposition 4.1 (Posterior predictive - non-empty cluster).

For a given cluster $k$ , consider the data matrix $\bm{X}^{(k)}$ collecting the $n_{k}$ observations $\big{\{}\bm{x}^{(l)}:\xi_{l}=k\big{\}}$ and an observation $\bm{x}^{(i)}$ . Then, the posterior predictive of $\bm{x}^{(i)}$ given $\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\}$ is

\displaystyle p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D% }_{k})=\prod_{j=1}^{q}\left\{\frac{a/|{\mathcal{X}}_{\mathrm{fa}(j)}|+% \prescript{}{k}{n}^{\mathrm{fa}(j)}_{(\tilde{m}_{j},\tilde{s}_{j})}-\mathbbm{1% }\{\xi_{i}=k\}}{a/|{\mathcal{X}}_{\mathrm{pa}(j)}|+\prescript{}{k}{n}^{\mathrm% {pa}(j)}_{\tilde{s}_{j}}-\mathbbm{1}\{\xi_{i}=k\}}\right\}

(16)

where $\tilde{m}_{j}=\bm{x}^{(i)}_{j},\tilde{s}_{j}=\bm{x}^{(i)}_{\mathrm{pa}(j)}$ and

\prescript{}{k}{n}^{\mathrm{fa}(j)}_{(\tilde{m}_{j},\tilde{s}_{j})}=\sum_{l:% \xi_{l}=k}\mathbbm{1}\big{\{}\bm{x}^{(l)}_{\mathrm{fa}(j)}=(\tilde{m}_{j},% \tilde{s}_{j})\big{\}},\quad\prescript{}{k}{n}^{\mathrm{pa}(j)}_{\tilde{s}_{j}% }=\sum_{l:\xi_{l}=k}\mathbbm{1}\big{\{}\bm{x}^{(l)}_{\mathrm{pa}(j)}=\tilde{s}% _{j}\big{\}}.

Proof.

See Supplementary Material. ∎

The second expression of (15) considers the case of a (new) empty cluster $k=K+1$ , where the DAG $\mathcal{D}_{K+1}$ is sampled from the baseline over $\mathcal{S}_{q}$ . In such case, the full conditional is proportional to the product of the concentration parameter $\alpha$ and a posterior predictive which reduces to the marginal likelihood (prior predictive) of a cluster containing subject $i$ only. A related closed-form expression is provided by the following proposition.

Proposition 4.2 (Posterior predictive - empty cluster).

For a new cluster $k=K+1$ , the posterior predictive of $\bm{x}^{(i)}$ coincides with the marginal likelihood and is given by

p(\bm{x}^{(i)}\,|\,\mathcal{D}_{k})=\prod_{j=1}^{q}\frac{1}{|{\mathcal{X}}_{j}% |}.

(17)

Proof.

See Supplementary Material. ∎

4.1.2 Update of $\alpha$

Under the DP prior, the full conditional distribution of $\alpha$ coincides with $p(\alpha\,|\,K)\propto p(K\,|\,\alpha)p(\alpha)$ , where in particular

p(K\,|\,\alpha)\propto c_{n}(K)\alpha^{K}\frac{\Gamma(\alpha)}{\Gamma(\alpha+n)}

is the prior on the number of clusters induced by the DP and $c_{n}(K)$ is a normalizing constant not involving $\alpha$ . Sampling from $p(\alpha\,|\,K)$ can be done by augmenting the distribution through an auxiliary variable $\eta\sim\textnormal{Beta}(1,\alpha)$ . It can be shown (Escobar and West, 1994) that under the prior $\alpha\sim\textnormal{Gamma}(c,d)$ the full conditional of $\alpha\,|\,K,\eta$ corresponds to a mixture of Gamma distributions, specifically

\alpha\,|\,\eta,K\sim g\cdot\textnormal{Gamma}(c+K,d-log\eta)+(1-g)\cdot% \textnormal{Gamma}(c+K-1,d-log\eta),

where $g/(1-g)=(c+K-1)/n(d-log\ \eta)$ .

4.1.3 Update of DAGs and sampling of DAG parameters

Let $K$ be the number of clusters and $\xi_{1},\dots,\xi_{n}$ the cluster indicators, with each $\xi_{i}\in\{1,\dots,K\}$ . For a given $k\in\{1,\dots,K\}$ , let $\{\bm{x}_{i}:\xi_{i}=k\}$ be the set of observations currently assigned to cluster $k$ , and $\bm{X}^{(k)}$ the implied $(n_{k},q)$ data matrix; see also Equation (8). Without loss of generality, consider a generic cluster and omit for simplicity subscripts $k$ from $\mathcal{D}_{k}$ and $\bm{X}^{(k)}$ . Update of DAG $\mathcal{D}$ is performed through a Metropolis Hastings step where a DAG $\widetilde{\mathcal{D}}$ is sampled from a proposal distribution $q(\widetilde{\mathcal{D}}\,|\,\mathcal{D})$ conditionally on a current DAG $\mathcal{D}$ and it is accepted with probability $\alpha_{\widetilde{\mathcal{D}}}=\min\{1;r_{\widetilde{\mathcal{D}}}\}$ with

r_{\widetilde{\mathcal{D}}}=\frac{m(\bm{X}\,|\,\widetilde{\mathcal{D}})}{m(\bm% {X}\,|\,\mathcal{D})}\cdot\frac{p(\widetilde{\mathcal{D}})}{p(\mathcal{D})}% \cdot\frac{q(\mathcal{D}\,|\,\widetilde{\mathcal{D}})}{q(\widetilde{\mathcal{D% }}\,|\,\mathcal{D})},

(18)

and $m(\bm{X}\,|\,\widetilde{\mathcal{D}})$ as in Equation (13); see also the Supplementary Material for full details.

Finally, conditionally on DAGs $\mathcal{D}_{1},\dots,\mathcal{D}_{K}$ and indicators $\xi_{1},\dots,\xi_{n}$ , we can sample each DAG parameter $\bm{\theta}_{k}$ based on Equation (12), corresponding to the posterior of $\bm{\theta}_{k}$ which is available in closed-form as a product of Dirichlet probability functions; see also Section 3.1.

4.2 Posterior summaries

Output of the MCMC scheme is a collection of cluster indicators, DAGs and DAG parameters approximately drawn from the posterior distribution of our DP mixture model. Starting from such output, we can provide posterior summaries regarding clustering, DAG structures, as well as DAG-model parameters. Specifically, let $K^{(s)}$ be the number of clusters at MCMC iteration $s$ , $\xi_{i}^{(s)}$ , $i=1,\dots,n$ , $\mathcal{D}_{k}^{(s)}$ and $\bm{\theta}_{k}^{(s)}$ , $k=1,\dots,K^{(s)}$ , be the corresponding realizations of the three sets of parameters. For clustering purposes, we first recover an $(n,n)$ posterior similarity matrix $\bm{S}$ , with $(i,i^{\prime})$ -element $\bm{S}_{i,i^{\prime}}$ corresponding to the (estimated) posterior probability that individuals $i$ and $i^{\prime}$ are assigned to the same cluster, namely

\widehat{p}(\xi_{i}=\xi_{i^{\prime}}\,|\,\bm{X})=\frac{1}{S}\sum_{s=1}^{S}% \mathbbm{1}\left\{\xi_{i}^{(s)}=\xi_{i^{\prime}}^{(s)}\right\}.

(19)

A point estimate of the clustering structure, $\widehat{\bm{c}}$ , can be recovered by assigning individuals $i$ and $i^{\prime}$ to the same cluster if $\widehat{p}(\xi_{i}=\xi_{i^{\prime}}\,|\,\bm{X})$ exceeds a given threshold, say $z=0.5$ . As an alternative, a clustering estimate can be obtained following Wade and Ghahramani (2018) as the partition minimizing the expected Variation of Information (VI); see also Section 5.

From the same MCMC output, we can recover for each subject $i$ a $(q,q)$ matrix collecting estimates of the Posterior Probabilities of edge Inclusion (PPIs). For a given subject $i$ and edge $u\rightarrow v,u\neq v$ , its PPI is estimated as

\widehat{p}_{i}(u\rightarrow v\,|\,\bm{X})=\frac{1}{S}\sum_{s=1}^{S}\mathbbm{1% }\left\{u\rightarrow v\in\mathcal{D}^{(s)}_{\xi_{i}^{(s)}}\right\},

(20)

corresponding to the proportion of DAGs $\mathcal{D}_{i}^{(s)}$ in the chain containing the directed edge $u\rightarrow v$ . Finally, a graph estimate at subject-specific level, say $\widehat{\mathcal{D}}_{i}$ , can be obtained by including those edges for which $\widehat{p}_{i}(u\rightarrow v\,|\,\bm{X})>z$ for $z\in(0,1)$ , e.g $z=0.5$ .

Recall now the definition of causal effect $\gamma_{y,h}(\bm{\theta})$ provided in Equation (2.2) and assume that the intervened variable and response, $X_{h}$ and $Y$ respectively, are given so that we can omit them from the notation. A Bayesian Model Averaging (BMA) estimate of $\gamma_{i}$ , the subject-specific causal effect, for $i\in\{1,\dots,n\}$ , is given by

\widehat{\gamma_{i}}=\frac{1}{S}\sum_{s=1}^{S}\gamma_{i}^{(s)}\left(\bm{\theta% }_{\xi_{i}^{(s)}}\right).

(21)

The resulting collection $\{\widehat{\gamma_{1}},\dots,\widehat{\gamma}_{n}\}$ provides estimates of causal effects at individual level, which also naturally account for DAG-model uncertainty through BMA.

5 Simulations

We conduct simulation studies to evaluate the performance of our methodology relative to the tasks of clustering and structure learning and compare it with alternative methods for clustering multivariate categorical data. We consider settings with $q=10$ nodes, number of clusters $K=2$ and sample sizes $n_{1}=n_{2}$ that we range in $\{100,200,500\}$ . We generate the two DAGs $\mathcal{D}_{1}$ and $\mathcal{D}_{2}$ independently, so that the two clusters differ in general by the dependence structure among variables, and by fixing a probability of edge inclusion $\pi=0.2$ . The two categorical datasets $\bm{X}^{(1)},\bm{X}^{(2)}$ are built by discretization of latent Gaussian observations as detailed in the Supplementary Material (Section 4). Importantly, discretization is based on a collection of thresholds $g_{j}\in(-\infty,+\infty)$ , that we randomly draw from a $\textnormal{Unif}(\hat{z}_{j,\alpha},\hat{z}_{j,1-\alpha})$ , where $\hat{z}_{j,\alpha}$ denotes the quantile of order $\alpha$ in the empirical distribution of latent variable $Z_{j}$ , independently across $j$ and for each $k=1,2$ ; see again our Supplementary Material. We consider $\alpha\in\{0.1,0.4\}$ which implies different degrees of similarity among marginal distributions between clusters. For benchmark methods that do not consider a dependence structure between variables we expect a lower ability in recovering the true clustering when $\alpha=0.4$ , namely when marginal distributions are more similar across clusters. Finally, under each scenario, a collection of $N=40$ multiple ( $K=2$ ) datasets is generated.

5.1 Clustering

We evaluate the clustering performance of our method (DAG mixture) w.r.t. state-of-the-art approaches and specifically the Latent Class Model (LCM) (Goodman, 1974; Linzer and Lewis, 2011) and K-modes (Huang, 1998). For both methods, we input the number of clusters as $K=2$ . Additionally, to emphasize the contribution of a DAG-based model on cluster identification, we also implement a No DAG strategy, where for each group the DAG is assumed to be known and corresponds to an empty graph. Performances are assessed by comparing the true partition $\bm{c}$ with the estimated partitions $\widehat{\bm{c}}$ based on the Variation of Information (VI). Lower values of the metric correspond to better performances. In addition, we expect scenario with $\alpha=0.1$ to be characterized by overall better performances than $\alpha=0.4$ , because in the latter the difference between the two clusters is mainly due to the dependency structure among the variables. Results are summarized in the boxplots of Figure 1.

Refer to caption — Figure 1: Simulations. Distribution (across $40$ replicates) of Variation of Information, for the two different simulation scenarios: $\alpha=0.1$ and $\alpha=0.4$ . Methods under comparison are: Latent Class Model (LCM), K-modes, No DAG, and our DP mixture of DAGs (DAG mixture).

As it appears, all methods tend to improve as the sample size $n_{k}$ grows, with our DAG mixture model clearly outperforming all the benchmarks under all scenarios.

5.2 Structure learning

We now assess the ability of our method in recovering the graphical structure underlying each cluster. To this end, we consider the Structural Hamming Distance (SHD), which represents the number of modifications (edge insertions, edge removals, edge reversal) that are needed to transform the estimated DAG $\widehat{\mathcal{D}}$ into the true DAG $\mathcal{D}$ . Specifically, we compare each subject-specific estimated DAG $\widehat{\mathcal{D}}_{i},i=1,\dots,n$ with $\mathcal{D}_{\xi_{i}}$ , where $\xi_{i}$ is the true class-membership. In addition, we include the Oracle version of our method, in which the true clustering is assumed to be known, and a “one-group” naive strategy (No mixture), which instead assigns all subjects to the same cluster and therefore disregard heterogeneity. Results, for each scenario defined by $\alpha$ and $n_{k}$ , are summarized in Figure 2. The No mixture strategy, which neglects the clustering structure in the data, performs worse than the other two methods under all scenarios and with a worsen performance as the sample size $n_{k}$ increases. By converse, the Oracle version of our method performs slightly better than our DP mixture method, a behavior which is more evident under scenario $\alpha=0.4$ where clustering is indeed more difficult. Finally, both methods improve their performance as $n_{k}$ grows.

6 Analysis of breast cancer data

6.1 Dataset and model implementation

In this section we analyse a dataset of $n=404$ women diagnosed with HER2+ breast cancer and treated with potentially cardiotoxic therapies based on monoclonal antibodies (trastuzumab) and chemotherapy drugs (antracyclines). Variables in the dataset include: demographic and physical features, such as age, height and weight (expressed in terms of Body Mass Index, BMI, and included through a three-level categorical variable); risk factors, such as diagnosis of hypertension (HTA), dyslipidemia (DL), diabetes mellitus (DM), smoking (smoker and ex smoker); past cardiac diseases, namely cardiac insufficiency (CIprev), ischemic cardiomyopathy (ICMprev), arrhythmia (ARRprev), valvulopathy (VALVprev), valve surgery (valvsurgprev). In addition, the dataset provides information regarding treatments, antiHER2 monoclonal therapy (antiHER2) and/or antracyclines (AC), that were administrated to patients. Finally, the target variable corresponds to Cancer Therapy-Related Cardiac Dysfunction (CTRCD), a binary outcome indicating the occurrence (1) or not (0) of cardiac dysfunction. The original dataset is provided as a supplement to Piñeiro-Lamas et al. (2023) and included as supplementary material to our paper. Notably, all variables are categorical, with the exception of age and heart rate which have been discretized into two dummy variables, corresponding to middle vs low, and high vs low. While in general the cardiotoxic effects of the available oncological therapies have been established in the literature, still, the occurrence of CTRCD can vary substantially among patients because of both observed features (such as risk factors) or even unobserved characteristics. Accordingly, it is of interest to quantify causal effects w.r.t. the occurrence of CTRCD at individual-level, which is crucial for the development and administration of appropriate antiHER2 therapies.

Given the structure of the dataset, we constrain the adjacency matrix of DAGs in such a way that CTRCD can only (potentially) be a response, i.e. no outgoing edges are allowed, and treat age, BMI, smoker, ex smoker as exogenous variables, i.e. with no incoming edges from other nodes, while possible links/dependencies between them are allowed. Moreover, we assume that the absence/presence of risk factors can imply the administration of a therapy (AC, antiHER2), while the converse is not possible. We implement our mixture model by running the MCMC scheme for $S=100000$ iterations, which include a burn-in period of $10000$ draws that are discarded from the posterior analysis. With regard to hyperparameters, we fix $c=3,d=1$ in the Gamma prior on the DP precision parameter $\alpha$ . The common hyperparameter $a$ on the collection of Dirichlet priors on $\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}$ is instead fixed as $a=1$ , while in the hierarchical prior on DAGs we fix $a=1$ , $b=2q$ , reflecting an a priori assumption of sparsity in the graph space. To assess the convergence of our algorithm we also run two independent MCMC chains; results suggest an overall agreement in terms of clustering (evaluated through posterior similarity matrices), structure learning (based on estimated PPIs) and posterior distribution of causal-effect parameters. Results relative to all such quantities are presented discussed in the following sections.

6.2 Clustering

We summarize the clustering structure learned by our model by building an $(n,n)$ posterior similarity matrix; see Equation (19). From the latter, we recover a point estimate of the clustering based on the minimum posterior expectation of VI (Wade and Ghahramani, 2018); see also Section 4.2. As the result, we obtain two clusters, that we label as $\widehat{\bm{c}}_{1}$ and $\widehat{\bm{c}}_{2}$ , whose sizes are $n_{1}=101$ and $n_{2}=303$ respectively. The posterior similarity matrix is represented as a heatmap in Figure 3, with individuals arranged according to the estimated clusters, specifically those assigned to $\widehat{\bm{c}}_{1}$ first and then those in $\widehat{\bm{c}}_{2}$ . The two-cluster structure is pretty evident from the matrix since the probabilities of membership to the same group approach value one (zero) for individuals assigned to the same (to a different) estimated cluster.

We then investigate differences between the estimated clusters by comparing the empirical (marginal) distribution of each variable across $\widehat{\bm{c}}_{1}$ and $\widehat{\bm{c}}_{2}$ . For each cluster, we provide a graphical representation based on a spider plot, which includes for each categorical (binary) variable the proportion (percentage values) of observations corresponding to level labeled as $1$ of the variable. These values are reported as colored points joined by lines in each graph; see Figure 4. Additionally, each plot includes the same proportions as obtained from the pooled sample, namely when no clustering is considered, that are instead represented with grey dots joined by grey lines.

While for cluster 2 (right-side plot) the cluster-proportions are almost aligned with those computed on the pooled dataset, cluster 1 presents a few peculiarities. In particular, patients included in cluster 1 are in general older, as it appears from the higher frequency associated to variable age 2, and characterized by a higher BMI index. Additionally, the proportion of patients suffering from hypertension (HTA), dyslipidemia (DL), and diabetes mellitus (DM) is in general higher in comparison with the pooled dataset, than those in cluster 2. We emphasize that the results above allow to capture differences between estimated clusters that are reflected in the marginal (empirical) distribution of the variables. Importantly however, differences may emerge from the joint distribution of the variables, and specifically in the dependence (DAG) structure for patients assigned to different estimated clusters. To this end, in the next section we provide results relative to structure learning, which is carried out at subject-specific level.

6.3 Structure learning

Following Equation (20), we provide for each subject $i=1,\dots,n$ an estimate of the Posterior Probabilities of Inclusion (PPIs), that we collect in a $(q,q)$ matrix. Because of the structure of our baseline measure, we expect individuals that with high probability are assigned to the same cluster (Figure 3) to share similar dependence structures, and in particular similar PPIs. By converse, differences in the underlying graphical structure are expected for individuals assigned to distinct estimated clusters. For two randomly chosen subjects, whose membership is estimated to be cluster 1 and cluster 2 respectively, the corresponding PPIs are reported as heatmaps in Figure 5. The underlying dependence structures are both characterized by sparsity. This is much more evident in subject from cluster 1, where PPIs are more uniform and there are no edges whose PPI exceeds the $0.5$ threshold. Differently, the heatmap from cluster 2 shows a few variables that are more strongly related to the outcome CTRCD, specifically AC, together with several risk factors, in particular hearth rhythm, VALVprev, ARRprev. Accordingly, we expect such differences to imply heterogeneous causal effects between variables. We present some of these results in the next section.

6.4 Causal effects

The ultimate goal of our analysis is to quantify the (causal) effect of anticancer therapies on the occurrence of cardiotoxicity. In particular, patients in the study were treated with therapies based on either antracycline (AC) or trastuzumab (antiHER2). We then consider the causal effect on the occurrence of cardiotoxicity (variable CTRCD) implied by the administration of AC and antiHER2 therapies. To this end, we recover from our MCMC output a posterior distribution for each causal-effect parameter and each subject $i=1,\dots,n$ , that we summarize through BMA estimates following Equation (21). Our results are summarized in the two scatterplots of Figure 6, each reporting BMA causal-effect estimates (y-axis) computed across individuals (x-axis) and with subjects arranged according to the estimated clustering with two groups (Section 6.2). One can appreciate the heterogeneity in the estimates, with individuals assigned to the same cluster sharing similar values, except for a few patients in each group. Interestingly, these subjects are also characterized by a higher uncertainty in cluster allocation between either group 1 or 2; see in particular the posterior similarity matrix in Figure 1. As an interesting result, AC and antiHER2 treatments in general increase the probability of CTRCD occurrence for individuals assigned to cluster 2, while the effect is less pronounced, or even null, for cluster 1. Notably, cluster 1 is characterized by older patients and with a higher prevalence of some risk factors; see in particular Figure 4. Accordingly, in such patients, the occurrence of cardiotoxicity might be due to the presence of such risk factors, that may cause cardiac diseases, rather than implied by the therapy itself. Therefore, the direct effect of AC and antiHER2 therapies is lower in comparison with the same estimates in cluster 2.

In addition, to emphasize the role played by population heterogeneity in causal effect estimation, we compare our results with those based on an alternative One-group naive strategy, corresponding to a standard categorical DAG model in which all individuals are assigned to the same cluster. In such case, causal effect estimates are uniform across subjects and are included as horizontal lines in the two plots. Each resulting estimate is approximately an average of cluster-specific causal estimates, suggesting that a causal effect analysis that disregards population heterogeneity would over- and under- estimate the risk of CTRCD development across individuals.

7 Discussion

We proposed a modeling framework for structure learning and causal inference under heterogeneity that applies to multivariate categorical data. Our methodology is based on a Dirichlet Process (DP) mixture of categorical Directed Acyclic Graphs (DAGs) which allows to cluster subjects characterized by similar patterns of dependencies into homogeneous groups, and to provide estimates of causal effects at personalized, namely subject-specific, level. When adopted for clustering purposes, our method clearly outperforms benchmark strategies that do not account for dependence relations in the joint distribution of variables. Most importantly, our causal-effect analysis shows that approaches neglecting possible population heterogeneity can provide misleading estimates of causal effects. With regard to our application to breast cancer data, the probability of cardiotoxic side-effects implied by anticancer therapies could be underestimated for several patients, with serious consequences in clinical decision making for optimal therapies’ administration.

A possible extension of our model could be the analysis of mixed multivariate data, which comprise both quantitative and categorical measurements. Specifically in a biomedical setting, one can collect besides categorical clinical features the expression levels of genes that are involved in the progression of the disease. Causal inference methods that integrate such biological information can provide a more precise quantification of causal effects for the development of personalized therapies. More in general, multivariate models that can manage mixed data would be valuable for clustering purposes too, as several real-world applications frequently involve data of various types. Mixed-data represent an interesting framework for graphical modelling which has been addressed in a few works, although without accounting for population heterogeneity; see for instance Castelletti (2024).

The breast cancer dataset provided by Piñeiro-Lamas et al. (2023) includes Tissue Doppler Imaging (TDI) data, which measure the rate of contraction and relaxation of the cardiac muscle. TDI measurements can be treated as functional data, which whenever included in our analysis could help identifying sub-groups of patients who experienced different side effects of anti-HER2 therapies depending on the presence of cardiac dysfunctions. The inclusion of functional variables in our framework presents challenges that are related to the development of a DAG-based model. As a starting point, Qiao et al. (2019) propose a frequentist method to learn dependencies across functional variables, which is based on undirected graphical models and lasso-type penalization techniques.

SUPPLEMENTARY MATERIAL

Supplementary Materials comprise theoretical results on the computation of prior and posterior predictive distributions, details on sampling from the baseline over DAGs, and data generation for our simulation studies.

References

Argiento and De Iorio (2022) Argiento, R. and M. De Iorio (2022). Is Infinity that Far? A Bayesian Nonparametric Perspective of Finite Mixture Models. The Annals of Statistics 50(5), 2641 – 2663.
Argiento et al. (2022) Argiento, R., E. Filippi-Mazzola, and L. Paci (2022). Model-based Clustering of Categorical Data based on the Hamming Distance. arXiv preprint.
Athey and Imbens (2016) Athey, S. and G. Imbens (2016). Recursive Partitioning for Heterogeneous Causal Effects. Proceedings of the National Academy of Sciences 113, 965 – 2020.
Bargagli Stoffi et al. (2022) Bargagli Stoffi, F., K. De Witte, and G. Gnecco (2022). Heterogeneous Causal Effects with Imperfect Compliance: A Bayesian Machine Learning Approach. The Annals of Applied Statistics 16(3), 1986 – 2009.
Bowles et al. (2012) Bowles, E. J. A., R. Wellman, H. S. Feigelson, A. A. Onitilo, A. N. Freedman, T. Delate, L. A. Allen, L. Nekhlyudov, K. A. B. Goddard, R. L. Davis, L. A. Habel, M. U. Yood, C. Mccarty, D. J. Magid, E. H. Wagner, and P. S. Team (2012, 09). Risk of Heart Failure in Breast Cancer Patients After Anthracycline and Trastuzumab Treatment: A Retrospective Cohort Study. JNCI: Journal of the National Cancer Institute 104(17), 1293 – 1305.
Castelletti (2024) Castelletti, F. (2024). Learning Bayesian Networks: A Copula Approach for Mixed-Type Data. Psychometrika 89, 658 – 686.
Castelletti and Consonni (2023) Castelletti, F. and G. Consonni (2023). Bayesian Graphical Modeling for Heterogeneous Causal Effects. Statistics in Medicine 42, 15–32.
Castelletti et al. (2024) Castelletti, F., G. Consonni, and M. L. Della Vedova (2024, 07). Joint Structure Learning and Causal Effect Estimation for Categorical Graphical Models. Biometrics 80(3).
Dempke et al. (2023) Dempke, W. C., R. Zielinski, C. Winkler, S. Silberman, S. Reuther, and W. Priebe (2023). Anthracycline-Induced Cardiotoxicity – Are we About to Clear this Hurdle? European Journal of Cancer 185, 94 – 104.
Dempsey et al. (2021) Dempsey, N., A. Rosenthal, N. Dabas, Y. Kropotova, M. Lippman, and N. H. Bishopric (2021, 07). Trastuzumab-Induced Cardiotoxicity: A Review of Clinical Risk Factors, Pharmacologic Prevention, and Cardiotoxicity of other HER2-Directed Therapies. Breast Cancer Research and Treatment 188(17), 21 – 36.
Dominici et al. (2021) Dominici, F., F. J. Bargagli Stoffi, and F. Mealli (2021). From Controlled to Undisciplined Data: Estimating Causal Effects in the Era of Data Science Using a Potential Outcome Framework. Harvard Data Science Review 3(3).
Escobar and West (1994) Escobar, M. and M. West (1994). Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association 90(430), 577 – 588.
Ferguson (1973) Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2), 209 – 230.
Frühwirth-Schnatter et al. (2021) Frühwirth-Schnatter, S., G. Malsiner-Walli, and B. Grün (2021). Generalized Mixtures of Finite Mixtures and Telescoping Sampling. Bayesian Analysis 16(4), 1279 – 1307.
Geiger and Heckerman (1997) Geiger, D. and D. Heckerman (1997, 02). A Characterization of the Dirichlet Distribution through Global and Local Parameter Independence. Annals of Statistics 25, 1344 – 1369.
Goodman (1974) Goodman, L. A. (1974). Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models. Biometrika 61(2), 215 – 231.
Hahn et al. (2020) Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion). Bayesian Analysis 15(3), 965 – 2020.
Heckerman et al. (1995) Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20(3), 197 – 243.
Huang (1998) Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283 – 304.
Katzorke et al. (2013) Katzorke, N., B. Kathrin Rack, L. Haeberle, J. Katharina Neugebauer, C. Anna Melcher, C. Hagenbeck, H. Forstbauer, H. Ulrich Ulmer, U. Soeling, R. Kreienberg, T. N. Fehm, A. Schneeweiss, M. W. Beckmann, P. A. Fasching, and W. Janni (2013). Prognostic Value of HER2 on Breast Cancer Survival. Journal of Clinical Oncology 31(15), 600 – 640.
Lee et al. (2021) Lee, K., F. Bargagli Stoffi, and F. Dominici (2021). Causal Rule Ensemble: Interpretable Inference of Heterogeneous Treatment Effects. arXiv preprint.
Linero and Antonelli (2022) Linero, A. R. and J. L. Antonelli (2022). The How and Why of Bayesian Nonparametric Causal Inference. Wiley Interdisciplinary Reviews: Computational Statistics 15.
Linzer and Lewis (2011) Linzer, D. A. and J. B. Lewis (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software 42(10), 1 – 29.
Lotrionte et al. (2013) Lotrionte, M., G. Biondi-Zoccai, A. Abbate, G. Lanzetta, F. D’Ascenzo, V. Malavasi, M. Peruzzi, G. Frati, and G. Palazzoni (2013). Review and Meta-Analysis of Incidence and Clinical Predictors of Anthracycline Cardiotoxicity. The American Journal of Cardiology 112(12), 1980 – 1984.
Ma et al. (2015) Ma, J., B. Hobbs, and F. Stingo (2015). Statistical Methods for Establishing Personalized Treatment Rules in Oncology. BioMed Research International 2015(1), 670691.
MacQueen (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Volume 1: Statistics, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281 – 297. University of California Press.
Malsiner-Walli et al. (2024) Malsiner-Walli, G., B. Grün, and S. Frühwirth-Schnatter (2024). Without Pain – Clustering Categorical Data Using a Bayesian Mixture of Finite Mixtures of Latent Class Analysis Models. arXiv preprint.
Müller and Rodriguez (2013) Müller, P. and A. Rodriguez (2013). Dirichlet process. In Nonparametric Bayesian Inference, Volume 9, pp. 23 – 42. Institute of Mathematical Statistics.
Neal (2000) Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics 9(2), 249 – 265.
Oganisian et al. (2021) Oganisian, A., N. Mitra, and J. A. Roy (2021). A Bayesian Nonparametric Model for Zero-Inflated Outcomes: Prediction, Clustering, and Causal Estimation. Biometrics 77(1), 125 – 135.
Pearl (2000) Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge.
Pearl (2003) Pearl, J. (2003). Statistics and Causal Inference: A Review. Sociedad de Estadistica e Investigacion Operativa 12, 101–165.
Piñeiro-Lamas et al. (2023) Piñeiro-Lamas, B., A. López-Cheda, R. Cao, L. Ramos-Alonso, G. González-Barbeito, C. Barbeito-Caamaño, and A. Bouzas-Mosquera (2023). A Cardiotoxicity Dataset for Breast Cancer Patients. Scientific Data 10(1), 527.
Qiao et al. (2019) Qiao, X., S. Guo, and G. M. James (2019). Functional Graphical Models. Journal of the American Statistical Association 525, 211 – 222.
Rodríguez et al. (2011) Rodríguez, A., A. Lenkoski, and A. Dobra (2011). Sparse Covariance Estimation in Heterogeneous Samples. Electronic Journal of Statistics 5, 981 – 1014.
Roy et al. (2016) Roy, J., K. J. Lum, and M. J. Daniels (2016). A Bayesian Nonparametric Approach to Marginal Structural Models for Point Treatments and a Continuous or Survival Outcome. Biostatistics 18(1), 32 – 47.
Rubin (2005) Rubin, D. B. (2005). Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. Journal of the American Statistical Association 100(469), 322 – 331.
Sethuraman (1994) Sethuraman, J. (1994). A Constructive Definition of Dirichlet Priors. Statistica Sinica 4(2), 639 – 650.
Slamon et al. (1987) Slamon, D. J., G. M. Clark, S. G. Wong, W. J. Levin, A. Ullrich, and W. L. McGuire (1987). Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2/Neu Oncogene. Science 235(4785), 177 – 182.
Wade and Ghahramani (2018) Wade, S. and Z. Ghahramani (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion). Bayesian Analysis 13(2), 559 – 626.
Zorzetto et al. (2024) Zorzetto, D., F. Bargagli Stoffi, A. Canale, and F. Dominici (2024). Confounder-Dependent Bayesian Mixture Model: Characterizing Heterogeneity of Causal Effects in Air Pollution Epidemiology. Biometrics 80(2).

	$\displaystyle p(\bm{X}\,\|\,\bm{\theta})$	$\displaystyle=\prod_{i=1}^{n}\left\{\prod_{x\in{\mathcal{X}}}\left\{p(\bm{x}^{% (i)}\,\|\,\bm{\theta})\right\}^{\mathbbm{1}\{\bm{x}^{(i)}=x\}}\right\}$		(2)
		$\displaystyle=\prod_{j=1}^{q}\left\{\prod_{s\in{\mathcal{X}}_{\mathrm{pa}(j)}}% \left\{\prod_{m\in\mathcal{X}_{j}}\left\{\theta^{j\,\|\,\mathrm{pa}(j)}_{m\,\|\,% s}\right\}^{n^{\mathrm{fa}(j)}_{(m,s)}}\right\}\right\},$		(2)

	$\displaystyle p\left(\bm{X}\,\|\,\{\xi_{i}\}_{i=1}^{n},\{\bm{\theta}_{i}\}_{i=1% }^{n},\{\mathcal{D}_{i}\}_{i=1}^{n}\right)$	$\displaystyle=\prod_{k=1}^{K}\left\{\prod_{i:\xi_{i}=k}p\left(\bm{x}^{(i)}\,\|% \,\bm{\theta}_{\xi_{i}},\mathcal{D}_{\xi_{i}}\right)\right\}$		(8)
		$\displaystyle=\prod_{k=1}^{K}p\big{(}\bm{X}^{(k)}\,\|\,\bm{\theta}_{k},\mathcal% {D}_{k}\big{)},$		(8)

Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference

Abstract

1 Introduction

1.1 Motivation and framework

1.2 Related work

1.3 Contribution and structure of the paper

2 Background

2.1 Categorical DAG models

2.2 Causal effects for categorical DAGs

3 DP mixture of categorical DAG models

3.1 Baseline on DAG parameter

3.2 Baseline on DAGs

4 Posterior inference

4.1 MCMC scheme

4.1.1 Update of cluster indicators

Proposition 4.1 (Posterior predictive - non-empty cluster).

Proof.

Proposition 4.2 (Posterior predictive - empty cluster).

Proof.

4.1.2 Update of α𝛼\alphaitalic_α

4.1.3 Update of DAGs and sampling of DAG parameters

4.2 Posterior summaries

5 Simulations

5.1 Clustering

5.2 Structure learning

6 Analysis of breast cancer data

6.1 Dataset and model implementation

6.2 Clustering

6.3 Structure learning

6.4 Causal effects

7 Discussion

References

4.1.2 Update of $\alpha$