\externaldocument

Supplement

Bayesian nonparametric mixtures of categorical directed graphs for heterogeneous causal inference

Federico Castelletti and Laura Ferrini
Department of Statistical Sciences
Università Cattolica del Sacro Cuore Milan
Abstract

Quantifying causal effects of exposures on outcomes, such as a treatment and a disease respectively, is a crucial issue in medical science for the administration of effective therapies. Importantly, any related causal analysis should account for all those variables, e.g. clinical features, that can act as risk factors involved in the occurrence of a disease. In addition, the selection of targeted strategies for therapy administration requires to quantify such treatment effects at personalized level rather than at population level. We address these issues by proposing a methodology based on categorical Directed Acyclic Graphs (DAGs) which provide an effective tool to infer causal relationships and causal effects between variables. We account for population heterogeneity by considering a Dirichlet Process mixture of categorical DAGs, which clusters individuals into homogeneous groups characterized by common causal structures, dependence parameters and causal effects. We develop computational strategies for Bayesian posterior inference, from which a battery of causal effects at subject-specific level is recovered. Our methodology is evaluated through simulations and applied to a dataset of breast cancer patients to investigate cardiotoxic side effects that can be induced by the administrated anticancer therapies.


Keywords: Breast cancer; Clustering; Personalized medicine; Subject-specific graph.

1 Introduction

1.1 Motivation and framework

Estimating cause-and-effect relations between variables is a pervasive issue in many applied domains and primarily medical science. Typically in this setting, interest lies in measuring the (direct or indirect) effect of a therapy on the progression of a disease. Our methodology is motivated by a dataset of patients diagnosed with breast cancer and treated with different oncological therapies. In this context, the protein Human Epidermal growth factor Receptor 2 (HER2) has been identified as one of the main responsibles of tumor progression and growth. Recent studies have shown that therapies targeting HER2 have a strong antitumor effect, improving the overall and progression-free survival. These therapies are commonly based on both monoclonal antibodies, such as trastuzumab, as well as anticancer drugs, in particular antracyclines; see Slamon et al. (1987) and Katzorke et al. (2013). However, they can cause cardiotoxicity as a side effect, with consequent heart failure and loss of left ventricular contractile function (Dempke et al., 2023; Bowles et al., 2012). Establishing the (causal) effect of anti-HER2 therapies on cardiotoxicity is therefore of key importance for the administration of appropriate anticancer treatments, and to develop strategies for preventing and detecting cardiotoxiciy in high-risk patients. In addition, there exist several factors, such as advanced age, hypertension, valvulopathy and arrhythmia, that predispose to cardiotoxicity; see in particular Dempsey et al. (2021), Lotrionte et al. (2013) and references therein. Accordingly, in a related causal-effect analysis one should account for all those clinical features that may act as risk factors in the occurrence of cardiotoxicity.

When several variables are entertained, as in the framework above, one should account for possible interactions/dependencies between them in order to provide a coherent quantification of causal effects. Typically however, such dependence structure is unknown, or can be partially drawn only based on experts’ knowledge and one need to learn it from the data. Graphical models based on Directed Acyclic Graphs (DAGs) offer a powerful tool for this structure learning task. Importantly to our purposes, DAGs allow to properly define the causal effect on a target variable of interest induced by a hypothetical intervention on another variable in the system. Additionally, learning such causal effect can be achieved from observational data alone, under suitable causal assumptions on the data generating process (Pearl, 2000). An important issue in this general framework is however represented by heterogeneity, which implies that causal effects may vary across individuals, as the consequence of an existing, yet unknown, clustering structure in the population. Causal-inference methodologies accounting for heterogeneity can provide a more reliable quantification of treatment effects across patients, leading to personalized strategies for the administration of therapies; see in particular Ma et al. (2015) for an overview.

1.2 Related work

Available methods for clustering multivariate (categorical) data include distance- and model-based approaches. Among the first, k-modes (Huang, 1998) is the most popular methodology, which is based on a modified version of the k-means algorithm (MacQueen, 1967) and implements a dissimilarity measure between modes of categorical variables computed across clusters. On the other side, poLCA (polytomous variable Latent Class Analysis) (Linzer and Lewis, 2011) is a model-based method which applies to a collection of categorical random variables. In the model, each mixture component corresponds to a multivariate categorical distribution built under the assumption of independence between marginal distributions and for a known number of components in the mixture. Importantly however, none of these methods accounts for possible dependence relationships between variables in the underlying multivariate statistical model, which instead represents a peculiar feature of our methodology. Model-based clustering methods have been extensively developed in the Bayesian literature from both a finite and infinite mixture-model perspective; see in particular Frühwirth-Schnatter et al. (2021), Argiento and De Iorio (2022) and references therein for a review and connections between the two approaches. In a multivariate Gaussian framework, infinite-mixture models based on a Dirichlet Process (DP) prior are considered by Rodríguez et al. (2011) and Castelletti and Consonni (2023) for clustering and structure learning of undirected and directed graphs respectively. Recently, Argiento et al. (2022) proposed a finite-mixture model specifically designed for multivariate unordered categorical data. This is based on a newly-introduced class of Hamming distributions which is assumed for each categorical variable marginally, and from which a joint distribution over q𝑞qitalic_q variables is built under the assumption of (local) independence. Finally, Malsiner-Walli et al. (2024) proposed a two-layer mixture model which also allows for associations among categorical variables within each mixture-component. Such dependencies arise from a second-layer mixture, assumed within each component of the main mixture model, rather than a multivariate model with allied dependence parameter, which is instead a distinctive feature of our method for causal discovery and inference.

The literature on heterogeneous causal inference has grown extensively in the last years, particularly under the potential outcome framework (Rubin, 2005). Assuming the existence of latent sub-groups of individuals in the population, this issue is addressed through the definition and estimation of group-specific causal effects, known as Conditional Average Treatment Effects (CATEs). Machine learning methods are employed to identify the underlying clustering structure by stratification of subjects based on the levels of available covariates; see Dominici et al. (2021) for a review, Athey and Imbens (2016), Hahn et al. (2020) and Bargagli Stoffi et al. (2022), the latter proposing a method which is specifically designed to handle imperfect compliance. Recent methods also aim at improving interpretability of causal results. An instance in this direction is the Causal Rule Ensamble (CRE) method (Lee et al., 2021), which adopts multiple trees to identify patterns of heterogeneity in the data and to ensure stability in sub-group identification. Other approaches to heterogeneous causal inference using counterfactuals are based on Bayesian nonparametric methods; see for instance Linero and Antonelli (2022) and references therein. Among these, Zorzetto et al. (2024) employ a Dependent Probit Stick-Breaking mixture model to simultaneously impute the missing outcomes (counterfactuals), and to identify mutually exclusive groups, thus allowing to estimate causal effects in the presence of population heterogeneity. Still in a potential outcome framework, Roy et al. (2016) adopt marginal structural models and implement a dependent Dirichlet Process (DP) prior for the evaluation of heterogeneous causal effects of treatments on survival outcomes; Oganisian et al. (2021) instead consider a DP mixture of zero inflated regression models for pathological data exhibiting excesses of zeros. DP priors for clustering and heterogeneous causal inference are also employed by Castelletti and Consonni (2023) in a multivariate framework based on Gaussian graphical models where causal effects are defined and estimated according to do-calculus theory (Pearl, 2000).

1.3 Contribution and structure of the paper

We propose a Bayesian methodology based on a infinite mixture of categorical DAGs for causal discovery and causal effect estimation in the presence of heterogeneous data. Our model allows for the presence of latent sub-groups of individual/patients in the sample, each characterized by a possibly different causal structure and battery of related causal-effect parameters. Specifically, we assume that the multivariate distribution of the observables belongs to a Dirichlet Process (DP) mixture of categorical DAG models. Each mixture component reflects a factorization of the sampling distribution satisfying a set of conditional independencies imposed by the DAG. Under the latter, causal effects between variables are then defined according to do-calculus. With regard to the DP prior, we define a baseline measure over the space of priors on (𝒟,𝜽)𝒟𝜽(\mathcal{D},\bm{\theta})( caligraphic_D , bold_italic_θ ) where 𝒟𝒟\mathcal{D}caligraphic_D is a DAG and 𝜽𝜽\bm{\theta}bold_italic_θ the parameter of a categorical DAG model. We employ a constructive procedure based on local and global parameter independence to assign priors to DAG parameters, providing closed-form expressions for both the prior and posterior predictive of DAGs, as well as for the posterior distribution of DAG-parameters. We then leverage these results to develop a computational scheme for posterior inference of our DP model. When applied to breast cancer data, our methodology ultimately allows to quantify treatment effects of assigned therapies w.r.t. the occurrence of cardiotoxicity at subject-specific levels, thus leading to a more reliable decision process for the development of personalized therapies.

The rest of the paper is organized as follows. In Section 2 we provide some background material on DAGs and causal effects within a categorical modelling framework. In Section 3 we introduce our mixture model based on a DP prior, for which we detail the construction of the baseline mixing measure over the space of DAGs and allied parameters. We then describe a Markov Chain Monte Carlo (MCMC) strategy for posterior inference in Section 4. In Section 5 we evaluate our methodology relative to the tasks of clustering and causal discovery through extensive simulation studies, which include comparisons with alternative state-of-the-art methods. Section 6 is devoted to the analysis of breast cancer data and includes our causal-effect analysis to evaluate heterogeneous side effects of anti-HER2 therapies with respect to the occurrence of cardiotoxicity. We finally provide a discussion to our methodology in Section 7, together with possible future developments. Some technical results, including the computation of prior and posterior predictive distributions required by our posterior sampler, are reported in the Supplementary Material.

2 Background

2.1 Categorical DAG models

Consider a Directed Acyclic Graph (DAG) 𝒟=(V,E)𝒟𝑉𝐸\mathcal{D}=(V,E)caligraphic_D = ( italic_V , italic_E ) with set of nodes V={1,,q}𝑉1𝑞V=\{1,\dots,q\}italic_V = { 1 , … , italic_q } and set of directed edges EV×V𝐸𝑉𝑉E\subseteq V\times Vitalic_E ⊆ italic_V × italic_V. For a given 𝒟𝒟\mathcal{D}caligraphic_D, if (u,v)E𝑢𝑣𝐸(u,v)\in E( italic_u , italic_v ) ∈ italic_E, we say that u𝑢uitalic_u is a parent of v𝑣vitalic_v and let pa𝒟(v)subscriptpa𝒟𝑣\mathrm{pa}_{\mathcal{D}}(v)roman_pa start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_v ) be the set of all parents of v𝑣vitalic_v in 𝒟𝒟\mathcal{D}caligraphic_D. Moreover, we let fa𝒟(v):=vpa𝒟(v)assignsubscriptfa𝒟𝑣𝑣subscriptpa𝒟𝑣\mathrm{fa}_{\mathcal{D}}(v):=v\cup\mathrm{pa}_{\mathcal{D}}(v)roman_fa start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_v ) := italic_v ∪ roman_pa start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( italic_v ) be the family of node v𝑣vitalic_v in the DAG. Consider now a collection of random variables X=(X1,,Xq)𝑋subscript𝑋1subscript𝑋𝑞X=(X_{1},\dots,X_{q})italic_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ), such as clinical features that can be measured on patients, and binary categorical variables indicating the administration of a therapy and the absence/presence of a disease. In the following, we assume that each Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, jV𝑗𝑉j\in Vitalic_j ∈ italic_V, is categorical with set of levels 𝒳jsubscript𝒳𝑗\mathcal{X}_{j}caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and let xj𝒳jsubscript𝑥𝑗subscript𝒳𝑗x_{j}\in\mathcal{X}_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT be one of its levels. Accordingly, X𝒳:=×jV𝒳jX\in{\mathcal{X}}:=\times_{j\in V}\mathcal{X}_{j}italic_X ∈ caligraphic_X := × start_POSTSUBSCRIPT italic_j ∈ italic_V end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, whose generic element is x𝒳𝑥𝒳x\in{\mathcal{X}}italic_x ∈ caligraphic_X. In addition, if for any SV𝑆𝑉S\subset Vitalic_S ⊂ italic_V we let XS=(Xj,jS)subscript𝑋𝑆subscript𝑋𝑗𝑗𝑆X_{S}=(X_{j},j\in S)italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_j ∈ italic_S ), then XS𝒳:=×jS𝒳jX_{S}\in{\mathcal{X}}:=\times_{j\in S}\mathcal{X}_{j}italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_X := × start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, with xS𝒳Ssubscript𝑥𝑆subscript𝒳𝑆x_{S}\in{\mathcal{X}}_{S}italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∈ caligraphic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Under 𝒟𝒟\mathcal{D}caligraphic_D, the joint probability p(x)=Pr(X1=x1,,Xq=xq)𝑝𝑥Prformulae-sequencesubscript𝑋1subscript𝑥1subscript𝑋𝑞subscript𝑥𝑞p(x)=\text{Pr}(X_{1}=x_{1},\dots,X_{q}=x_{q})italic_p ( italic_x ) = Pr ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) admits the factorization

p(x)=j=1qPr(Xj=xj|Xpa(j)=xpa(j)).𝑝𝑥superscriptsubscriptproduct𝑗1𝑞Prsubscript𝑋𝑗conditionalsubscript𝑥𝑗subscript𝑋pa𝑗subscript𝑥pa𝑗p(x)=\prod_{j=1}^{q}\Pr(X_{j}=x_{j}\,|\,X_{\mathrm{pa}(j)}=x_{\mathrm{pa}(j)}).italic_p ( italic_x ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT roman_Pr ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT ) . (1)

For the remainder of this section we omit DAG 𝒟𝒟\mathcal{D}caligraphic_D from our notation and reason conditionally on a fixed DAG. Let now θsS=Pr(XS=s)superscriptsubscript𝜃𝑠𝑆Prsubscript𝑋𝑆𝑠\theta_{s}^{S}=\Pr(X_{S}=s)italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT = roman_Pr ( italic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT = italic_s ), s𝒳S𝑠subscript𝒳𝑆s\in{\mathcal{X}}_{S}italic_s ∈ caligraphic_X start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, be a marginal probability for variables in SV𝑆𝑉S\subseteq Vitalic_S ⊆ italic_V. Moreover, let θm|sj|pa(j)=Pr(Xj=m|Xpa(j)=s)subscriptsuperscript𝜃conditional𝑗pa𝑗conditional𝑚𝑠Prsubscript𝑋𝑗conditional𝑚subscript𝑋pa𝑗𝑠\theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}=\Pr(X_{j}=m\,|\,X_{\mathrm{pa}(j)}=s)italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT = roman_Pr ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_m | italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT = italic_s ) be a conditional probability for Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT given configuration (level) s𝑠sitalic_s of Xpa(j)subscript𝑋pa𝑗X_{\mathrm{pa}(j)}italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT, with m𝒳j,s𝒳pa(j)formulae-sequence𝑚subscript𝒳𝑗𝑠subscript𝒳pa𝑗m\in{\mathcal{X}}_{j},s\in{\mathcal{X}}_{\mathrm{pa}(j)}italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT. Consider n𝑛nitalic_n observations from X𝑋Xitalic_X, 𝒙(1),,𝒙(n)superscript𝒙1superscript𝒙𝑛\bm{x}^{(1)},\dots,\bm{x}^{(n)}bold_italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT, where each 𝒙(i)=(x1(i),,xq(i))superscript𝒙𝑖superscriptsubscriptsuperscript𝑥𝑖1subscriptsuperscript𝑥𝑖𝑞top\bm{x}^{(i)}=(x^{(i)}_{1},\dots,x^{(i)}_{q})^{\top}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ( italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and 𝒙(i)𝒳superscript𝒙𝑖𝒳\bm{x}^{(i)}\in{\mathcal{X}}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_X, i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n. Also, let 𝒙S(i)superscriptsubscript𝒙𝑆𝑖\bm{x}_{S}^{(i)}bold_italic_x start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT be the sub-vector of 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT with components indexed by SV𝑆𝑉S\subset Vitalic_S ⊂ italic_V. If we collect the 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT’s into an (n,q)𝑛𝑞(n,q)( italic_n , italic_q ) data matrix 𝑿𝑿\bm{X}bold_italic_X, then the likelihood function can be written as

p(𝑿|𝜽)𝑝conditional𝑿𝜽\displaystyle p(\bm{X}\,|\,\bm{\theta})italic_p ( bold_italic_X | bold_italic_θ ) =i=1n{x𝒳{p(𝒙(i)|𝜽)}𝟙{𝒙(i)=x}}absentsuperscriptsubscriptproduct𝑖1𝑛subscriptproduct𝑥𝒳superscript𝑝conditionalsuperscript𝒙𝑖𝜽1superscript𝒙𝑖𝑥\displaystyle=\prod_{i=1}^{n}\left\{\prod_{x\in{\mathcal{X}}}\left\{p(\bm{x}^{% (i)}\,|\,\bm{\theta})\right\}^{\mathbbm{1}\{\bm{x}^{(i)}=x\}}\right\}= ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { ∏ start_POSTSUBSCRIPT italic_x ∈ caligraphic_X end_POSTSUBSCRIPT { italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_italic_θ ) } start_POSTSUPERSCRIPT blackboard_1 { bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_x } end_POSTSUPERSCRIPT } (2)
=j=1q{s𝒳pa(j){m𝒳j{θm|sj|pa(j)}n(m,s)fa(j)}},absentsuperscriptsubscriptproduct𝑗1𝑞subscriptproduct𝑠subscript𝒳pa𝑗subscriptproduct𝑚subscript𝒳𝑗superscriptsubscriptsuperscript𝜃conditional𝑗pa𝑗conditional𝑚𝑠subscriptsuperscript𝑛fa𝑗𝑚𝑠\displaystyle=\prod_{j=1}^{q}\left\{\prod_{s\in{\mathcal{X}}_{\mathrm{pa}(j)}}% \left\{\prod_{m\in\mathcal{X}_{j}}\left\{\theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,% s}\right\}^{n^{\mathrm{fa}(j)}_{(m,s)}}\right\}\right\},= ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT { ∏ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ∏ start_POSTSUBSCRIPT italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_m , italic_s ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } } ,

now emphasizing the dependence on the DAG-parameter 𝜽𝜽\bm{\theta}bold_italic_θ (corresponding to the collection of conditional probabilities in the equation) and where n(m,s)fa(j)=i=1n𝟙{𝒙fa(j)(i)=(m,s)}subscriptsuperscript𝑛fa𝑗𝑚𝑠superscriptsubscript𝑖1𝑛1subscriptsuperscript𝒙𝑖fa𝑗𝑚𝑠n^{\mathrm{fa}(j)}_{(m,s)}=\sum_{i=1}^{n}\mathbbm{1}\left\{\bm{x}^{(i)}_{% \mathrm{fa}(j)}=(m,s)\right\}italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_m , italic_s ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 { bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT = ( italic_m , italic_s ) } is the number of observations for which Xfa(j)=(m,s)subscript𝑋fa𝑗𝑚𝑠X_{\mathrm{fa}(j)}=(m,s)italic_X start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT = ( italic_m , italic_s ). See also Castelletti et al. (2024) for further notation on categorical DAG models.

2.2 Causal effects for categorical DAGs

For a given collection of random variables whose multivariate distribution factorizes according to a DAG, we now focus on the causal effect of an intervention on Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, hV𝑉h\in Vitalic_h ∈ italic_V, on a response variable of interest, say Xj:=Yassignsubscript𝑋𝑗𝑌X_{j}:=Yitalic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := italic_Y, jh𝑗j\neq hitalic_j ≠ italic_h. In practice, such an intervention corresponds to assigning a treatment to an individual, equivalently fixing Xh=x~subscript𝑋~𝑥X_{h}=\tilde{x}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG, where Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is typically an exposure of Y𝑌Yitalic_Y, and this action can be denoted using Pearl’s do-operator do(Xh=x~)dosubscript𝑋~𝑥\textnormal{do}(X_{h}=\tilde{x})do ( italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG ) (Pearl, 2003). This implies a change in the observational distribution (1), leading to the so-called post-intervention distribution

p(x|do(Xh=x~))={jhp(Xj=xj|Xpa(j)=xpa(j))ifXh=x~  0otherwise.𝑝conditional𝑥dosubscript𝑋~𝑥casessubscriptproduct𝑗𝑝subscript𝑋𝑗conditionalsubscript𝑥𝑗subscript𝑋pa𝑗subscript𝑥pa𝑗ifsubscript𝑋~𝑥  0otherwisep(x\,|\,\textnormal{do}(X_{h}=\tilde{x}))=\begin{cases}\prod\limits_{j\neq h}p% \big{(}X_{j}=x_{j}\,|\,X_{\mathrm{pa}(j)}=x_{\mathrm{pa}(j)}\big{)}&\textrm{if% }\ X_{h}=\tilde{x}\\ \,\,0&\textrm{otherwise}.\end{cases}italic_p ( italic_x | do ( italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG ) ) = { start_ROW start_CELL ∏ start_POSTSUBSCRIPT italic_j ≠ italic_h end_POSTSUBSCRIPT italic_p ( italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise . end_CELL end_ROW (3)

Assuming for simplicity that both Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Y𝑌Yitalic_Y are binary variables with levels in {0,1}01\{0,1\}{ 0 , 1 }, the causal effect of do{Xh=x~}dosubscript𝑋~𝑥\textnormal{do}\{X_{h}=\tilde{x}\}do { italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = over~ start_ARG italic_x end_ARG } on Y𝑌Yitalic_Y can be defined as

cy,h:=𝔼[Y|do(Xh=1)]𝔼[Y|do(Xh=0)];assignsubscript𝑐𝑦𝔼delimited-[]conditional𝑌dosubscript𝑋1𝔼delimited-[]conditional𝑌dosubscript𝑋0c_{y,h}\vcentcolon=\mathbb{E}\big{[}Y\,|\,\textnormal{do}(X_{h}=1)\big{]}-% \mathbb{E}\big{[}Y\,|\,\textnormal{do}(X_{h}=0)\big{]};italic_c start_POSTSUBSCRIPT italic_y , italic_h end_POSTSUBSCRIPT := blackboard_E [ italic_Y | do ( italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 1 ) ] - blackboard_E [ italic_Y | do ( italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 0 ) ] ; (4)

see Pearl (2003). More in general, if Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is polytomous with levels labeled as {0,1,,L}01𝐿\{0,1,\dots,L\}{ 0 , 1 , … , italic_L }, one can define a battery of causal effects by considering Xh=lsubscript𝑋𝑙X_{h}=litalic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_l, for each l=1,,L𝑙1𝐿l=1,\dots,Litalic_l = 1 , … , italic_L in the first expectation of Equation (4). According to the definition above, cy,hsubscript𝑐𝑦c_{y,h}italic_c start_POSTSUBSCRIPT italic_y , italic_h end_POSTSUBSCRIPT involves a (marginal) post-intervention distribution of Y𝑌Yitalic_Y. However, because of (3), the latter can be expressed in terms of observational distributions, simply by conditioning and then marginalizing w.r.t. a valid adjustment set ZX𝑍𝑋Z\subset Xitalic_Z ⊂ italic_X; see Pearl (2003). A common choice for such an adjustment set is Z=Xpa(h)𝑍subscript𝑋paZ=X_{\mathrm{pa}(h)}italic_Z = italic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT, namely the parents of Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, leading to

cy,h=s𝒳pa(h)𝔼(Y|\displaystyle c_{y,h}=\sum_{s\in{\mathcal{X}}_{\mathrm{pa}(h)}}\mathbb{E}\big{% (}Y\,|italic_c start_POSTSUBSCRIPT italic_y , italic_h end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E ( italic_Y | Xh=1,Xpa(h)=s)Pr(Xpa(h)=s)\displaystyle X_{h}=1,X_{\mathrm{pa}(h)}=s\big{)}\Pr\big{(}X_{\mathrm{pa}(h)}=% s\big{)}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 1 , italic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT = italic_s ) roman_Pr ( italic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT = italic_s ) (5)
\displaystyle-- s𝒳pa(h)𝔼(Y|Xh=0,Xpa(h)=s)Pr(Xpa(h)=s);subscript𝑠subscript𝒳pa𝔼formulae-sequenceconditional𝑌subscript𝑋0subscript𝑋pa𝑠Prsubscript𝑋pa𝑠\displaystyle\sum_{s\in{\mathcal{X}}_{\mathrm{pa}(h)}}\mathbb{E}\big{(}Y\,|\,X% _{h}=0,X_{\mathrm{pa}(h)}=s\big{)}\Pr\big{(}X_{\mathrm{pa}(h)}=s\big{)};∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E ( italic_Y | italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 0 , italic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT = italic_s ) roman_Pr ( italic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT = italic_s ) ;

see in particular Theorem 3.2.3 in Pearl (2000). Under model (2), the causal effect in (5) can be expressed as a function of the DAG parameter 𝜽𝜽\bm{\theta}bold_italic_θ as

γy,h(𝜽)=s𝒳pa(h){(θ1|(1,s)Y|fa(h)θ1|(0,s)Y|fa(h))θspa(h)}.subscript𝛾𝑦𝜽subscript𝑠subscript𝒳pasubscriptsuperscript𝜃conditional𝑌faconditional11𝑠subscriptsuperscript𝜃conditional𝑌faconditional10𝑠subscriptsuperscript𝜃pa𝑠\displaystyle\gamma_{y,h}(\bm{\theta})=\sum_{s\in\mathcal{X}_{\mathrm{pa}(h)}}% \left\{\left(\theta^{Y\,|\,\mathrm{fa}(h)}_{1\,|\,(1,s)}-\theta^{Y\,|\,\mathrm% {fa}(h)}_{1|(0,s)}\right)\theta^{\mathrm{pa}(h)}_{s}\right\}.italic_γ start_POSTSUBSCRIPT italic_y , italic_h end_POSTSUBSCRIPT ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_h ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT { ( italic_θ start_POSTSUPERSCRIPT italic_Y | roman_fa ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 | ( 1 , italic_s ) end_POSTSUBSCRIPT - italic_θ start_POSTSUPERSCRIPT italic_Y | roman_fa ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 | ( 0 , italic_s ) end_POSTSUBSCRIPT ) italic_θ start_POSTSUPERSCRIPT roman_pa ( italic_h ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT } . (6)

3 DP mixture of categorical DAG models

In this section we introduce our Dirichlet Process (DP) mixture of categorical DAG models. This can be written using the following hierarchical structure

𝒙(i)|𝜽i,𝒟iconditionalsuperscript𝒙𝑖subscript𝜽𝑖subscript𝒟𝑖\displaystyle\bm{x}^{(i)}\,|\,\bm{\theta}_{i},\mathcal{D}_{i}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT p(𝒙(i)|𝜽i,𝒟i)similar-toabsent𝑝conditionalsuperscript𝒙𝑖subscript𝜽𝑖subscript𝒟𝑖\displaystyle\sim p(\bm{x}^{(i)}\,|\,\bm{\theta}_{i},\mathcal{D}_{i})∼ italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (7)
(𝜽i,𝒟i)|Hconditionalsubscript𝜽𝑖subscript𝒟𝑖𝐻\displaystyle(\bm{\theta}_{i},\mathcal{D}_{i})\,|\,H( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_H Hsimilar-toabsent𝐻\displaystyle\sim H∼ italic_H
H𝐻\displaystyle Hitalic_H DP(M0,α)similar-toabsent𝐷𝑃subscript𝑀0𝛼\displaystyle\sim DP(M_{0},\alpha)∼ italic_D italic_P ( italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α )

where DP(M0,α)𝐷𝑃subscript𝑀0𝛼DP(M_{0},\alpha)italic_D italic_P ( italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α ) denotes a DP prior with baseline M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and concentration parameter α𝛼\alphaitalic_α (Ferguson, 1973), and we now emphasize the dependence on DAG 𝒟isubscript𝒟𝑖\mathcal{D}_{i}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the sampling distribution p(𝒙(i)|𝜽i,𝒟i)𝑝conditionalsuperscript𝒙𝑖subscript𝜽𝑖subscript𝒟𝑖p(\bm{x}^{(i)}\,|\,\bm{\theta}_{i},\mathcal{D}_{i})italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

A property of model (7) is that it induces a partition of the observations 𝒙(1),,𝒙(n)superscript𝒙1superscript𝒙𝑛\bm{x}^{(1)},\dots,\bm{x}^{(n)}bold_italic_x start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , … , bold_italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT into clusters, with individuals assigned to the same cluster sharing the same DAG 𝒟𝒟\mathcal{D}caligraphic_D and DAG parameter 𝜽𝜽\bm{\theta}bold_italic_θ. Moreover, the expected number of clusters is controlled by α𝛼\alphaitalic_α: each observation 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is associated with its own (𝜽i,𝒟i)subscript𝜽𝑖subscript𝒟𝑖(\bm{\theta}_{i},\mathcal{D}_{i})( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )-parameter as α𝛼\alpha\rightarrow\inftyitalic_α → ∞; on the contrary, if α0𝛼0\alpha\rightarrow 0italic_α → 0, then all observations are assigned to the same cluster, leading to a standard categorical DAG model (Castelletti et al., 2024); see also Müller and Rodriguez (2013) for related properties of the DP prior.

Let now Kn𝐾𝑛K\leq nitalic_K ≤ italic_n be the number of unique values among (𝜽1,𝒟1),,(𝜽n,𝒟n)subscript𝜽1subscript𝒟1subscript𝜽𝑛subscript𝒟𝑛(\bm{\theta}_{1},\mathcal{D}_{1}),\dots,(\bm{\theta}_{n},\mathcal{D}_{n})( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( bold_italic_θ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and {ξi}i=1nsuperscriptsubscriptsubscript𝜉𝑖𝑖1𝑛\{\xi_{i}\}_{i=1}^{n}{ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT a sequence of (cluster) indicator variables such that ξi{1,,K}subscript𝜉𝑖1𝐾\xi_{i}\in\{1,\dots,K\}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K } and (𝜽i,𝒟i)=(𝜽ξi,𝒟ξi)subscript𝜽𝑖subscript𝒟𝑖subscript𝜽subscript𝜉𝑖subscript𝒟subscript𝜉𝑖(\bm{\theta}_{i},\mathcal{D}_{i})=(\bm{\theta}_{\xi_{i}},\mathcal{D}_{\xi_{i}})( bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( bold_italic_θ start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). Conditionally on {ξi}i=1nsuperscriptsubscriptsubscript𝜉𝑖𝑖1𝑛\{\xi_{i}\}_{i=1}^{n}{ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, observations are i.i.d. within each cluster, so that the likelihood can be written as

p(𝑿|{ξi}i=1n,{𝜽i}i=1n,{𝒟i}i=1n)𝑝conditional𝑿superscriptsubscriptsubscript𝜉𝑖𝑖1𝑛superscriptsubscriptsubscript𝜽𝑖𝑖1𝑛superscriptsubscriptsubscript𝒟𝑖𝑖1𝑛\displaystyle p\left(\bm{X}\,|\,\{\xi_{i}\}_{i=1}^{n},\{\bm{\theta}_{i}\}_{i=1% }^{n},\{\mathcal{D}_{i}\}_{i=1}^{n}\right)italic_p ( bold_italic_X | { italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , { bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , { caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) =k=1K{i:ξi=kp(𝒙(i)|𝜽ξi,𝒟ξi)}absentsuperscriptsubscriptproduct𝑘1𝐾subscriptproduct:𝑖subscript𝜉𝑖𝑘𝑝conditionalsuperscript𝒙𝑖subscript𝜽subscript𝜉𝑖subscript𝒟subscript𝜉𝑖\displaystyle=\prod_{k=1}^{K}\left\{\prod_{i:\xi_{i}=k}p\left(\bm{x}^{(i)}\,|% \,\bm{\theta}_{\xi_{i}},\mathcal{D}_{\xi_{i}}\right)\right\}= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT { ∏ start_POSTSUBSCRIPT italic_i : italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) } (8)
=k=1Kp(𝑿(k)|𝜽k,𝒟k),absentsuperscriptsubscriptproduct𝑘1𝐾𝑝conditionalsuperscript𝑿𝑘subscript𝜽𝑘subscript𝒟𝑘\displaystyle=\prod_{k=1}^{K}p\big{(}\bm{X}^{(k)}\,|\,\bm{\theta}_{k},\mathcal% {D}_{k}\big{)},= ∏ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

with p(𝑿(k)|𝜽k,𝒟k)𝑝conditionalsuperscript𝑿𝑘subscript𝜽𝑘subscript𝒟𝑘p\big{(}\bm{X}^{(k)}\,|\,\bm{\theta}_{k},\mathcal{D}_{k}\big{)}italic_p ( bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT | bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) as in Equation (2), and where 𝑿(k)superscript𝑿𝑘\bm{X}^{(k)}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT is the (nk,q)subscript𝑛𝑘𝑞(n_{k},q)( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q ) matrix collecting all observations 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT such that ξi=ksubscript𝜉𝑖𝑘\xi_{i}=kitalic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k. Additionally, a generic count involved in the k𝑘kitalic_k-th component above will be denoted as n(m,s)fa(j)k=i:ξi=k𝟙{𝒙fa(j)(i)=(m,s)}subscriptsubscriptsuperscript𝑛fa𝑗𝑚𝑠𝑘subscript:𝑖subscript𝜉𝑖𝑘1subscriptsuperscript𝒙𝑖fa𝑗𝑚𝑠\prescript{}{k}{n}^{\mathrm{fa}(j)}_{(m,s)}=\sum_{i:\xi_{i}=k}\mathbbm{1}\big{% \{}\bm{x}^{(i)}_{\mathrm{fa}(j)}=(m,s)\big{\}}start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( italic_m , italic_s ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i : italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT blackboard_1 { bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT = ( italic_m , italic_s ) }, which corresponds to the number of observations in cluster k𝑘kitalic_k for which the level taken by variables Xfa(j)subscript𝑋fa𝑗X_{\mathrm{fa}(j)}italic_X start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT is equal to (m,s)𝑚𝑠(m,s)( italic_m , italic_s ). An alternative representation of the DP prior is based on the so-called stick-breaking process (Sethuraman, 1994). Accordingly, H𝐻Hitalic_H can be written in the form

H=k=1ωkδ(𝜽k,𝒟k)𝐻superscriptsubscript𝑘1subscript𝜔𝑘subscript𝛿subscript𝜽𝑘subscript𝒟𝑘H=\sum_{k=1}^{\infty}\omega_{k}\delta_{(\bm{\theta}_{k},\mathcal{D}_{k})}italic_H = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT (9)

where δ(𝜽k,𝒟k)subscript𝛿subscript𝜽𝑘subscript𝒟𝑘\delta_{(\bm{\theta}_{k},\mathcal{D}_{k})}italic_δ start_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT is a degenerate probability measure placing all of its mass on {𝜽k,𝒟k}subscript𝜽𝑘subscript𝒟𝑘\left\{\bm{\theta}_{k},\mathcal{D}_{k}\right\}{ bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } and {𝜽k,𝒟k}k=1iidM0superscriptsubscriptsubscript𝜽𝑘subscript𝒟𝑘𝑘1iidsimilar-tosubscript𝑀0\left\{\bm{\theta}_{k},\mathcal{D}_{k}\right\}_{k=1}^{\infty}\overset{% \textnormal{iid}}{\sim}M_{0}{ bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT overiid start_ARG ∼ end_ARG italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Moreover, the weights {ωk}k=1superscriptsubscriptsubscript𝜔𝑘𝑘1\{\omega_{k}\}_{k=1}^{\infty}{ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT satisfy ω1=v1subscript𝜔1subscript𝑣1\omega_{1}=v_{1}italic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and ωk=vkh<k(1vh)subscript𝜔𝑘subscript𝑣𝑘subscriptproduct𝑘1subscript𝑣\omega_{k}=v_{k}\prod_{h<k}(1-v_{h})italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∏ start_POSTSUBSCRIPT italic_h < italic_k end_POSTSUBSCRIPT ( 1 - italic_v start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ), where {vk}k=1iidBeta(1,α)superscriptsubscriptsubscript𝑣𝑘𝑘1iidsimilar-toBeta1𝛼\{v_{k}\}_{k=1}^{\infty}\overset{\textnormal{iid}}{\sim}\textnormal{Beta}(1,\alpha){ italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT overiid start_ARG ∼ end_ARG Beta ( 1 , italic_α ), with α𝛼\alphaitalic_α the concentration parameter of the DP prior. In the following we will assign αGamma(c,d)similar-to𝛼Gamma𝑐𝑑\alpha\sim\textnormal{Gamma}(c,d)italic_α ∼ Gamma ( italic_c , italic_d ) following Escobar and West (1994).

In the next sections we detail the construction of the baseline M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This is structured as M0=p(𝜽|𝒟)p(𝒟)subscript𝑀0𝑝conditional𝜽𝒟𝑝𝒟M_{0}=p(\bm{\theta}\,|\,\mathcal{D})p(\mathcal{D})italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_p ( bold_italic_θ | caligraphic_D ) italic_p ( caligraphic_D ), where the former term corresponds to a prior on the DAG parameter 𝜽𝜽\bm{\theta}bold_italic_θ conditionally on DAG 𝒟𝒟\mathcal{D}caligraphic_D, while the latter is a marginal prior over DAGs.

3.1 Baseline on DAG parameter

Conditionally on DAG 𝒟𝒟\mathcal{D}caligraphic_D, we first assign a prior p(𝜽|𝒟)𝑝conditional𝜽𝒟p(\bm{\theta}\,|\,\mathcal{D})italic_p ( bold_italic_θ | caligraphic_D ). To this end, consider for each node j{1,,q}𝑗1𝑞j\in\{1,\dots,q\}italic_j ∈ { 1 , … , italic_q } and s𝒳pa(j)𝑠subscript𝒳pa𝑗s\in{\mathcal{X}}_{\mathrm{pa}(j)}italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT the parameter (θm|sj|pa(j),m𝒳j):=𝜽sj|pa(j)assignsubscriptsuperscript𝜃conditional𝑗pa𝑗conditional𝑚𝑠𝑚subscript𝒳𝑗subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠\big{(}\theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s},m\in\mathcal{X}_{j}\big{)}:=\bm% {\theta}^{j\,|\,\mathrm{pa}(j)}_{s}( italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT , italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT corresponding to a |𝒳j|subscript𝒳𝑗|\mathcal{X}_{j}|| caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT |-dimensional vector collecting conditional probabilities for variable Xjsubscript𝑋𝑗X_{j}italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, given a configuration s𝑠sitalic_s of its parents Xpa(j)subscript𝑋pa𝑗X_{\mathrm{pa}(j)}italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT. We assign to each 𝜽sj|pa(j)subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT a Dirichlet prior with hyper-parameter 𝒂sj|pa(j)=(am|sj|pa(j),m𝒳j)subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠subscriptsuperscript𝑎conditional𝑗pa𝑗conditional𝑚𝑠𝑚subscript𝒳𝑗\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}=\big{(}a^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s},m% \in\mathcal{X}_{j}\big{)}bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT , italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), written as 𝜽sj|pa(j)Dirichlet(𝒂sj|pa(j))similar-tosubscriptsuperscript𝜽conditional𝑗pa𝑗𝑠Dirichletsubscriptsuperscript𝒂conditional𝑗pa𝑗𝑠\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\sim\textnormal{Dirichlet}(\bm{a}^{j\,|% \,\mathrm{pa}(j)}_{s})bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ Dirichlet ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), whose p.d.f. is

p(𝜽sj|pa(j))=h(𝒂sj|pa(j))m𝒳j{θm|sj|pa(j)}am|sj|pa(j)1,𝑝subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠subscriptproduct𝑚subscript𝒳𝑗superscriptsubscriptsuperscript𝜃conditional𝑗pa𝑗conditional𝑚𝑠subscriptsuperscript𝑎conditional𝑗pa𝑗conditional𝑚𝑠1\displaystyle p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}=h\big{(}% \bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}\prod_{m\in\mathcal{X}_{j}}\left\{% \theta^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}\right\}^{a^{j\,|\,\mathrm{pa}(j)}_{m\,% |\,s}-1},italic_p ( bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_h ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT { italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT , (10)

and where h(𝒂sj|pa(j))subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}italic_h ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the prior normalizing constant. Let now 𝜽j|pa(j)=(𝜽sj|pa(j),s𝒳pa(j))superscript𝜽conditional𝑗pa𝑗subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠𝑠subscript𝒳pa𝑗\bm{\theta}^{j\,|\,\mathrm{pa}(j)}=\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{% s},s\in{\mathcal{X}}_{\mathrm{pa}(j)}\big{)}bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT = ( bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT ). By assuming global and local parameter independence (Geiger and Heckerman, 1997), respectively j𝜽j|pa(j)\bot\bot_{j}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}⊥ ⊥ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT and s𝜽sj|pa(j)\bot\bot_{s}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}⊥ ⊥ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, a joint prior on 𝜽={𝜽j|pa(j),jV}𝜽superscript𝜽conditional𝑗pa𝑗𝑗𝑉\bm{\theta}=\big{\{}\bm{\theta}^{j\,|\,\mathrm{pa}(j)},j\in V\big{\}}bold_italic_θ = { bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT , italic_j ∈ italic_V } can be written as

p(𝜽)=j=1q{s𝒳pa(j)p(𝜽sj|pa(j))}.𝑝𝜽superscriptsubscriptproduct𝑗1𝑞subscriptproduct𝑠subscript𝒳pa𝑗𝑝subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠\displaystyle p(\bm{\theta})=\prod_{j=1}^{q}\left\{\prod_{s\in{\mathcal{X}}_{% \mathrm{pa}(j)}}p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}\right\}.italic_p ( bold_italic_θ ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT { ∏ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) } . (11)

In what follows we implement the default choice am|sj|pa(j)=a/|𝒳fa(j)|subscriptsuperscript𝑎conditional𝑗pa𝑗conditional𝑚𝑠𝑎subscript𝒳fa𝑗a^{j\,|\,\mathrm{pa}(j)}_{m\,|\,s}=a/|{\mathcal{X}}_{\mathrm{fa}(j)}|italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m | italic_s end_POSTSUBSCRIPT = italic_a / | caligraphic_X start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT |, a>0𝑎0a>0italic_a > 0, leading to the Bayesian Dirichlet Equivalent uniform (BDEu) score (Heckerman et al., 1995), which guarantees that Markov equivalent DAGs are assigned the same marginal likelihood; see also Castelletti et al. (2024).

Also notice that the resulting prior is conjugate with the likelihood (2) since, for generic dataset 𝑿𝑿\bm{X}bold_italic_X, 𝜽sj|pa(j)|𝑿Dirichlet(𝒂sj|pa(j)+𝒏sfa(j))similar-toconditionalsubscriptsuperscript𝜽conditional𝑗pa𝑗𝑠𝑿Dirichletsubscriptsuperscript𝒂conditional𝑗pa𝑗𝑠superscriptsubscript𝒏𝑠fa𝑗\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\,|\,\bm{X}\sim\textnormal{Dirichlet}% \big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}_{s}^{\mathrm{fa}(j)}\big{)}bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_X ∼ Dirichlet ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + bold_italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT ) with 𝒏sfa(j)=(n(m,s)fa(j),m𝒳j)superscriptsubscript𝒏𝑠fa𝑗superscriptsubscript𝑛𝑚𝑠fa𝑗𝑚subscript𝒳𝑗\bm{n}_{s}^{\mathrm{fa}(j)}=\big{(}n_{(m,s)}^{\mathrm{fa}(j)},m\in\mathcal{X}_% {j}\big{)}bold_italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT = ( italic_n start_POSTSUBSCRIPT ( italic_m , italic_s ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT , italic_m ∈ caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). Accordingly, the posterior of 𝜽𝜽\bm{\theta}bold_italic_θ is

p(𝜽|𝑿)=j=1q{s𝒳pa(j)p(𝜽sj|pa(j)|𝑿)},𝑝conditional𝜽𝑿superscriptsubscriptproduct𝑗1𝑞subscriptproduct𝑠subscript𝒳pa𝑗𝑝conditionalsubscriptsuperscript𝜽conditional𝑗pa𝑗𝑠𝑿\displaystyle p(\bm{\theta}\,|\,\bm{X})=\prod_{j=1}^{q}\left\{\prod_{s\in{% \mathcal{X}}_{\mathrm{pa}(j)}}p\big{(}\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}\,% |\,\bm{X}\big{)}\right\},italic_p ( bold_italic_θ | bold_italic_X ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT { ∏ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | bold_italic_X ) } , (12)

with each term corresponding to a Dirichlet p.d.f., so that direct sampling from p(𝜽|𝑿)𝑝conditional𝜽𝑿p(\bm{\theta}\,|\,\bm{X})italic_p ( bold_italic_θ | bold_italic_X ) is possible. Finally, under the same prior, a marginal (i.e. integrated w.r.t. to 𝜽𝜽\bm{\theta}bold_italic_θ) likelihood m(𝑿|𝒟)=p(𝑿|𝜽,𝒟)p(𝜽|𝒟)𝑑𝜽𝑚conditional𝑿𝒟𝑝conditional𝑿𝜽𝒟𝑝conditional𝜽𝒟differential-d𝜽m(\bm{X}\,|\,\mathcal{D})=\int p(\bm{X}\,|\,\bm{\theta},\mathcal{D})p(\bm{% \theta}\,|\,\mathcal{D})\,d\bm{\theta}italic_m ( bold_italic_X | caligraphic_D ) = ∫ italic_p ( bold_italic_X | bold_italic_θ , caligraphic_D ) italic_p ( bold_italic_θ | caligraphic_D ) italic_d bold_italic_θ is available and admits the factorization

m(𝑿|𝒟)=j=1qm(𝑿j|𝑿pa(j)),𝑚conditional𝑿𝒟superscriptsubscriptproduct𝑗1𝑞𝑚conditionalsubscript𝑿𝑗subscript𝑿pa𝑗m(\bm{X}\,|\,\mathcal{D})=\prod_{j=1}^{q}m(\bm{X}_{j}\,|\,\bm{X}_{\mathrm{pa}(% j)}),italic_m ( bold_italic_X | caligraphic_D ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_m ( bold_italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT ) , (13)

with

m(𝑿j|𝑿pa(j))=s𝒳pa(j)h(𝒂sj|pa(j))h(𝒂sj|pa(j)+𝒏sfa(j))𝑚conditionalsubscript𝑿𝑗subscript𝑿pa𝑗subscriptproduct𝑠subscript𝒳pa𝑗subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠subscriptsuperscript𝒏fa𝑗𝑠m\big{(}\bm{X}_{j}\,|\,\bm{X}_{\mathrm{pa}(j)}\big{)}=\prod_{s\in{\mathcal{X}}% _{\mathrm{pa}(j)}}\frac{h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}\big{)}}{h% \big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}^{\mathrm{fa}(j)}_{s}\big{)}}italic_m ( bold_italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_s ∈ caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_h ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_h ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + bold_italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG (14)

and where h(𝒂sj|pa(j)+𝒏sfa(j))subscriptsuperscript𝒂conditional𝑗pa𝑗𝑠subscriptsuperscript𝒏fa𝑗𝑠h\big{(}\bm{a}^{j\,|\,\mathrm{pa}(j)}_{s}+\bm{n}^{\mathrm{fa}(j)}_{s}\big{)}italic_h ( bold_italic_a start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + bold_italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) is the posterior normalizing constant. See also the Supplementary Material (Section 1) for full details.

3.2 Baseline on DAGs

Let 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT be the (discrete) space of all DAGs with q𝑞qitalic_q nodes. Additionally, we can restrict 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT to a subset of DAGs satisfying some structural constraints, typically edge orientations that can be postulated in advance based on the specific real-data problem. As an instance, in our application to breast cancer data we regard age as an exogenous variable and forbid any incoming edge to it from other variables; conversely, we regard the occurrence of cardiotoxic side effect as a response variable and accordingly forbid any outgoing edge from it. Each DAG 𝒟=(V,E)𝒟𝑉𝐸\mathcal{D}=(V,E)caligraphic_D = ( italic_V , italic_E ) in 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT can be represented through a 0-1 adjacency matrix 𝑨𝒟superscript𝑨𝒟\bm{A}^{\mathcal{D}}bold_italic_A start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT, whose (u,v)𝑢𝑣(u,v)( italic_u , italic_v )-element 𝑨u,v𝒟=1subscriptsuperscript𝑨𝒟𝑢𝑣1\bm{A}^{\mathcal{D}}_{u,v}=1bold_italic_A start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT = 1 if (u,v)E𝑢𝑣𝐸(u,v)\in E( italic_u , italic_v ) ∈ italic_E, 00 otherwise. Additionally, let 𝑺𝒟superscript𝑺𝒟\bm{S}^{\mathcal{D}}bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT, be the adjacency matrix of the skeleton of 𝒟𝒟\mathcal{D}caligraphic_D, namely the undirected graph obtained from 𝒟𝒟\mathcal{D}caligraphic_D by disregarding edges orientations. We assign for each u>v𝑢𝑣u>vitalic_u > italic_v, 𝑺u,v𝒟|πiidBer(π)conditionalsubscriptsuperscript𝑺𝒟𝑢𝑣𝜋iidsimilar-toBer𝜋\bm{S}^{\mathcal{D}}_{u,v}\,|\,\pi\overset{\textnormal{{iid}}}{\sim}% \textnormal{Ber}(\pi)bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_v end_POSTSUBSCRIPT | italic_π overiid start_ARG ∼ end_ARG Ber ( italic_π ) where π𝜋\piitalic_π is a prior probability of edge inclusion. We then assume hierarchically πBeta(a,b)similar-to𝜋Beta𝑎𝑏\pi\sim\textnormal{Beta}(a,b)italic_π ∼ Beta ( italic_a , italic_b ), leading to the integrated prior on DAG 𝒟𝒟\mathcal{D}caligraphic_D

p(𝒟)p(𝑺𝒟)=Γ(a+b)Γ(a)Γ(b)Γ(|𝑺𝒟|+a)Γ(q(q1)/2|𝑺𝒟|+b)Γ(q(q1)/2+a+b),proportional-to𝑝𝒟𝑝superscript𝑺𝒟Γ𝑎𝑏Γ𝑎Γ𝑏Γsuperscript𝑺𝒟𝑎Γ𝑞𝑞12superscript𝑺𝒟𝑏Γ𝑞𝑞12𝑎𝑏p(\mathcal{D})\propto p\big{(}\bm{S}^{\mathcal{D}}\big{)}=\frac{\Gamma(a+b)}{% \Gamma(a)\Gamma(b)}\cdot\frac{\Gamma\big{(}|\bm{S}^{\mathcal{D}}|+a\big{)}% \Gamma\big{(}q(q-1)/2-|\bm{S}^{\mathcal{D}}|+b\big{)}}{\Gamma\big{(}q(q-1)/2+a% +b\big{)}},italic_p ( caligraphic_D ) ∝ italic_p ( bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT ) = divide start_ARG roman_Γ ( italic_a + italic_b ) end_ARG start_ARG roman_Γ ( italic_a ) roman_Γ ( italic_b ) end_ARG ⋅ divide start_ARG roman_Γ ( | bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT | + italic_a ) roman_Γ ( italic_q ( italic_q - 1 ) / 2 - | bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT | + italic_b ) end_ARG start_ARG roman_Γ ( italic_q ( italic_q - 1 ) / 2 + italic_a + italic_b ) end_ARG ,

where |𝑺𝒟|superscript𝑺𝒟|\bm{S}^{\mathcal{D}}|| bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT | the number of non-null elements in 𝑺𝒟superscript𝑺𝒟\bm{S}^{\mathcal{D}}bold_italic_S start_POSTSUPERSCRIPT caligraphic_D end_POSTSUPERSCRIPT, corresponding to the number of edges in 𝒟𝒟\mathcal{D}caligraphic_D, and q(q1)/2𝑞𝑞12q(q-1)/2italic_q ( italic_q - 1 ) / 2 is the maximum number of edges in a DAG having q𝑞qitalic_q nodes. Sampling from the baseline over DAGs, as required by our MCMC sampler (Section 4), is possible through an acceptance-rejection algorithm over the space 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT; see also the Supplementary Material (Section 2).

4 Posterior inference

In this section we detail our Markov Chain Monte Carlo (MCMC) strategy for posterior inference of a DP mixture of categorical DAGs. This is based on a collapsed sampler with DAG parameters integrated out, and which accordingly approximates a marginal posterior over DAGs and cluster indicators ξ1,,ξnsubscript𝜉1subscript𝜉𝑛\xi_{1},\dots,\xi_{n}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Such output allows for inference about the clustering structure and/or the graphical structures associated with the clusters. In a second step, DAG parameters can be sampled conditionally on ξ1,,ξnsubscript𝜉1subscript𝜉𝑛\xi_{1},\dots,\xi_{n}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT based on Equation (12).

4.1 MCMC scheme

The structure of our baseline measure (Section 3.1) is such that we can integrate out the DAG parameter 𝜽𝜽\bm{\theta}bold_italic_θ, which allows for the implementation of a collapsed sampler approximating the marginal posterior of {ξi}i=1n,α,{𝒟k}k=1K,Ksuperscriptsubscriptsubscript𝜉𝑖𝑖1𝑛𝛼superscriptsubscriptsubscript𝒟𝑘𝑘1𝐾𝐾\{\xi_{i}\}_{i=1}^{n},\alpha,\{\mathcal{D}_{k}\}_{k=1}^{K},K{ italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , italic_α , { caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_K. The resulting scheme has a Gibbs-sampling structure as Algorithm 2 in Neal (2000) and implements the following steps.

4.1.1 Update of cluster indicators

The full conditional of ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is

p(ξi=k|{𝒙(l):li,ξl=k},𝒟k){nkip(𝒙(i)|{𝒙(l):li,ξl=k},𝒟k)k=1,,Kαp(𝒙(i)|𝒟k)k=K+1,proportional-to𝑝subscript𝜉𝑖conditional𝑘conditional-setsuperscript𝒙𝑙formulae-sequence𝑙𝑖subscript𝜉𝑙𝑘subscript𝒟𝑘casessubscriptsuperscript𝑛𝑖𝑘𝑝conditionalsuperscript𝒙𝑖conditional-setsuperscript𝒙𝑙formulae-sequence𝑙𝑖subscript𝜉𝑙𝑘subscript𝒟𝑘𝑘1𝐾𝛼𝑝conditionalsuperscript𝒙𝑖subscript𝒟𝑘𝑘𝐾1\displaystyle p(\xi_{i}=k\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D}_{% k})\propto\begin{cases}n^{-i}_{k}\ p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,% \xi_{l}=k\},\mathcal{D}_{k})&k=1,\dots,K\\ \alpha\ p(\bm{x}^{(i)}\,|\,\mathcal{D}_{k})&k=K+1,\end{cases}italic_p ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k | { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_l ≠ italic_i , italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∝ { start_ROW start_CELL italic_n start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_l ≠ italic_i , italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL italic_k = 1 , … , italic_K end_CELL end_ROW start_ROW start_CELL italic_α italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL start_CELL italic_k = italic_K + 1 , end_CELL end_ROW (15)

which corresponds to the probability that subject i𝑖iitalic_i is assigned to cluster k𝑘kitalic_k, conditionally on all the observations currently assigned to that cluster, and on 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. In particular, for a non empty cluster k=1,,K𝑘1𝐾k=1,\dots,Kitalic_k = 1 , … , italic_K, the full conditional is proportional to the product between two terms: the number of observations belonging to cluster k𝑘kitalic_k (possibly excluding observation i𝑖iitalic_i), nki=li𝟙{ξl=k}subscriptsuperscript𝑛𝑖𝑘subscript𝑙𝑖1subscript𝜉𝑙𝑘n^{-i}_{k}=\sum_{l\neq i}\mathbbm{1}\{\xi_{l}=k\}italic_n start_POSTSUPERSCRIPT - italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l ≠ italic_i end_POSTSUBSCRIPT blackboard_1 { italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k }, and the posterior predictive distribution p(𝒙(i)|{𝒙(l):li,ξl=k},𝒟k)𝑝conditionalsuperscript𝒙𝑖conditional-setsuperscript𝒙𝑙formulae-sequence𝑙𝑖subscript𝜉𝑙𝑘subscript𝒟𝑘p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D}_{k})italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_l ≠ italic_i , italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) evaluated at 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. For the latter, we provide in the following proposition a simple closed-form expression.

Proposition 4.1 (Posterior predictive - non-empty cluster).

For a given cluster k𝑘kitalic_k, consider the data matrix 𝐗(k)superscript𝐗𝑘\bm{X}^{(k)}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT collecting the nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT observations {𝐱(l):ξl=k}conditional-setsuperscript𝐱𝑙subscript𝜉𝑙𝑘\big{\{}\bm{x}^{(l)}:\xi_{l}=k\big{\}}{ bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } and an observation 𝐱(i)superscript𝐱𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Then, the posterior predictive of 𝐱(i)superscript𝐱𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT given {𝐱(l):li,ξl=k}conditional-setsuperscript𝐱𝑙formulae-sequence𝑙𝑖subscript𝜉𝑙𝑘\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\}{ bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_l ≠ italic_i , italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } is

p(𝒙(i)|{𝒙(l):li,ξl=k},𝒟k)=j=1q{a/|𝒳fa(j)|+n(m~j,s~j)fa(j)k𝟙{ξi=k}a/|𝒳pa(j)|+ns~jpa(j)k𝟙{ξi=k}}𝑝conditionalsuperscript𝒙𝑖conditional-setsuperscript𝒙𝑙formulae-sequence𝑙𝑖subscript𝜉𝑙𝑘subscript𝒟𝑘superscriptsubscriptproduct𝑗1𝑞𝑎subscript𝒳fa𝑗subscriptsubscriptsuperscript𝑛fa𝑗subscript~𝑚𝑗subscript~𝑠𝑗𝑘1subscript𝜉𝑖𝑘𝑎subscript𝒳pa𝑗subscriptsubscriptsuperscript𝑛pa𝑗subscript~𝑠𝑗𝑘1subscript𝜉𝑖𝑘\displaystyle p(\bm{x}^{(i)}\,|\,\{\bm{x}^{(l)}:l\neq i,\xi_{l}=k\},\mathcal{D% }_{k})=\prod_{j=1}^{q}\left\{\frac{a/|{\mathcal{X}}_{\mathrm{fa}(j)}|+% \prescript{}{k}{n}^{\mathrm{fa}(j)}_{(\tilde{m}_{j},\tilde{s}_{j})}-\mathbbm{1% }\{\xi_{i}=k\}}{a/|{\mathcal{X}}_{\mathrm{pa}(j)}|+\prescript{}{k}{n}^{\mathrm% {pa}(j)}_{\tilde{s}_{j}}-\mathbbm{1}\{\xi_{i}=k\}}\right\}italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT : italic_l ≠ italic_i , italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k } , caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT { divide start_ARG italic_a / | caligraphic_X start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT | + start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT - blackboard_1 { italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k } end_ARG start_ARG italic_a / | caligraphic_X start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT | + start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT italic_n start_POSTSUPERSCRIPT roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT - blackboard_1 { italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k } end_ARG } (16)

where m~j=𝐱j(i),s~j=𝐱pa(j)(i)formulae-sequencesubscript~𝑚𝑗subscriptsuperscript𝐱𝑖𝑗subscript~𝑠𝑗subscriptsuperscript𝐱𝑖pa𝑗\tilde{m}_{j}=\bm{x}^{(i)}_{j},\tilde{s}_{j}=\bm{x}^{(i)}_{\mathrm{pa}(j)}over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT and

n(m~j,s~j)fa(j)k=l:ξl=k𝟙{𝒙fa(j)(l)=(m~j,s~j)},ns~jpa(j)k=l:ξl=k𝟙{𝒙pa(j)(l)=s~j}.formulae-sequencesubscriptsubscriptsuperscript𝑛fa𝑗subscript~𝑚𝑗subscript~𝑠𝑗𝑘subscript:𝑙subscript𝜉𝑙𝑘1subscriptsuperscript𝒙𝑙fa𝑗subscript~𝑚𝑗subscript~𝑠𝑗subscriptsubscriptsuperscript𝑛pa𝑗subscript~𝑠𝑗𝑘subscript:𝑙subscript𝜉𝑙𝑘1subscriptsuperscript𝒙𝑙pa𝑗subscript~𝑠𝑗\prescript{}{k}{n}^{\mathrm{fa}(j)}_{(\tilde{m}_{j},\tilde{s}_{j})}=\sum_{l:% \xi_{l}=k}\mathbbm{1}\big{\{}\bm{x}^{(l)}_{\mathrm{fa}(j)}=(\tilde{m}_{j},% \tilde{s}_{j})\big{\}},\quad\prescript{}{k}{n}^{\mathrm{pa}(j)}_{\tilde{s}_{j}% }=\sum_{l:\xi_{l}=k}\mathbbm{1}\big{\{}\bm{x}^{(l)}_{\mathrm{pa}(j)}=\tilde{s}% _{j}\big{\}}.start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT italic_n start_POSTSUPERSCRIPT roman_fa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l : italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT blackboard_1 { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_fa ( italic_j ) end_POSTSUBSCRIPT = ( over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) } , start_FLOATSUBSCRIPT italic_k end_FLOATSUBSCRIPT italic_n start_POSTSUPERSCRIPT roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_l : italic_ξ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_k end_POSTSUBSCRIPT blackboard_1 { bold_italic_x start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_pa ( italic_j ) end_POSTSUBSCRIPT = over~ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT } .
Proof.

See Supplementary Material. ∎

The second expression of (15) considers the case of a (new) empty cluster k=K+1𝑘𝐾1k=K+1italic_k = italic_K + 1, where the DAG 𝒟K+1subscript𝒟𝐾1\mathcal{D}_{K+1}caligraphic_D start_POSTSUBSCRIPT italic_K + 1 end_POSTSUBSCRIPT is sampled from the baseline over 𝒮qsubscript𝒮𝑞\mathcal{S}_{q}caligraphic_S start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT. In such case, the full conditional is proportional to the product of the concentration parameter α𝛼\alphaitalic_α and a posterior predictive which reduces to the marginal likelihood (prior predictive) of a cluster containing subject i𝑖iitalic_i only. A related closed-form expression is provided by the following proposition.

Proposition 4.2 (Posterior predictive - empty cluster).

For a new cluster k=K+1𝑘𝐾1k=K+1italic_k = italic_K + 1, the posterior predictive of 𝐱(i)superscript𝐱𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT coincides with the marginal likelihood and is given by

p(𝒙(i)|𝒟k)=j=1q1|𝒳j|.𝑝conditionalsuperscript𝒙𝑖subscript𝒟𝑘superscriptsubscriptproduct𝑗1𝑞1subscript𝒳𝑗p(\bm{x}^{(i)}\,|\,\mathcal{D}_{k})=\prod_{j=1}^{q}\frac{1}{|{\mathcal{X}}_{j}% |}.italic_p ( bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG . (17)
Proof.

See Supplementary Material. ∎

4.1.2 Update of α𝛼\alphaitalic_α

Under the DP prior, the full conditional distribution of α𝛼\alphaitalic_α coincides with p(α|K)p(K|α)p(α)proportional-to𝑝conditional𝛼𝐾𝑝conditional𝐾𝛼𝑝𝛼p(\alpha\,|\,K)\propto p(K\,|\,\alpha)p(\alpha)italic_p ( italic_α | italic_K ) ∝ italic_p ( italic_K | italic_α ) italic_p ( italic_α ), where in particular

p(K|α)cn(K)αKΓ(α)Γ(α+n)proportional-to𝑝conditional𝐾𝛼subscript𝑐𝑛𝐾superscript𝛼𝐾Γ𝛼Γ𝛼𝑛p(K\,|\,\alpha)\propto c_{n}(K)\alpha^{K}\frac{\Gamma(\alpha)}{\Gamma(\alpha+n)}italic_p ( italic_K | italic_α ) ∝ italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_K ) italic_α start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( italic_α ) end_ARG start_ARG roman_Γ ( italic_α + italic_n ) end_ARG

is the prior on the number of clusters induced by the DP and cn(K)subscript𝑐𝑛𝐾c_{n}(K)italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_K ) is a normalizing constant not involving α𝛼\alphaitalic_α. Sampling from p(α|K)𝑝conditional𝛼𝐾p(\alpha\,|\,K)italic_p ( italic_α | italic_K ) can be done by augmenting the distribution through an auxiliary variable ηBeta(1,α)similar-to𝜂Beta1𝛼\eta\sim\textnormal{Beta}(1,\alpha)italic_η ∼ Beta ( 1 , italic_α ). It can be shown (Escobar and West, 1994) that under the prior αGamma(c,d)similar-to𝛼Gamma𝑐𝑑\alpha\sim\textnormal{Gamma}(c,d)italic_α ∼ Gamma ( italic_c , italic_d ) the full conditional of α|K,ηconditional𝛼𝐾𝜂\alpha\,|\,K,\etaitalic_α | italic_K , italic_η corresponds to a mixture of Gamma distributions, specifically

α|η,KgGamma(c+K,dlogη)+(1g)Gamma(c+K1,dlogη),similar-toconditional𝛼𝜂𝐾𝑔Gamma𝑐𝐾𝑑𝑙𝑜𝑔𝜂1𝑔Gamma𝑐𝐾1𝑑𝑙𝑜𝑔𝜂\alpha\,|\,\eta,K\sim g\cdot\textnormal{Gamma}(c+K,d-log\eta)+(1-g)\cdot% \textnormal{Gamma}(c+K-1,d-log\eta),italic_α | italic_η , italic_K ∼ italic_g ⋅ Gamma ( italic_c + italic_K , italic_d - italic_l italic_o italic_g italic_η ) + ( 1 - italic_g ) ⋅ Gamma ( italic_c + italic_K - 1 , italic_d - italic_l italic_o italic_g italic_η ) ,

where g/(1g)=(c+K1)/n(dlogη)𝑔1𝑔𝑐𝐾1𝑛𝑑𝑙𝑜𝑔𝜂g/(1-g)=(c+K-1)/n(d-log\ \eta)italic_g / ( 1 - italic_g ) = ( italic_c + italic_K - 1 ) / italic_n ( italic_d - italic_l italic_o italic_g italic_η ).

4.1.3 Update of DAGs and sampling of DAG parameters

Let K𝐾Kitalic_K be the number of clusters and ξ1,,ξnsubscript𝜉1subscript𝜉𝑛\xi_{1},\dots,\xi_{n}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT the cluster indicators, with each ξi{1,,K}subscript𝜉𝑖1𝐾\xi_{i}\in\{1,\dots,K\}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_K }. For a given k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}italic_k ∈ { 1 , … , italic_K }, let {𝒙i:ξi=k}conditional-setsubscript𝒙𝑖subscript𝜉𝑖𝑘\{\bm{x}_{i}:\xi_{i}=k\}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_k } be the set of observations currently assigned to cluster k𝑘kitalic_k, and 𝑿(k)superscript𝑿𝑘\bm{X}^{(k)}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT the implied (nk,q)subscript𝑛𝑘𝑞(n_{k},q)( italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_q ) data matrix; see also Equation (8). Without loss of generality, consider a generic cluster and omit for simplicity subscripts k𝑘kitalic_k from 𝒟ksubscript𝒟𝑘\mathcal{D}_{k}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and 𝑿(k)superscript𝑿𝑘\bm{X}^{(k)}bold_italic_X start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT. Update of DAG 𝒟𝒟\mathcal{D}caligraphic_D is performed through a Metropolis Hastings step where a DAG 𝒟~~𝒟\widetilde{\mathcal{D}}over~ start_ARG caligraphic_D end_ARG is sampled from a proposal distribution q(𝒟~|𝒟)𝑞conditional~𝒟𝒟q(\widetilde{\mathcal{D}}\,|\,\mathcal{D})italic_q ( over~ start_ARG caligraphic_D end_ARG | caligraphic_D ) conditionally on a current DAG 𝒟𝒟\mathcal{D}caligraphic_D and it is accepted with probability α𝒟~=min{1;r𝒟~}subscript𝛼~𝒟1subscript𝑟~𝒟\alpha_{\widetilde{\mathcal{D}}}=\min\{1;r_{\widetilde{\mathcal{D}}}\}italic_α start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG end_POSTSUBSCRIPT = roman_min { 1 ; italic_r start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG end_POSTSUBSCRIPT } with

r𝒟~=m(𝑿|𝒟~)m(𝑿|𝒟)p(𝒟~)p(𝒟)q(𝒟|𝒟~)q(𝒟~|𝒟),subscript𝑟~𝒟𝑚conditional𝑿~𝒟𝑚conditional𝑿𝒟𝑝~𝒟𝑝𝒟𝑞conditional𝒟~𝒟𝑞conditional~𝒟𝒟r_{\widetilde{\mathcal{D}}}=\frac{m(\bm{X}\,|\,\widetilde{\mathcal{D}})}{m(\bm% {X}\,|\,\mathcal{D})}\cdot\frac{p(\widetilde{\mathcal{D}})}{p(\mathcal{D})}% \cdot\frac{q(\mathcal{D}\,|\,\widetilde{\mathcal{D}})}{q(\widetilde{\mathcal{D% }}\,|\,\mathcal{D})},italic_r start_POSTSUBSCRIPT over~ start_ARG caligraphic_D end_ARG end_POSTSUBSCRIPT = divide start_ARG italic_m ( bold_italic_X | over~ start_ARG caligraphic_D end_ARG ) end_ARG start_ARG italic_m ( bold_italic_X | caligraphic_D ) end_ARG ⋅ divide start_ARG italic_p ( over~ start_ARG caligraphic_D end_ARG ) end_ARG start_ARG italic_p ( caligraphic_D ) end_ARG ⋅ divide start_ARG italic_q ( caligraphic_D | over~ start_ARG caligraphic_D end_ARG ) end_ARG start_ARG italic_q ( over~ start_ARG caligraphic_D end_ARG | caligraphic_D ) end_ARG , (18)

and m(𝑿|𝒟~)𝑚conditional𝑿~𝒟m(\bm{X}\,|\,\widetilde{\mathcal{D}})italic_m ( bold_italic_X | over~ start_ARG caligraphic_D end_ARG ) as in Equation (13); see also the Supplementary Material for full details.

Finally, conditionally on DAGs 𝒟1,,𝒟Ksubscript𝒟1subscript𝒟𝐾\mathcal{D}_{1},\dots,\mathcal{D}_{K}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT and indicators ξ1,,ξnsubscript𝜉1subscript𝜉𝑛\xi_{1},\dots,\xi_{n}italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we can sample each DAG parameter 𝜽ksubscript𝜽𝑘\bm{\theta}_{k}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT based on Equation (12), corresponding to the posterior of 𝜽ksubscript𝜽𝑘\bm{\theta}_{k}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT which is available in closed-form as a product of Dirichlet probability functions; see also Section 3.1.

4.2 Posterior summaries

Output of the MCMC scheme is a collection of cluster indicators, DAGs and DAG parameters approximately drawn from the posterior distribution of our DP mixture model. Starting from such output, we can provide posterior summaries regarding clustering, DAG structures, as well as DAG-model parameters. Specifically, let K(s)superscript𝐾𝑠K^{(s)}italic_K start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT be the number of clusters at MCMC iteration s𝑠sitalic_s, ξi(s)superscriptsubscript𝜉𝑖𝑠\xi_{i}^{(s)}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT, i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n, 𝒟k(s)superscriptsubscript𝒟𝑘𝑠\mathcal{D}_{k}^{(s)}caligraphic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT and 𝜽k(s)superscriptsubscript𝜽𝑘𝑠\bm{\theta}_{k}^{(s)}bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT, k=1,,K(s)𝑘1superscript𝐾𝑠k=1,\dots,K^{(s)}italic_k = 1 , … , italic_K start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT, be the corresponding realizations of the three sets of parameters. For clustering purposes, we first recover an (n,n)𝑛𝑛(n,n)( italic_n , italic_n ) posterior similarity matrix 𝑺𝑺\bm{S}bold_italic_S, with (i,i)𝑖superscript𝑖(i,i^{\prime})( italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )-element 𝑺i,isubscript𝑺𝑖superscript𝑖\bm{S}_{i,i^{\prime}}bold_italic_S start_POSTSUBSCRIPT italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT corresponding to the (estimated) posterior probability that individuals i𝑖iitalic_i and isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are assigned to the same cluster, namely

p^(ξi=ξi|𝑿)=1Ss=1S𝟙{ξi(s)=ξi(s)}.^𝑝subscript𝜉𝑖conditionalsubscript𝜉superscript𝑖𝑿1𝑆superscriptsubscript𝑠1𝑆1superscriptsubscript𝜉𝑖𝑠superscriptsubscript𝜉superscript𝑖𝑠\widehat{p}(\xi_{i}=\xi_{i^{\prime}}\,|\,\bm{X})=\frac{1}{S}\sum_{s=1}^{S}% \mathbbm{1}\left\{\xi_{i}^{(s)}=\xi_{i^{\prime}}^{(s)}\right\}.over^ start_ARG italic_p end_ARG ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT blackboard_1 { italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT } . (19)

A point estimate of the clustering structure, 𝒄^^𝒄\widehat{\bm{c}}over^ start_ARG bold_italic_c end_ARG, can be recovered by assigning individuals i𝑖iitalic_i and isuperscript𝑖i^{\prime}italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to the same cluster if p^(ξi=ξi|𝑿)^𝑝subscript𝜉𝑖conditionalsubscript𝜉superscript𝑖𝑿\widehat{p}(\xi_{i}=\xi_{i^{\prime}}\,|\,\bm{X})over^ start_ARG italic_p end_ARG ( italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | bold_italic_X ) exceeds a given threshold, say z=0.5𝑧0.5z=0.5italic_z = 0.5. As an alternative, a clustering estimate can be obtained following Wade and Ghahramani (2018) as the partition minimizing the expected Variation of Information (VI); see also Section 5.

From the same MCMC output, we can recover for each subject i𝑖iitalic_i a (q,q)𝑞𝑞(q,q)( italic_q , italic_q ) matrix collecting estimates of the Posterior Probabilities of edge Inclusion (PPIs). For a given subject i𝑖iitalic_i and edge uv,uvformulae-sequence𝑢𝑣𝑢𝑣u\rightarrow v,u\neq vitalic_u → italic_v , italic_u ≠ italic_v, its PPI is estimated as

p^i(uv|𝑿)=1Ss=1S𝟙{uv𝒟ξi(s)(s)},subscript^𝑝𝑖𝑢conditional𝑣𝑿1𝑆superscriptsubscript𝑠1𝑆1𝑢𝑣subscriptsuperscript𝒟𝑠superscriptsubscript𝜉𝑖𝑠\widehat{p}_{i}(u\rightarrow v\,|\,\bm{X})=\frac{1}{S}\sum_{s=1}^{S}\mathbbm{1% }\left\{u\rightarrow v\in\mathcal{D}^{(s)}_{\xi_{i}^{(s)}}\right\},over^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u → italic_v | bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT blackboard_1 { italic_u → italic_v ∈ caligraphic_D start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } , (20)

corresponding to the proportion of DAGs 𝒟i(s)superscriptsubscript𝒟𝑖𝑠\mathcal{D}_{i}^{(s)}caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT in the chain containing the directed edge uv𝑢𝑣u\rightarrow vitalic_u → italic_v. Finally, a graph estimate at subject-specific level, say 𝒟^isubscript^𝒟𝑖\widehat{\mathcal{D}}_{i}over^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, can be obtained by including those edges for which p^i(uv|𝑿)>zsubscript^𝑝𝑖𝑢conditional𝑣𝑿𝑧\widehat{p}_{i}(u\rightarrow v\,|\,\bm{X})>zover^ start_ARG italic_p end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_u → italic_v | bold_italic_X ) > italic_z for z(0,1)𝑧01z\in(0,1)italic_z ∈ ( 0 , 1 ), e.g z=0.5𝑧0.5z=0.5italic_z = 0.5.

Recall now the definition of causal effect γy,h(𝜽)subscript𝛾𝑦𝜽\gamma_{y,h}(\bm{\theta})italic_γ start_POSTSUBSCRIPT italic_y , italic_h end_POSTSUBSCRIPT ( bold_italic_θ ) provided in Equation (2.2) and assume that the intervened variable and response, Xhsubscript𝑋X_{h}italic_X start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Y𝑌Yitalic_Y respectively, are given so that we can omit them from the notation. A Bayesian Model Averaging (BMA) estimate of γisubscript𝛾𝑖\gamma_{i}italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the subject-specific causal effect, for i{1,,n}𝑖1𝑛i\in\{1,\dots,n\}italic_i ∈ { 1 , … , italic_n }, is given by

γi^=1Ss=1Sγi(s)(𝜽ξi(s)).^subscript𝛾𝑖1𝑆superscriptsubscript𝑠1𝑆superscriptsubscript𝛾𝑖𝑠subscript𝜽superscriptsubscript𝜉𝑖𝑠\widehat{\gamma_{i}}=\frac{1}{S}\sum_{s=1}^{S}\gamma_{i}^{(s)}\left(\bm{\theta% }_{\xi_{i}^{(s)}}\right).over^ start_ARG italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) . (21)

The resulting collection {γ1^,,γ^n}^subscript𝛾1subscript^𝛾𝑛\{\widehat{\gamma_{1}},\dots,\widehat{\gamma}_{n}\}{ over^ start_ARG italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG , … , over^ start_ARG italic_γ end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } provides estimates of causal effects at individual level, which also naturally account for DAG-model uncertainty through BMA.

5 Simulations

We conduct simulation studies to evaluate the performance of our methodology relative to the tasks of clustering and structure learning and compare it with alternative methods for clustering multivariate categorical data. We consider settings with q=10𝑞10q=10italic_q = 10 nodes, number of clusters K=2𝐾2K=2italic_K = 2 and sample sizes n1=n2subscript𝑛1subscript𝑛2n_{1}=n_{2}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT that we range in {100,200,500}100200500\{100,200,500\}{ 100 , 200 , 500 }. We generate the two DAGs 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT independently, so that the two clusters differ in general by the dependence structure among variables, and by fixing a probability of edge inclusion π=0.2𝜋0.2\pi=0.2italic_π = 0.2. The two categorical datasets 𝑿(1),𝑿(2)superscript𝑿1superscript𝑿2\bm{X}^{(1)},\bm{X}^{(2)}bold_italic_X start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT are built by discretization of latent Gaussian observations as detailed in the Supplementary Material (Section 4). Importantly, discretization is based on a collection of thresholds gj(,+)subscript𝑔𝑗g_{j}\in(-\infty,+\infty)italic_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ ( - ∞ , + ∞ ), that we randomly draw from a Unif(z^j,α,z^j,1α)Unifsubscript^𝑧𝑗𝛼subscript^𝑧𝑗1𝛼\textnormal{Unif}(\hat{z}_{j,\alpha},\hat{z}_{j,1-\alpha})Unif ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j , italic_α end_POSTSUBSCRIPT , over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j , 1 - italic_α end_POSTSUBSCRIPT ), where z^j,αsubscript^𝑧𝑗𝛼\hat{z}_{j,\alpha}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_j , italic_α end_POSTSUBSCRIPT denotes the quantile of order α𝛼\alphaitalic_α in the empirical distribution of latent variable Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, independently across j𝑗jitalic_j and for each k=1,2𝑘12k=1,2italic_k = 1 , 2; see again our Supplementary Material. We consider α{0.1,0.4}𝛼0.10.4\alpha\in\{0.1,0.4\}italic_α ∈ { 0.1 , 0.4 } which implies different degrees of similarity among marginal distributions between clusters. For benchmark methods that do not consider a dependence structure between variables we expect a lower ability in recovering the true clustering when α=0.4𝛼0.4\alpha=0.4italic_α = 0.4, namely when marginal distributions are more similar across clusters. Finally, under each scenario, a collection of N=40𝑁40N=40italic_N = 40 multiple (K=2𝐾2K=2italic_K = 2) datasets is generated.

5.1 Clustering

We evaluate the clustering performance of our method (DAG mixture) w.r.t. state-of-the-art approaches and specifically the Latent Class Model (LCM) (Goodman, 1974; Linzer and Lewis, 2011) and K-modes (Huang, 1998). For both methods, we input the number of clusters as K=2𝐾2K=2italic_K = 2. Additionally, to emphasize the contribution of a DAG-based model on cluster identification, we also implement a No DAG strategy, where for each group the DAG is assumed to be known and corresponds to an empty graph. Performances are assessed by comparing the true partition 𝒄𝒄\bm{c}bold_italic_c with the estimated partitions 𝒄^^𝒄\widehat{\bm{c}}over^ start_ARG bold_italic_c end_ARG based on the Variation of Information (VI). Lower values of the metric correspond to better performances. In addition, we expect scenario with α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 to be characterized by overall better performances than α=0.4𝛼0.4\alpha=0.4italic_α = 0.4, because in the latter the difference between the two clusters is mainly due to the dependency structure among the variables. Results are summarized in the boxplots of Figure 1.

Refer to caption
Refer to caption
Figure 1: Simulations. Distribution (across 40404040 replicates) of Variation of Information, for the two different simulation scenarios: α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 and α=0.4𝛼0.4\alpha=0.4italic_α = 0.4. Methods under comparison are: Latent Class Model (LCM), K-modes, No DAG, and our DP mixture of DAGs (DAG mixture).

As it appears, all methods tend to improve as the sample size nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT grows, with our DAG mixture model clearly outperforming all the benchmarks under all scenarios.

5.2 Structure learning

We now assess the ability of our method in recovering the graphical structure underlying each cluster. To this end, we consider the Structural Hamming Distance (SHD), which represents the number of modifications (edge insertions, edge removals, edge reversal) that are needed to transform the estimated DAG 𝒟^^𝒟\widehat{\mathcal{D}}over^ start_ARG caligraphic_D end_ARG into the true DAG 𝒟𝒟\mathcal{D}caligraphic_D. Specifically, we compare each subject-specific estimated DAG 𝒟^i,i=1,,nformulae-sequencesubscript^𝒟𝑖𝑖1𝑛\widehat{\mathcal{D}}_{i},i=1,\dots,nover^ start_ARG caligraphic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , … , italic_n with 𝒟ξisubscript𝒟subscript𝜉𝑖\mathcal{D}_{\xi_{i}}caligraphic_D start_POSTSUBSCRIPT italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, where ξisubscript𝜉𝑖\xi_{i}italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the true class-membership. In addition, we include the Oracle version of our method, in which the true clustering is assumed to be known, and a “one-group” naive strategy (No mixture), which instead assigns all subjects to the same cluster and therefore disregard heterogeneity. Results, for each scenario defined by α𝛼\alphaitalic_α and nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, are summarized in Figure 2. The No mixture strategy, which neglects the clustering structure in the data, performs worse than the other two methods under all scenarios and with a worsen performance as the sample size nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT increases. By converse, the Oracle version of our method performs slightly better than our DP mixture method, a behavior which is more evident under scenario α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 where clustering is indeed more difficult. Finally, both methods improve their performance as nksubscript𝑛𝑘n_{k}italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT grows.

Refer to caption
Refer to caption
Figure 2: Simulations. Distribution (across 40404040 replicates) of the SHD between estimated DAGs and true DAGs. Methods under comparison are: No mixture, the Oracle version of our method (Oracle) and our DP mixture of DAGs (DAG mixture).

6 Analysis of breast cancer data

6.1 Dataset and model implementation

In this section we analyse a dataset of n=404𝑛404n=404italic_n = 404 women diagnosed with HER2+ breast cancer and treated with potentially cardiotoxic therapies based on monoclonal antibodies (trastuzumab) and chemotherapy drugs (antracyclines). Variables in the dataset include: demographic and physical features, such as age, height and weight (expressed in terms of Body Mass Index, BMI, and included through a three-level categorical variable); risk factors, such as diagnosis of hypertension (HTA), dyslipidemia (DL), diabetes mellitus (DM), smoking (smoker and ex smoker); past cardiac diseases, namely cardiac insufficiency (CIprev), ischemic cardiomyopathy (ICMprev), arrhythmia (ARRprev), valvulopathy (VALVprev), valve surgery (valvsurgprev). In addition, the dataset provides information regarding treatments, antiHER2 monoclonal therapy (antiHER2) and/or antracyclines (AC), that were administrated to patients. Finally, the target variable corresponds to Cancer Therapy-Related Cardiac Dysfunction (CTRCD), a binary outcome indicating the occurrence (1) or not (0) of cardiac dysfunction. The original dataset is provided as a supplement to Piñeiro-Lamas et al. (2023) and included as supplementary material to our paper. Notably, all variables are categorical, with the exception of age and heart rate which have been discretized into two dummy variables, corresponding to middle vs low, and high vs low. While in general the cardiotoxic effects of the available oncological therapies have been established in the literature, still, the occurrence of CTRCD can vary substantially among patients because of both observed features (such as risk factors) or even unobserved characteristics. Accordingly, it is of interest to quantify causal effects w.r.t. the occurrence of CTRCD at individual-level, which is crucial for the development and administration of appropriate antiHER2 therapies.

Given the structure of the dataset, we constrain the adjacency matrix of DAGs in such a way that CTRCD can only (potentially) be a response, i.e. no outgoing edges are allowed, and treat age, BMI, smoker, ex smoker as exogenous variables, i.e. with no incoming edges from other nodes, while possible links/dependencies between them are allowed. Moreover, we assume that the absence/presence of risk factors can imply the administration of a therapy (AC, antiHER2), while the converse is not possible. We implement our mixture model by running the MCMC scheme for S=100000𝑆100000S=100000italic_S = 100000 iterations, which include a burn-in period of 10000100001000010000 draws that are discarded from the posterior analysis. With regard to hyperparameters, we fix c=3,d=1formulae-sequence𝑐3𝑑1c=3,d=1italic_c = 3 , italic_d = 1 in the Gamma prior on the DP precision parameter α𝛼\alphaitalic_α. The common hyperparameter a𝑎aitalic_a on the collection of Dirichlet priors on 𝜽sj|pa(j)subscriptsuperscript𝜽conditional𝑗pa𝑗𝑠\bm{\theta}^{j\,|\,\mathrm{pa}(j)}_{s}bold_italic_θ start_POSTSUPERSCRIPT italic_j | roman_pa ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is instead fixed as a=1𝑎1a=1italic_a = 1, while in the hierarchical prior on DAGs we fix a=1𝑎1a=1italic_a = 1, b=2q𝑏2𝑞b=2qitalic_b = 2 italic_q, reflecting an a priori assumption of sparsity in the graph space. To assess the convergence of our algorithm we also run two independent MCMC chains; results suggest an overall agreement in terms of clustering (evaluated through posterior similarity matrices), structure learning (based on estimated PPIs) and posterior distribution of causal-effect parameters. Results relative to all such quantities are presented discussed in the following sections.

6.2 Clustering

We summarize the clustering structure learned by our model by building an (n,n)𝑛𝑛(n,n)( italic_n , italic_n ) posterior similarity matrix; see Equation (19). From the latter, we recover a point estimate of the clustering based on the minimum posterior expectation of VI (Wade and Ghahramani, 2018); see also Section 4.2. As the result, we obtain two clusters, that we label as 𝒄^1subscript^𝒄1\widehat{\bm{c}}_{1}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒄^2subscript^𝒄2\widehat{\bm{c}}_{2}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, whose sizes are n1=101subscript𝑛1101n_{1}=101italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 101 and n2=303subscript𝑛2303n_{2}=303italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 303 respectively. The posterior similarity matrix is represented as a heatmap in Figure 3, with individuals arranged according to the estimated clusters, specifically those assigned to 𝒄^1subscript^𝒄1\widehat{\bm{c}}_{1}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT first and then those in 𝒄^2subscript^𝒄2\widehat{\bm{c}}_{2}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The two-cluster structure is pretty evident from the matrix since the probabilities of membership to the same group approach value one (zero) for individuals assigned to the same (to a different) estimated cluster.

Refer to caption
Figure 3: Breast cancer data. Estimated posterior similarity matrix, with individuals arranged according to the clustering structure estimated via the posterior expectation of VI criterion (cluster labeled as 𝒄^1subscript^𝒄1\widehat{\bm{c}}_{1}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒄^2subscript^𝒄2\widehat{\bm{c}}_{2}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively).

We then investigate differences between the estimated clusters by comparing the empirical (marginal) distribution of each variable across 𝒄^1subscript^𝒄1\widehat{\bm{c}}_{1}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒄^2subscript^𝒄2\widehat{\bm{c}}_{2}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For each cluster, we provide a graphical representation based on a spider plot, which includes for each categorical (binary) variable the proportion (percentage values) of observations corresponding to level labeled as 1111 of the variable. These values are reported as colored points joined by lines in each graph; see Figure 4. Additionally, each plot includes the same proportions as obtained from the pooled sample, namely when no clustering is considered, that are instead represented with grey dots joined by grey lines.

Refer to caption
Refer to caption
Figure 4: Breast cancer data. Spider plots for the comparison of the empirical marginal distribution of each variable between estimated cluster (colored line) and pooled dataset (gray line). Left and right panels corresponding to clusters 1 and 2 respectively.

While for cluster 2 (right-side plot) the cluster-proportions are almost aligned with those computed on the pooled dataset, cluster 1 presents a few peculiarities. In particular, patients included in cluster 1 are in general older, as it appears from the higher frequency associated to variable age 2, and characterized by a higher BMI index. Additionally, the proportion of patients suffering from hypertension (HTA), dyslipidemia (DL), and diabetes mellitus (DM) is in general higher in comparison with the pooled dataset, than those in cluster 2. We emphasize that the results above allow to capture differences between estimated clusters that are reflected in the marginal (empirical) distribution of the variables. Importantly however, differences may emerge from the joint distribution of the variables, and specifically in the dependence (DAG) structure for patients assigned to different estimated clusters. To this end, in the next section we provide results relative to structure learning, which is carried out at subject-specific level.

6.3 Structure learning

Following Equation (20), we provide for each subject i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n an estimate of the Posterior Probabilities of Inclusion (PPIs), that we collect in a (q,q)𝑞𝑞(q,q)( italic_q , italic_q ) matrix. Because of the structure of our baseline measure, we expect individuals that with high probability are assigned to the same cluster (Figure 3) to share similar dependence structures, and in particular similar PPIs. By converse, differences in the underlying graphical structure are expected for individuals assigned to distinct estimated clusters. For two randomly chosen subjects, whose membership is estimated to be cluster 1 and cluster 2 respectively, the corresponding PPIs are reported as heatmaps in Figure 5. The underlying dependence structures are both characterized by sparsity. This is much more evident in subject from cluster 1, where PPIs are more uniform and there are no edges whose PPI exceeds the 0.50.50.50.5 threshold. Differently, the heatmap from cluster 2 shows a few variables that are more strongly related to the outcome CTRCD, specifically AC, together with several risk factors, in particular hearth rhythm, VALVprev, ARRprev. Accordingly, we expect such differences to imply heterogeneous causal effects between variables. We present some of these results in the next section.

Refer to caption
Refer to caption
Figure 5: Breast cancer data. Heat maps of posterior probabilities of edge inclusion for two subject-specific graphs. Left map corresponds to one subjects in estimated cluster 1; right map to one subject in cluster 2.

6.4 Causal effects

The ultimate goal of our analysis is to quantify the (causal) effect of anticancer therapies on the occurrence of cardiotoxicity. In particular, patients in the study were treated with therapies based on either antracycline (AC) or trastuzumab (antiHER2). We then consider the causal effect on the occurrence of cardiotoxicity (variable CTRCD) implied by the administration of AC and antiHER2 therapies. To this end, we recover from our MCMC output a posterior distribution for each causal-effect parameter and each subject i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n, that we summarize through BMA estimates following Equation (21). Our results are summarized in the two scatterplots of Figure 6, each reporting BMA causal-effect estimates (y-axis) computed across individuals (x-axis) and with subjects arranged according to the estimated clustering with two groups (Section 6.2). One can appreciate the heterogeneity in the estimates, with individuals assigned to the same cluster sharing similar values, except for a few patients in each group. Interestingly, these subjects are also characterized by a higher uncertainty in cluster allocation between either group 1 or 2; see in particular the posterior similarity matrix in Figure 1. As an interesting result, AC and antiHER2 treatments in general increase the probability of CTRCD occurrence for individuals assigned to cluster 2, while the effect is less pronounced, or even null, for cluster 1. Notably, cluster 1 is characterized by older patients and with a higher prevalence of some risk factors; see in particular Figure 4. Accordingly, in such patients, the occurrence of cardiotoxicity might be due to the presence of such risk factors, that may cause cardiac diseases, rather than implied by the therapy itself. Therefore, the direct effect of AC and antiHER2 therapies is lower in comparison with the same estimates in cluster 2.

In addition, to emphasize the role played by population heterogeneity in causal effect estimation, we compare our results with those based on an alternative One-group naive strategy, corresponding to a standard categorical DAG model in which all individuals are assigned to the same cluster. In such case, causal effect estimates are uniform across subjects and are included as horizontal lines in the two plots. Each resulting estimate is approximately an average of cluster-specific causal estimates, suggesting that a causal effect analysis that disregards population heterogeneity would over- and under- estimate the risk of CTRCD development across individuals.

Refer to caption
(a)
Refer to caption
Figure 6: Breast cancer data. BMA estimates of subject-specific causal effects, with individuals arranged according to the estimated clustering structure (cluster labeled as 𝒄^1subscript^𝒄1\widehat{\bm{c}}_{1}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in green and 𝒄^2subscript^𝒄2\widehat{\bm{c}}_{2}over^ start_ARG bold_italic_c end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in orange). Blue lines correspond to causal-effect estimated obtained under a One-naive strategy disregarding heterogeneity.

7 Discussion

We proposed a modeling framework for structure learning and causal inference under heterogeneity that applies to multivariate categorical data. Our methodology is based on a Dirichlet Process (DP) mixture of categorical Directed Acyclic Graphs (DAGs) which allows to cluster subjects characterized by similar patterns of dependencies into homogeneous groups, and to provide estimates of causal effects at personalized, namely subject-specific, level. When adopted for clustering purposes, our method clearly outperforms benchmark strategies that do not account for dependence relations in the joint distribution of variables. Most importantly, our causal-effect analysis shows that approaches neglecting possible population heterogeneity can provide misleading estimates of causal effects. With regard to our application to breast cancer data, the probability of cardiotoxic side-effects implied by anticancer therapies could be underestimated for several patients, with serious consequences in clinical decision making for optimal therapies’ administration.

A possible extension of our model could be the analysis of mixed multivariate data, which comprise both quantitative and categorical measurements. Specifically in a biomedical setting, one can collect besides categorical clinical features the expression levels of genes that are involved in the progression of the disease. Causal inference methods that integrate such biological information can provide a more precise quantification of causal effects for the development of personalized therapies. More in general, multivariate models that can manage mixed data would be valuable for clustering purposes too, as several real-world applications frequently involve data of various types. Mixed-data represent an interesting framework for graphical modelling which has been addressed in a few works, although without accounting for population heterogeneity; see for instance Castelletti (2024).

The breast cancer dataset provided by Piñeiro-Lamas et al. (2023) includes Tissue Doppler Imaging (TDI) data, which measure the rate of contraction and relaxation of the cardiac muscle. TDI measurements can be treated as functional data, which whenever included in our analysis could help identifying sub-groups of patients who experienced different side effects of anti-HER2 therapies depending on the presence of cardiac dysfunctions. The inclusion of functional variables in our framework presents challenges that are related to the development of a DAG-based model. As a starting point, Qiao et al. (2019) propose a frequentist method to learn dependencies across functional variables, which is based on undirected graphical models and lasso-type penalization techniques.


SUPPLEMENTARY MATERIAL

Supplementary Materials comprise theoretical results on the computation of prior and posterior predictive distributions, details on sampling from the baseline over DAGs, and data generation for our simulation studies.

References

  • Argiento and De Iorio (2022) Argiento, R. and M. De Iorio (2022). Is Infinity that Far? A Bayesian Nonparametric Perspective of Finite Mixture Models. The Annals of Statistics 50(5), 2641 – 2663.
  • Argiento et al. (2022) Argiento, R., E. Filippi-Mazzola, and L. Paci (2022). Model-based Clustering of Categorical Data based on the Hamming Distance. arXiv preprint.
  • Athey and Imbens (2016) Athey, S. and G. Imbens (2016). Recursive Partitioning for Heterogeneous Causal Effects. Proceedings of the National Academy of Sciences 113, 965 – 2020.
  • Bargagli Stoffi et al. (2022) Bargagli Stoffi, F., K. De Witte, and G. Gnecco (2022). Heterogeneous Causal Effects with Imperfect Compliance: A Bayesian Machine Learning Approach. The Annals of Applied Statistics 16(3), 1986 – 2009.
  • Bowles et al. (2012) Bowles, E. J. A., R. Wellman, H. S. Feigelson, A. A. Onitilo, A. N. Freedman, T. Delate, L. A. Allen, L. Nekhlyudov, K. A. B. Goddard, R. L. Davis, L. A. Habel, M. U. Yood, C. Mccarty, D. J. Magid, E. H. Wagner, and P. S. Team (2012, 09). Risk of Heart Failure in Breast Cancer Patients After Anthracycline and Trastuzumab Treatment: A Retrospective Cohort Study. JNCI: Journal of the National Cancer Institute 104(17), 1293 – 1305.
  • Castelletti (2024) Castelletti, F. (2024). Learning Bayesian Networks: A Copula Approach for Mixed-Type Data. Psychometrika 89, 658 – 686.
  • Castelletti and Consonni (2023) Castelletti, F. and G. Consonni (2023). Bayesian Graphical Modeling for Heterogeneous Causal Effects. Statistics in Medicine 42, 15–32.
  • Castelletti et al. (2024) Castelletti, F., G. Consonni, and M. L. Della Vedova (2024, 07). Joint Structure Learning and Causal Effect Estimation for Categorical Graphical Models. Biometrics 80(3).
  • Dempke et al. (2023) Dempke, W. C., R. Zielinski, C. Winkler, S. Silberman, S. Reuther, and W. Priebe (2023). Anthracycline-Induced Cardiotoxicity – Are we About to Clear this Hurdle? European Journal of Cancer 185, 94 – 104.
  • Dempsey et al. (2021) Dempsey, N., A. Rosenthal, N. Dabas, Y. Kropotova, M. Lippman, and N. H. Bishopric (2021, 07). Trastuzumab-Induced Cardiotoxicity: A Review of Clinical Risk Factors, Pharmacologic Prevention, and Cardiotoxicity of other HER2-Directed Therapies. Breast Cancer Research and Treatment 188(17), 21 – 36.
  • Dominici et al. (2021) Dominici, F., F. J. Bargagli Stoffi, and F. Mealli (2021). From Controlled to Undisciplined Data: Estimating Causal Effects in the Era of Data Science Using a Potential Outcome Framework. Harvard Data Science Review 3(3).
  • Escobar and West (1994) Escobar, M. and M. West (1994). Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association 90(430), 577 – 588.
  • Ferguson (1973) Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics 1(2), 209 – 230.
  • Frühwirth-Schnatter et al. (2021) Frühwirth-Schnatter, S., G. Malsiner-Walli, and B. Grün (2021). Generalized Mixtures of Finite Mixtures and Telescoping Sampling. Bayesian Analysis 16(4), 1279 – 1307.
  • Geiger and Heckerman (1997) Geiger, D. and D. Heckerman (1997, 02). A Characterization of the Dirichlet Distribution through Global and Local Parameter Independence. Annals of Statistics 25, 1344 – 1369.
  • Goodman (1974) Goodman, L. A. (1974). Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models. Biometrika 61(2), 215 – 231.
  • Hahn et al. (2020) Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian Regression Tree Models for Causal Inference: Regularization, Confounding, and Heterogeneous Effects (with Discussion). Bayesian Analysis 15(3), 965 – 2020.
  • Heckerman et al. (1995) Heckerman, D., D. Geiger, and D. M. Chickering (1995). Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning 20(3), 197 – 243.
  • Huang (1998) Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery 2, 283 – 304.
  • Katzorke et al. (2013) Katzorke, N., B. Kathrin Rack, L. Haeberle, J. Katharina Neugebauer, C. Anna Melcher, C. Hagenbeck, H. Forstbauer, H. Ulrich Ulmer, U. Soeling, R. Kreienberg, T. N. Fehm, A. Schneeweiss, M. W. Beckmann, P. A. Fasching, and W. Janni (2013). Prognostic Value of HER2 on Breast Cancer Survival. Journal of Clinical Oncology 31(15), 600 – 640.
  • Lee et al. (2021) Lee, K., F. Bargagli Stoffi, and F. Dominici (2021). Causal Rule Ensemble: Interpretable Inference of Heterogeneous Treatment Effects. arXiv preprint.
  • Linero and Antonelli (2022) Linero, A. R. and J. L. Antonelli (2022). The How and Why of Bayesian Nonparametric Causal Inference. Wiley Interdisciplinary Reviews: Computational Statistics 15.
  • Linzer and Lewis (2011) Linzer, D. A. and J. B. Lewis (2011). poLCA: An R Package for Polytomous Variable Latent Class Analysis. Journal of Statistical Software 42(10), 1 – 29.
  • Lotrionte et al. (2013) Lotrionte, M., G. Biondi-Zoccai, A. Abbate, G. Lanzetta, F. D’Ascenzo, V. Malavasi, M. Peruzzi, G. Frati, and G. Palazzoni (2013). Review and Meta-Analysis of Incidence and Clinical Predictors of Anthracycline Cardiotoxicity. The American Journal of Cardiology 112(12), 1980 – 1984.
  • Ma et al. (2015) Ma, J., B. Hobbs, and F. Stingo (2015). Statistical Methods for Establishing Personalized Treatment Rules in Oncology. BioMed Research International 2015(1), 670691.
  • MacQueen (1967) MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Volume 1: Statistics, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp.  281 – 297. University of California Press.
  • Malsiner-Walli et al. (2024) Malsiner-Walli, G., B. Grün, and S. Frühwirth-Schnatter (2024). Without Pain – Clustering Categorical Data Using a Bayesian Mixture of Finite Mixtures of Latent Class Analysis Models. arXiv preprint.
  • Müller and Rodriguez (2013) Müller, P. and A. Rodriguez (2013). Dirichlet process. In Nonparametric Bayesian Inference, Volume 9, pp.  23 – 42. Institute of Mathematical Statistics.
  • Neal (2000) Neal, R. M. (2000). Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics 9(2), 249 – 265.
  • Oganisian et al. (2021) Oganisian, A., N. Mitra, and J. A. Roy (2021). A Bayesian Nonparametric Model for Zero-Inflated Outcomes: Prediction, Clustering, and Causal Estimation. Biometrics 77(1), 125 – 135.
  • Pearl (2000) Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge.
  • Pearl (2003) Pearl, J. (2003). Statistics and Causal Inference: A Review. Sociedad de Estadistica e Investigacion Operativa 12, 101–165.
  • Piñeiro-Lamas et al. (2023) Piñeiro-Lamas, B., A. López-Cheda, R. Cao, L. Ramos-Alonso, G. González-Barbeito, C. Barbeito-Caamaño, and A. Bouzas-Mosquera (2023). A Cardiotoxicity Dataset for Breast Cancer Patients. Scientific Data 10(1), 527.
  • Qiao et al. (2019) Qiao, X., S. Guo, and G. M. James (2019). Functional Graphical Models. Journal of the American Statistical Association 525, 211 – 222.
  • Rodríguez et al. (2011) Rodríguez, A., A. Lenkoski, and A. Dobra (2011). Sparse Covariance Estimation in Heterogeneous Samples. Electronic Journal of Statistics 5, 981 – 1014.
  • Roy et al. (2016) Roy, J., K. J. Lum, and M. J. Daniels (2016). A Bayesian Nonparametric Approach to Marginal Structural Models for Point Treatments and a Continuous or Survival Outcome. Biostatistics 18(1), 32 – 47.
  • Rubin (2005) Rubin, D. B. (2005). Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. Journal of the American Statistical Association 100(469), 322 – 331.
  • Sethuraman (1994) Sethuraman, J. (1994). A Constructive Definition of Dirichlet Priors. Statistica Sinica 4(2), 639 – 650.
  • Slamon et al. (1987) Slamon, D. J., G. M. Clark, S. G. Wong, W. J. Levin, A. Ullrich, and W. L. McGuire (1987). Human Breast Cancer: Correlation of Relapse and Survival with Amplification of the HER-2/Neu Oncogene. Science 235(4785), 177 – 182.
  • Wade and Ghahramani (2018) Wade, S. and Z. Ghahramani (2018). Bayesian Cluster Analysis: Point Estimation and Credible Balls (with Discussion). Bayesian Analysis 13(2), 559 – 626.
  • Zorzetto et al. (2024) Zorzetto, D., F. Bargagli Stoffi, A. Canale, and F. Dominici (2024). Confounder-Dependent Bayesian Mixture Model: Characterizing Heterogeneity of Causal Effects in Air Pollution Epidemiology. Biometrics 80(2).