Parameter identification in linear non-Gaussian causal models under general confounding

Daniele Tramontano Technical University of Munich; School of Computation, Information and Technology, Germany Mathias Drton Technical University of Munich; School of Computation, Information and Technology, Germany Jalal Etesami Technical University of Munich; School of Computation, Information and Technology, Germany

Abstract

Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result, explore a generalization to models with feedback loops, and provide new results on the identifiability of the causal graph.

Keywords: Causal effect, graphical model, independent component analysis, latent variable model, structural causal model

1 Introduction

Graphical models, directed or undirected, are ubiquitous in modern statistical practice for modeling multivariate distributions; see, e.g., Lauritzen (1996); Maathuis et al. (2019). In particular, structural equation models (SEMs) associated with directed acyclic graphs (DAGs) provide a concise and effective way of stating the additional assumptions necessary to identify the causal parameters of interest (Pearl, 2009; Spirtes et al., 2000). As argued in Pearl (2017), understanding a causal phenomenon for linear SEMs is often a necessary step towards a generalized understanding of the same in a nonparametric framework.

The specific framework we focus on in this paper are linear non-Gaussian models that allow for general, possibly non-linear, confounding. Each such model is naturally associated to an acyclic directed mixed graph (ADMG); compare Richardson et al. (2023) and references therein. Let $X=(X_{v})_{v\in V}$ be a collection of observed random variables, and let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ be an ADMG whose vertices index the random variables $X_{v}$ and which contains two types of edge sets $E_{\rightarrow{}},E_{\leftrightarrow{}}\subseteq~{}V\times~{}V\setminus\{(v,v)% :v\in V\}$ . The edges in $E_{\rightarrow{}}$ are directed, and we depict them by $u\xrightarrow[]{}v$ . Those in $E_{\leftrightarrow{}}$ are bidirected and depicted by $u\xleftrightarrow{}v$ . The model encoded by $\mathcal{G}$ posits that

X_{v}=\sum_{w\,:\,w\to v\in E_{\rightarrow{}}}\lambda_{wv}X_{w}+\varepsilon_{v% },\quad v\in V,

(1.1)

where possible latent confounding is subsumed in the error variables $\varepsilon_{v}$ , which are allowed to be dependent in accordance with the bidirected edges in $E_{\leftrightarrow{}}$ . In particular, if $v\xleftrightarrow{}w\in E_{\leftrightarrow{}}$ , then the two errors $\varepsilon_{v}$ and $\varepsilon_{w}$ may be (arbitrarily) dependent.

Example 1.1.

To illustrate the above definition, consider the graph from Fig. 1. It specifies an instrumental variable model for the joint distribution of three observed variables. The equations from (1.1) take the form:

\displaystyle X_{1}

\displaystyle=\varepsilon_{1},

\displaystyle X_{2}

\displaystyle=\lambda_{12}X_{1}+\varepsilon_{2},

\displaystyle X_{3}

\displaystyle=\lambda_{23}X_{2}+\varepsilon_{3},

(1.2)

where $\varepsilon_{1}$ is independent of $(\varepsilon_{2},\varepsilon_{3})$ but $\varepsilon_{2}$ and $\varepsilon_{3}$ may be dependent.

Figure 1: Instrumental variable graph based on Evans and Ringel (1999).

In this paper, we treat the problem of deciding which of the direct causal effects $\lambda_{wv}$ in (1.1) can be identified from the joint distribution of the vector of observed variables $X$ . Our main results give a complete graphical characterization in terms of the ADMG $\mathcal{G}$ , an efficient algorithm to check the resulting identifiability criterion, and a simple practical estimation method that is based on empirical measures of dependences among estimates of the errors $(\varepsilon_{v})_{v\in V}$ . We should highlight that our characterization targets generic identifiability, which is the notion most suitable for problems such as the instrumental variable model from Example 1.1. There, the key coefficient of interest $\lambda_{23}$ is identified as a ratio of covariances, $\text{Cov}[X_{1},X_{3}]/\text{Cov}[X_{1},X_{2}]$ , but only if the denominator is nonzero which requires the genericity constraint that $\lambda_{12}\not=0$ . As part of our results, we also develop a framework for making genericity conditions for the infinite-dimensional set of non-Gaussian error distributions, which we justify via cumulants truncated at arbitrary order.

1.1 Related Work

For fully nonparametric SEMs, the ID algorithm (Shpitser and Pearl, 2006; Kivva et al., 2022; Shpitser, 2023; Kivva et al., 2023a) is sound and complete for determining global identifiability in a given ADMG. In contrast to the generic setting treated in this paper, global identifiability requires identifiability under every single distribution in the model. It follows from the work of Drton et al. (2011) that the graphical criterion underpinning the ID algorithm also applies to global identifiability within linear Gaussian models. However, the graphical prerequisites for achieving global identification are frequently overly restrictive. For example, any ADMG containing a bow (a pair of nodes, $u,v\in V$ , such that $u\to v,u\xleftrightarrow{}v\in\mathcal{G}$ ) would fail to meet the criteria for global identifiability, thus overlooking significant scenarios such as the IV model illustrated in Fig. 1. Consequently, when studying linear SEMs, researchers have shifted their focus towards generic identifiability results, for which much progress has been recently, but for which a complete characterization is still lacking; see, e.g., Kumor et al. (2020); Barber et al. (2022).

Non-Gaussianity of the error term has been extensively employed to achieve identifiability of the graphical structure of causal models; see Shimizu (2022) for a recent account. In contrast, its application to causal effect identification has received little attention (Salehkaleybar et al., 2020; Kivva et al., 2023b; Shuai et al., 2023). All the works just mentioned explicitly model the confounding as linear. This approach, on the one hand, allows one to draw on the vast literature on overcomplete independent component analysis (OICA) in order to obtain stronger identifiability results (Eriksson and Koivunen, 2004). On the other hand, however, it restricts the possible confounding structures. Moreover, as opposed to ICA in the fully observed case, overcomplete ICA is not separable (Eriksson and Koivunen, 2004, Thm. 4), implying that the only algorithms for solving OICA that come with theoretical guarantees require making parametric distributional assumptions and using ad-hoc EM-type algorithms to solve the optimization problem; see, e.g., Lewicki and Sejnowski (2000).

The work most similar to ours in terms of distributional assumptions is that of Wang and Drton (2023), which focuses on structural identifiability. Wang and Drton (2023) note that bow-free acyclic graphs are identifiable from observational data and provide an estimation algorithm for such graphs. Liu et al. (2021) extended the algorithm to learn graphs with multi-directed edges.

1.2 Organization of the Paper

The rest of the paper is organized as follows. Section 1.3 contains standard graphical model notation used in the rest of the paper. In Section 2, we formally define the identifiability problem we study. Section 3 contains the main results of our work; we provide a necessary and sufficient graphical condition for generic identifiability in the model under study. In Section 4, we prove that our criterion can be certified in polynomial time in the size of the graph. Section 5 contains a detailed analysis of the genericity assumption. In Section 6, we apply our results from Section 3 to the identifiability of the causal graph, providing new insights on the model equivalence of two ADMGs. In Section 7, we provide partial results about the identification for cyclic models. In Section 8, we note that when the identification criterion is met, the parameters can be estimated as the solution to a suitable optimization problem, and we present a simulation study to assess the performance of the estimation method. In Section 9, we draw final conclusions and suggest future research directions. Appendix A contains further preliminary material and details of the proofs.

1.3 Notation

Figure 2: An acyclic directed mixed graphs (ADMG) with 4 nodes.

A mixed graph is a triple $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ , where $E_{\rightarrow{}},E_{\leftrightarrow{}}\subset~{}V\times~{}V\setminus\{(v,v):v% \in V\}$ . We depict the pairs in $E_{\rightarrow{}}$ by $u\xrightarrow[]{}v$ and the ones in $E_{\leftrightarrow{}}$ by $u\xleftrightarrow{}v$ ; we refer to them as directed and bidirected edges, respectively. For an example, see Fig. 2.

Let $u,v,\in V$ be two vertices in $\mathcal{G}$ . A directed path from $u$ to $v$ is a sequence of nodes $\pi=(u=v_{0},\dots,v_{k}=v)$ such that $v_{i}\to v_{i+1}\in E_{\rightarrow{}}$ for all $i=0,\dots,k-1$ . This includes the case $k=0$ , where $u=v$ and the path has no edges; we call such a path trivial. We denote by $\mathcal{P}(u,v)$ the set of all directed paths from $u$ to $v$ .

A directed cycle is a non-trivial directed path from a node $u$ to itself. The graph $\mathcal{G}$ is acyclic if it contains no directed cycles; we refer to this class of graphs as acyclic directed mixed graphs (ADMG). Fig. 2 shows one instance. If the graph is acyclic, we can define a causal order on the nodes of $\mathcal{G}$ , that is, a total order $\leq$ on $V$ such that $u\leq v$ whenever $\mathcal{P}(u,v)\not=\emptyset$ . When considering parameter matrices associated to $\mathcal{G}$ , we will typically fix a causal order $\leq$ and assume that the vertices in $V$ are enumerated as $v_{1},\dots,v_{p}$ with $i\leq j$ with $v_{i}\leq v_{j}$ ; compare Fig. 2.

We will consider the following genealogical relations that are commonly used to indicate relationships between the vertices of an ADMG (parents, ancestors, children, descendants, siblings):

$\displaystyle\mathop{\rm pa}\nolimits(v)$	$\displaystyle:=\{u\in V\>:\>u\to v\in\mathcal{G}\},\qquad$	$\displaystyle\mathop{\rm an}\nolimits(v):=\{u\in V\>:\>\mathcal{P}(u,v)\neq% \emptyset\in\mathcal{G}\},$
$\displaystyle\mathop{\rm ch}\nolimits(v)$	$\displaystyle:=\{u\in V\>:\>v\to u\in\mathcal{G}\},\qquad$	$\displaystyle\mathop{\rm de}\nolimits(v):=\{u\in V\>:\>\mathcal{P}(v,u)\neq% \emptyset\in\mathcal{G}\},$
$\displaystyle\mathop{\rm sib}\nolimits(v)$	$\displaystyle:=\{u\in V\>:\>u\xleftrightarrow{}v\in\mathcal{G}\},\qquad$	$\displaystyle\mathop{\rm Sib}\nolimits(v):=\mathop{\rm sib}\nolimits(v)\cup\{v\}.$

Note that $v\in\mathop{\rm an}\nolimits(v)$ and $v\in\mathop{\rm de}\nolimits(v)$ , via trivial paths. For a subset of vertices $U\subseteq V$ , we define $\mathop{\rm pa}\nolimits(U)=\cup_{u\in U}\mathop{\rm pa}\nolimits(u)$ and make the analogous convention for the other relations.

Let $U=\{u_{1},\dots,u_{n}\}$ and $W=\{w_{1},\dots,w_{n}\}$ be two subsets of $V$ that have the same cardinality $n$ , and for which we have fixed an ordering of their elements. Let $S_{n}$ be the symmetric group on $[n]=\{1,\dots,n\}$ . We say that $\Pi=(\pi_{1},\dots,\pi_{n})$ is a system of paths between $U$ and $W$ , if there exists a permutation $\sigma_{\Pi}\in S_{n}$ such that $\pi_{k}\in\mathcal{P}(u_{k},w_{\sigma_{\Pi}(k)})$ for every $k\in[n]$ . We denote the set of all such systems by $\mathcal{P}(U,W)$ . A system $\Pi\in\mathcal{P}(U,W)$ is called non-intersecting if $\pi_{k}\cap\pi_{l}=\emptyset$ for $k\neq l$ . The set of all non-intersecting systems in $\mathcal{P}(U,W)$ is denoted by $\tilde{\mathcal{P}}(U,W)$ ; see Fig. 3 for an example.

(a) (b)

Figure 3: (a) A non-intersecting systems of paths from

\{u_{1},u_{2}\}

\{w_{1},w_{2}\}

. (b) Two intersecting paths with node

c

in their intersection.

When connecting a graph $\mathcal{G}$ to a statistical model, we will introduce a matrix of parameters whose entries act as weights on the directed edges. We will write $\mathbb{R}^{\mathcal{G}_{D}}$ for the set of $p\times p$ real matrices $\Lambda=(\lambda_{uv})$ such that $\lambda_{uv}=0$ if $u\to v\notin E_{\rightarrow{}}$ . When $\mathcal{G}$ is acyclic—as we assume throughout this work, the matrix $I-\Lambda$ is invertible for all $\Lambda\in\mathbb{R}^{\mathcal{G}_{D}}$ ; here, $I$ denotes the identity matrix. Indeed, when the nodes of $\mathcal{G}$ are ordered according to a causal order, $(I-\Lambda)^{T}$ is lower triangular with all ones on the diagonal and $\det(I-\Lambda)=1$ . We define $B_{\Lambda}:=(I-\Lambda)^{-T}$ . Later in Section 7, we briefly discuss the identifiability problem in cyclic graphs, where the invertibility of $I-\Lambda$ becomes a modeling assumption.

Finally, let $U$ and $W$ be subsets of the row and column sets of a matrix $A$ , respectively. We denote the submatrix containing only the rows in $U$ and the columns in $W$ as $A_{U,W}$ .

2 Linear Mixed Graph Models and Identifiability

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ be an ADMG, and let $\mathcal{G}_{D}=(V,E_{\rightarrow{}})$ and $\mathcal{G}_{B}=(V,E_{\leftrightarrow{}})$ be its directed and bidirected subgraphs, respectively. We say that a subset $C\subseteq V$ is connected in the bidirected part $\mathcal{G}_{B}$ if every pair of vertices $u,v\in C$ is joined by a path in $\mathcal{G}_{B}$ , where every vertex on the path is in $C$ . On a fixed probability space, let $\varepsilon=(\varepsilon_{1},\dots,\varepsilon_{p})$ be a random vector taking values in $\mathbb{R}^{p}$ and satisfying the connected set Markov property with respect to $\mathcal{G}_{B}$ (Richardson, 2003; Drton and Richardson, 2008), that is,

\varepsilon_{C}\perp\!\!\!\perp\varepsilon_{V\setminus\mathop{\rm Sib}% \nolimits(C)}\ \text{for all}\ \emptyset\neq C\subset V,\ C\text{ connected in% }\mathcal{G}_{B}.

We denote the set of all such random vectors by $\mathcal{M}(\mathcal{G}_{B})$ . Note that the connected set Markov property implies but is generally stronger than requiring that $\varepsilon_{u}\perp\!\!\!\perp\varepsilon_{v}$ for $u,v$ non-adjacent in $\mathcal{G}_{B}$ .

Definition 2.1.

The linear structural equation model $\mathcal{M}(\mathcal{G})$ corresponding to a mixed graph $\mathcal{G}$ is the set of all $p$ -variate real random vectors $X$ (on our fixed probability space) that solve the equation system

\displaystyle X=\Lambda^{T}\cdot X+\varepsilon\iff X=(I-\Lambda)^{-T}\cdot% \varepsilon=B_{\Lambda}\cdot\varepsilon,

for a choice of $\Lambda\in\mathbb{R}^{\mathcal{G}_{D}}$ and $\varepsilon\in\mathcal{M}(\mathcal{G}_{B})$ . The model $\mathcal{M}(\mathcal{G})$ is thus parametrized by the map

	$\displaystyle\Phi_{\mathcal{G}}:\mathbb{R}^{\mathcal{G}_{D}}\times\mathcal{M}(% \mathcal{G}_{B})$	$\displaystyle\xrightarrow{}\mathcal{M}(\mathcal{G})$
	$\displaystyle(\Lambda,\varepsilon)$	$\displaystyle\mapsto(I-\Lambda)^{-T}\varepsilon.$

Example 2.1.

Let $\mathcal{G}$ be the ADMG from Fig. 2. The set $\mathcal{\mathcal{M}}(\mathcal{G}_{B})$ contains all the random vectors $\varepsilon=(\varepsilon_{1},\varepsilon_{2},\varepsilon_{3},\varepsilon_{4})$ such that $(\varepsilon_{1},\varepsilon_{2})\perp\!\!\!\perp\varepsilon_{3}$ and $\varepsilon_{1}\perp\!\!\!\perp(\varepsilon_{3},\varepsilon_{4})$ . The space $\mathbb{R}^{\mathcal{G}_{D}}$ is comprised of all matrices of the shape:

\Lambda=\begin{bmatrix}0&\lambda_{12}&\lambda_{13}&0\\ 0&0&0&\lambda_{24}\\ 0&0&0&\lambda_{34}\\ 0&0&0&0\end{bmatrix}.

Accordingly, we have

B_{\Lambda}:=(I-\Lambda)^{-T}=\begin{bmatrix}1&0&0&0\\ \lambda_{12}&1&0&0\\ \lambda_{13}&0&1&0\\ \lambda_{12}\lambda_{24}+\lambda_{13}\lambda_{34}&\lambda_{24}&\lambda_{34}&1% \end{bmatrix}.

In this paper, we are concerned with parameter identifiability. In other words, we ask under which conditions on $\mathcal{G}$ , the distribution of $X\in\mathcal{M}(\mathcal{G})$ uniquely determines entries of the coefficient matrix $\Lambda\in\mathbb{R}^{\mathcal{G}_{D}}$ in the representation $X=(I-\Lambda)^{-T}\cdot\varepsilon$ . While we will not emphasize this in the sequel, the unique determination of all entries of $\Lambda$ also entails unique recovery of the distribution of $\varepsilon=(I-\Lambda)^{T}X$ .

As noted earlier, our interest is in a generic notion of identifiability, so we ask:

Problem.

Under which graphical conditions on $\mathcal{G}$ is a set of entries $\lambda_{uv}$ of the parameter matrix $\Lambda$ generically identifiable?

To detail the problem, we make the following definition and then firm up the involved notion of genericity.

Definition 2.2.

We define the fiber of an element $X\in\mathcal{M}(\mathcal{G})$ with respect to $\Phi_{\mathcal{G}}$ as the set

\Phi_{\mathcal{G}}^{-1}(X):=\{(\Lambda,\varepsilon)\in\mathbb{R}^{\mathcal{G}_% {D}}\times\mathcal{M}(\mathcal{G}_{B})\;:\;\Phi_{\mathcal{G}}(\Lambda,% \varepsilon)\stackrel{{\scriptstyle d}}{{=}}X\},

(2.1)

where $\stackrel{{\scriptstyle d}}{{=}}$ denotes equality in distribution. Let $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi_{\mathcal{G}}^{-1}(X))$ be the projection of the set onto $\mathbb{R}^{\mathcal{G}_{D}}$ . A parameter given by a function $f(\Lambda)$ is generically identifiable if any generic choice of $(\Lambda,\varepsilon)$ yields a random vector $X=(I-\Lambda)^{-T}\varepsilon$ for which it holds that

f(\tilde{\Lambda})=f(\Lambda)\ \text{for all}\ \tilde{\Lambda}\in\mathrm{P}_{% \mathbb{R}^{\mathcal{G}}}(\Phi^{-1}_{\mathcal{G}}(X)).

Requiring genericity of $\Lambda$ will mean that we exclude a fixed Lebesgue null set of $\mathbb{R}^{\mathcal{G}_{D}}$ . For instance, in the instrumental variable (IV) example depicted in Fig. 1, the unknown coefficients are $(\lambda_{12},\lambda_{23})$ and $\mathbb{R}^{\mathcal{G}_{D}}\equiv\mathbb{R}^{2}$ . Coefficient $\lambda_{23}$ is identifiable outside the null set given by $\lambda_{12}=0$ , i.e., we exclude the case that the instrument ( $X_{1}$ ) does not affect the exposure ( $X_{2}$ ).

While excluding null sets of a finite-dimensional space is a standard approach in related literature (Drton, 2018, §9), speaking of a generic choice of $\varepsilon$ requires clarification as Definition 2.1 is nonparametric with respect to the distribution of the errors $\varepsilon$ . Indeed, our genericity concept has a very specific meaning, namely, that the distribution of $\varepsilon$ satisfies the following assumption.

Assumption 1.

Let $\varepsilon\in\mathcal{M}(\mathcal{G}_{B})$ . For every two vectors $a_{1},a_{2}\in\mathbb{R}^{V}$ , it holds that $a_{1}^{T}\varepsilon\perp\!\!\!\perp a_{2}^{T}\varepsilon$ implies that $a_{1u}\cdot a_{2v}=0$ whenever $u=v$ oder $u\xleftrightarrow{}v\in E_{\leftrightarrow{}}$ .

Our genericity assumption is natural in view of the Darmois-Skitovich theorem. Indeed, this theorem amounts to exactly the statement that if $\mathcal{G}_{B}$ is the empty graph (i.e., has no edges), then 1 holds for every random vector that has at most one normally distributed coordinate. To further justify our assumption, we present in Section 5 a detailed study of two different classes of submodels for which we show that indeed only a lower-dimensional set of distributions is excluded by our assumption. One class of submodels is built by assuming the existence of moments up to an arbitrary but fixed order. The other class is built by assuming linearity of confounding.

For the remainder of this work, whenever we use the term generic, it is implied that the result holds for any matrix $\Lambda$ outside of a fixed Lebesgue measure zero subset of $\mathbb{R}^{\mathcal{G}_{D}}$ and for any $\varepsilon\in\mathcal{M}(\mathcal{G}_{B})$ that satisfies 1.

3 Necessary and Sufficient Conditions for Generic Identifiability of Direct Causal Effects

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ be an ADMG, and let $X$ be a random vector in the model $\mathcal{M}(\mathcal{G})$ , i.e.,

X=\Phi_{\mathcal{G}}(\Lambda,\varepsilon)\ \text{for}\ (\Lambda,\varepsilon)% \in\mathbb{R}^{\mathcal{G}_{D}}\times\mathcal{M}(\mathcal{G}_{B}).

Suppose $X$ can be generated using another pair $(\tilde{\Lambda},\tilde{\varepsilon})\in\Phi_{\mathcal{G}}^{-1}(X)$ . From the definition of the fiber in Eq. 2.1, one can see that

\tilde{\varepsilon}\stackrel{{\scriptstyle d}}{{=}}(I-\tilde{\Lambda})^{T}(I-% \Lambda)^{-T}\varepsilon=(I-\tilde{\Lambda})^{T}B_{\Lambda}\varepsilon=:A\varepsilon.

(3.1)

The next result shows that the entries of matrix $A=(I-\tilde{\Lambda})^{T}B_{\Lambda}$ can be fully specified as a function of both $\tilde{\Lambda}$ and $B_{\Lambda}$ through the ancestral relations among the nodes of $\mathcal{G}$ .

Lemma 3.1.

The entries of matrix $A$ defined in Eq. 3.1 can be written as

a_{vu}=b_{vu}-\sum_{w\in\\ \mathop{\rm pa}\nolimits(v)\cap\mathop{\rm de}\nolimits(u)}\tilde{\lambda}_{wv% }b_{wu}.

(3.2)

In particular, we have $a_{vu}=0$ if $v\notin\mathop{\rm de}\nolimits(u)$ , and $a_{uu}=1$ for every $u\in V$ .

Proof.

Writing the product of matrices explicitly, we get

a_{vu}=b_{vu}-\sum_{w\in V}\tilde{\lambda}_{wv}b_{wu}.

From the definition of $\mathbb{R}^{\mathcal{G}_{D}}$ , we know that $\tilde{\lambda}_{wv}=0$ if $v\notin\mathop{\rm ch}\nolimits(w)$ , while it holds that $b_{wu}=0$ if $w\notin\mathop{\rm de}\nolimits(u)$ , from which the claim follows. To see that $b_{wu}=0$ if $w\notin\mathop{\rm de}\nolimits(u)$ , note that $B_{\Lambda}$ is a path matrix for the directed part $\mathcal{G}_{D}$ , as we detail in Lemma A.1 in the Appendix. ∎

Definition 3.1.

The set of removable ancestors of a node $v\in V$ is defined as

R_{v}:=\{u\in\mathop{\rm an}\nolimits(v)\>:\>\mathop{\rm Sib}\nolimits(u)% \setminus\mathop{\rm Sib}\nolimits(v)\neq\emptyset\}=\mathop{\rm Sib}\nolimits% (V\setminus\mathop{\rm Sib}\nolimits(v))\cap\mathop{\rm an}\nolimits(v).

Clearly, $v\notin R_{v}$ .

Example 3.1.

Consider the graph in Fig. 2. In this graph, the only strict ancestor of $v_{2}$ is $v_{1}$ , which has only $v_{2}$ as its sibling. Hence, $R_{v_{2}}=\mathop{\rm Sib}\nolimits(v_{1})\setminus\mathop{\rm Sib}\nolimits(v% _{2})=\{v_{1},v_{2}\}\setminus\{v_{1},v_{2},v_{4}\}=\emptyset$ . On the other hand, $R_{v_{4}}=\{v_{1},v_{2}\}$ because $v_{1}$ belongs to both $\mathop{\rm Sib}\nolimits(v_{2})\setminus\mathop{\rm Sib}\nolimits(v_{4})$ and $\mathop{\rm Sib}\nolimits(v_{1})\setminus\mathop{\rm Sib}\nolimits(v_{4})$ .

Using the concept of removable ancestors, the next result introduces a linear system of equations whose solution space fully characterizes the parameter matrices in $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi^{-1}_{\mathcal{G}}(X))$ .

Lemma 3.2.

Let $X=\Phi_{\mathcal{G}}(\Lambda,\varepsilon)$ for a generic choice of parameters $(\Lambda,\varepsilon)\in\mathbb{R}^{\mathcal{G}_{D}}\times\mathcal{M}(\mathcal% {G}_{B})$ . The matrix $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}$ belongs to $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi_{\mathcal{G}}^{-1}(X))$ if and only if it is a solution to the following linear system of equations:

\underbrace{[(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v),R_{v}}]^{T}}_{(B_{% \Lambda})^{v}}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}=[(B_{% \Lambda})_{v,R_{v}}]^{T},\quad\forall v\in V.

(3.3)

Proof.

We start by showing the direct implication. Eq. 3.1 shows that for every $v_{0}\in V\setminus\{v\}$ we can write $\tilde{\varepsilon}_{\{v,v_{0}\}}=A_{\{v,v_{0}\},V}\cdot\varepsilon$ , where $A_{\{v,v_{0}\},V}$ denotes the rows of $A$ corresponding to nodes $\{v,v_{0}\}$ . Since $\tilde{\varepsilon}\in\mathcal{M}(\mathcal{G}_{B})$ , it holds that $\tilde{\varepsilon}_{v}\perp\!\!\!\perp\tilde{\varepsilon}_{u_{0}}$ for every $u_{0}\notin\mathop{\rm sib}\nolimits(v)$ and, thus, 1 implies

	$\displaystyle a_{vv_{0}}\cdot a_{u_{0}v_{0}}$	$\displaystyle=0,\qquad$	$\displaystyle\forall\,v_{0}\in V,$		(3.4)
	$\displaystyle a_{vv_{0}}\cdot a_{u_{0}v_{1}}$	$\displaystyle=0,\qquad$	$\displaystyle\forall\,v_{0}\xleftrightarrow{}v_{1}\in\mathcal{G}.$		(3.5)

If $u\notin\mathop{\rm sib}\nolimits(v)$ , considering $u_{0}=v_{0}=u$ in Eq. 3.4 yields $a_{vu}\cdot a_{uu}=a_{vu}=0$ , where we used the fact that $a_{uu}=1$ as a result of Lemma 3.1. Again, from Lemma 3.1, writing $a_{vu}$ explicitly, we get

b_{vu}=\sum_{w\in\\ \mathop{\rm pa}\nolimits(v)\cap\mathop{\rm de}\nolimits(u)}\tilde{\lambda}_{wv% }b_{wu}=[(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v),u}]^{T}\cdot\tilde{\lambda% }_{\mathop{\rm pa}\nolimits(v),v}.

(3.6)

Now, let $u\in\mathop{\rm sib}\nolimits(v)$ , and let $w\in\mathop{\rm sib}\nolimits(u)\setminus\mathop{\rm sib}\nolimits(v)$ . Considering Eq. 3.5 with $u_{0}=v_{1}=w$ , and $v_{0}=u$ we get $a_{vu}\cdot a_{ww}=a_{vu}=0$ . Proceeding as above, this yields that $u\in\mathop{\rm an}\nolimits(v)$ leads to Eq. 3.6, as claimed.

For the reverse implication, consider $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}$ such that each one of its column vectors $\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}$ is a solution of Eq. 3.3. Define $\tilde{\varepsilon}:=A\cdot\varepsilon$ , where the matrix $A$ is defined in Eq. 3.1. By the definition of $\tilde{\varepsilon}$ , we have $\Phi_{\mathcal{G}}(\Lambda,\varepsilon)\stackrel{{\scriptstyle d}}{{=}}\Phi_{% \mathcal{G}}(\tilde{\Lambda},\tilde{\varepsilon})$ , so it remains to prove $\tilde{\varepsilon}\in\mathcal{M}(\mathcal{G}_{B})$ , that is, $\tilde{\varepsilon}$ satisfies the connected set Markov property with respect to $\mathcal{G}_{B}$ . Let $C\subseteq V$ . Then it is easy to see that

\tilde{\varepsilon}_{C}=A_{C,D(C)}\cdot\varepsilon_{D(C)},

where $D(C)$ is the set of non-removable ancestors of $C$ , so

\quad\text{and}\quad D(C):=C\cup\bigcup_{v\in C}(\mathop{\rm an}\nolimits(v)% \setminus R_{v})\subseteq\mathop{\rm Sib}\nolimits(C).

Here, the last set inclusion comes from the definition of $R_{v}$ . Hence, to prove that $\tilde{\varepsilon}$ satisfies the connected set Markov property, we need to show that $\tilde{\varepsilon}_{C}\perp\!\!\!\perp\tilde{\varepsilon}_{V\setminus\mathop{% \rm Sib}\nolimits(C)}$ whenever $C$ is a connected subset of $\mathcal{G}_{B}$ . For this, it suffices to show that $\varepsilon_{D(C)}\perp\!\!\!\perp\varepsilon_{D(V\setminus\mathop{\rm Sib}% \nolimits(C))}$ . We will argue that this is indeed the case for $C$ connected by showing i) $D(C)$ is connected and ii) $D(V\setminus\mathop{\rm Sib}\nolimits(C))\subseteq V\setminus\mathop{\rm Sib}% \nolimits(D(C))$ . The asserted result will then follow from the fact that $\varepsilon$ satisfies the connected set Markov property with respect to $\mathcal{G}_{B}$ .

i) To show that $D(C)$ is connected consider $u,v\in C$ . From the definition of $D(C)$ , one can see that there are $u_{0},v_{0}\in C$ such that $u_{0}\in\mathop{\rm Sib}\nolimits(u)$ and $v_{0}\in\mathop{\rm Sib}\nolimits(v)$ . Because $C$ is connected, a bidirected path joining $u_{0}$ and $v_{0}$ exists over $C$ , and we can extend the path to suitably join $u$ and $v$ as well.

ii) Notice that $\mathop{\rm Sib}\nolimits(D(C))\subseteq\mathop{\rm Sib}\nolimits(C)$ . This is because if $w\in\mathop{\rm Sib}\nolimits(D(C))$ , either $w\in\mathop{\rm Sib}\nolimits(C)$ or there is $v\in C$ such that $w\in\mathop{\rm Sib}\nolimits(\mathop{\rm an}\nolimits(v)\setminus R_{v})$ which again implies that $w\in\mathop{\rm Sib}\nolimits(C)$ by the definition of $R_{v}$ . This implies that in order to prove $D(V\setminus\mathop{\rm Sib}\nolimits(C))\subseteq V\setminus\mathop{\rm Sib}% \nolimits(D(C))$ , we only need to show $D(V\setminus\mathop{\rm Sib}\nolimits(C))\cap\mathop{\rm Sib}\nolimits(C)=\emptyset$ . Suppose there exists $u\in D(V\setminus\mathop{\rm Sib}\nolimits(C))\cap\mathop{\rm Sib}\nolimits(C)$ , then there are $v\in C$ and $w\in V\setminus\mathop{\rm Sib}\nolimits(C)$ such that $v\xleftrightarrow{}u\xleftrightarrow{}w\in\mathcal{G}_{B}$ , and $u\in R_{w}$ . By the definition of $R_{w}$ , this implies $v\in\mathop{\rm sib}\nolimits(w)$ which is impossible as $w\in V\setminus\mathop{\rm Sib}\nolimits(C)$ . This concludes the proof. ∎

Definition 3.2.

Let $v\in V$ , and let $Q\subseteq(\mathop{\rm pa}\nolimits(v)\cup\{v\})$ . We define the $v$ -rank of $Q$ as

r^{v}_{Q}:=\max_{1\leq k\leq|Q|}\{\,(I,P)\in 2^{R_{v}}\times 2^{Q}\>:\>|I|=|P|% =k,\,\,\tilde{\mathcal{P}}(I,P)\neq\emptyset\},

(3.7)

where $2^{S}$ denotes the power set of $S$ . Recall that $\tilde{\mathcal{P}}(I,P)$ is a set of non-intersecting systems of paths.

Notice that from Eq. 3.7 it is immediate that

r^{v}_{\mathop{\rm pa}\nolimits(v)\setminus Q}\geq r^{v}_{\mathop{\rm pa}% \nolimits(v)}-|Q|.

The following theorem, which constitutes our main identifiability result, shows that this lower bound for $r^{v}_{\mathop{\rm pa}\nolimits(v)\setminus Q}$ is reached if and only if $\lambda_{Q,v}$ is generically identifiable. The theorem is based on characterizing the linear subspace of the solution set of Eq. 3.3, which describes $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi_{-1}(X))$ based on the previous lemma.

Theorem 3.3.

Let $v\in V$ , and let $Q\subseteq\mathop{\rm pa}\nolimits(v)$ . The vector $\lambda_{Q,v}$ is generically identifiable if and only if $r^{v}_{\mathop{\rm pa}\nolimits(v)\setminus Q}=r^{v}_{\mathop{\rm pa}\nolimits% (v)}-|Q|$ , where $r^{v}_{Q}$ is defined in Eq. 3.7.

Proof.

The vector $\lambda_{Q,v}$ is identifiable if and only $\tilde{\lambda}_{Q,v}=\lambda_{Q,v}$ , for every $\tilde{\Lambda}\in\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi^{-1}_{\mathcal{G}% }(X))$ . We know from Lemma 3.2 that $\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}$ is a solution of the linear system given in Eq. 3.3 for every such matrix $\tilde{\Lambda}$ . Hence, if we define

	$\displaystyle S^{v}$	$\displaystyle:=\{\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}\in\mathbb{R}^% {\|\mathop{\rm pa}\nolimits(v)\|}\>:\>[(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v% ),R_{v}}]^{T}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}=[(B_{\Lambda% })_{v,R_{v}}]^{T}\},$
	$\displaystyle S^{v}_{Q}$	$\displaystyle:=\{\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}\in S^{v}\>:\>% \tilde{\lambda}_{Q,v}=\lambda_{Q,v}\},$

then $\lambda_{Q,v}$ is identifiable if and only if $S^{v}_{Q}=S^{v}$ . By definition, $S^{v}_{Q}$ is a linear subspace of $S^{v}$ , so the two are equal if and only if they have the same dimension.

We can write $S^{v}_{Q}$ as the solution space of the following linear system

\underbrace{\begin{bmatrix}(I_{p})_{Q,[p]}\\ (B_{\Lambda})^{v}\end{bmatrix}}_{(B_{\Lambda})^{v}_{Q}}\cdot\tilde{\lambda}_{% \mathop{\rm pa}\nolimits(v),v}=\begin{bmatrix}\lambda_{Q,v}\\ [(B_{\Lambda})_{v,R_{v}}]^{T}\end{bmatrix},

(3.8)

where $I_{p}$ is the $p\times p$ identity matrix, and $(B_{\Lambda})^{v}$ is defined in Eq. 3.3. We know that the solution space of Eq. 3.8 is not empty since $\lambda_{Q,v}$ belongs to it. Hence, we have $\dim(S^{v}_{Q})=|\mathop{\rm pa}\nolimits(v)|-\operatorname{rank}((B_{\Lambda}% )^{v}_{Q})$ , which implies

\dim(S^{v}_{Q})=\dim(S^{v})\iff\operatorname{rank}((B_{\Lambda})^{v}_{Q})=% \operatorname{rank}((B_{\Lambda})^{v}).

From the definition of $(B_{\Lambda})^{v}_{Q}$ in Eq. 3.8 one can easily see that

\operatorname{rank}((B_{\Lambda})^{v}_{Q})=\operatorname{rank}([(B_{\Lambda})^% {v}_{R_{v},\mathop{\rm pa}\nolimits(v)\setminus Q}])+|Q|=\operatorname{rank}([% (B_{\Lambda})_{\mathop{\rm pa}\nolimits(v)\setminus Q,R_{v}}]^{T})+|Q|.

Finally, we have

\displaystyle\dim(S^{v}_{Q})=\dim(S^{v})\iff\operatorname{rank}([(B_{\Lambda})% _{\mathop{\rm pa}\nolimits(v)\setminus\ Q,R_{v}}]^{T})=\operatorname{rank}([(B% _{\Lambda})_{\mathop{\rm pa}\nolimits(v),R_{v}}]^{T})-|Q|,

which concludes the proof by noticing that from Lemma A.2, we have $r^{v}_{Q}$ is generically equal to $\operatorname{rank}([(B_{\Lambda})_{Q,R_{v}}]^{T})$ for every subset $Q$ of $\mathop{\rm pa}\nolimits(v)$ . ∎

Example 3.2.

Consider again the graph in Fig. 2, as in Example 3.1. We have $R_{v_{2}}=\emptyset$ , implying that the parameter $\lambda_{v_{1}v_{2}}$ is not identifiable. In contrast $R_{v_{4}}=\{v_{1},v_{2}\}$ , and there is a system of non-intersecting paths from $R_{v_{4}}$ to $\mathop{\rm pa}\nolimits(v_{4})=\{v_{2},v_{3}\}$ given by $\pi_{1}=(v_{1},v_{3})$ and $\pi_{2}=(v_{2},v_{2})$ . This implies that the vector $\lambda_{\mathop{\rm pa}\nolimits(v),v}$ is identifiable.

The following theorem characterizes the situations in which the whole matrix $\Lambda$ is identifiable.

Theorem 3.4.

The matrix $\Lambda$ is generically identifiable if and only if for every node $v\in V$ , there is a subset $I_{v}$ of $R_{v}$ of size $|\mathop{\rm pa}\nolimits(v)|$ such that there is a system of non-intersecting paths from $I_{v}$ to $\mathop{\rm pa}\nolimits(v)$ .

Proof.

The matrix $\Lambda$ is identifiable if and only if all of its columns are, so we get the statement by applying Theorem 3.3 to each of the columns, with $Q=\mathop{\rm pa}\nolimits(v)$ . ∎

Remark 3.1.

It is noteworthy that 1 is used only for proving the direct implication of Lemma 3.2. This implies that the necessity of the graphical condition in Theorem 3.4 also holds if the model was extended by not requiring 1 to hold.

Remark 3.2.

A direct consequence of Lemma 3.2 is that if the matrix $\Lambda$ is not generically identifiable, the fiber $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}}(\Phi_{\mathcal{G}}^{-1}(\Phi_{\mathcal{G% }}(\Lambda,\varepsilon)))$ has infinite cardinality. This implies that in our setting, there are no ADMGs that are $k$ -to-one with finite $k>1$ . This is in contrast with the linear Gaussian case; see e.g., Foygel et al. (2012, Ex. 8).

4 Certifying Identifiability

Verifying directly whether the condition of Theorem 3.3 is satisfied can be computationally challenging. Following the approach of Brito (2004) and Foygel et al. (2012), we now introduce an alternative approach that can verify the identifiability condition of Theorem 3.3 in polynomial time in the size of the graph via a maximum flow reformulation.

For the sake of completeness, we first revisit the definition of the maximum flow problem; further details are available in Cormen et al. (2009, §26). Subsequently, we introduce our reformulation.

The proofs of the results presented in this section can be found in Section B.1.

4.1 The Maximum Flow Problem

Let $G=(V,D)$ be a directed graph with source node $s\in V$ and sink node $t\in V$ . Let $c_{V}:V\to\mathbb{R}_{\geq 0}$ be a node capacity function, and let $c_{D}:D\to\mathbb{R}_{\geq 0}$ be an edge capacity function. A flow on $G$ is a function $f:D\to\mathbb{R}_{\geq 0}$ satisfying

	$\displaystyle\sum_{w\in\mathop{\rm ch}\nolimits(v)}f(v,w)=\sum_{u\in\mathop{% \rm pa}\nolimits(v)}f(u,v)$	$\displaystyle\leq c_{V}(v),\quad\forall v\in V\setminus\{s,t\},$		(4.1)
	$\displaystyle f(u,v)$	$\displaystyle\leq c_{D}(u,v),\quad\forall u\to v\in D.$		(4.1)

The size of a flow $f$ is defined as

|f|:=\sum_{w\in\mathop{\rm ch}\nolimits(s)}f(s,w)=\sum_{u\in\mathop{\rm pa}% \nolimits(t)}f(u,t).

(4.2)

The max-flow problem on $(G,s,t,c_{V},c_{D})$ is the problem of finding a flow $f$ whose size $|f|$ is maximum.

4.2 Deciding Generic Identifiability

For every node $v\in V$ and every $Q\subseteq\mathop{\rm pa}\nolimits(v)$ , let $G^{v}_{Q}=(V^{v}_{Q},E^{v}_{Q})$ be defined as follows:

	$\displaystyle V^{v}_{Q}:=$	$\displaystyle\mathop{\rm an}\nolimits(v)\cup\{s_{v},t_{v}\},$
	$\displaystyle E^{v}_{Q}:=$	$\displaystyle\{s_{v}\to u\>:\>u\in R_{v}\}\cup\{u\to t_{v}\>:\>u\in Q\}\cup\{u% \to w\>:\>u\to w\in\mathcal{G}\},$

where $s_{v}$ and $t_{v}$ are, respectively, newly introduced source and sink nodes. The edge capacity is $\infty$ for all the edges. The node capacity is $\infty$ for both the sink and the source, and $1$ , otherwise. We denote the maximum size of any flow on $G^{v}_{Q}$ by $\,\operatorname{max-flow}{(G^{v}_{Q})}$ .

Lemma 4.1.

It holds that $\operatorname{max-flow}{(G^{v}_{Q})}=r^{v}_{Q}$ .

Theorem 4.2.

Given a mixed graph $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ , a node $v\in V$ , and any $Q\subseteq\mathop{\rm pa}\nolimits(v)$ , the generic identifiability of $\lambda_{Q,v}$ holds if and only if $\operatorname{max-flow}{(G^{v}_{Q})}=|Q|$ , which can be certified in $\mathcal{O}(|V|^{2+o(1)})$ time.

Theorem 4.3.

Given a mixed graph $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ , the generic identifiability of $\Lambda$ holds if and only if $\operatorname{max-flow}{(G^{v}_{\mathop{\rm pa}\nolimits(v)})}=|\mathop{\rm pa% }\nolimits(v)|$ for all $v\in V$ , which can be certified in $\mathcal{O}(|V|^{3+o(1)})$ time.

Example 4.1.

Fig. 4 illustrates the maximum flows when the criterion from Theorem 4.2 to two of the nodes of the ADMG in Fig. 2.

$G^{v_{2}}_{\mathop{\rm pa}\nolimits(v_{2})}:$ The graph is constructed for parameter $\lambda_{12}$ . The only flow on $G^{v_{2}}_{\mathop{\rm pa}\nolimits(v_{2})}$ is the trivial flow setting all edges to $0$ . Hence, $\lambda_{12}$ is not identifiable.

$G^{v_{4}}_{\mathop{\rm pa}\nolimits(v_{4})}:$ The graph is constructed for parameter $\Lambda_{\{2,3\},4}$ . The figure displays a flow on $G^{v_{4}}_{\mathop{\rm pa}\nolimits(v_{4})}$ of size $|\mathop{\rm pa}\nolimits(3)|=2$ . Consequently, the parameters $\lambda_{24}$ and $\lambda_{34}$ are identifiable.

Figure 4: Two maximum flow problems corresponding to the ADMG of Fig. 2.

5 The Genericity Condition for the Error Distribution

The idea underlying 1 is that it should not be possible to linearly disentangle a general dependence between two errors $\varepsilon_{u}$ and $\varepsilon_{v}$ . In other words, if two different linear combinations of $\varepsilon$ are independent, then at least one of them cannot have any signal coming from $(\varepsilon_{u},\varepsilon_{v})$ . The purpose of this section is to prove that this fact is indeed true for two tractable subfamilies of joint distributions for the errors. Specifically, Section 5.1 considers the setting in which dependence is generated through linear latent factor models, and Section 5.2 treats distributions with finite moments.

5.1 Linear Factor Models

Assume that the error vector $\varepsilon$ is generated according to a sparse factor model that respects the Markov property of the bidirected part $\mathcal{G}_{B}$ of a given ADMG $\mathcal{G}$ . Define a latent factor graph for $\mathcal{G}_{B}$ to be any DAG $\mathcal{L}=(V\cup L,E_{\mathcal{L}})$ , in which the latent nodes $L$ are source nodes and whose latent projection (see Verma and Pearl (1990, Sec. 3)) on the nodes in $V$ is equal to $\mathcal{G}_{B}$ . Define $\mathcal{M}(k)$ to be the set of $k$ -dimensional random vectors with independent and non-Gaussian components. Then, the sparse factor model associated to $\mathcal{L}$ is the set of random vectors

\mathcal{M}^{\mathcal{L}}(\mathcal{G}_{B})=\{\varepsilon\in\mathcal{M}(% \mathcal{G}_{B})\>:\>\exists\,\eta\in\mathcal{M}(|V|+|L|),\,\,H\in\mathbb{R}^{% \mathcal{L}},\,\,\varepsilon=H_{L,V}^{T}\cdot\eta_{L}+\eta_{V}\}.

(5.1)

Theorem 5.1.

Let $\mathcal{L}=(V\cup L,E_{\mathcal{L}})$ be a latent factor graph for $\mathcal{G}_{B}$ , and for any subset $C\subset V$ define $L_{C}:=\{l\in L\>:\>\mathop{\rm ch}\nolimits_{\mathcal{L}}(l)\subseteq C\}$ . If for every edge $u\xleftrightarrow{}v\in\mathcal{G}_{B}$ there is a clique $C_{uv}$ (a subset of $V$ for which every pair of nodes is adjacent) in $\mathcal{G}_{B}$ such that $|L_{C_{uv}}|\geq|C_{uv}|-1$ then $\varepsilon$ satisfies 1 for Lebesgue-almost every matrix $H\in\mathbb{R}^{\mathcal{L}}$ .

Proof.

Let $a_{1},a_{2}\in\mathbb{R}^{V}$ , and consider $\varepsilon=H_{L,V}^{T}\cdot\eta_{L}+\eta_{V}$ as in Eq. 5.1. Applying the Darmois-Skitovich theorem (Comon and Jutten, 2010, Thm. 9.5) to $a_{1}^{T}\varepsilon$ and $a_{2}^{T}\varepsilon$ , we obtain that

	$\displaystyle a_{1s}\cdot a_{2s}$	$\displaystyle=0,\;$	$\displaystyle\forall s\in V,$		(5.2)
	$\displaystyle(a_{1}^{T}H_{L,V}^{T})_{l}\cdot(a_{2}^{T}H_{L,V}^{T})_{l}$	$\displaystyle=0,\;$	$\displaystyle\forall l\in L.$		(5.3)

Note that Eq. 5.2 already gives the part of the claim referring to the case $u=v$ in 1. It remains to consider the case of two nodes $u,v$ that are adjacent in $\mathcal{G}_{B}$ .

Let $u\xleftrightarrow{}v\in\mathcal{G}_{B}$ , and assume for contradiction that $a_{1u}\cdot a_{2v}\neq 0$ . Consider a clique $C_{uv}$ as in the statement of the theorem. The vector $a_{C_{u,v}}:=(a_{1,C_{u,v}},a_{2,C_{u,v}})$ is a solution of the following system of quadratic equations:

\left(\sum_{c\in C_{uv}}a_{1c}H_{cl}\right)\cdot\left(\sum_{c\in C_{uv}}a_{2c}% H_{cl}\right)=0,\,\quad l\in L_{C_{u,v}};

we denote the system by $\mathcal{S}_{C_{u,v}}$ . Notice that from Eq. 5.2 we know that the vector $a_{C_{u,v}}$ has at most $|C_{u,v}|$ non-zero entries. We now show that, for a generic choice of the entries of $H$ , $\mathcal{S}_{C_{u,v}}$ does not admit solutions with $a_{1u}\cdot a_{2v}\neq 0$ . Following the case distinctions resulting from the vanishing of the first or the second factor in the equations in (5.2), the solution set of $\mathcal{S}_{C_{u,v}}$ can be written as the union of the solution set of $2^{|L_{C_{u,v}}|}$ homogeneous linear systems. Each of these linear systems can be characterized by a partition of $L_{C_{u,v}}$ defined as follows:

L_{1}:=\{l\in L_{C_{u,v}}\>:\>(a_{1}^{T}H_{L,V}^{T})_{l}=0\},\,\quad L_{2}:=L_% {C_{u,v}}\setminus L_{1}.

We denote by $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ the linear systems associated to $L_{1}$ and $L_{2}$ , respectively. Define $V_{1}=\{v\in C_{u,v}\>:\>a_{1v}=0\}$ and $V_{2}=\{v\in C_{u,v}\>:\>a_{2v}=0\}$ .

If $V_{1}\cap V_{2}\neq\emptyset$ , the vector $a_{C_{u,v}}$ has at most $|C_{u,v}|-1$ non-zero entries, implying that $\mathcal{S}_{1}\cup\mathcal{S}_{2}$ has $|L_{C_{u,v}}|$ equation and $|C_{u,v}|-1$ parameters. If $|L_{C_{u,v}}|\geq|C_{u,v}|-1$ , for a generic choice of the entries of $H$ , such a system admits only the 0 solution (Okamoto, 1973, Lemma). Hence, the assumption that $a_{1u}\cdot a_{2v}\neq 0$ leads to a contradiction.

For $V_{1}\cap V_{2}=\emptyset$ , we now show that either $\mathcal{S}_{1}$ oder $\mathcal{S}_{2}$ admits only the 0 solution. Notice that since $a_{1u}\cdot a_{2v}\neq 0$ we have $V_{1},V_{2}\neq\emptyset$ . This implies that both $\mathcal{S}_{1}$ and $\mathcal{S}_{2}$ can have a non-zero solution for a generic choice of the entries of $H$ only if

|L_{1}|\leq|V_{1}|-1,\,\quad|L_{2}|\leq|V_{2}|-1.

This would lead to $|L_{C_{u,v}}|=|L_{1}|+|L_{2}|\leq|V_{1}|+|V_{2}|-2=|C_{u,v}|-2$ , which contradicts $|L_{C_{u,v}}|\geq|C_{u,v}|-1$ . ∎

Corollary 5.2.

Let $\mathcal{D}(\mathcal{G}_{B})$ , be the canonical DAG associated to $\mathcal{G}_{B}$ , Richardson and Spirtes (2002, §6), then 1 is satisfied for a generic choice of parameters of $\mathcal{M}^{\mathcal{D}(\mathcal{G}_{B})}(\mathcal{G}_{B})$ .

Remark 5.1.

We point out that in the non-Gaussian setting, sparse linear factor analysis models have testable implications, see, e.g., Ardiyansyah and Sodomaco (2023); Xie et al. (2023); Schkoda and Drton (2023). Hence, the failure of 1 could, in principle, be tested by testing all linear factor models leading to the failure.

Example 5.1.

We borrow the example in Fig. 5 from Barber et al. (2022, Fig. 5). Notice that in the proof of our main result, 1 is used only for matrices with a specific structure, described in Lemma 3.1. Therefore, we focus on this type of matrices. In particular, we will consider the matrix

A=\begin{pmatrix}a_{1}^{T}\\ a_{2}^{T}\end{pmatrix}=\begin{pmatrix}1&0&0&0&0\\ 0&0&a_{53}&a_{54}&1\end{pmatrix},

(5.4)

and the bidirected graph $\mathcal{G}_{B}$ , corresponding to $\mathcal{G}$ in Fig. 5 with respect to the latent factor models $\mathcal{L},\mathcal{L}_{1},\mathcal{L}_{2}$ given in Fig. 5, and Fig. 6.

1.

Consider the pair, $v_{2}\xleftrightarrow{}v_{3}\in\mathcal{G}_{B}$ , the only latent parent of both in $\mathcal{L}$ is $l_{1}$ and $\mathop{\rm ch}\nolimits(l_{1})=\{v_{1},v_{2},v_{3},v_{4}\}$ . This means that the only clique we can consider is $C_{v_{2}v_{3}}=\{v_{1},v_{2},v_{3},v_{4}\}$ and $|L_{C_{v_{2}v_{3}}}|=1$ , hence the condition in Theorem 5.1 is violated. Now, we will show that Eq. 5.3 has a nonzero solution. Indeed, the only latent variable for which the system is not trivially satisfied is $l_{1}$ , implying that any solution of the equation $a_{53}H_{3l_{1}}+a_{54}H_{4l_{1}}=0$ , is also a solution of Eq. 5.3.

Figure 5: A latent factor model under which 1 does not hold and the corresponding latent projection.

It is straightforward to see that after adding a latent node $l_{4}$ to the graph as in the graph $\mathcal{L}_{1}$ in Fig. 6, the condition of Theorem 5.1 is still not satisfied. However, in this case, 1 cannot be violated by the matrices described in Eq. 5.4. To see this, consider Eq. 5.3 for the latent variables $l_{1}$ and $l_{4}$ , which leads to the following system of equations for $(a_{53},a_{54})$ .

\begin{cases}H_{1l_{1}}(a_{53}H_{3l_{1}}+a_{54}H_{4l_{1}})&=0\\ H_{1l_{4}}(a_{53}H_{3l_{1}})&=0,\end{cases}

Clearly, the only solution to this system of equation is $(a_{53},a_{54})=(0,0)$ .

Figure 6: Two latent factor models with the same latent projection as in Fig. 5, under which 1 holds generically, for matrices as in Eq. 5.4. The graph on the left does not satisfy the condition of Theorem 5.1 while the right graph satisfies the condition. Hence, the condition introduced in Theorem 5.1 is sufficient but not necessary.

3.

The graph $\mathcal{L}_{2}$ satisfies the hypothesis of Theorem 5.1; hence it does not violate 1.

5.2 Random Variables with Finite Moments

We now turn to a setting where the error vector has finite moments up to a suitable order. As we show in Theorem 5.5 below, the distributions at which 1 fails define a set of moments, or also cumulants, that form a Lebesgue null set in all possible moments/cumulants up to the considered truncation order. The proofs for the results presented in this section can be found in Section B.2.

Definition 5.1.

The $k$ -th cumulant tensor of a random vector $\varepsilon=(\varepsilon_{1},\dots,\varepsilon_{p})$ is the $k$ -way tensor in $\mathbb{R}^{p\times\dots\times p}\equiv(\mathbb{R}^{p})^{k}$ whose entry in position $(i_{1},\dots,i_{k})$ is the joint cumulant

\displaystyle\mathcal{C}^{(k)}(\varepsilon)_{i_{1},\dots,i_{k}}:=\sum_{(A_{1},% \dots,A_{L})}(-1)^{L-1}(L-1)!\mathbb{E}\bigg{[}\prod_{j\in A_{1}}\varepsilon_{% j}\bigg{]}\cdots\mathbb{E}\bigg{[}\prod_{j\in A_{L}}\varepsilon_{j}\bigg{]},

where the sum is taken over all partitions $(A_{1},\dots,A_{L})$ of the multiset $\{i_{1},\dots,i_{k}\}$ .

Cumulant tensors are symmetric, i.e.,

\mathcal{C}^{(k)}(\varepsilon)_{i_{1},\dots,i_{k}}=\mathcal{C}^{(k)}(% \varepsilon)_{\sigma(i_{1}),\dots,\sigma(i_{k})}\ \forall\sigma\in S_{k},

where $S_{k}$ is the symmetric group on $[k]=\{1,\dots,k\}$ . We write $\operatorname{Sym}_{k}(p)$ for the subspace of symmetric tensors in $(\mathbb{R}^{p})^{k}$ .

1 involves linear combinations of the entries of a random vector. We will, thus, have to consider cumulants after linear transformation, for which we can leverage the following fact.

Lemma 5.3 (Comon and Jutten (2010), §5, Eq. 5.8).

Let $\varepsilon=(\varepsilon_{1},\dots,\varepsilon_{p})$ be any $p$ -variate random vector, and $A\in\mathbb{R}^{s\times p}$ for any $s\in\mathbb{N}$ , then

\displaystyle\mathcal{C}^{(k)}(A\cdot\varepsilon)_{i_{1},\dots,i_{k}}=\sum_{j_% {1},\dots,j_{k}}\mathcal{C}^{(k)}(\varepsilon)_{j_{1},\dots,j_{k}}a_{j_{1}i_{i% }}\cdots a_{j_{k}i_{k}}.

In order to justify 1 we wish to offer statements of its generic validity. Our strategy to do so in the present context is to consider cumulants up to a suitable truncation order $k$ . In the remainder of this section we consider a mixed graph $\mathcal{G}$ with $p$ nodes, which we label by taking the vertex set to be $V=[p]$ .

Definition 5.2.

Let $\mathcal{M}_{\infty}(\mathcal{G}_{B})$ be the subset of $\mathcal{M}(\mathcal{G}_{B})$ yielding distributions with all moments finite. For any integer $k\geq 2$ , let

\mathcal{M}^{(k)}(\mathcal{G}_{B})=\left\{\mathcal{C}^{(k)}\in\operatorname{% Sym}_{k}(\mathbb{R}^{p})\>:\>\mathcal{C}^{(k)}_{i_{1},\dots,i_{k}}=0\emph{ if % }\{i_{1},\dots,i_{k}\}\emph{ is not connected in }\mathcal{G}_{B}\;\right\}.

Moreover, we let

\mathcal{M}^{\leq k}(\mathcal{G}_{B})=\mathcal{M}^{(2)}(\mathcal{G}_{B})\times% \cdots\times\mathcal{M}^{(k)}(\mathcal{G}_{B}).

Lemma 5.4.

Fix any integer $k\geq 1$ .

(i)

The map $\phi^{k}:\mathcal{M}_{\infty}(\mathcal{G}_{B})\to\mathcal{M}^{(k)}(\mathcal{G}% _{B})$ that sends random vectors with all moments finite to their $k$ -th cumulant tensors is well-defined in the sense that $\phi^{k}(\mathcal{M}_{\infty}(\mathcal{G}_{B}))\subseteq\mathcal{M}^{(k)}(% \mathcal{G}_{B})$ .
(ii)

Define the map $\phi^{\leq k}=(\phi^{l})_{l\leq k}:\mathcal{M}_{\infty}(\mathcal{G}_{B})\to% \mathcal{M}^{\leq k}(\mathcal{G}_{B})$ . Then $\phi^{\leq k}(\mathcal{M}_{\infty}(\mathcal{G}_{B}))$ is a full dimensional subset of $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ .

Theorem 5.5.

Let

\kappa(\mathcal{G}_{B}):=\{A=(a_{ij})\in\mathbb{R}^{2\times p}\>:\>a_{1i}\cdot a% _{2j}=0,\emph{ if }u_{i}\xleftrightarrow{}u_{j}\in\mathcal{G}_{B}\emph{ or }i=% j\}.

For every $\varepsilon\in\mathcal{M}_{\infty}(\mathcal{G}_{B})$ , define $\kappa(\varepsilon)=\{A\in\mathbb{R}^{2\times p}\>:\>(A\varepsilon)_{1}\perp\!% \!\!\perp(A\varepsilon)_{2}\}$ , and let $\mathcal{S}(\mathcal{G}_{B})=\{\varepsilon\in\mathcal{M}_{\infty}(\mathcal{G}_% {B})\>:\>\kappa(\varepsilon)\setminus\kappa(\mathcal{G}_{B})\neq 0\}$ , which is precisely the set of distributions for which 1 fails. Then there is a positive integer $k\leq 2(p+1)$ such that $\phi^{\leq k}(\mathcal{S}(\mathcal{G}_{B}))$ is a Lebesgue measure 0 subset of $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ .

We remark that, for simplicity, we stated Theorem 5.5 for distributions with finite moments of any order. However, we only needed the first $2(p+1)$ moments to be finite.

Example 5.2.

One simple type of exceptional distribution for which 1 fails to hold are distributions that are obtained linear transformations of independent non-Gaussian variables. For example, let $U_{1},U_{2}$ be two independent, standard univariate normal distributions, let $V_{i}=\sqrt[3]{U_{i}}$ for $i=1,2$ , and let $X=B\cdot(V_{1},V_{2})$ for any invertible 2 by 2 matrix $B$ . Then $X\in\mathcal{M}_{\infty}(\mathcal{G}_{B})$ for $\mathcal{G}_{B}=\{\{1,2\},\{1\xleftrightarrow{}2{}\}\}$ , but by construction $V=B^{-1}\cdot X$ , and the fact that $V_{1}\perp\!\!\!\perp V_{2}$ implies $X\in\mathcal{S}(\mathcal{G}_{B})$ . As noted in Schkoda and Drton (2023), linear transformations of independent components are forming a null set already when considering cumulants of order up to $k\leq 3$ .

Remark 5.2.

Theorem 5.5 is of independent interest given the recent scholarly attention to generalizations of ICA that can deal with dependent error terms; see, e.g., Mesters and Zwiernik (2022); Garrote-López and Stephenson (2024); Wang and Seigal (2024). Indeed, if we consider $\mathcal{G}_{B}$ to be the empty graph, then Theorem 5.5 reduces to a generic version of the classical Darmois-Skitovich theorem that underlies ICA theory (Comon and Jutten, 2010, Thm. 9.5). From this perspective, Theorem 5.5 provides a generic generalization of the Darmois-Skitovich theorem to the case where the independence structure of the sources is more complex. A consequence of Examples 5.1-5.2 is that a generalization of the Darmois-Skitovich theorem that holds globally, i.e., for every non-Gaussian distribution cannot be achieved.

6 Structure Identifiability

Up to this point, we have studied a parameter identifiability problem, assuming that the mixed graph $\mathcal{G}$ is given. However, there are numerous practical problems in which one needs to infer the graph from data. This model selection problem is often referred to as structure learning oder causal discovery (Drton and Maathuis, 2017). Therefore, it is important to understand whether, within the model class under consideration, the graph is identifiable. If this is not the case, one is forced to either restrict the class of graphs under consideration, as done by Wang and Drton (2023), or to learn an equivalence class of graphs (Peters et al., 2017, §6). The problem that we consider in this section is as follows.

Problem (Model equivalence).

Given two ADMGs $\mathcal{G}$ and $\tilde{\mathcal{G}}$ , is it true that $\mathcal{M}(\mathcal{G})\!=\!\mathcal{M}(\tilde{\mathcal{G}})\!\iff\!\mathcal{% G}\!=\!\tilde{\mathcal{G}}$ ?

When $\mathcal{M}(\mathcal{G})=\mathcal{M}(\tilde{\mathcal{G}})$ , we say that $\mathcal{G}$ and $\tilde{\mathcal{G}}$ are model equivalent. The equivalence class of an ADMG $\mathcal{G}$ is the set of all the ADMGs $\tilde{\mathcal{G}}$ that are model equivalent to it.

In the next result, we prove that the model equivalence of two arbitrary ADMGs can be certified by solving a system of quadratic equations. For graphs of small size, such a system can be solved with computer algebra software, in the same spirit as in García-Puente et al. (2010). In Example 6.1, we use this result to characterize the equivalence class of the IV graph depicted in Fig. 1.

Theorem 6.1.

Let $\mathcal{G}$ and $\tilde{\mathcal{G}}$ be two arbitrary ADMGs with the same vertex set $V$ . Then $\mathcal{M}(\mathcal{G})\subseteq\mathcal{M}(\tilde{\mathcal{G}})$ if and only if for every $\Lambda\in\mathbb{R}^{\mathcal{G}}$ , and every $u\xleftrightarrow{}v\notin\tilde{\mathcal{G}}$ the following system of equations has a solution in $\tilde{\Lambda}\in\mathbb{R}^{\tilde{\mathcal{G}}}$ :

((I-\tilde{\Lambda})^{T}B_{\Lambda})_{uw_{1}}\cdot((I-\tilde{\Lambda})^{T}B_{% \Lambda})_{vw_{2}}=0,\quad\forall w_{1}=w_{2}\text{ and }w_{1}\xleftrightarrow% {}w_{2}\in\mathcal{G}.

(6.1)

Proof.

By definition, $\mathcal{M}(\mathcal{G})\subseteq\mathcal{M}(\tilde{\mathcal{G}})$ if and only if for every $X=(I-\Lambda)^{-T}\varepsilon\in\mathcal{M}(\mathcal{G})$ , we have $X\stackrel{{\scriptstyle d}}{{=}}(I-\tilde{\Lambda})^{-T}\tilde{\varepsilon}$ for some $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}^{e}}$ and $\tilde{\varepsilon}\in\mathcal{M}(\tilde{\mathcal{G}}_{B})$ . This implies

\tilde{\varepsilon}\stackrel{{\scriptstyle d}}{{=}}(I-\tilde{\Lambda})^{T}(I-% \Lambda)^{-T}\varepsilon=\underbrace{(I-\tilde{\Lambda})^{T}B_{\Lambda}}_{A}\varepsilon.

(6.2)

In particular, Eq. 6.2 has to hold for the generic elements $\varepsilon\in\mathcal{\mathcal{G}}_{B}$ satisfying 1. By applying 1 to every pair $u\xleftrightarrow{}v\notin\tilde{\mathcal{G}}$ , we conclude that the condition is necessary for the model inclusion. In order to prove its sufficiency, we must demonstrate that $\tilde{\varepsilon}$ satisfies the connected set Markov property with respect to $\mathcal{G}$ .

For every $v\in V$ , let $D(v)=\{u\in V\>:\>a_{vu}\neq 0\}$ , and $D(C):=\cup_{v\in C}D(v)$ for $C\subseteq V$ . We need to show that $\tilde{\varepsilon}_{C}\perp\!\!\!\perp\tilde{\varepsilon}_{V\setminus\mathop{% \rm Sib}\nolimits(C)}$ for every $C\subseteq V$ that is connected in $\tilde{\mathcal{G}}_{B}$ . If $D(V\setminus\mathop{\rm Sib}\nolimits(C))\subseteq V\setminus\mathop{\rm Sib}% \nolimits(D(C))$ in $\mathcal{G}$ , then the result follows from the connected set Markov property of $\varepsilon$ with respect to $\mathcal{G}$ . Assume, for contradiction, that $D(V\setminus\mathop{\rm Sib}\nolimits(C))\cap\mathop{\rm Sib}\nolimits(D(C))$ , indicating that there exist $w_{1}\in D(V\setminus\mathop{\rm Sib}\nolimits(C)),w_{2}\in D(C),u\in V% \setminus\mathop{\rm Sib}\nolimits(C),$ and $v\in C$ such that $w_{1}\xleftrightarrow{}w_{2}$ in $\mathcal{G}$ oder $w_{1}=w_{2}$ , and $a_{uw_{1}}\cdot a_{vw_{2}}\neq 0$ , which contradicts Eq. 6.1 since $u\xleftrightarrow{}v\notin\tilde{\mathcal{G}}$ . ∎

The next result applies Theorem 3.3 to graphically characterize when two ADMGs that only differ in one directed edge are model equivalent.

Theorem 6.2.

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}})$ be an ADMG, $e=u_{0}\to v_{0}\in\mathcal{G}$ , and $\mathcal{G}^{e}=(V,E_{\rightarrow{}}\setminus\{e\},E_{\leftrightarrow{}})$ . Then $\mathcal{M}(\mathcal{G}^{e})=\mathcal{M}(\mathcal{G})$ if and only if $\lambda_{u_{0}v_{0}}$ is not identifiable in $\mathcal{G}$ .

Proof.

Since $\mathcal{G}^{e}$ is a subgraph of $\mathcal{G}$ , we always have $\mathcal{M}(\mathcal{G}^{e})\subseteq\mathcal{M}(\mathcal{G})$ . Hence, we only need to show that $\mathcal{M}(\mathcal{G}^{e})\supseteq\mathcal{M}(\mathcal{G})$ if and only if $\lambda_{uv}$ is not identifiable in $\mathcal{G}$ .

Let $X=(I-\Lambda)^{-T}\varepsilon\in\mathcal{M}(\mathcal{G})$ . Then $X\in\mathcal{M}(\mathcal{G}^{e})$ if and only if $X\stackrel{{\scriptstyle d}}{{=}}(I-\tilde{\Lambda})^{-T}\tilde{\varepsilon}$ for $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}^{e}}$ and $\tilde{\varepsilon}\in\mathcal{M}(\mathcal{G}_{B})$ . This implies

\tilde{\varepsilon}\stackrel{{\scriptstyle d}}{{=}}(I-\tilde{\Lambda})^{T}(I-% \Lambda)^{-T}\varepsilon=\underbrace{(I-\tilde{\Lambda})^{T}B_{\Lambda}}_{A}\varepsilon.

(6.3)

Following the steps of the proof in Lemma 3.1, we get

a_{uv}=b_{uv}-\sum_{w\in\\ \mathop{\rm pa}\nolimits(v)^{e}\cap\mathop{\rm de}\nolimits(v)}\tilde{\lambda}% _{wu}b_{wv},

where $\mathop{\rm pa}\nolimits(v)^{e}$ is the parent set of $v$ in $\mathcal{G}^{e}$ .

Since $\mathcal{G}$ and $\mathcal{G}^{e}$ have the same bidirected part, we can repeat the same proof as for Lemma 3.2, concluding that a matrix $A$ as in Eq. 6.3 can exist if and only if, for every $v\in V$ , the following system has a solution:

[(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v)^{e},R_{v}}]^{T}\cdot\tilde{\lambda% }_{\mathop{\rm pa}\nolimits(v)^{e},v}=[(B_{\Lambda})_{v,R_{v}}]^{T}.

(6.4)

For $v\neq v_{0}$ , we have $\mathop{\rm pa}\nolimits(v)^{e}=\mathop{\rm pa}\nolimits(v)$ . Hence, the system in Eq. 6.4 has always a solution given by $\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v)^{e},v}=\lambda_{\mathop{\rm pa}% \nolimits(v),v}$ .

For $v=v_{0}$ , we have $\mathop{\rm pa}\nolimits(v_{0})^{e}=\mathop{\rm pa}\nolimits(v_{0})\setminus\{% u_{0}\}$ . The system in Eq. 6.4 has a solution if and only if $\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{0})\setminus\{u% _{0}\},R_{v_{0}}})=\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits% (v_{0})\setminus\{u_{0}\}\cup\{v_{0}\},R_{v_{0}}})$ . Let $\Pi=(\pi_{0},\dots,\pi_{k})$ be a system of paths without intersection from $I\subseteq R_{v_{0}}$ to $J\subseteq\mathop{\rm pa}\nolimits(v_{0})$ . If $u_{0}\in\Pi$ , let $\pi_{0}$ be the path that ends at $u_{0}$ , and $\pi_{0}^{*}$ be the path obtained by concatenating $\pi_{0}$ with the edge $u_{0}\to v_{0}$ . By construction, $\Pi^{*}=(\pi_{0}^{*},\dots,\pi_{k})$ is a system of non-intersecting paths from $I$ to $J\setminus\{u_{0}\}\cup\{v_{0}\}\subseteq\mathop{\rm pa}\nolimits(v_{0})% \setminus\{u_{0}\}\cup\{v_{0}\}$ . This proves that $r^{v_{0}}_{\mathop{\rm pa}\nolimits(v_{0})\setminus\{u_{0}\}\cup\{v_{0}\}}=r^{% v_{0}}_{\mathop{\rm pa}\nolimits(v_{0})}$ , that implies $\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{0})\setminus\{u% _{0}\}\cup\{v_{0}\},R_{v_{0}}})=\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa% }\nolimits(v_{0}),R_{v_{0}}})$ .

From Theorem 3.3, we know that $\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{0})\setminus\{u% _{0}\},R_{v_{0}}})\geq\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}% \nolimits(v_{0}),R_{v_{0}}})-1$ and that the two equality holds if only if $\lambda_{u_{0}v_{0}}$ is identifiable. Finally, we can write

	$\displaystyle\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{0}% ),R_{v_{0}}})-1\leq\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits% (v_{0})\setminus\{u_{0}\},R_{v_{0}}})$
	$\displaystyle\leq\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v% _{0})\setminus\{u_{0}\}\cup\{v_{0}\},R_{v_{0}}})=\operatorname{rank}((B_{% \Lambda})_{\mathop{\rm pa}\nolimits(v_{0}),R_{v_{0}}}).$

This concludes the proof by noticing that $\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{0})\setminus\{u% _{0}\},R_{v_{0}}})=\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits% (v_{0})\setminus\{u_{0}\}\cup\{v_{0}\},R_{v_{0}}})$ if and only if the first inequality is strict. ∎

It is well known that the presence of a valid instrumental variable is sufficient for estimating the causal effect from a treatment to an outcome (Wright, 1928, App. B). However, testing from data that an instrument is valid is a much more involved task. Indeed, Gunsilius (2021) shows that in the nonparametric case for continuous treatment, the IV model does not impose any constraint on the observed distribution. Developing tests for the validity of an instrument under different parametric assumptions is an active and important area of research; see, e.g., Pearl (1995); Silva and Shimizu (2017); Xie et al. (2022). The next example shows that, unlike the nonparametric case, the IV model does impose constraints on the observed distribution in linear models. However, our results prove that these constraints are not sufficient for testing the validity of an instrument.

Example 6.1 (Instrumental Validity).

Figure 7: The IV graph (top row in the middle) with its equivalence class.

Let $\tilde{\mathcal{G}}_{IV}$ be the graph on the top left in Fig. 7. Applying Theorem 6.2 to this graph, and the edges $I\to Y$ and $T\to Y$ , one can see that $\tilde{\mathcal{G}}_{IV},\tilde{\mathcal{G}}_{IV}^{I\to Y}$ , and $\tilde{\mathcal{G}}_{IV}^{T\to Y}$ are all model equivalent. Applying the same argument to the graph on the bottom left in Fig. 7, and the edges $Y\to T$ and $I\to T$ , we obtain that all the graphs depicted in Fig. 7 are model equivalent. Furthermore, by verifying the conditions of Theorem 6.1 for all ADMGs with three nodes using the software Macaulay2 (Grayson and Stillman, 2023), we confirm that the graphs depicted in Fig. 7 are the only ones in the equivalence class of the IV graph.

In particular, the equivalence of the IV graph $\mathcal{G}_{IV}$ and $\tilde{\mathcal{G}}_{IV}$ implies that the so-called exclusion restriction (Lousdal, 2018) is not testable within our model class.

Moreover, through direct computation (for instance, by applying Robeva and Seby (2021, Cor. 21) to any of the graphs in Fig. 7), it can be shown that all the graphs in the equivalence class impose the following moment constraints on the observed distribution

\Sigma_{IT}\mathcal{T}_{III}-\Sigma_{II}\mathcal{T}_{IIT}=\Sigma_{IY}\mathcal{% T}_{III}-\Sigma_{II}\mathcal{T}_{IIY}=0,

where $\Sigma_{X_{1}X_{2}}=\mathbb{E}(X_{1}X_{2})$ , and $\mathcal{T}_{X_{1}X_{2}X_{3}}=\mathbb{E}(X_{1}X_{2}X_{3})$ .

7 Cyclic Graphs

Up to this point, we have exclusively studied acyclic models. This assumption has allowed us to obtain a complete characterization of the identifiable parameters. In this section, we relax this assumption and show that the proposed graphical criterion in Section 3 remains a necessary condition but is no longer sufficient. Moreover, we will provide a complete characterization of parameter identifiability for a special sub-class of cyclic graphs.

The first issue that one encounters when dealing with cyclic models is that the matrix $(I-\Lambda)$ might not be invertible. This implies that the assignment $X=\Lambda^{T}\cdot X+\varepsilon$ does not induce a unique solution for $X$ . Hence, we need to restrict our attention to a subset of $\mathbb{R}^{\mathcal{G}_{D}}$ , namely, the set $\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}:=\{\Lambda\in\mathbb{R}^{\mathcal{% G}_{D}}\>:\>\det(I-\Lambda)\neq 0\}$ . In this section, we focus on the identifiability of the matrix $\Lambda$ and reformulate the problem as follows.

Definition 7.1.

Define the parametrization map

	$\displaystyle\Phi_{{\mathcal{G}}_{\mathrm{reg}}}:\mathbb{R}^{\mathcal{G}_{D}}_% {\mathrm{reg}}\times\mathcal{M}(\mathcal{G}_{B})$	$\displaystyle\xrightarrow{}\mathcal{M}(\mathcal{G})$
	$\displaystyle(\Lambda,\varepsilon)$	$\displaystyle\mapsto(I-\Lambda)^{-T}\varepsilon,$

and for every $X\in\mathcal{M}(\mathcal{G})$ , let the fiber of $X$ with respect to $\Phi_{\mathcal{G}}$ be

\Phi_{{\mathcal{G}}_{\mathrm{reg}}}^{-1}(X):=\{(\Lambda,\varepsilon)\in\mathbb% {R}^{\mathcal{G}_{D}}_{\mathrm{reg}}\times\mathcal{M}(\mathcal{G}_{B})\>:\>% \Phi_{{\mathcal{G}}_{\mathrm{reg}}}(\Lambda,\varepsilon)\stackrel{{% \scriptstyle d}}{{=}}X\}.

(7.1)

For any generic choice of $(\Lambda,\varepsilon)$ , let $X=\Phi_{\mathcal{G}}(\Lambda,\varepsilon)$ . We say that the graph $\mathcal{G}$ is generically identifiable if $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}_{\mathrm{reg}}}(\Phi_{{\mathcal{G}}_{% \mathrm{reg}}}^{-1}(X))=\{\Lambda\}$ .

Lemma 7.1.

Let $X=\Phi_{{\mathcal{G}}_{\mathrm{reg}}}(\Lambda,\varepsilon)$ for a generic choice of parameters $(\Lambda,\varepsilon)\in\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}\times% \mathcal{M}(\mathcal{G}_{B})$ . The matrix $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}$ belongs to $\mathrm{P}_{\mathbb{R}^{\mathcal{G}}_{\mathrm{reg}}}(\Phi_{{\mathcal{G}}_{% \mathrm{reg}}}^{-1}(X))$ if it is a solution to the following linear system of equations:

\underbrace{[(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v),R_{v}}]^{T}}_{(B_{% \Lambda})^{v}}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}=[(B_{% \Lambda})_{v,R_{v}}]^{T},\quad\forall v\in V.

(7.2)

Proof.

It suffices to notice that for the reverse implication of the proof of Lemma 3.2, we never used the acyclicity of the graph. Hence, the same proof applies. ∎

Theorem 7.2.

If a mixed graph $\mathcal{G}$ is identifiable, then for every $v\in V$ , there exists a subset $I_{v}$ of $R_{v}$ with the size $|\text{pa}(v)|$ , such that there is a system of non-intersecting paths from $I_{v}$ to $\text{pa}(v)$ .

Proof.

If there is a $v\in V$ that satisfies the assumptions of the theorem, then the matrix $(B_{\Lambda})^{v}$ is rank deficient; hence there is $\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}\neq\lambda_{\mathop{\rm pa}% \nolimits(v),v}$ that solves Eq. 7.2. The matrix $\tilde{\Lambda}$ obtained from $\Lambda$ by substituting the column corresponding to $v$ with $\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v),v}$ , belongs to $\mathrm{P}_{{\mathbb{R}^{\mathcal{G}}}_{\mathrm{reg}}}(\Phi_{\mathcal{G}}^{-1}% (X))$ according to Lemma 7.1. Hence, the matrix $\Lambda$ is not identifiable. ∎

Example 7.1 (A non-identifiable cyclic graph).

For the graph in Fig. 8, we have $\mathop{\rm pa}\nolimits(v_{2})=\{v_{1},v_{3}\}$ , and $I_{v_{2}}=\{v_{1}\}$ . Considering $v=v_{2}$ in Theorem 7.2, we can see that the matrix $\Lambda$ is not identifiable for this cyclic graph.

Figure 8: A non-identifiable cyclic graph.

Example 7.2 (Non-sufficiency of the graphical criterion).

Let $\mathcal{G}_{2}$ be the 2-cycle in Fig. 9. The matrix $A$ of Eq. 3.1 will have the following form:

A=\begin{bmatrix}b_{v_{1}v_{1}}-\tilde{\lambda}_{v_{2}v_{1}}b_{v_{2}v_{1}}&b_{% v_{1}v_{2}}-\tilde{\lambda}_{v_{2}v_{1}}b_{v_{2}v_{2}}\\ b_{v_{2}v_{1}}-\tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{1}}&b_{v_{2}v_{2}}-% \tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{2}}\end{bmatrix}.

From 1 and the fact that there are no bidirected edges in the graph, we should have

a_{v_{1}v_{1}}\cdot a_{v_{2}v_{1}}=a_{v_{1}v_{2}}\cdot a_{v_{2}v_{2}}=0.

Since the graph has cycles, we cannot rule out the possibility that the diagonal entries of $A$ are equal to zero. Hence, a valid solution is

\tilde{\lambda}_{v_{1}v_{2}}=\frac{b_{v_{2}v_{2}}}{b_{v_{1}v_{2}}}=\frac{1}{% \lambda_{v_{2}v_{1}}},\quad\tilde{\lambda}_{v_{2}v_{1}}=\frac{b_{v_{1}v_{1}}}{% b_{v_{2}v_{1}}}=\frac{1}{\lambda_{v_{1}v_{2}}}.

This implies that the observed vector $X=(X_{1},X_{2})$ can be written in at least two different ways,

\begin{bmatrix}1&-\lambda_{v_{2}v_{1}}\\ -\lambda_{v_{1}v2}&1\end{bmatrix}\begin{bmatrix}\varepsilon_{1}\\ \varepsilon_{2}\end{bmatrix},\quad\begin{bmatrix}1&-1/\lambda_{v_{1}v_{2}}\\ -1/\lambda_{v_{2}v_{1}}&1\end{bmatrix}\begin{bmatrix}(-1/\lambda_{v_{1}v_{2}})% \varepsilon_{2}\\ (-1/\lambda_{v_{2}v_{1}})\varepsilon_{1}\end{bmatrix},

that are both compatible with the graph $\mathcal{G}_{2}$ . In other words, the matrix $\Lambda$ is not identifiable.

Figure 9: On the left, a 2 cycle. On the right, a

k

-cycle

Lemma 7.3.

The $k$ -cycle, depicted in Fig. 9, is generically identifiable if and only if $k\geq 3$ .

Proof.

From Example 7.2, we already know that $\mathcal{G}_{2}$ is not identifiable. Hence, it is only left to show that $\mathcal{G}_{k}$ is identifiable for $k\geq 3$ . Herein, we present a proof for the case $k=3$ , as for larger cases, a similar argument would hold.

The matrix $A$ of Eq. 3.1, will have the following shape:

A=\begin{bmatrix}b_{v_{1}v_{1}}-\tilde{\lambda}_{v_{3}v_{1}}b_{v_{3}v_{1}}&b_{% v_{1}v_{2}}-\tilde{\lambda}_{v_{3}v_{1}}b_{v_{3}v_{2}}&b_{v_{1}v_{3}}-\tilde{% \lambda}_{v_{3}v_{1}}b_{v_{3}v_{3}}\\ b_{v_{2}v_{1}}-\tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{1}}&b_{v_{2}v_{2}}-% \tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{2}}&b_{v_{2}v_{3}}-\tilde{\lambda}_{v_{% 1}v_{2}}b_{v_{1}v_{3}}\\ b_{v_{3}v_{1}}-\tilde{\lambda}_{v_{2}v_{3}}b_{v_{2}v_{1}}&b_{v_{3}v_{2}}-% \tilde{\lambda}_{v_{2}v_{3}}b_{v_{2}v_{2}}&b_{v_{3}v_{3}}-\tilde{\lambda}_{v_{% 2}v_{3}}b_{v_{2}v_{3}}\end{bmatrix}.

(7.3)

From 1 and the fact that there are no bidirected edges in the graph, we should have

\begin{cases}&a_{v_{1}v_{1}}\cdot a_{v_{2}v_{1}}=a_{v_{1}v_{1}}\cdot a_{v_{3}v% _{1}}=0\\ &a_{v_{1}v_{2}}\cdot a_{v_{2}v_{2}}=a_{v_{3}v_{2}}\cdot a_{v_{2}v_{2}}=0\\ &a_{v_{1}v_{3}}\cdot a_{v_{3}v_{3}}=a_{v_{2}v_{3}}\cdot a_{v_{3}v_{3}}=0\\ &a_{v_{1}v_{3}}\cdot a_{v_{2}v_{3}}=a_{v_{1}v_{2}}\cdot a_{v_{3}v_{2}}=a_{v_{2% }v_{1}}\cdot a_{v_{3}v_{1}}=0.\end{cases}

(7.4)

If all the non-diagonal entries of $A$ are set to zero, we find $\tilde{\Lambda}=\Lambda$ . We now show that this is the only solution for the system. Assume $a_{v_{2}v_{1}}\neq 0$ , this implies $a_{v_{1}v_{1}}=0$ . This leads to

\tilde{\lambda}_{v_{3}v_{1}}=\frac{b_{v_{1}v_{1}}}{b_{v_{3}v_{1}}},

plugging this value in Eq. 7.3, we obtain

A=\begin{bmatrix}0&b_{v_{1}v_{2}}-({b_{v_{1}v_{1}}}/{b_{v_{3}v_{1}}})b_{v_{3}v% _{2}}&b_{v_{1}v_{3}}-({b_{v_{1}v_{1}}}/{b_{v_{3}v_{1}}})b_{v_{3}v_{3}}\\ b_{v_{2}v_{1}}-\tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{1}}&b_{v_{2}v_{2}}-% \tilde{\lambda}_{v_{1}v_{2}}b_{v_{1}v_{2}}&b_{v_{2}v_{3}}-\tilde{\lambda}_{v_{% 1}v_{2}}b_{v_{1}v_{3}}\\ b_{v_{3}v_{1}}-\tilde{\lambda}_{v_{2}v_{3}}b_{v_{2}v_{1}}&b_{v_{3}v_{2}}-% \tilde{\lambda}_{v_{2}v_{3}}b_{v_{2}v_{2}}&b_{v_{3}v_{3}}-\tilde{\lambda}_{v_{% 2}v_{3}}b_{v_{2}v_{3}}\end{bmatrix}.

(7.5)

Writing explicitly the first row of the matrix in Eq. 7.5, one can see that

a_{v_{1}v_{2}}=\frac{\det((B_{\Lambda})_{\{1,3\},\{1,2\}})}{b_{v_{3}v_{2}}},% \quad a_{v_{1}v_{3}}=\frac{\det((B_{\Lambda})_{\{1,3\},\{1,3\}})}{b_{v_{3}v_{3% }}},

and both these quantities are different from zero for a generic choice of parameters in $\mathbb{R}^{\mathcal{G}_{3}}_{\mathrm{reg}}$ , see Lemma A.2. Having $a_{v_{1}v_{2}}\neq 0$ , Eq. 7.4 implies that $a_{v_{2}v_{2}}=0$ and following the same argument as above, we conclude that $a_{v_{2}v_{3}}\neq 0$ . Finally, we have $a_{v_{1}v_{3}}\cdot a_{v_{2}v_{3}}\neq 0$ since both terms are non-zero, which contradicts Eq. 7.4. This proves that for a generic choice of parameters, the only solution of Eq. 7.4 is given by $\tilde{\Lambda}=\Lambda$ . In other words, the matrix $\Lambda$ is generically identifiable. ∎

Remark 7.1.

It is noteworthy that Drton et al. (2011, Lemma 9) prove that $k$ -cycles with $k\geq 3$ are not generically identifiable from the covariance matrix alone.

Theorem 7.4.

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}}=\emptyset)$ be a directed graph, such that $V=C_{1}\dot{\cup}\cdots\dot{\cup}C_{n}$ , with $C_{i}$ being a $k_{i}$ -cycle, and $\mathop{\rm pa}\nolimits(C_{i})\subseteq\bigcup_{j=0}^{i}C_{j}$ . Then $\mathcal{G}$ is generically identifiable if and only if for every cycle $C=\{v_{1},v_{2}\}$ of size $2$ , we have $\mathop{\rm pa}\nolimits(C)\setminus C=\mathop{\rm pa}\nolimits(v_{i})\setminus C$ for $i\in\{1,2\}$ .

Proof.

Section B.3 ∎

8 Computational Experiments

8.1 Certifying Identifiability

We implemented the criterion from Theorem 4.3 using the algorithm of Dinits (1970) to solve the maximum flow problem. It operates with a complexity of $\mathcal{O}(|V|^{4})$ . Consequently, the algorithm we implemented has a complexity of $\mathcal{O}(|V|^{5})$ . We then determine the proportion of identifiable randomly sampled ADMGs of size $p=25,50$ and with $e=p,2p,\dots,10p$ edges. For each setup, we randomly sampled $5000$ graphs. More details on how the graphs were generated are given in Section C.1.

The proportions are displayed in Fig. 10. We observe that for the given sampling scheme, most graphs yield identifiable models. The proportion of identifiable models remains similar across the two considered dimensions.

Refer to caption — Figure 10: Proportion ADMGs for which every entry of the matrix $\Lambda$ is generically identifiable.

8.2 Causal Effect Estimation

Herein, we present an optimization problem that can be used to infer the identifiable causal effects.

Lemma 8.1.

Let $X=\Phi_{\mathcal{G}}(\Lambda,\varepsilon)\in\mathcal{M}(\mathcal{G})$ , for a generic choice of $(\Lambda,\varepsilon)$ , then $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}$ is a solution of Eq. 3.3 if and only if it is a solution of the optimization problem

\min_{\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}}\left\{\sum_{u% \xleftrightarrow{}v\notin\mathcal{G}}\mu\left([(I-\tilde{\Lambda})\cdot X]_{u}% ,[(I-\tilde{\Lambda})\cdot X]_{v}\right)\right\},

(8.1)

where $\mu(\cdot,\cdot)$ is any consistent measure of dependence, i.e., any nonnegative function that takes as input two random variables and returns zero if and only if the random variables are independent.

Proof.

Section B.4 ∎

For practical estimation we may form an empirical version of the problem in Eq. 8.1 by replacing the dependence measure $\mu$ with suitable consistent estimates. One natural choice for $\mu$ is mutual information (Cover and Thomas, 2006, §8.6). However, the most popular estimator for the mutual information is based on a k-nearest neighbor clustering of the sample, which would result in a non-smooth optimization problem (Kraskov et al., 2004). Several alternatives to mutual information have been proposed in the literature (Székely et al., 2007; Geenens and Lafaye de Micheaux, 2022; Shi et al., 2022). In particular, the Hilbert-Schmidt information criterion (HSIC) (Gretton et al., 2007) has been extensively applied in causal inference (Mooij et al., 2009; Saengkyongam et al., 2022). For our empirical study, we used the HSIC, but other measures of independence can also be implemented.

When the underlying kernel is characteristic, the HSIC provides a measure of dependence that vanishes if and only if the variables for which it is computed are independent (Fukumizu et al., 2007, §2.2 and Thm. 1). Moreover, a consistent estimator for the HSIC (Gretton et al., 2007) is given by

\widehat{\operatorname{HSIC}}_{n}(X,Y):=\operatorname{tr}(K_{X}HK_{Y}H)/n^{2},

where $H_{i,j}=\delta_{i,j}-1/n$ and $K_{X}$ , and $K_{Y}$ denote the respective Gram matrices.

For a fixed graph $\mathcal{G}$ , and a given sample matrix $X\in\mathbb{R}^{n\times p}$ , we estimate $\Lambda$ as a solution to the following optimization problem

\min_{\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}}\left\{\sum_{u% \xleftrightarrow{}v\notin\mathcal{G}}\widehat{\operatorname{HSIC}}_{n}\left([(% I-\tilde{\Lambda})\cdot X]_{u},[(I-\tilde{\Lambda})\cdot X]_{v}\right)\right\}.

(8.2)

We used the L-BFGS method in (Liu and Nocedal, 1989) for solving the above optimization problem. We considered two types of kernels in our experiments: radial basis function (RBF) kernels, the results of which are presented in Fig. 13 in the Section C.2 and polynomial kernels, the results of which are depicted in Fig. 12. More details on the data generation, as well as additional experiments with different error distributions, can be found in Section C.1. It is noteworthy that in our experiments the polynomial kernels of degree 2 (Schölkopf and Smola, 2018, §2.3) provide a better estimate, and the results rely less on the initialization compared to the RBF kernels.

Figure 11: Double confounder graph.

In Fig. 12, we report the performance of our method on the IV graph (Fig. 1), the ADMG shown in Fig. 11, and the 3-cycle (Fig. 9). We used the normalized Frobenious loss between the estimated matrix $\hat{\Lambda}$ , and the true matrix $\Lambda$ , i.e., $||\hat{\Lambda}-\Lambda||_{F}/||\Lambda||_{F}$ , as our loss function and reported the mean loss over fifty randomly sampled $\Lambda$ . We compared our method against the Empirical Likelihood (EL) estimator proposed in Wang and Drton (2017). Note that for the IV graph, the parameters are identifiable from the covariance matrix. Therefore, the EL estimator, which is a covariance-based method, outperforms our method as the covariance matrix estimator is more sample-efficient than the HSIC estimator. In contrast, for the ADMG in Fig. 11 and the 3-cycle, the performance of the EL estimator does not improve with the sample size. This is due to the fact that the parameters of these two mixed graphs cannot be determined solely from the covariance matrix; see Foygel et al. (2012, Prop. 2) and Drton et al. (2011, Lemma 9). When initialized ad the regression coefficient, the performance of our proposed estimator improves with the sample size. This indicates that the potential numerical issues arising from the non-convexity of the objective function and the estimation errors become less relevant as the sample size grows.

9 Conclusions

In this work, we studied the generic identifiability of direct causal effects in linear structural equation models with dependent errors. For acyclic models, we obtained a complete graphical characterization of the identifiable causal effects, with a graphical criterion that can be checked in polynomial time in the size of the graph. For cyclic models, we proved that the same graphical conditions are necessary for identifiability, and we provided counter-examples to show that they are not sufficient. For a smaller family of cyclic models, we provided a complete graphical characterization of the identifiable effects. A complete characterization of the identifiability for cyclic models, however, involves additional mathematical subleties and is left as a problem for future work.

We also discussed the identifiability of the causal graph. For this problem, we provided an algorithm to test the model equivalence of two arbitrary ADMGs and a graphical characterization of the model equivalence for two graphs that only differ in the presence of a directed edge.

Most of the literature on identifiability in linear structural equation models leverages specific moment equations to obtain identifiability results. In this work, we follow a different approach and exploit the information contained in the whole distribution, explicitly leveraging the independence relations dictated by the missing bidirected edges in the graph. To the best of our knowledge, our work is the first to follow this route in this generality. In an initial exploration of parameter estimation we showed that estimates obtained by minimizing structurally absent dependences can be useful.

To conclude, we highlight possible future directions.

Beyond Observational Data.

In this paper, we considered the situation when only observational data is available. Recent identification results that additionally consider information from interventional datasets have been proposed for non-parametric models (Lee et al., 2020; Kivva et al., 2022). Extending our results to these setups can be seen as a natural future direction.

Non-linear Models.

In the graphical models literature, different non-parametric assumptions on the functional relations among the variables have been used to guarantee the identifiability of the causal structure (Peters et al., 2017, §7.1). Similar assumptions have been used to prove the identifiability of the causal effect under specific causal assumptions, e.g., Imbens and Newey (2009). However, a general graphical criterion for identification in non-linear structural equation models is currently missing. We believe that the ideas we propose in this work admit suitable extensions to these more general settings.

Structure Identifiability.

In Section 6, we provided a graphical characterization for the model equivalence of two ADMGs that only differ in the presence of an edge. A graphical characterization for the model equivalence of two arbitrary ADMGs is still an open problem. Its solution would be relevant for the development of algorithms for causal discovery from observational data that work under minimal linearity assumptions.

References

Ardiyansyah and Sodomaco (2023) Muhammad Ardiyansyah and Luca Sodomaco. Dimensions of higher order factor analysis models. Algebr. Stat., 14(1):91–108, 2023.
Barber et al. (2022) Rina Foygel Barber, Mathias Drton, Nils Sturma, and Luca Weihs. Half-trek criterion for identifiability of latent variable models. Ann. Statist., 50(6):3174–3196, 2022.
Brito (2004) Carlos Brito. Graphical models for identification in structural equation models. Ph.D. thesis, UCLA Computer Science Dept., 2004.
Chen et al. (2022) Li Chen, Rasmus Kyng, Yang P. Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva. Maximum flow and minimum-cost flow in almost-linear time. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 612–623. IEEE, 2022.
Comon and Jutten (2010) Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, Inc., USA, 1st edition, 2010.
Cormen et al. (2009) Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, Cambridge, MA, third edition, 2009.
Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
Cox et al. (2015) David A. Cox, John Little, and Donal O’Shea. Ideals, varieties, and algorithms. Undergraduate Texts in Mathematics. Springer, Cham, fourth edition, 2015. An introduction to computational algebraic geometry and commutative algebra.
di Dio and Schmüdgen (2022) Philipp J. di Dio and Konrad Schmüdgen. The multidimensional truncated moment problem: the moment cone. J. Math. Anal. Appl., 511(1):Paper No. 126066, 38, 2022.
Dinits (1970) E. A. Dinits. Algorithm for solution of a problem of maximum flow in a network with power estimation. Sov. Math., Dokl., 11:1277–1280, 1970. ISSN 0197-6788.
Drton (2018) Mathias Drton. Algebraic problems in structural equation modeling. In The 50th anniversary of Gröbner bases, volume 77 of Adv. Stud. Pure Math., pages 35–86. Math. Soc. Japan, Tokyo, 2018.
Drton and Maathuis (2017) Mathias Drton and Marloes H. Maathuis. Structure learning in graphical modeling. Annu. Rev. Stat. Appl., 4:365–393, 2017.
Drton and Richardson (2008) Mathias Drton and Thomas S. Richardson. Binary models for marginal independence. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(2):287–309, 2008.
Drton et al. (2011) Mathias Drton, Rina Foygel, and Seth Sullivant. Global identifiability of linear structural equation models. Ann. Statist., 39(2):865–886, 2011.
Eriksson and Koivunen (2004) Jan Eriksson and Visa Koivunen. Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Process. Lett., 11(7):601–604, 2004.
Evans and Ringel (1999) William N. Evans and Jeanne S. Ringel. Can higher cigarette taxes improve birth outcomes? Journal of Public Economics, 72(1):135–154, 1999.
Foygel et al. (2012) Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linear structural equation models. Ann. Statist., 40(3):1682–1713, 2012.
Fukumizu et al. (2007) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 489–496. Curran Associates, Inc., 2007.
García-Puente et al. (2010) Luis D. García-Puente, Sarah Spielvogel, and Seth Sullivant. Identifying causal effects with computer algebra. In UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8-11, 2010, pages 193–200. AUAI Press, 2010.
Garrote-López and Stephenson (2024) Marina Garrote-López and Monroe Stephenson. Cumulant tensors in partitioned independent component analysis. arXiv:2402.10089, 2024.
Geenens and Lafaye de Micheaux (2022) Gery Geenens and Pierre Lafaye de Micheaux. The Hellinger correlation. J. Amer. Statist. Assoc., 117(538):639–653, 2022.
Grayson and Stillman (2023) Daniel R. Grayson and Michael E. Stillman. Macaulay2, a software system for research in algebraic geometry. Available at http://www2.macaulay2.com, 2023.
Gretton et al. (2007) Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 585–592. Curran Associates, Inc., 2007.
Gunsilius (2021) F. F. Gunsilius. Nontestability of instrument validity under continuous treatments. Biometrika, 108(4):989–995, 2021.
Imbens and Newey (2009) Guido Imbens and Whitney Newey. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica, 77(5):1481–1512, 2009.
Kivva et al. (2022) Yaroslav Kivva, Ehsan Mokhtarian, Jalal Etesami, and Negar Kiyavash. Revisiting the general identifiability problem. In Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands, volume 180 of Proceedings of Machine Learning Research, pages 1022–1030. PMLR, 2022.
Kivva et al. (2023a) Yaroslav Kivva, Jalal Etesami, and Negar Kiyavash. On identifiability of conditional causal effects. In Uncertainty in Artificial Intelligence, UAI 2023, July 31 - 4 August 2023, Pittsburgh, PA, USA, volume 216 of Proceedings of Machine Learning Research, pages 1078–1086. PMLR, 2023a.
Kivva et al. (2023b) Yaroslav Kivva, Saber Salehkaleybar, and Negar Kiyavash. A cross-moment approach for causal effect estimation. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E (3), 69(6):066138, 16, 2004.
Kumor et al. (2020) Daniel Kumor, Carlos Cinelli, and Elias Bareinboim. Efficient identification in linear structural causal models with auxiliary cutsets. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5501–5510. PMLR, 2020.
Lauritzen (1996) Steffen L. Lauritzen. Graphical Models. Oxford University Press, 1996.
Lee et al. (2020) Sanghack Lee, Juan D. Correa, and Elias Bareinboim. General identifiability with arbitrary surrogate experiments. In Ryan P. Adams and Vibhav Gogate, editors, Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pages 389–398. PMLR, 2020.
Lewicki and Sejnowski (2000) Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural Comput., 12(2):337–365, 2000.
Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3):503–528, 1989.
Liu et al. (2021) Yiheng Liu, Elina Robeva, and Huanqing Wang. Learning linear non-Gaussian graphical models with multidirected edges. J. Causal Inference, 9(1):250–263, 2021.
Lousdal (2018) Mette Lise Lousdal. An introduction to instrumental variable assumptions, validation and estimation. Emerging themes in epidemiology, 15(1):1, 2018.
Maathuis et al. (2019) Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright, editors. Handbook of Graphical Models. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2019.
McCullagh (1987) Peter McCullagh. Tensor methods in statistics. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1987.
Mesters and Zwiernik (2022) Geert Mesters and Piotr Zwiernik. Non-independent components analysis. arXiv:2206.13668, 2022.
Michałek and Sturmfels (2021) Mateusz Michałek and Bernd Sturmfels. Invitation to nonlinear algebra, volume 211 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2021.
Mooij et al. (2009) Joris M. Mooij, Dominik Janzing, Jonas Peters, and Bernhard Schölkopf. Regression by dependence minimization and its application to causal inference in additive noise models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 745–752. ACM, 2009.
Okamoto (1973) Masashi Okamoto. Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann. Statist., 1:763–765, 1973.
Pearl (1995) Judea Pearl. On the testability of causal models with latent and instrumental variables. In Philippe Besnard and Steve Hanks, editors, UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, August 18-20, 1995, pages 435–443. Morgan Kaufmann, 1995.
Pearl (2009) Judea Pearl. Causality. Cambridge University Press, Cambridge, second edition, 2009. Models, reasoning, and inference.
Pearl (2017) Judea Pearl. A linear ‘microscope’ for interventions and counterfactuals. J. Causal Inference, 5(1):Art. No. 20170003, 2017.
Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. Foundations and learning algorithms.
Richardson (2003) Thomas Richardson. Markov properties for acyclic directed mixed graphs. Scand. J. Statist., 30(1):145–157, 2003.
Richardson and Spirtes (2002) Thomas Richardson and Peter Spirtes. Ancestral graph Markov models. Ann. Statist., 30(4):962–1030, 2002.
Richardson et al. (2023) Thomas S. Richardson, Robin J. Evans, James M. Robins, and Ilya Shpitser. Nested Markov properties for acyclic directed mixed graphs. Ann. Statist., 51(1):334–361, 2023.
Robeva and Seby (2021) Elina Robeva and Jean-Baptiste Seby. Multi-trek separation in linear structural equation models. SIAM Journal on Applied Algebra and Geometry, 5(2):278–303, 2021.
Saengkyongam et al. (2022) Sorawit Saengkyongam, Leonard Henckel, Niklas Pfister, and Jonas Peters. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18935–18958. PMLR, 2022.
Salehkaleybar et al. (2020) Saber Salehkaleybar, AmirEmad Ghassami, Negar Kiyavash, and Kun Zhang. Learning linear non-Gaussian causal models in the presence of latent variables. J. Mach. Learn. Res., 21:Paper No. 39, 24, 2020.
Schkoda and Drton (2023) Daniela Schkoda and Mathias Drton. Goodness-of-fit tests for linear non-gaussian structural equation models. arXiv: 2311.04585, 2023.
Schölkopf and Smola (2018) Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
Shi et al. (2022) Hongjian Shi, Mathias Drton, and Fang Han. On the power of Chatterjee’s rank correlation. Biometrika, 109(2):317–333, 2022.
Shimizu (2022) Shōhei Shimizu. Statistical Causal Discovery: LiNGAM Approach. Springer, 2022.
Shpitser (2023) Ilya Shpitser. When does the id algorithm fail? arXiv:2307.03750, 2023.
Shpitser and Pearl (2006) Ilya Shpitser and Judea Pearl. Identification of joint interventional distributions in recursive semi-markovian causal models. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, page 1219–1226. AAAI Press, 2006.
Shuai et al. (2023) Kang Shuai, Shanshan Luo, Yue Zhang, Feng Xie, and Yangbo He. Identification and estimation of causal effects using non-gaussianity and auxiliary covariates. arXiv:2304.14895, 2023.
Silva and Shimizu (2017) Ricardo Silva and Shohei Shimizu. Learning instrumental variables with structural and non-Gaussianity assumptions. J. Mach. Learn. Res., 18:Paper No. 120, 49, 2017.
Spirtes et al. (2000) Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition, 2000. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson, A Bradford Book.
Sullivant et al. (2010) Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models. Ann. Statist., 38(3):1665–1685, 2010.
Székely et al. (2007) Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6):2769–2794, 2007.
Verma and Pearl (1990) Thomas Verma and Judea Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, page 255–270, USA, 1990. Elsevier Science Inc.
Wang and Seigal (2024) Kexin Wang and Anna Seigal. Identifiability of overcomplete independent component analysis. arXiv:2401.14709, 2024.
Wang and Drton (2017) Y. Samuel Wang and Mathias Drton. Empirical likelihood for linear structural equation models with dependent errors. Stat, 6:434–447, 2017.
Wang and Drton (2023) Y. Samuel Wang and Mathias Drton. Causal discovery with unobserved confounding and non-Gaussian data. J. Mach. Learn. Res., 24:Paper No. [271], 61, 2023.
Wright (1928) P.G. Wright. The Tariff on Animal and Vegetable Oils. Investigations in international commercial policies. Macmillan, 1928.
Xie et al. (2022) Feng Xie, Yangbo He, Zhi Geng, Zhengming Chen, Ru Hou, and Kun Zhang. Testability of instrumental variables in linear non-Gaussian acyclic causal models. Entropy, 24(4):Paper No. 512, 19, 2022.
Xie et al. (2023) Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, and Kun Zhang. Generalized independent noise condition for estimating causal structure with latent variables. arXiv:2308.06718, 2023.

Appendix A Notions of Non-Linear Algebra

In this section, we give the basic definitions of non-linear algebra we will need for the proofs; we defer the interested reader to Cox et al. (2015); Michałek and Sturmfels (2021) for more details.

Definition A.1.

For every natural number $n$ , we denote the ring of polynomials in $n$ variables $x_{1},\dots,x_{n}$ by $\mathbb{R}[x_{1},\dots,x_{n}]$ . Let $S$ be a, possibly infinite, subset of $\mathbb{R}[x_{1},\dots,x_{n}]$ . The affine variety associated to $S$ is defined as $\mathcal{V}(S)=\{x\in\mathbb{R}^{n}\>:\>f(x)=0,\,\forall f\in S\}$ . The vanishing ideal associated to a variety $\mathcal{V}\subseteq\mathbb{R}^{n}$ is $\mathcal{I}(\mathcal{V})=\{f\in\mathbb{R}[x_{1},\dots,x_{n}]\>:\>f(x)=0.\,% \forall x\in\mathcal{V}\}$ , and the coordinate ring of $\mathcal{V}$ is defined as $\mathbb{R}[\mathcal{V}]=\mathbb{R}[x_{1},\dots,x_{n}]/\mathcal{I}(\mathcal{V})$ .

Lemma A.1.

Sullivant et al. (2010, Prop. 3.1) Let $\mathcal{G}_{D}$ be any directed graph. For every $\Lambda\in\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}$ we have

(B_{\Lambda})_{uv}=(I-\Lambda)^{-T}_{uv}=\sum_{P\in\mathcal{P}(v,u)}\lambda^{P},

in particular $(B_{\Lambda})_{uv}=0$ if $u\notin\mathop{\rm de}\nolimits(v)$ .

Lemma A.2.

Sullivant et al. (2010, Lem. 3.3), Foygel et al. (Supplement 2012, Lem. 1) Let $\mathcal{G}_{D}$ be any directed graph, and let $I,J$ be two subsets of $V$ of the same size. Then for every $\Lambda\in\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}$ we have:

\det((B_{\Lambda})_{I,J}=\det(I-\Lambda)^{-1}_{I,J}=\sum_{\Pi\in\mathcal{P}(I,% J)}|\sigma_{\Pi}|\lambda^{\Pi}=\sum_{\Pi\in\tilde{\mathcal{P}}(I,J)}|\sigma_{% \Pi}|\lambda^{\Pi},

where $|\sigma_{\Pi}|$ denotes the sign of the permutation. In particular, $\tilde{\mathcal{P}}(I,J)=\emptyset$ implies $\det((B_{\Lambda})_{I,J}=0$ . The reverse implications holds for a generic choice of $\Lambda\in~{}\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}$ , i.e., for any $\Lambda$ outside a Lebesgue measure 0 subset of $\mathbb{R}^{\mathcal{G}_{D}}_{\mathrm{reg}}$ .

Appendix B Proofs

B.1 Proofs for Section 4

Proof of Lemma 4.1.

First, notice that for all practical purposes, we can consider the edge capacity to be $|\mathop{\rm pa}\nolimits(v)|$ instead of $\infty$ ; this implies that we can exploit addition properties of maximum flow problems with integer values.

We are going to show that to every flow $f$ in $G^{v}_{Q}$ of size $k$ with integer values, we can associated $(I_{f},P_{f})\in 2^{R_{v}}\times 2^{Q}$ and a system of paths $\Pi_{f}\in\tilde{\mathcal{P}}(I_{f},P_{f})$ such that $|I_{f}|=k$ , and vice versa. That is, for every pair $(I,P)\in 2^{R_{v}}\times 2^{Q}$ and a system of paths $\Pi=(\pi_{1},\dots,\pi_{k})\in\tilde{\mathcal{P}}(I,P)$ such that $|I|=k$ , we can associate an integer flow $f_{I,P}$ of size $k$ .

Let us first consider a flow $f$ with integer value in $G^{v}_{Q}$ . Since the capacity of each node, that is not a sink or a source, is $1$ , we can restrict the image of $f$ to be $\{0,1\}$ . Define $I_{f}=\{u\in R_{v}\>:\>f(s_{v},u)=1\}$ . Since the size of the flow is $k$ , we have $|I_{f}|=k$ . For every $u\in I_{f}$ consider the path $\pi_{u}:u=:u_{0},u_{1},\dots,u_{k_{u}}$ such that $f(u_{i},u_{i+1})=1$ and $u_{k_{u}}\in Q$ . This is well defined since for every $u_{i}\in V_{v}\setminus\{s_{v},t_{v}\}$ there is at most one other $u_{i+1}\in V_{v}\setminus\{s_{v},t_{v}\}$ such that $f(u_{i},u_{i+1})=1$ , and if one assumes that there is an $u_{i-1}$ such that $f(u_{i-1},u_{i})=1$ then the existence of $u_{i+1}$ is guaranteed from the first equality in Eq. 4.1. Let $P_{f}=\{u_{k_{u}}\>:\>u\in I_{f}\}$ . By contradiction, we prove that $\Pi=(\pi_{u}\>:\>u\in I_{v})\in\tilde{\mathcal{P}}(I_{f},P_{f})$ . Suppose two paths in $\Pi$ intersect at a node $u$ . This implies that there are $u_{0}\neq u_{1}\in V_{v}$ such that $f(u_{0},u)=f(u_{1},u)=1$ , hence $\sum_{w\in V_{v}}f(w,u)\geq 2>c_{V}(u)$ . We obtain a violation of Eq. 4.1.

For the other implication, consider $(I,P)\in 2^{R_{v}}\times 2^{Q}$ and a system of paths $\Pi=(\pi_{1},\dots,\pi_{k})\in\tilde{\mathcal{P}}(I,P)$ such that $|I|=k$ . We define $f_{I,P}$ as follows:

f_{I,P}(u,w)=\begin{cases}1,\qquad&\emph{if }u=s_{v}\emph{ and }w\in I\emph{ % or }u\in P\emph{ and }v=t_{v},\\ 1,\qquad&\exists\,j\in[k]\>:\>u\to v\in\pi_{j},\\ 0,\qquad&\emph{otherwise}.\end{cases}

(B.1)

We need to show that $f_{I,P}$ satisfies Eq. 4.1. Since the capacity of each edge is infinity, we only need to check the first inequality; this holds because $\Pi$ is a non-intersecting system of paths, and so each node has at most one incoming, outgoing, for which the flow is different from 0. By directly plugging in Eq. B.1 into Eq. 4.2, it is straightforward to show that $|f_{I,P}|=k$ .

To conclude the proof, we need to show that there is a solution to $G^{v}_{Q}$ with integer values. This is ensured by applying Cormen et al. (2009, Thm. 26.10) and the fact that that all the capacities in $G^{v}_{Q}$ are integers. ∎

Proof of Theorem 4.2.

Chen et al. (2022) proved that the complexity of any maximum flow problem $(G,s,t,c_{V},c_{D})$ is almost linear in the number of edges in the graph $G$ . For every node $v\in V$ and $Q\subseteq\mathop{\rm pa}\nolimits(v)$ to certify the identifiability of $\lambda_{Q,v}$ , one needs to solve the maximum flow problems $G^{v}_{Q}$ and $G^{v}_{\mathop{\rm pa}\nolimits(v)}$ and then check whether the difference of the sizes of the corresponding maximum flows is $|Q|$ . Since both $G^{v}_{Q}$ and $G^{v}_{\mathop{\rm pa}\nolimits(v)}$ have at most $2(|V|+1)$ nodes, the overall complexity is $\mathcal{O}(|V|^{2+o(1)})$ . ∎

Proof of Theorem 4.3.

To certify the identifiability of all the directed edges, i.e., the whole matrix, one needs to solve the maximum flow problem $G^{v}_{\mathop{\rm pa}\nolimits(v)}$ for every $v$ in $V$ and check whether the maximum flow has the size $|\mathop{\rm pa}\nolimits(v)|$ . This adds a multiplicative factor $|V|$ to the result of Theorem 4.2 which leads to $\mathcal{O}(|V|^{3+o(1)})$ . ∎

B.2 Proofs for Section 5.2

In the sequel, we will consider $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ as a variety in the space given by the Cartesian product of the symmetric tensor spaces $(\operatorname{Sym}_{l}(\mathbb{R}^{p}))_{2\leq l\leq k}$ , which is isomorphic to $\mathbb{R}^{\sum_{s=2}^{k}\binom{p+s-1}{s}}$ . We denote the corresponding coordinate ring as $\mathbb{R}[\mathcal{M}^{\leq k}(\mathcal{G}_{B})]$ . For every $k$ -tuple $(i_{1},\dots,i_{k})$ , we denote by $\mathcal{M}^{(k)}_{\setminus(i,\dots,i)}(\mathcal{G}_{B})$ the projection of $\mathcal{M}^{(k)}(\mathcal{G}_{B})$ on the coordinates not corresponding to the entry $(i_{1},\dots,i_{k})$ .

Proof of Lemma 5.4.

The fact that $\phi^{k}$ is well defined is a consequence of Comon and Jutten (2010, Prop. 3.1).

There is a one-to-one linear transformation between cumulants and moments; see, e.g., McCullagh (1987, §2.3); hence, it is enough to prove the result for the corresponding set of moments. It is known that the set of symmetric tensors that can be generated as a moment of a distribution is a full dimensional convex cone in the space of symmetric tensors, see, e.g., di Dio and Schmüdgen (2022, Lem. 3.3). Hence, the same result holds for the set of cumulants, $\phi^{\leq k}(\mathcal{M}_{\infty}(\mathcal{G}_{B}))$ is the projection of this convex cone along the coordinate axes corresponding to connected subsets of $\mathcal{G}_{B}$ , so is itself a full dimensional convex cone in $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ . ∎

Lemma B.1.

Let $\varepsilon\in\mathcal{M}_{\infty}(\mathcal{G}_{B})$ , and $A\in\mathbb{R}^{2\times p}$ , then

\mathcal{C}^{(k)}(A\cdot\varepsilon)_{i_{1},\dots,i_{k}}=\sum_{\begin{subarray% }{c}\{j_{1},\dots,j_{k}\}\emph{ is }\\ \emph{connected in }\mathcal{G}_{B}\end{subarray}}\mathcal{C}^{(k)}(% \varepsilon)_{j_{1},\dots,j_{k}}a_{i_{i}j_{1}}\cdots a_{i_{k}j_{j}}

Proof.

A direct consequence of Lemma 5.3 and Lemma 5.4. ∎

Proof of Theorem 5.5.

From Lemma 5.4, we know that $\dim(\phi^{\leq k}(\mathcal{M}_{\infty}(\mathcal{G}_{B})))=\dim(\mathcal{M}^{% \leq k}(\mathcal{G}_{B}))$ . Hence, it is enough to show that $\phi^{\leq k}(\mathcal{S}(\mathcal{G}_{B}))$ lies in a subvariety of $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ of strictly smaller dimension, see e.g., Okamoto (1973, Lemma).

Notice that we can write

\mathcal{S}(\mathcal{G}_{B})=\left(\bigcup_{i\in[p]}\mathcal{S}_{i}(\mathcal{G% }_{B})\right)\cup\left(\bigcup_{u_{i}\leftrightarrow{}u_{j}\in\mathcal{G}_{B}}% \mathcal{S}_{i\leftrightarrow{}j}(\mathcal{G}_{B})\right),

where

\displaystyle\kappa_{i}(\varepsilon)=\{A\in\kappa(\varepsilon)\>:\>a_{1i}\cdot a% _{2i}\neq 0\},\qquad\mathcal{S}_{i}(\mathcal{G}_{B})=\{\varepsilon\in\mathcal{% S}(\mathcal{G}_{B})\>:\>\kappa_{i}(\varepsilon)\neq\emptyset\},

while $\kappa_{i\leftrightarrow{}j}(\varepsilon),\mathcal{S}_{i\leftrightarrow{}j}(% \mathcal{G}_{B})$ are defined in a similar way. Hence, it is enough to prove that both $\phi^{\leq k}(\mathcal{S}_{i}(\mathcal{G}_{B}))$ and $\phi^{\leq k}(\mathcal{S}_{i\leftrightarrow{}j}(\mathcal{G}_{B}))$ are Lebesgue measure 0 subsets of $\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ for $k$ high enough.

We start with by bounding the dimension of $\mathcal{S}_{i}(\mathcal{G}_{B})$ . For every $\varepsilon\in\mathcal{S}_{i}(\mathcal{G}_{B})$ , every $A\in\kappa_{i}(\varepsilon)$ , and every $0\neq s,t\in\mathbb{N}$ we can use Lemma B.1 to write

0=\mathcal{C}^{(s+t)}(A\cdot\varepsilon)_{\underbrace{1,\dots,1}_{s},% \underbrace{2\dots,2}_{t}}=\mathcal{C}^{(s+t)}(\varepsilon)_{i,\dots,i}a_{1i}^% {s}\cdot a^{t}_{2i}+r^{s+t}_{\setminus i}(\varepsilon,A)

where $r^{s+t}_{\setminus i}(\varepsilon,A)$ is a non-zero polynomial in $\mathbb{R}[\mathcal{M}_{{}_{\setminus(i,\dots,i)}}^{(s+t)}(\mathcal{G}_{B}),a_% {i,j}\>:\>i,j\in[p]]$ , notice that for the first equality we used that $(A\varepsilon)_{1}$ and $(A\varepsilon)_{2}$ are independent. This implies that we can write

\mathcal{C}^{(s+t)}(\varepsilon)_{i,\dots,i}=\phi^{k}(\varepsilon)_{(i,\dots,i% )}=-{r^{s+t}_{\setminus i}\left(\phi^{k}(\varepsilon)_{\setminus(i,\dots,i)},A% \right)}/{a_{i1}^{s}\cdot a^{t}_{i2}}.

(B.2)

We can define a rational map $\psi^{s,t}_{i}:\mathcal{M}_{{}_{\setminus(i,\dots,i)}}^{(s+t)}(\mathcal{G}_{B}% )\times\mathbb{R}^{2p}\to\mathcal{M}^{(s+t)}(\mathcal{G}_{B})$ in the following way

\psi^{s,t}_{i}(A,\mathcal{C}^{s+t})_{i_{1},\dots,i_{k}}:=\begin{cases}{-r^{s+t% }_{\setminus i}(\mathcal{C}^{s+t}_{\setminus(i,\dots,i)},A)}/{a_{1i}^{s}\cdot a% ^{t}_{2i}},&\emph{ if }(i_{1},\dots,i_{k})=(i,\dots,i),\\ \mathcal{C}^{s+t}_{i_{1},\dots,i_{k}},&\emph{otherwise},\end{cases}

see, e.g., Cox et al. (2015, §5) for the definition of rational map.

What Eq. B.2 shows is that

\phi^{(s+t)}(\mathcal{S}_{i}(\mathcal{G}_{B}))\subseteq\psi^{(s,t)}(\mathcal{M% }_{{}_{\setminus(i,\dots,i)}}^{(s+t)}(\mathcal{G}_{B})\times\mathbb{R}^{2p})% \subseteq\mathcal{M}^{(s+t)}(\mathcal{G}_{B}).

Let’s consider $\psi^{\leq k}_{i}:\mathcal{M}_{{}_{\setminus(i,\dots,i)}}^{\leq k}(\mathcal{G}% _{B})\times\mathbb{R}^{2p}\to\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ such that $\psi^{\leq k}_{i}\circ\pi_{k_{0}}=\psi^{k_{0}-1,1}_{i}$ , where $\pi_{k_{0}}$ is the projection of $\mathcal{M}^{\leq k}$ onto $\mathcal{M}^{(k_{0})}$ for every $k_{0}\leq k$ . Again, we have $\phi^{\leq k}(\mathcal{S}_{i}(\mathcal{G}_{B}))\subseteq\psi^{\leq k}(\mathcal% {M}_{{}_{\setminus(i,\dots,i)}}^{\leq k}(\mathcal{G}_{B})\times\mathbb{R}^{2p}% )\subseteq\mathcal{M}^{\leq k}(\mathcal{G}_{B})$ , that concludes the proof by noticing that

		$\displaystyle\dim(\phi^{\leq k}(\mathcal{S}_{i}(\mathcal{G}_{B})))\leq\dim% \bigg{(}\psi^{\leq k}(\mathcal{M}_{{}_{\setminus(i,\dots,i)}}^{\leq k}(% \mathcal{G}_{B})\times\mathbb{R}^{2p})\bigg{)}$
		$\displaystyle\leq\dim(\mathbb{R}^{2p})+\dim(\mathcal{M}_{{}_{\setminus(i,\dots% ,i)}}^{\leq k}(\mathcal{G}_{B}))\leq 2p+\dim(\mathcal{M}^{\leq k}(\mathcal{G}_% {B}))-(k-1),$

that is strictly smaller than $\dim(\mathcal{M}^{\leq k}(\mathcal{G}_{B}))$ if $k\geq 2(p+1)$ .

In order to prove the result for $\mathcal{S}_{i\leftrightarrow{}j}(\mathcal{G}_{B})$ , we first notice that we can always write

\mathcal{S}_{i\leftrightarrow{}j}(\mathcal{G}_{B})=\left(\mathcal{S}_{i% \leftrightarrow{}j}(\mathcal{G}_{B})\cap\bigcup_{i\in[p]}\mathcal{S}_{i}(% \mathcal{G}_{B})\right)\dot{\cup}\left(\mathcal{S}_{i\leftrightarrow{}j}(% \mathcal{G}_{B})\setminus\bigcup_{i\in[p]}\mathcal{S}_{i}(\mathcal{G}_{B})% \right).

Since we have already bounded the dimension of $\mathcal{S}_{i\leftrightarrow{}j}(\mathcal{G}_{B})\cap\bigcup_{i\in[p]}% \mathcal{S}_{i}(\mathcal{G}_{B})$ ; to conclude the proof we only need to bound the dimension of

\tilde{\mathcal{S}}_{i\leftrightarrow{}j}(\mathcal{G}_{B}):=\mathcal{S}_{i% \leftrightarrow{}j}(\mathcal{G}_{B})\setminus\bigcup_{i\in[p]}\mathcal{S}_{i}(% \mathcal{G}_{B}).

For every $\varepsilon\in\tilde{\mathcal{S}}_{\leftrightarrow{}j}(\mathcal{G}_{B})$ , and any $A\in\kappa_{\leftrightarrow{}j}(\varepsilon)$ , and every $2\leq k\in\mathbb{N}$ we can use Lemma B.1 to write

0=\mathcal{C}^{(k)}(A\cdot\varepsilon)_{1,\dots,1,2}=\mathcal{C}^{(s+t)}(% \varepsilon)_{i,\dots,i,j}a_{1i}^{k-1}\cdot a_{2j}+r^{k}_{\setminus i% \leftrightarrow{}j}(\varepsilon,A),

where we used that $a_{1i}\cdot a_{2i}=0$ for every $i$ , that is a consequence of $\varepsilon\notin\cup_{i\in[p]}\mathcal{S}_{i}(\mathcal{G}_{B})$ to simplify the formula given in Lemma B.1. This allows us to write

\mathcal{C}^{(k)}(\varepsilon)_{i,\dots,i,j}=\phi^{k}(\varepsilon)_{(i,\dots,i% ,j)}=-{r^{k}_{\setminus i\leftrightarrow{}j}\left(\phi^{k}(\varepsilon)_{% \setminus(1,\dots,1,2)},A\right)}/{a_{1i}^{k-1}\cdot a_{2j}}.

The rest of the proof follows verbatim the case of $\mathcal{S}_{i}(\mathcal{G}_{B})$ . ∎

B.3 Proofs for Section 7

Lemma B.2.

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\rightarrow{}})$ be a mixed graph. Assume the vertex set can be partitioned as $V=C_{1}\dot{\cup}\cdots\dot{\cup}C_{n}$ , with $C_{i}$ being a $k_{i}$ -cycle, and $\mathop{\rm pa}\nolimits(C_{i})\subseteq\bigcup_{j=0}^{i}C_{j}$ , where $\dot{\cup}$ denotes the union of disjoint sets. Then, $\mathcal{G}$ is generically identifiable if and only if $\lambda_{C_{i},v}$ is identifiable for every $i\in[n]$ and $v\in C_{i}$ , and the graphical criterion in Theorem 7.2 is satisfied.

Proof.

If the matrix $\Lambda$ is identifiable, then by definition, all of its columns are also identifiable, and from Theorem 7.2, we know that the graphical condition is satisfied. We now prove that the reverse implication is also true.

By plugging in $\lambda_{C_{i},v}$ instead of $\tilde{\lambda}_{C_{i},v}$ in Eq. 3.2, one can see that the matrix $A$ has the following shape

\begin{bmatrix}I_{k_{1}}&0&\cdots&0\\ A_{C_{2},C_{1}}&I_{k_{2}}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ A_{C_{n},C_{1}}&A_{C_{n},C_{2}}&\cdots&I_{k_{n}}\\ \end{bmatrix}.

In particular, we have $a_{v,v}=1$ for every $v\in V$ . The same proof as in Lemma 3.2 applies. ∎

Theorem B.3 (Theorem 8.4).

Let $\mathcal{G}=(V,E_{\rightarrow{}},E_{\leftrightarrow{}}=\emptyset)$ be a directed graph such that $V=C_{1}\dot{\cup}\cdots\dot{\cup}C_{n}$ , with $C_{i}$ being a $k_{i}$ -cycle, and $\mathop{\rm pa}\nolimits(C_{i})\subseteq\bigcup_{j=0}^{i}C_{j}$ . Then $\mathcal{G}$ is generically identifiable if and only if for every cycle $C=\{v_{1},v_{2}\}$ of size $2$ , we have $\mathop{\rm pa}\nolimits(C)\setminus C=\mathop{\rm pa}\nolimits(v_{i})\setminus C$ for $i\in\{1,2\}$ .

Proof of Theorem 7.4.

We know from Lemma 7.3 that if $k_{i}\neq 2$ then $\lambda_{C_{i},v}$ is identifiable. If the set $S=\{i\in[n]\>:\>k_{i}=2\}$ is empty then we know from Lemma B.2 and the fact that $E_{\leftrightarrow{}}=\emptyset$ that $\Lambda$ is identifiable. Otherwise, let $m=\min S$ and $C_{m}=\{v_{1},v_{2}\}$ .

We know from Example 7.2 that we can choose $(\tilde{\lambda}_{v_{1}v_{2}},\tilde{\lambda}_{v_{2}v_{1}})=({b_{v_{2}v_{2}}}/% {b_{v_{1}v_{2}}},{b_{v_{1}v_{1}}}/{b_{v_{2}v_{1}}})$ . If $m=1$ , letting $\tilde{\lambda}_{u,v}=\lambda_{u,v}$ for $v\notin C_{1}$ , the matrix $A$ of Eq. 5.4 will have the following shape

\begin{bmatrix}0&\det((B_{\Lambda})_{\{v_{1},v_{2}\},\{v_{1},v_{2}\}})/b_{v_{2% }v_{1}}&0&\cdots&0\\ \det((B_{\Lambda})_{\{v_{1},v_{2}\},\{v_{1},v_{2}\}})/b_{v_{1}v_{2}}&0&0&% \cdots&0\\ 0&0&I_{k_{2}}&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ 0&0&0&\cdots&I_{k_{n}}\\ \end{bmatrix},

that satisfies all the constraints imposed by 1. Proving that $\Lambda$ is not identifiable in this case.

If $m>1$ , we know that $\lambda_{C_{i},v}$ is identifiable for every $i<m$ , hence the matrix $A$ will be as follows

\begin{bmatrix}I_{k_{1}}&0&\cdots&\cdots&\cdots&\cdots&0\\ 0&I_{k_{2}}&\cdots&\cdots&\cdots&\cdots&0\\ \vdots&\vdots&\ddots&\ddots&\ddots&\vdots&\vdots\\ A_{v_{1},C_{1}}&A_{v_{1},C_{2}}&\cdots&0&\det(B_{\Lambda})_{\{v_{1},v_{2}\},\{% v_{1},v_{2}\}}/b_{v_{2}v_{1}}&\cdots&0\\ A_{v_{2},C_{1}}&A_{v_{2},C_{2}}&\cdots&\det(B_{\Lambda})_{\{v_{1},v_{2}\},\{v_% {1},v_{2}\}}/b_{v_{1}v_{2}}&0&\cdots&0\\ \vdots&\vdots&\vdots&\ddots&\ddots&\ddots&\vdots\\ A_{C_{n},C_{1}}&A_{C_{n},C_{2}}&\cdots&\cdots&\cdots&\cdots&A_{C_{n},C_{n}}% \end{bmatrix}.

This implies that in order for the matrix $A$ to satisfy the conditions in 1 for every pair of nodes, we must have $A_{C_{m},\mathop{\rm an}\nolimits(C_{m})\setminus C_{m}}=0$ . This might happen if and only if $A_{C_{m},\mathop{\rm pa}\nolimits^{*}(C_{m})}=0$ , where $\mathop{\rm pa}\nolimits^{*}(C_{m})=\mathop{\rm pa}\nolimits(C_{m})\setminus C% _{m}$ . Writing $A_{v_{1},\mathop{\rm pa}\nolimits^{*}(C_{m})}=0$ explicitly we get to the following linear system

(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v_{2}\},\mathop{\rm pa% }\nolimits^{*}(C_{m})}^{T}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v_{1}% )\setminus\{v_{2}\},v_{1}}=(B_{\Lambda})_{v_{1},\mathop{\rm pa}\nolimits^{*}(C% _{m})}^{T}-\frac{1}{\lambda_{v_{2}v_{1}}}(B_{\Lambda})_{v_{2},\mathop{\rm pa}% \nolimits^{*}(C_{m})}^{T}.

(B.3)

We know that the system

(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v_{2}\},\mathop{\rm pa% }\nolimits^{*}(C_{m})}^{T}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v_{1}% )\setminus\{v_{2}\},v_{1}}=(B_{\Lambda})_{v_{1},\mathop{\rm pa}\nolimits^{*}(C% _{m})}^{T}

has always a solution given by $\lambda_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v_{2}\}}$ . Hence, the system in Eq. B.3 has a solution if and only if the system

(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v_{2}\},\mathop{\rm pa% }\nolimits^{*}(C_{m})}^{T}\cdot\tilde{\lambda}_{\mathop{\rm pa}\nolimits(v_{1}% )\setminus\{v_{2}\},v_{1}}=(B_{\Lambda})_{v_{2},\mathop{\rm pa}\nolimits^{*}(C% _{m})}^{T}

(B.4)

has one. Using $B_{\Lambda}=(I-\Lambda)^{-T}$ and $\lambda_{v_{1},\mathop{\rm pa}\nolimits^{*}(C_{m})=0}$ , we can write

(B_{\Lambda})_{v_{2},\mathop{\rm pa}\nolimits^{*}(C_{m})}^{T}=(B_{\Lambda})_{% \mathop{\rm pa}\nolimits(v_{2})\setminus\{v_{1}\},\mathop{\rm pa}\nolimits^{*}% (C_{m})}^{T}\cdot\lambda_{\mathop{\rm pa}\nolimits(v_{2})\setminus\{v_{1}\},v_% {2}}.

This implies that that the system in Eq. B.4 has solutions for a generic choice of $\lambda_{\mathop{\rm pa}\nolimits(v_{2})\setminus\{v_{1}\},v_{2}}$ if and only if the row space of $(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v_{2}\},\mathop{\rm pa% }\nolimits^{*}(C_{m})}$ contains the row space of $(B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{2})\setminus\{v_{1}\},\mathop{\rm pa% }\nolimits^{*}(C_{m})}$ . That is, if

\operatorname{rank}((B_{\Lambda})_{\mathop{\rm pa}\nolimits(v_{1})\setminus\{v% _{2}\},\mathop{\rm pa}\nolimits^{*}(C_{m})})=\operatorname{rank}((B_{\Lambda})% _{\mathop{\rm pa}\nolimits^{*}(C_{m}),\mathop{\rm pa}\nolimits^{*}(C_{m})}).

From Lemma A.2, one can see that this is possible if and only if the graphical condition of the theorem is satisfied. ∎

B.4 Proofs for Section 8.2

Proof of Lemma 8.1.

Let us denote the value of the objective function in the optimization problem of Eq. 8.1 for a matrix $\tilde{\Lambda}\in\mathbb{R}^{\mathcal{G}_{D}}$ by $O(\tilde{\Lambda})$ . By definition of the map $\Phi_{\mathcal{G}}$ we have $(I-\Lambda)^{T}\cdot X=\varepsilon$ , that implies $O(\Lambda)=0$ . Hence, $\tilde{\Lambda}$ minimizes Eq. 8.1 if and only if $O(\tilde{\Lambda})=0$ , that is if and only if

\tilde{\varepsilon}=(I-\tilde{\Lambda})\cdot X=(I-\tilde{\Lambda})B_{\Lambda}% \cdot\varepsilon=A\cdot\varepsilon\in\mathcal{M}(\mathcal{G}_{B}),

and we know from Lemma 3.2, that this is the case if and only if $\tilde{\Lambda}$ satisfies Eq. 3.3. ∎

Appendix C Details for Experiments

C.1 Data Generation

Identification.

For fixed $p$ and $e$ , the ADMG for the experiments in Section 8.1 are generated as follows:

1.

sample a random integer $e_{d}$ in $\{1,\dots,e\}$ ,
2.

let $\mathcal{G}_{D}$ be a randomly generated DAG with $p$ nodes and $e_{d}$ edges,
3.

let $\mathcal{G}_{B}$ be a randomly generated undirected graph with $p$ nodes and $e-e_{d}$ edges,
4.

define $\mathcal{G}$ as $([p],\mathcal{G}_{D},\mathcal{G}_{B})$ .

Estimation.

The data for the experiments in Section 8.2 are generated as follows:

1.

for every $v\in V$ we sample $\eta_{v}$ from a Laplace distribution with mean zero and standard deviation $s_{v}\sim\text{U}(0.2,3)$ ,
2.

for every $u\xleftrightarrow{}v\in\mathcal{G}_{B}$ we sample two independent random vectors $\eta^{1}_{u,v},\eta^{2}_{u,v}$ , again with standard deviations $s^{1}_{u,v},s^{2}_{u,v}\sim\text{U}(0.2,3)$ ,
3.

for every $v\in V$ , we have $\varepsilon_{v}=\eta_{v}+\sum_{u\xleftrightarrow{}v\in\mathcal{G}_{B}}(w^{v,1}% _{uv}\eta^{1}_{uv}+w^{v,2}_{uv}\eta^{2}_{uv})$ , where $w^{v,1}_{uv},w^{v,2}_{uv}\sim\text{U}(-5,5)$ ,
4.

for every $u\to v\in\mathcal{G}$ , $\lambda_{uv}\sim\text{U}(-5,5)$ , and $X_{v}=\sum_{u\in\mathop{\rm pa}\nolimits(v)}\lambda_{uv}X_{u}+\varepsilon_{v}$ .

C.2 Additional Experiments.

Fig. 13 shows the performance of our methods when using the RBF kernel, with bandwidth computed using the median heuristic. We see that compared to the polynomial kernel, this choice seems to suffer more from the non-convexity of the objective function. In contrast, it provides a better estimate when initialized at the true parameter value.

The results for the same data generating process, but using a uniform distribution for the error terms, are shown in Fig. 14.