Parameter identification in linear non-Gaussian causal models under general confounding
Abstract
Linear non-Gaussian causal models postulate that each random variable is a linear function of parent variables and non-Gaussian exogenous error terms. We study identification of the linear coefficients when such models contain latent variables. Our focus is on the commonly studied acyclic setting, where each model corresponds to a directed acyclic graph (DAG). For this case, prior literature has demonstrated that connections to overcomplete independent component analysis yield effective criteria to decide parameter identifiability in latent variable models. However, this connection is based on the assumption that the observed variables linearly depend on the latent variables. Departing from this assumption, we treat models that allow for arbitrary non-linear latent confounding. Our main result is a graphical criterion that is necessary and sufficient for deciding the generic identifiability of direct causal effects. Moreover, we provide an algorithmic implementation of the criterion with a run time that is polynomial in the number of observed variables. Finally, we report on estimation heuristics based on the identification result, explore a generalization to models with feedback loops, and provide new results on the identifiability of the causal graph.
Keywords: Causal effect, graphical model, independent component analysis, latent variable model, structural causal model
1 Introduction
Graphical models, directed or undirected, are ubiquitous in modern statistical practice for modeling multivariate distributions; see, e.g., Lauritzen (1996); Maathuis et al. (2019). In particular, structural equation models (SEMs) associated with directed acyclic graphs (DAGs) provide a concise and effective way of stating the additional assumptions necessary to identify the causal parameters of interest (Pearl, 2009; Spirtes et al., 2000). As argued in Pearl (2017), understanding a causal phenomenon for linear SEMs is often a necessary step towards a generalized understanding of the same in a nonparametric framework.
The specific framework we focus on in this paper are linear non-Gaussian models that allow for general, possibly non-linear, confounding. Each such model is naturally associated to an acyclic directed mixed graph (ADMG); compare Richardson et al. (2023) and references therein. Let be a collection of observed random variables, and let be an ADMG whose vertices index the random variables and which contains two types of edge sets . The edges in are directed, and we depict them by . Those in are bidirected and depicted by . The model encoded by posits that
(1.1) |
where possible latent confounding is subsumed in the error variables , which are allowed to be dependent in accordance with the bidirected edges in . In particular, if , then the two errors and may be (arbitrarily) dependent.
Example 1.1.
In this paper, we treat the problem of deciding which of the direct causal effects in (1.1) can be identified from the joint distribution of the vector of observed variables . Our main results give a complete graphical characterization in terms of the ADMG , an efficient algorithm to check the resulting identifiability criterion, and a simple practical estimation method that is based on empirical measures of dependences among estimates of the errors . We should highlight that our characterization targets generic identifiability, which is the notion most suitable for problems such as the instrumental variable model from Example 1.1. There, the key coefficient of interest is identified as a ratio of covariances, , but only if the denominator is nonzero which requires the genericity constraint that . As part of our results, we also develop a framework for making genericity conditions for the infinite-dimensional set of non-Gaussian error distributions, which we justify via cumulants truncated at arbitrary order.
1.1 Related Work
For fully nonparametric SEMs, the ID algorithm (Shpitser and Pearl, 2006; Kivva et al., 2022; Shpitser, 2023; Kivva et al., 2023a) is sound and complete for determining global identifiability in a given ADMG. In contrast to the generic setting treated in this paper, global identifiability requires identifiability under every single distribution in the model. It follows from the work of Drton et al. (2011) that the graphical criterion underpinning the ID algorithm also applies to global identifiability within linear Gaussian models. However, the graphical prerequisites for achieving global identification are frequently overly restrictive. For example, any ADMG containing a bow (a pair of nodes, , such that ) would fail to meet the criteria for global identifiability, thus overlooking significant scenarios such as the IV model illustrated in Fig. 1. Consequently, when studying linear SEMs, researchers have shifted their focus towards generic identifiability results, for which much progress has been recently, but for which a complete characterization is still lacking; see, e.g., Kumor et al. (2020); Barber et al. (2022).
Non-Gaussianity of the error term has been extensively employed to achieve identifiability of the graphical structure of causal models; see Shimizu (2022) for a recent account. In contrast, its application to causal effect identification has received little attention (Salehkaleybar et al., 2020; Kivva et al., 2023b; Shuai et al., 2023). All the works just mentioned explicitly model the confounding as linear. This approach, on the one hand, allows one to draw on the vast literature on overcomplete independent component analysis (OICA) in order to obtain stronger identifiability results (Eriksson and Koivunen, 2004). On the other hand, however, it restricts the possible confounding structures. Moreover, as opposed to ICA in the fully observed case, overcomplete ICA is not separable (Eriksson and Koivunen, 2004, Thm. 4), implying that the only algorithms for solving OICA that come with theoretical guarantees require making parametric distributional assumptions and using ad-hoc EM-type algorithms to solve the optimization problem; see, e.g., Lewicki and Sejnowski (2000).
The work most similar to ours in terms of distributional assumptions is that of Wang and Drton (2023), which focuses on structural identifiability. Wang and Drton (2023) note that bow-free acyclic graphs are identifiable from observational data and provide an estimation algorithm for such graphs. Liu et al. (2021) extended the algorithm to learn graphs with multi-directed edges.
1.2 Organization of the Paper
The rest of the paper is organized as follows. Section 1.3 contains standard graphical model notation used in the rest of the paper. In Section 2, we formally define the identifiability problem we study. Section 3 contains the main results of our work; we provide a necessary and sufficient graphical condition for generic identifiability in the model under study. In Section 4, we prove that our criterion can be certified in polynomial time in the size of the graph. Section 5 contains a detailed analysis of the genericity assumption. In Section 6, we apply our results from Section 3 to the identifiability of the causal graph, providing new insights on the model equivalence of two ADMGs. In Section 7, we provide partial results about the identification for cyclic models. In Section 8, we note that when the identification criterion is met, the parameters can be estimated as the solution to a suitable optimization problem, and we present a simulation study to assess the performance of the estimation method. In Section 9, we draw final conclusions and suggest future research directions. Appendix A contains further preliminary material and details of the proofs.
1.3 Notation
A mixed graph is a triple , where . We depict the pairs in by and the ones in by ; we refer to them as directed and bidirected edges, respectively. For an example, see Fig. 2.
Let be two vertices in . A directed path from to is a sequence of nodes such that for all . This includes the case , where and the path has no edges; we call such a path trivial. We denote by the set of all directed paths from to .
A directed cycle is a non-trivial directed path from a node to itself. The graph is acyclic if it contains no directed cycles; we refer to this class of graphs as acyclic directed mixed graphs (ADMG). Fig. 2 shows one instance. If the graph is acyclic, we can define a causal order on the nodes of , that is, a total order on such that whenever . When considering parameter matrices associated to , we will typically fix a causal order and assume that the vertices in are enumerated as with with ; compare Fig. 2.
We will consider the following genealogical relations that are commonly used to indicate relationships between the vertices of an ADMG (parents, ancestors, children, descendants, siblings):
Note that and , via trivial paths. For a subset of vertices , we define and make the analogous convention for the other relations.
Let and be two subsets of that have the same cardinality , and for which we have fixed an ordering of their elements. Let be the symmetric group on . We say that is a system of paths between and , if there exists a permutation such that for every . We denote the set of all such systems by . A system is called non-intersecting if for . The set of all non-intersecting systems in is denoted by ; see Fig. 3 for an example.
When connecting a graph to a statistical model, we will introduce a matrix of parameters whose entries act as weights on the directed edges. We will write for the set of real matrices such that if . When is acyclic—as we assume throughout this work, the matrix is invertible for all ; here, denotes the identity matrix. Indeed, when the nodes of are ordered according to a causal order, is lower triangular with all ones on the diagonal and . We define . Later in Section 7, we briefly discuss the identifiability problem in cyclic graphs, where the invertibility of becomes a modeling assumption.
Finally, let and be subsets of the row and column sets of a matrix , respectively. We denote the submatrix containing only the rows in and the columns in as .
2 Linear Mixed Graph Models and Identifiability
Let be an ADMG, and let and be its directed and bidirected subgraphs, respectively. We say that a subset is connected in the bidirected part if every pair of vertices is joined by a path in , where every vertex on the path is in . On a fixed probability space, let be a random vector taking values in and satisfying the connected set Markov property with respect to (Richardson, 2003; Drton and Richardson, 2008), that is,
We denote the set of all such random vectors by . Note that the connected set Markov property implies but is generally stronger than requiring that for non-adjacent in .
Definition 2.1.
The linear structural equation model corresponding to a mixed graph is the set of all -variate real random vectors (on our fixed probability space) that solve the equation system
for a choice of and . The model is thus parametrized by the map
Example 2.1.
Let be the ADMG from Fig. 2. The set contains all the random vectors such that and . The space is comprised of all matrices of the shape:
Accordingly, we have
In this paper, we are concerned with parameter identifiability. In other words, we ask under which conditions on , the distribution of uniquely determines entries of the coefficient matrix in the representation . While we will not emphasize this in the sequel, the unique determination of all entries of also entails unique recovery of the distribution of .
As noted earlier, our interest is in a generic notion of identifiability, so we ask:
Problem.
Under which graphical conditions on is a set of entries of the parameter matrix generically identifiable?
To detail the problem, we make the following definition and then firm up the involved notion of genericity.
Definition 2.2.
We define the fiber of an element with respect to as the set
(2.1) |
where denotes equality in distribution. Let be the projection of the set onto . A parameter given by a function is generically identifiable if any generic choice of yields a random vector for which it holds that
Requiring genericity of will mean that we exclude a fixed Lebesgue null set of . For instance, in the instrumental variable (IV) example depicted in Fig. 1, the unknown coefficients are and . Coefficient is identifiable outside the null set given by , i.e., we exclude the case that the instrument () does not affect the exposure ().
While excluding null sets of a finite-dimensional space is a standard approach in related literature (Drton, 2018, §9), speaking of a generic choice of requires clarification as Definition 2.1 is nonparametric with respect to the distribution of the errors . Indeed, our genericity concept has a very specific meaning, namely, that the distribution of satisfies the following assumption.
Assumption 1.
Let . For every two vectors , it holds that implies that whenever oder .
Our genericity assumption is natural in view of the Darmois-Skitovich theorem. Indeed, this theorem amounts to exactly the statement that if is the empty graph (i.e., has no edges), then 1 holds for every random vector that has at most one normally distributed coordinate. To further justify our assumption, we present in Section 5 a detailed study of two different classes of submodels for which we show that indeed only a lower-dimensional set of distributions is excluded by our assumption. One class of submodels is built by assuming the existence of moments up to an arbitrary but fixed order. The other class is built by assuming linearity of confounding.
For the remainder of this work, whenever we use the term generic, it is implied that the result holds for any matrix outside of a fixed Lebesgue measure zero subset of and for any that satisfies 1.
3 Necessary and Sufficient Conditions for Generic Identifiability of Direct Causal Effects
Let be an ADMG, and let be a random vector in the model , i.e.,
Suppose can be generated using another pair . From the definition of the fiber in Eq. 2.1, one can see that
(3.1) |
The next result shows that the entries of matrix can be fully specified as a function of both and through the ancestral relations among the nodes of .
Lemma 3.1.
The entries of matrix defined in Eq. 3.1 can be written as
(3.2) |
In particular, we have if , and for every .
Proof.
Writing the product of matrices explicitly, we get
From the definition of , we know that if , while it holds that if , from which the claim follows. To see that if , note that is a path matrix for the directed part , as we detail in Lemma A.1 in the Appendix. ∎
Definition 3.1.
The set of removable ancestors of a node is defined as
Clearly, .
Example 3.1.
Consider the graph in Fig. 2. In this graph, the only strict ancestor of is , which has only as its sibling. Hence, . On the other hand, because belongs to both and .
Using the concept of removable ancestors, the next result introduces a linear system of equations whose solution space fully characterizes the parameter matrices in .
Lemma 3.2.
Let for a generic choice of parameters . The matrix belongs to if and only if it is a solution to the following linear system of equations:
(3.3) |
Proof.
We start by showing the direct implication. Eq. 3.1 shows that for every we can write , where denotes the rows of corresponding to nodes . Since , it holds that for every and, thus, 1 implies
(3.4) | |||||
(3.5) |
If , considering in Eq. 3.4 yields , where we used the fact that as a result of Lemma 3.1. Again, from Lemma 3.1, writing explicitly, we get
(3.6) |
Now, let , and let . Considering Eq. 3.5 with , and we get . Proceeding as above, this yields that leads to Eq. 3.6, as claimed.
For the reverse implication, consider such that each one of its column vectors is a solution of Eq. 3.3. Define , where the matrix is defined in Eq. 3.1. By the definition of , we have , so it remains to prove , that is, satisfies the connected set Markov property with respect to . Let . Then it is easy to see that
where is the set of non-removable ancestors of , so
Here, the last set inclusion comes from the definition of . Hence, to prove that satisfies the connected set Markov property, we need to show that whenever is a connected subset of . For this, it suffices to show that . We will argue that this is indeed the case for connected by showing i) is connected and ii) . The asserted result will then follow from the fact that satisfies the connected set Markov property with respect to .
i) To show that is connected consider . From the definition of , one can see that there are such that and . Because is connected, a bidirected path joining and exists over , and we can extend the path to suitably join and as well.
ii) Notice that . This is because if , either or there is such that which again implies that by the definition of . This implies that in order to prove , we only need to show . Suppose there exists , then there are and such that , and . By the definition of , this implies which is impossible as . This concludes the proof. ∎
Definition 3.2.
Let , and let . We define the -rank of as
(3.7) |
where denotes the power set of . Recall that is a set of non-intersecting systems of paths.
Notice that from Eq. 3.7 it is immediate that
The following theorem, which constitutes our main identifiability result, shows that this lower bound for is reached if and only if is generically identifiable. The theorem is based on characterizing the linear subspace of the solution set of Eq. 3.3, which describes based on the previous lemma.
Theorem 3.3.
Let , and let . The vector is generically identifiable if and only if , where is defined in Eq. 3.7.
Proof.
The vector is identifiable if and only , for every . We know from Lemma 3.2 that is a solution of the linear system given in Eq. 3.3 for every such matrix . Hence, if we define
then is identifiable if and only if . By definition, is a linear subspace of , so the two are equal if and only if they have the same dimension.
We can write as the solution space of the following linear system
(3.8) |
where is the identity matrix, and is defined in Eq. 3.3. We know that the solution space of Eq. 3.8 is not empty since belongs to it. Hence, we have , which implies
From the definition of in Eq. 3.8 one can easily see that
Finally, we have
which concludes the proof by noticing that from Lemma A.2, we have is generically equal to for every subset of . ∎
Example 3.2.
Consider again the graph in Fig. 2, as in Example 3.1. We have , implying that the parameter is not identifiable. In contrast , and there is a system of non-intersecting paths from to given by and . This implies that the vector is identifiable.
The following theorem characterizes the situations in which the whole matrix is identifiable.
Theorem 3.4.
The matrix is generically identifiable if and only if for every node , there is a subset of of size such that there is a system of non-intersecting paths from to .
Proof.
The matrix is identifiable if and only if all of its columns are, so we get the statement by applying Theorem 3.3 to each of the columns, with . ∎
Remark 3.1.
It is noteworthy that 1 is used only for proving the direct implication of Lemma 3.2. This implies that the necessity of the graphical condition in Theorem 3.4 also holds if the model was extended by not requiring 1 to hold.
Remark 3.2.
4 Certifying Identifiability
Verifying directly whether the condition of Theorem 3.3 is satisfied can be computationally challenging. Following the approach of Brito (2004) and Foygel et al. (2012), we now introduce an alternative approach that can verify the identifiability condition of Theorem 3.3 in polynomial time in the size of the graph via a maximum flow reformulation.
For the sake of completeness, we first revisit the definition of the maximum flow problem; further details are available in Cormen et al. (2009, §26). Subsequently, we introduce our reformulation.
The proofs of the results presented in this section can be found in Section B.1.
4.1 The Maximum Flow Problem
Let be a directed graph with source node and sink node . Let be a node capacity function, and let be an edge capacity function. A flow on is a function satisfying
(4.1) | ||||
The size of a flow is defined as
(4.2) |
The max-flow problem on is the problem of finding a flow whose size is maximum.
4.2 Deciding Generic Identifiability
For every node and every , let be defined as follows:
where and are, respectively, newly introduced source and sink nodes. The edge capacity is for all the edges. The node capacity is for both the sink and the source, and , otherwise. We denote the maximum size of any flow on by .
Lemma 4.1.
It holds that .
Theorem 4.2.
Given a mixed graph , a node , and any , the generic identifiability of holds if and only if , which can be certified in time.
Theorem 4.3.
Given a mixed graph , the generic identifiability of holds if and only if for all , which can be certified in time.
Example 4.1.
Fig. 4 illustrates the maximum flows when the criterion from Theorem 4.2 to two of the nodes of the ADMG in Fig. 2.
The graph is constructed for parameter . The only flow on is the trivial flow setting all edges to . Hence, is not identifiable.
The graph is constructed for parameter . The figure displays a flow on of size . Consequently, the parameters and are identifiable.
5 The Genericity Condition for the Error Distribution
The idea underlying 1 is that it should not be possible to linearly disentangle a general dependence between two errors and . In other words, if two different linear combinations of are independent, then at least one of them cannot have any signal coming from . The purpose of this section is to prove that this fact is indeed true for two tractable subfamilies of joint distributions for the errors. Specifically, Section 5.1 considers the setting in which dependence is generated through linear latent factor models, and Section 5.2 treats distributions with finite moments.
5.1 Linear Factor Models
Assume that the error vector is generated according to a sparse factor model that respects the Markov property of the bidirected part of a given ADMG . Define a latent factor graph for to be any DAG , in which the latent nodes are source nodes and whose latent projection (see Verma and Pearl (1990, Sec. 3)) on the nodes in is equal to . Define to be the set of -dimensional random vectors with independent and non-Gaussian components. Then, the sparse factor model associated to is the set of random vectors
(5.1) |
Theorem 5.1.
Let be a latent factor graph for , and for any subset define . If for every edge there is a clique (a subset of for which every pair of nodes is adjacent) in such that then satisfies 1 for Lebesgue-almost every matrix .
Proof.
Let , and consider as in Eq. 5.1. Applying the Darmois-Skitovich theorem (Comon and Jutten, 2010, Thm. 9.5) to and , we obtain that
(5.2) | |||||
(5.3) |
Note that Eq. 5.2 already gives the part of the claim referring to the case in 1. It remains to consider the case of two nodes that are adjacent in .
Let , and assume for contradiction that . Consider a clique as in the statement of the theorem. The vector is a solution of the following system of quadratic equations:
we denote the system by . Notice that from Eq. 5.2 we know that the vector has at most non-zero entries. We now show that, for a generic choice of the entries of , does not admit solutions with . Following the case distinctions resulting from the vanishing of the first or the second factor in the equations in (5.2), the solution set of can be written as the union of the solution set of homogeneous linear systems. Each of these linear systems can be characterized by a partition of defined as follows:
We denote by and the linear systems associated to and , respectively. Define and .
If , the vector has at most non-zero entries, implying that has equation and parameters. If , for a generic choice of the entries of , such a system admits only the 0 solution (Okamoto, 1973, Lemma). Hence, the assumption that leads to a contradiction.
For , we now show that either oder admits only the 0 solution. Notice that since we have . This implies that both and can have a non-zero solution for a generic choice of the entries of only if
This would lead to , which contradicts . ∎
Corollary 5.2.
Remark 5.1.
We point out that in the non-Gaussian setting, sparse linear factor analysis models have testable implications, see, e.g., Ardiyansyah and Sodomaco (2023); Xie et al. (2023); Schkoda and Drton (2023). Hence, the failure of 1 could, in principle, be tested by testing all linear factor models leading to the failure.
Example 5.1.
We borrow the example in Fig. 5 from Barber et al. (2022, Fig. 5). Notice that in the proof of our main result, 1 is used only for matrices with a specific structure, described in Lemma 3.1. Therefore, we focus on this type of matrices. In particular, we will consider the matrix
(5.4) |
and the bidirected graph , corresponding to in Fig. 5 with respect to the latent factor models given in Fig. 5, and Fig. 6.
-
1.
Consider the pair, , the only latent parent of both in is and . This means that the only clique we can consider is and , hence the condition in Theorem 5.1 is violated. Now, we will show that Eq. 5.3 has a nonzero solution. Indeed, the only latent variable for which the system is not trivially satisfied is , implying that any solution of the equation , is also a solution of Eq. 5.3.
-
2.
It is straightforward to see that after adding a latent node to the graph as in the graph in Fig. 6, the condition of Theorem 5.1 is still not satisfied. However, in this case, 1 cannot be violated by the matrices described in Eq. 5.4. To see this, consider Eq. 5.3 for the latent variables and , which leads to the following system of equations for .
Clearly, the only solution to this system of equation is .
-
3.
The graph satisfies the hypothesis of Theorem 5.1; hence it does not violate 1.
5.2 Random Variables with Finite Moments
We now turn to a setting where the error vector has finite moments up to a suitable order. As we show in Theorem 5.5 below, the distributions at which 1 fails define a set of moments, or also cumulants, that form a Lebesgue null set in all possible moments/cumulants up to the considered truncation order. The proofs for the results presented in this section can be found in Section B.2.
Definition 5.1.
The -th cumulant tensor of a random vector is the -way tensor in whose entry in position is the joint cumulant
where the sum is taken over all partitions of the multiset .
Cumulant tensors are symmetric, i.e.,
where is the symmetric group on . We write for the subspace of symmetric tensors in .
1 involves linear combinations of the entries of a random vector. We will, thus, have to consider cumulants after linear transformation, for which we can leverage the following fact.
Lemma 5.3 (Comon and Jutten (2010), §5, Eq. 5.8).
Let be any -variate random vector, and for any , then
In order to justify 1 we wish to offer statements of its generic validity. Our strategy to do so in the present context is to consider cumulants up to a suitable truncation order . In the remainder of this section we consider a mixed graph with nodes, which we label by taking the vertex set to be .
Definition 5.2.
Let be the subset of yielding distributions with all moments finite. For any integer , let
Moreover, we let
Lemma 5.4.
Fix any integer .
-
(i)
The map that sends random vectors with all moments finite to their -th cumulant tensors is well-defined in the sense that .
-
(ii)
Define the map . Then is a full dimensional subset of .
Theorem 5.5.
Let
For every , define , and let , which is precisely the set of distributions for which 1 fails. Then there is a positive integer such that is a Lebesgue measure 0 subset of .
We remark that, for simplicity, we stated Theorem 5.5 for distributions with finite moments of any order. However, we only needed the first moments to be finite.
Example 5.2.
One simple type of exceptional distribution for which 1 fails to hold are distributions that are obtained linear transformations of independent non-Gaussian variables. For example, let be two independent, standard univariate normal distributions, let for , and let for any invertible 2 by 2 matrix . Then for , but by construction , and the fact that implies . As noted in Schkoda and Drton (2023), linear transformations of independent components are forming a null set already when considering cumulants of order up to .
Remark 5.2.
Theorem 5.5 is of independent interest given the recent scholarly attention to generalizations of ICA that can deal with dependent error terms; see, e.g., Mesters and Zwiernik (2022); Garrote-López and Stephenson (2024); Wang and Seigal (2024). Indeed, if we consider to be the empty graph, then Theorem 5.5 reduces to a generic version of the classical Darmois-Skitovich theorem that underlies ICA theory (Comon and Jutten, 2010, Thm. 9.5). From this perspective, Theorem 5.5 provides a generic generalization of the Darmois-Skitovich theorem to the case where the independence structure of the sources is more complex. A consequence of Examples 5.1-5.2 is that a generalization of the Darmois-Skitovich theorem that holds globally, i.e., for every non-Gaussian distribution cannot be achieved.
6 Structure Identifiability
Up to this point, we have studied a parameter identifiability problem, assuming that the mixed graph is given. However, there are numerous practical problems in which one needs to infer the graph from data. This model selection problem is often referred to as structure learning oder causal discovery (Drton and Maathuis, 2017). Therefore, it is important to understand whether, within the model class under consideration, the graph is identifiable. If this is not the case, one is forced to either restrict the class of graphs under consideration, as done by Wang and Drton (2023), or to learn an equivalence class of graphs (Peters et al., 2017, §6). The problem that we consider in this section is as follows.
Problem (Model equivalence).
Given two ADMGs and , is it true that ?
When , we say that and are model equivalent. The equivalence class of an ADMG is the set of all the ADMGs that are model equivalent to it.
In the next result, we prove that the model equivalence of two arbitrary ADMGs can be certified by solving a system of quadratic equations. For graphs of small size, such a system can be solved with computer algebra software, in the same spirit as in García-Puente et al. (2010). In Example 6.1, we use this result to characterize the equivalence class of the IV graph depicted in Fig. 1.
Theorem 6.1.
Let and be two arbitrary ADMGs with the same vertex set . Then if and only if for every , and every the following system of equations has a solution in :
(6.1) |
Proof.
By definition, if and only if for every , we have for some and . This implies
(6.2) |
In particular, Eq. 6.2 has to hold for the generic elements satisfying 1. By applying 1 to every pair , we conclude that the condition is necessary for the model inclusion. In order to prove its sufficiency, we must demonstrate that satisfies the connected set Markov property with respect to .
For every , let , and for . We need to show that for every that is connected in . If in , then the result follows from the connected set Markov property of with respect to . Assume, for contradiction, that , indicating that there exist and such that in oder , and , which contradicts Eq. 6.1 since . ∎
The next result applies Theorem 3.3 to graphically characterize when two ADMGs that only differ in one directed edge are model equivalent.
Theorem 6.2.
Let be an ADMG, , and . Then if and only if is not identifiable in .
Proof.
Since is a subgraph of , we always have . Hence, we only need to show that if and only if is not identifiable in .
Let . Then if and only if for and . This implies
(6.3) |
Following the steps of the proof in Lemma 3.1, we get
where is the parent set of in .
Since and have the same bidirected part, we can repeat the same proof as for Lemma 3.2, concluding that a matrix as in Eq. 6.3 can exist if and only if, for every , the following system has a solution:
(6.4) |
For , we have . Hence, the system in Eq. 6.4 has always a solution given by .
For , we have . The system in Eq. 6.4 has a solution if and only if . Let be a system of paths without intersection from to . If , let be the path that ends at , and be the path obtained by concatenating with the edge . By construction, is a system of non-intersecting paths from to . This proves that , that implies .
From Theorem 3.3, we know that and that the two equality holds if only if is identifiable. Finally, we can write
This concludes the proof by noticing that if and only if the first inequality is strict. ∎
It is well known that the presence of a valid instrumental variable is sufficient for estimating the causal effect from a treatment to an outcome (Wright, 1928, App. B). However, testing from data that an instrument is valid is a much more involved task. Indeed, Gunsilius (2021) shows that in the nonparametric case for continuous treatment, the IV model does not impose any constraint on the observed distribution. Developing tests for the validity of an instrument under different parametric assumptions is an active and important area of research; see, e.g., Pearl (1995); Silva and Shimizu (2017); Xie et al. (2022). The next example shows that, unlike the nonparametric case, the IV model does impose constraints on the observed distribution in linear models. However, our results prove that these constraints are not sufficient for testing the validity of an instrument.
Example 6.1 (Instrumental Validity).
Let be the graph on the top left in Fig. 7. Applying Theorem 6.2 to this graph, and the edges and , one can see that , and are all model equivalent. Applying the same argument to the graph on the bottom left in Fig. 7, and the edges and , we obtain that all the graphs depicted in Fig. 7 are model equivalent. Furthermore, by verifying the conditions of Theorem 6.1 for all ADMGs with three nodes using the software Macaulay2 (Grayson and Stillman, 2023), we confirm that the graphs depicted in Fig. 7 are the only ones in the equivalence class of the IV graph.
In particular, the equivalence of the IV graph and implies that the so-called exclusion restriction (Lousdal, 2018) is not testable within our model class.
7 Cyclic Graphs
Up to this point, we have exclusively studied acyclic models. This assumption has allowed us to obtain a complete characterization of the identifiable parameters. In this section, we relax this assumption and show that the proposed graphical criterion in Section 3 remains a necessary condition but is no longer sufficient. Moreover, we will provide a complete characterization of parameter identifiability for a special sub-class of cyclic graphs.
The first issue that one encounters when dealing with cyclic models is that the matrix might not be invertible. This implies that the assignment does not induce a unique solution for . Hence, we need to restrict our attention to a subset of , namely, the set . In this section, we focus on the identifiability of the matrix and reformulate the problem as follows.
Definition 7.1.
Define the parametrization map
and for every , let the fiber of with respect to be
(7.1) |
For any generic choice of , let . We say that the graph is generically identifiable if .
Lemma 7.1.
Let for a generic choice of parameters . The matrix belongs to if it is a solution to the following linear system of equations:
(7.2) |
Proof.
It suffices to notice that for the reverse implication of the proof of Lemma 3.2, we never used the acyclicity of the graph. Hence, the same proof applies. ∎
Theorem 7.2.
If a mixed graph is identifiable, then for every , there exists a subset of with the size , such that there is a system of non-intersecting paths from to .
Proof.
Example 7.1 (A non-identifiable cyclic graph).
For the graph in Fig. 8, we have , and . Considering in Theorem 7.2, we can see that the matrix is not identifiable for this cyclic graph.
Example 7.2 (Non-sufficiency of the graphical criterion).
Let be the 2-cycle in Fig. 9. The matrix of Eq. 3.1 will have the following form:
From 1 and the fact that there are no bidirected edges in the graph, we should have
Since the graph has cycles, we cannot rule out the possibility that the diagonal entries of are equal to zero. Hence, a valid solution is
This implies that the observed vector can be written in at least two different ways,
that are both compatible with the graph . In other words, the matrix is not identifiable.
Lemma 7.3.
The -cycle, depicted in Fig. 9, is generically identifiable if and only if .
Proof.
From Example 7.2, we already know that is not identifiable. Hence, it is only left to show that is identifiable for . Herein, we present a proof for the case , as for larger cases, a similar argument would hold.
The matrix of Eq. 3.1, will have the following shape:
(7.3) |
From 1 and the fact that there are no bidirected edges in the graph, we should have
(7.4) |
If all the non-diagonal entries of are set to zero, we find . We now show that this is the only solution for the system. Assume , this implies . This leads to
plugging this value in Eq. 7.3, we obtain
(7.5) |
Writing explicitly the first row of the matrix in Eq. 7.5, one can see that
and both these quantities are different from zero for a generic choice of parameters in , see Lemma A.2. Having , Eq. 7.4 implies that and following the same argument as above, we conclude that . Finally, we have since both terms are non-zero, which contradicts Eq. 7.4. This proves that for a generic choice of parameters, the only solution of Eq. 7.4 is given by . In other words, the matrix is generically identifiable. ∎
Remark 7.1.
It is noteworthy that Drton et al. (2011, Lemma 9) prove that -cycles with are not generically identifiable from the covariance matrix alone.
Theorem 7.4.
Let be a directed graph, such that , with being a -cycle, and . Then is generically identifiable if and only if for every cycle of size , we have for .
Proof.
8 Computational Experiments
8.1 Certifying Identifiability
We implemented the criterion from Theorem 4.3 using the algorithm of Dinits (1970) to solve the maximum flow problem. It operates with a complexity of . Consequently, the algorithm we implemented has a complexity of . We then determine the proportion of identifiable randomly sampled ADMGs of size and with edges. For each setup, we randomly sampled graphs. More details on how the graphs were generated are given in Section C.1.
The proportions are displayed in Fig. 10. We observe that for the given sampling scheme, most graphs yield identifiable models. The proportion of identifiable models remains similar across the two considered dimensions.
8.2 Causal Effect Estimation
Herein, we present an optimization problem that can be used to infer the identifiable causal effects.
Lemma 8.1.
Let , for a generic choice of , then is a solution of Eq. 3.3 if and only if it is a solution of the optimization problem
(8.1) |
where is any consistent measure of dependence, i.e., any nonnegative function that takes as input two random variables and returns zero if and only if the random variables are independent.
Proof.
For practical estimation we may form an empirical version of the problem in Eq. 8.1 by replacing the dependence measure with suitable consistent estimates. One natural choice for is mutual information (Cover and Thomas, 2006, §8.6). However, the most popular estimator for the mutual information is based on a k-nearest neighbor clustering of the sample, which would result in a non-smooth optimization problem (Kraskov et al., 2004). Several alternatives to mutual information have been proposed in the literature (Székely et al., 2007; Geenens and Lafaye de Micheaux, 2022; Shi et al., 2022). In particular, the Hilbert-Schmidt information criterion (HSIC) (Gretton et al., 2007) has been extensively applied in causal inference (Mooij et al., 2009; Saengkyongam et al., 2022). For our empirical study, we used the HSIC, but other measures of independence can also be implemented.
When the underlying kernel is characteristic, the HSIC provides a measure of dependence that vanishes if and only if the variables for which it is computed are independent (Fukumizu et al., 2007, §2.2 and Thm. 1). Moreover, a consistent estimator for the HSIC (Gretton et al., 2007) is given by
where and , and denote the respective Gram matrices.
For a fixed graph , and a given sample matrix , we estimate as a solution to the following optimization problem
(8.2) |
We used the L-BFGS method in (Liu and Nocedal, 1989) for solving the above optimization problem. We considered two types of kernels in our experiments: radial basis function (RBF) kernels, the results of which are presented in Fig. 13 in the Section C.2 and polynomial kernels, the results of which are depicted in Fig. 12. More details on the data generation, as well as additional experiments with different error distributions, can be found in Section C.1. It is noteworthy that in our experiments the polynomial kernels of degree 2 (Schölkopf and Smola, 2018, §2.3) provide a better estimate, and the results rely less on the initialization compared to the RBF kernels.
In Fig. 12, we report the performance of our method on the IV graph (Fig. 1), the ADMG shown in Fig. 11, and the 3-cycle (Fig. 9). We used the normalized Frobenious loss between the estimated matrix , and the true matrix , i.e., , as our loss function and reported the mean loss over fifty randomly sampled . We compared our method against the Empirical Likelihood (EL) estimator proposed in Wang and Drton (2017). Note that for the IV graph, the parameters are identifiable from the covariance matrix. Therefore, the EL estimator, which is a covariance-based method, outperforms our method as the covariance matrix estimator is more sample-efficient than the HSIC estimator. In contrast, for the ADMG in Fig. 11 and the 3-cycle, the performance of the EL estimator does not improve with the sample size. This is due to the fact that the parameters of these two mixed graphs cannot be determined solely from the covariance matrix; see Foygel et al. (2012, Prop. 2) and Drton et al. (2011, Lemma 9). When initialized ad the regression coefficient, the performance of our proposed estimator improves with the sample size. This indicates that the potential numerical issues arising from the non-convexity of the objective function and the estimation errors become less relevant as the sample size grows.
9 Conclusions
In this work, we studied the generic identifiability of direct causal effects in linear structural equation models with dependent errors. For acyclic models, we obtained a complete graphical characterization of the identifiable causal effects, with a graphical criterion that can be checked in polynomial time in the size of the graph. For cyclic models, we proved that the same graphical conditions are necessary for identifiability, and we provided counter-examples to show that they are not sufficient. For a smaller family of cyclic models, we provided a complete graphical characterization of the identifiable effects. A complete characterization of the identifiability for cyclic models, however, involves additional mathematical subleties and is left as a problem for future work.
We also discussed the identifiability of the causal graph. For this problem, we provided an algorithm to test the model equivalence of two arbitrary ADMGs and a graphical characterization of the model equivalence for two graphs that only differ in the presence of a directed edge.
Most of the literature on identifiability in linear structural equation models leverages specific moment equations to obtain identifiability results. In this work, we follow a different approach and exploit the information contained in the whole distribution, explicitly leveraging the independence relations dictated by the missing bidirected edges in the graph. To the best of our knowledge, our work is the first to follow this route in this generality. In an initial exploration of parameter estimation we showed that estimates obtained by minimizing structurally absent dependences can be useful.
To conclude, we highlight possible future directions.
Beyond Observational Data.
In this paper, we considered the situation when only observational data is available. Recent identification results that additionally consider information from interventional datasets have been proposed for non-parametric models (Lee et al., 2020; Kivva et al., 2022). Extending our results to these setups can be seen as a natural future direction.
Non-linear Models.
In the graphical models literature, different non-parametric assumptions on the functional relations among the variables have been used to guarantee the identifiability of the causal structure (Peters et al., 2017, §7.1). Similar assumptions have been used to prove the identifiability of the causal effect under specific causal assumptions, e.g., Imbens and Newey (2009). However, a general graphical criterion for identification in non-linear structural equation models is currently missing. We believe that the ideas we propose in this work admit suitable extensions to these more general settings.
Structure Identifiability.
In Section 6, we provided a graphical characterization for the model equivalence of two ADMGs that only differ in the presence of an edge. A graphical characterization for the model equivalence of two arbitrary ADMGs is still an open problem. Its solution would be relevant for the development of algorithms for causal discovery from observational data that work under minimal linearity assumptions.
References
- Ardiyansyah and Sodomaco (2023) Muhammad Ardiyansyah and Luca Sodomaco. Dimensions of higher order factor analysis models. Algebr. Stat., 14(1):91–108, 2023.
- Barber et al. (2022) Rina Foygel Barber, Mathias Drton, Nils Sturma, and Luca Weihs. Half-trek criterion for identifiability of latent variable models. Ann. Statist., 50(6):3174–3196, 2022.
- Brito (2004) Carlos Brito. Graphical models for identification in structural equation models. Ph.D. thesis, UCLA Computer Science Dept., 2004.
- Chen et al. (2022) Li Chen, Rasmus Kyng, Yang P. Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva. Maximum flow and minimum-cost flow in almost-linear time. In 63rd IEEE Annual Symposium on Foundations of Computer Science, FOCS 2022, Denver, CO, USA, October 31 - November 3, 2022, pages 612–623. IEEE, 2022.
- Comon and Jutten (2010) Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent Component Analysis and Applications. Academic Press, Inc., USA, 1st edition, 2010.
- Cormen et al. (2009) Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to algorithms. MIT Press, Cambridge, MA, third edition, 2009.
- Cover and Thomas (2006) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
- Cox et al. (2015) David A. Cox, John Little, and Donal O’Shea. Ideals, varieties, and algorithms. Undergraduate Texts in Mathematics. Springer, Cham, fourth edition, 2015. An introduction to computational algebraic geometry and commutative algebra.
- di Dio and Schmüdgen (2022) Philipp J. di Dio and Konrad Schmüdgen. The multidimensional truncated moment problem: the moment cone. J. Math. Anal. Appl., 511(1):Paper No. 126066, 38, 2022.
- Dinits (1970) E. A. Dinits. Algorithm for solution of a problem of maximum flow in a network with power estimation. Sov. Math., Dokl., 11:1277–1280, 1970. ISSN 0197-6788.
- Drton (2018) Mathias Drton. Algebraic problems in structural equation modeling. In The 50th anniversary of Gröbner bases, volume 77 of Adv. Stud. Pure Math., pages 35–86. Math. Soc. Japan, Tokyo, 2018.
- Drton and Maathuis (2017) Mathias Drton and Marloes H. Maathuis. Structure learning in graphical modeling. Annu. Rev. Stat. Appl., 4:365–393, 2017.
- Drton and Richardson (2008) Mathias Drton and Thomas S. Richardson. Binary models for marginal independence. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(2):287–309, 2008.
- Drton et al. (2011) Mathias Drton, Rina Foygel, and Seth Sullivant. Global identifiability of linear structural equation models. Ann. Statist., 39(2):865–886, 2011.
- Eriksson and Koivunen (2004) Jan Eriksson and Visa Koivunen. Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Process. Lett., 11(7):601–604, 2004.
- Evans and Ringel (1999) William N. Evans and Jeanne S. Ringel. Can higher cigarette taxes improve birth outcomes? Journal of Public Economics, 72(1):135–154, 1999.
- Foygel et al. (2012) Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linear structural equation models. Ann. Statist., 40(3):1682–1713, 2012.
- Fukumizu et al. (2007) Kenji Fukumizu, Arthur Gretton, Xiaohai Sun, and Bernhard Schölkopf. Kernel measures of conditional dependence. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 489–496. Curran Associates, Inc., 2007.
- García-Puente et al. (2010) Luis D. García-Puente, Sarah Spielvogel, and Seth Sullivant. Identifying causal effects with computer algebra. In UAI 2010, Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, July 8-11, 2010, pages 193–200. AUAI Press, 2010.
- Garrote-López and Stephenson (2024) Marina Garrote-López and Monroe Stephenson. Cumulant tensors in partitioned independent component analysis. arXiv:2402.10089, 2024.
- Geenens and Lafaye de Micheaux (2022) Gery Geenens and Pierre Lafaye de Micheaux. The Hellinger correlation. J. Amer. Statist. Assoc., 117(538):639–653, 2022.
- Grayson and Stillman (2023) Daniel R. Grayson and Michael E. Stillman. Macaulay2, a software system for research in algebraic geometry. Available at http://www2.macaulay2.com, 2023.
- Gretton et al. (2007) Arthur Gretton, Kenji Fukumizu, Choon Hui Teo, Le Song, Bernhard Schölkopf, and Alexander J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007, pages 585–592. Curran Associates, Inc., 2007.
- Gunsilius (2021) F. F. Gunsilius. Nontestability of instrument validity under continuous treatments. Biometrika, 108(4):989–995, 2021.
- Imbens and Newey (2009) Guido Imbens and Whitney Newey. Identification and estimation of triangular simultaneous equations models without additivity. Econometrica, 77(5):1481–1512, 2009.
- Kivva et al. (2022) Yaroslav Kivva, Ehsan Mokhtarian, Jalal Etesami, and Negar Kiyavash. Revisiting the general identifiability problem. In Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI 2022, 1-5 August 2022, Eindhoven, The Netherlands, volume 180 of Proceedings of Machine Learning Research, pages 1022–1030. PMLR, 2022.
- Kivva et al. (2023a) Yaroslav Kivva, Jalal Etesami, and Negar Kiyavash. On identifiability of conditional causal effects. In Uncertainty in Artificial Intelligence, UAI 2023, July 31 - 4 August 2023, Pittsburgh, PA, USA, volume 216 of Proceedings of Machine Learning Research, pages 1078–1086. PMLR, 2023a.
- Kivva et al. (2023b) Yaroslav Kivva, Saber Salehkaleybar, and Negar Kiyavash. A cross-moment approach for causal effect estimation. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b.
- Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev. E (3), 69(6):066138, 16, 2004.
- Kumor et al. (2020) Daniel Kumor, Carlos Cinelli, and Elias Bareinboim. Efficient identification in linear structural causal models with auxiliary cutsets. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 5501–5510. PMLR, 2020.
- Lauritzen (1996) Steffen L. Lauritzen. Graphical Models. Oxford University Press, 1996.
- Lee et al. (2020) Sanghack Lee, Juan D. Correa, and Elias Bareinboim. General identifiability with arbitrary surrogate experiments. In Ryan P. Adams and Vibhav Gogate, editors, Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research, pages 389–398. PMLR, 2020.
- Lewicki and Sejnowski (2000) Michael S. Lewicki and Terrence J. Sejnowski. Learning overcomplete representations. Neural Comput., 12(2):337–365, 2000.
- Liu and Nocedal (1989) Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3):503–528, 1989.
- Liu et al. (2021) Yiheng Liu, Elina Robeva, and Huanqing Wang. Learning linear non-Gaussian graphical models with multidirected edges. J. Causal Inference, 9(1):250–263, 2021.
- Lousdal (2018) Mette Lise Lousdal. An introduction to instrumental variable assumptions, validation and estimation. Emerging themes in epidemiology, 15(1):1, 2018.
- Maathuis et al. (2019) Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright, editors. Handbook of Graphical Models. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, FL, 2019.
- McCullagh (1987) Peter McCullagh. Tensor methods in statistics. Monographs on Statistics and Applied Probability. Chapman & Hall, London, 1987.
- Mesters and Zwiernik (2022) Geert Mesters and Piotr Zwiernik. Non-independent components analysis. arXiv:2206.13668, 2022.
- Michałek and Sturmfels (2021) Mateusz Michałek and Bernd Sturmfels. Invitation to nonlinear algebra, volume 211 of Graduate Studies in Mathematics. American Mathematical Society, Providence, RI, 2021.
- Mooij et al. (2009) Joris M. Mooij, Dominik Janzing, Jonas Peters, and Bernhard Schölkopf. Regression by dependence minimization and its application to causal inference in additive noise models. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pages 745–752. ACM, 2009.
- Okamoto (1973) Masashi Okamoto. Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann. Statist., 1:763–765, 1973.
- Pearl (1995) Judea Pearl. On the testability of causal models with latent and instrumental variables. In Philippe Besnard and Steve Hanks, editors, UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence, Montreal, Quebec, Canada, August 18-20, 1995, pages 435–443. Morgan Kaufmann, 1995.
- Pearl (2009) Judea Pearl. Causality. Cambridge University Press, Cambridge, second edition, 2009. Models, reasoning, and inference.
- Pearl (2017) Judea Pearl. A linear ‘microscope’ for interventions and counterfactuals. J. Causal Inference, 5(1):Art. No. 20170003, 2017.
- Peters et al. (2017) Jonas Peters, Dominik Janzing, and Bernhard Schölkopf. Elements of Causal Inference. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, 2017. Foundations and learning algorithms.
- Richardson (2003) Thomas Richardson. Markov properties for acyclic directed mixed graphs. Scand. J. Statist., 30(1):145–157, 2003.
- Richardson and Spirtes (2002) Thomas Richardson and Peter Spirtes. Ancestral graph Markov models. Ann. Statist., 30(4):962–1030, 2002.
- Richardson et al. (2023) Thomas S. Richardson, Robin J. Evans, James M. Robins, and Ilya Shpitser. Nested Markov properties for acyclic directed mixed graphs. Ann. Statist., 51(1):334–361, 2023.
- Robeva and Seby (2021) Elina Robeva and Jean-Baptiste Seby. Multi-trek separation in linear structural equation models. SIAM Journal on Applied Algebra and Geometry, 5(2):278–303, 2021.
- Saengkyongam et al. (2022) Sorawit Saengkyongam, Leonard Henckel, Niklas Pfister, and Jonas Peters. Exploiting independent instruments: Identification and distribution generalization. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 18935–18958. PMLR, 2022.
- Salehkaleybar et al. (2020) Saber Salehkaleybar, AmirEmad Ghassami, Negar Kiyavash, and Kun Zhang. Learning linear non-Gaussian causal models in the presence of latent variables. J. Mach. Learn. Res., 21:Paper No. 39, 24, 2020.
- Schkoda and Drton (2023) Daniela Schkoda and Mathias Drton. Goodness-of-fit tests for linear non-gaussian structural equation models. arXiv: 2311.04585, 2023.
- Schölkopf and Smola (2018) Bernhard Schölkopf and Alexander J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. The MIT Press, 2018.
- Shi et al. (2022) Hongjian Shi, Mathias Drton, and Fang Han. On the power of Chatterjee’s rank correlation. Biometrika, 109(2):317–333, 2022.
- Shimizu (2022) Shōhei Shimizu. Statistical Causal Discovery: LiNGAM Approach. Springer, 2022.
- Shpitser (2023) Ilya Shpitser. When does the id algorithm fail? arXiv:2307.03750, 2023.
- Shpitser and Pearl (2006) Ilya Shpitser and Judea Pearl. Identification of joint interventional distributions in recursive semi-markovian causal models. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, page 1219–1226. AAAI Press, 2006.
- Shuai et al. (2023) Kang Shuai, Shanshan Luo, Yue Zhang, Feng Xie, and Yangbo He. Identification and estimation of causal effects using non-gaussianity and auxiliary covariates. arXiv:2304.14895, 2023.
- Silva and Shimizu (2017) Ricardo Silva and Shohei Shimizu. Learning instrumental variables with structural and non-Gaussianity assumptions. J. Mach. Learn. Res., 18:Paper No. 120, 49, 2017.
- Spirtes et al. (2000) Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, prediction, and search. Adaptive Computation and Machine Learning. MIT Press, Cambridge, MA, second edition, 2000. With additional material by David Heckerman, Christopher Meek, Gregory F. Cooper and Thomas Richardson, A Bradford Book.
- Sullivant et al. (2010) Seth Sullivant, Kelli Talaska, and Jan Draisma. Trek separation for Gaussian graphical models. Ann. Statist., 38(3):1665–1685, 2010.
- Székely et al. (2007) Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. Ann. Statist., 35(6):2769–2794, 2007.
- Verma and Pearl (1990) Thomas Verma and Judea Pearl. Equivalence and synthesis of causal models. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence, UAI ’90, page 255–270, USA, 1990. Elsevier Science Inc.
- Wang and Seigal (2024) Kexin Wang and Anna Seigal. Identifiability of overcomplete independent component analysis. arXiv:2401.14709, 2024.
- Wang and Drton (2017) Y. Samuel Wang and Mathias Drton. Empirical likelihood for linear structural equation models with dependent errors. Stat, 6:434–447, 2017.
- Wang and Drton (2023) Y. Samuel Wang and Mathias Drton. Causal discovery with unobserved confounding and non-Gaussian data. J. Mach. Learn. Res., 24:Paper No. [271], 61, 2023.
- Wright (1928) P.G. Wright. The Tariff on Animal and Vegetable Oils. Investigations in international commercial policies. Macmillan, 1928.
- Xie et al. (2022) Feng Xie, Yangbo He, Zhi Geng, Zhengming Chen, Ru Hou, and Kun Zhang. Testability of instrumental variables in linear non-Gaussian acyclic causal models. Entropy, 24(4):Paper No. 512, 19, 2022.
- Xie et al. (2023) Feng Xie, Biwei Huang, Zhengming Chen, Ruichu Cai, Clark Glymour, Zhi Geng, and Kun Zhang. Generalized independent noise condition for estimating causal structure with latent variables. arXiv:2308.06718, 2023.
Appendix A Notions of Non-Linear Algebra
In this section, we give the basic definitions of non-linear algebra we will need for the proofs; we defer the interested reader to Cox et al. (2015); Michałek and Sturmfels (2021) for more details.
Definition A.1.
For every natural number , we denote the ring of polynomials in variables by . Let be a, possibly infinite, subset of . The affine variety associated to is defined as . The vanishing ideal associated to a variety is , and the coordinate ring of is defined as .
Lemma A.1.
Lemma A.2.
Sullivant et al. (2010, Lem. 3.3), Foygel et al. (Supplement 2012, Lem. 1) Let be any directed graph, and let be two subsets of of the same size. Then for every we have:
where denotes the sign of the permutation. In particular, implies . The reverse implications holds for a generic choice of , i.e., for any outside a Lebesgue measure 0 subset of .
Appendix B Proofs
B.1 Proofs for Section 4
Proof of Lemma 4.1.
First, notice that for all practical purposes, we can consider the edge capacity to be instead of ; this implies that we can exploit addition properties of maximum flow problems with integer values.
We are going to show that to every flow in of size with integer values, we can associated and a system of paths such that , and vice versa. That is, for every pair and a system of paths such that , we can associate an integer flow of size .
Let us first consider a flow with integer value in . Since the capacity of each node, that is not a sink or a source, is , we can restrict the image of to be . Define . Since the size of the flow is , we have . For every consider the path such that and . This is well defined since for every there is at most one other such that , and if one assumes that there is an such that then the existence of is guaranteed from the first equality in Eq. 4.1. Let . By contradiction, we prove that . Suppose two paths in intersect at a node . This implies that there are such that , hence . We obtain a violation of Eq. 4.1.
For the other implication, consider and a system of paths such that . We define as follows:
(B.1) |
We need to show that satisfies Eq. 4.1. Since the capacity of each edge is infinity, we only need to check the first inequality; this holds because is a non-intersecting system of paths, and so each node has at most one incoming, outgoing, for which the flow is different from 0. By directly plugging in Eq. B.1 into Eq. 4.2, it is straightforward to show that .
To conclude the proof, we need to show that there is a solution to with integer values. This is ensured by applying Cormen et al. (2009, Thm. 26.10) and the fact that that all the capacities in are integers. ∎
Proof of Theorem 4.2.
Chen et al. (2022) proved that the complexity of any maximum flow problem is almost linear in the number of edges in the graph . For every node and to certify the identifiability of , one needs to solve the maximum flow problems and and then check whether the difference of the sizes of the corresponding maximum flows is . Since both and have at most nodes, the overall complexity is . ∎
Proof of Theorem 4.3.
To certify the identifiability of all the directed edges, i.e., the whole matrix, one needs to solve the maximum flow problem for every in and check whether the maximum flow has the size . This adds a multiplicative factor to the result of Theorem 4.2 which leads to . ∎
B.2 Proofs for Section 5.2
In the sequel, we will consider as a variety in the space given by the Cartesian product of the symmetric tensor spaces , which is isomorphic to . We denote the corresponding coordinate ring as . For every -tuple , we denote by the projection of on the coordinates not corresponding to the entry .
Proof of Lemma 5.4.
The fact that is well defined is a consequence of Comon and Jutten (2010, Prop. 3.1).
There is a one-to-one linear transformation between cumulants and moments; see, e.g., McCullagh (1987, §2.3); hence, it is enough to prove the result for the corresponding set of moments. It is known that the set of symmetric tensors that can be generated as a moment of a distribution is a full dimensional convex cone in the space of symmetric tensors, see, e.g., di Dio and Schmüdgen (2022, Lem. 3.3). Hence, the same result holds for the set of cumulants, is the projection of this convex cone along the coordinate axes corresponding to connected subsets of , so is itself a full dimensional convex cone in . ∎
Lemma B.1.
Let , and , then
Proof of Theorem 5.5.
From Lemma 5.4, we know that . Hence, it is enough to show that lies in a subvariety of of strictly smaller dimension, see e.g., Okamoto (1973, Lemma).
Notice that we can write
where
while are defined in a similar way. Hence, it is enough to prove that both and are Lebesgue measure 0 subsets of for high enough.
We start with by bounding the dimension of . For every , every , and every we can use Lemma B.1 to write
where is a non-zero polynomial in , notice that for the first equality we used that and are independent. This implies that we can write
(B.2) |
We can define a rational map in the following way
see, e.g., Cox et al. (2015, §5) for the definition of rational map.
What Eq. B.2 shows is that
Let’s consider such that , where is the projection of onto for every . Again, we have , that concludes the proof by noticing that
that is strictly smaller than if .
In order to prove the result for , we first notice that we can always write
Since we have already bounded the dimension of ; to conclude the proof we only need to bound the dimension of
For every , and any , and every we can use Lemma B.1 to write
where we used that for every , that is a consequence of to simplify the formula given in Lemma B.1. This allows us to write
The rest of the proof follows verbatim the case of . ∎
B.3 Proofs for Section 7
Lemma B.2.
Let be a mixed graph. Assume the vertex set can be partitioned as , with being a -cycle, and , where denotes the union of disjoint sets. Then, is generically identifiable if and only if is identifiable for every and , and the graphical criterion in Theorem 7.2 is satisfied.
Proof.
If the matrix is identifiable, then by definition, all of its columns are also identifiable, and from Theorem 7.2, we know that the graphical condition is satisfied. We now prove that the reverse implication is also true.
By plugging in instead of in Eq. 3.2, one can see that the matrix has the following shape
In particular, we have for every . The same proof as in Lemma 3.2 applies. ∎
Theorem B.3 (Theorem 8.4).
Let be a directed graph such that , with being a -cycle, and . Then is generically identifiable if and only if for every cycle of size , we have for .
Proof of Theorem 7.4.
We know from Lemma 7.3 that if then is identifiable. If the set is empty then we know from Lemma B.2 and the fact that that is identifiable. Otherwise, let and .
We know from Example 7.2 that we can choose . If , letting for , the matrix of Eq. 5.4 will have the following shape
that satisfies all the constraints imposed by 1. Proving that is not identifiable in this case.
If , we know that is identifiable for every , hence the matrix will be as follows
This implies that in order for the matrix to satisfy the conditions in 1 for every pair of nodes, we must have . This might happen if and only if , where . Writing explicitly we get to the following linear system
(B.3) |
We know that the system
has always a solution given by . Hence, the system in Eq. B.3 has a solution if and only if the system
(B.4) |
has one. Using and , we can write
This implies that that the system in Eq. B.4 has solutions for a generic choice of if and only if the row space of contains the row space of . That is, if
From Lemma A.2, one can see that this is possible if and only if the graphical condition of the theorem is satisfied. ∎
B.4 Proofs for Section 8.2
Proof of Lemma 8.1.
Let us denote the value of the objective function in the optimization problem of Eq. 8.1 for a matrix by . By definition of the map we have , that implies . Hence, minimizes Eq. 8.1 if and only if , that is if and only if
and we know from Lemma 3.2, that this is the case if and only if satisfies Eq. 3.3. ∎
Appendix C Details for Experiments
C.1 Data Generation
Identification.
For fixed and , the ADMG for the experiments in Section 8.1 are generated as follows:
-
1.
sample a random integer in ,
-
2.
let be a randomly generated DAG with nodes and edges,
-
3.
let be a randomly generated undirected graph with nodes and edges,
-
4.
define as .
Estimation.
The data for the experiments in Section 8.2 are generated as follows:
-
1.
for every we sample from a Laplace distribution with mean zero and standard deviation ,
-
2.
for every we sample two independent random vectors , again with standard deviations ,
-
3.
for every , we have , where ,
-
4.
for every , , and .
C.2 Additional Experiments.
Fig. 13 shows the performance of our methods when using the RBF kernel, with bandwidth computed using the median heuristic. We see that compared to the polynomial kernel, this choice seems to suffer more from the non-convexity of the objective function. In contrast, it provides a better estimate when initialized at the true parameter value.
The results for the same data generating process, but using a uniform distribution for the error terms, are shown in Fig. 14.