Search | arXiv e-print repository

On seeded subgraph-to-subgraph matching: The ssSGM Algorithm and matchability information theory

Authors: Lingyao Meng, Mengqi Lou, Jianyu Lin, Vince Lyzinski, Donniell E. Fishkind

Abstract: The subgraph-subgraph matching problem is, given a pair of graphs and a positive integer $K$, to find $K$ vertices in the first graph, $K$ vertices in the second graph, and a bijection between them, so as to minimize the number of adjacency disagreements across the bijection; it is ``seeded" if some of this bijection is fixed. The problem is intractable, and we present the ssSGM algorithm, which u… ▽ More The subgraph-subgraph matching problem is, given a pair of graphs and a positive integer $K$, to find $K$ vertices in the first graph, $K$ vertices in the second graph, and a bijection between them, so as to minimize the number of adjacency disagreements across the bijection; it is ``seeded" if some of this bijection is fixed. The problem is intractable, and we present the ssSGM algorithm, which uses Frank-Wolfe methodology to efficiently find an approximate solution. Then, in the context of a generalized correlated random Bernoulli graph model, in which the pair of graphs naturally have a core of $K$ matched pairs of vertices, we provide and prove mild conditions for the subgraph-subgraph matching problem solution to almost always be the correct $K$ matched pairs of vertices. △ Less

Submitted 6 June, 2023; originally announced June 2023.

Comments: 27 pages, 16 figures

MSC Class: 05C60; 05C80; 90C35

arXiv:2103.00624 [pdf, other]

doi 10.1007/s41109-021-00398-z

The Phantom Alignment Strength Conjecture: Practical use of graph matching alignment strength to indicate a meaningful graph match

Authors: Donniell E. Fishkind, Felix Parker, Hamilton Sawczuk, Lingyao Meng, Eric Bridgeford, Avanti Athreya, Carey E. Priebe, Vince Lyzinski

Abstract: The alignment strength of a graph matching is a quantity that gives the practitioner a measure of the correlation of the two graphs, and it can also give the practitioner a sense for whether the graph matching algorithm found the true matching. Unfortunately, when a graph matching algorithm fails to find the truth because of weak signal, there may be "phantom alignment strength" from meaningless m… ▽ More The alignment strength of a graph matching is a quantity that gives the practitioner a measure of the correlation of the two graphs, and it can also give the practitioner a sense for whether the graph matching algorithm found the true matching. Unfortunately, when a graph matching algorithm fails to find the truth because of weak signal, there may be "phantom alignment strength" from meaningless matchings that, by random noise, have fewer disagreements than average (sometimes substantially fewer); this alignment strength may give the misleading appearance of significance. A practitioner needs to know what level of alignment strength may be phantom alignment strength and what level indicates that the graph matching algorithm obtained the true matching and is a meaningful measure of the graph correlation. The {\it Phantom Alignment Strength Conjecture} introduced here provides a principled and practical means to approach this issue. We provide empirical evidence for the conjecture, and explore its consequences. △ Less

Submitted 23 August, 2021; v1 submitted 28 February, 2021; originally announced March 2021.

arXiv:2002.09976 [pdf, ps, other]

On a complete and sufficient statistic for the correlated Bernoulli random graph model

Authors: Donniell E. Fishkind, Avanti Athreya, Lingyao Meng, Vince Lyzinski, Carey E. Priebe

Abstract: Inference on vertex-aligned graphs is of wide theoretical and practical importance.There are, however, few flexible and tractable statistical models for correlated graphs, and even fewer comprehensive approaches to parametric inference on data arising from such graphs. In this paper, we consider the correlated Bernoulli random graph model (allowing different Bernoulli coefficients and edge correla… ▽ More Inference on vertex-aligned graphs is of wide theoretical and practical importance.There are, however, few flexible and tractable statistical models for correlated graphs, and even fewer comprehensive approaches to parametric inference on data arising from such graphs. In this paper, we consider the correlated Bernoulli random graph model (allowing different Bernoulli coefficients and edge correlations for different pairs of vertices), and we introduce a new variance-reducing technique -- called \emph{balancing} -- that can refine estimators for model parameters. Specifically, we construct a disagreement statistic and show that it is complete and sufficient; balancing can be interpreted as Rao-Blackwellization with this disagreement statistic. We show that for unbiased estimators of functions of model parameters, balancing generates uniformly minimum variance unbiased estimators (UMVUEs). However, even when unbiased estimators for model parameters do {\em not} exist -- which, as we prove, is the case with both the heterogeneity correlation and the total correlation parameters -- balancing is still useful, and lowers mean squared error. In particular, we demonstrate how balancing can improve the efficiency of the alignment strength estimator for the total correlation, a parameter that plays a critical role in graph matchability and graph matching runtime complexity. △ Less

Submitted 30 March, 2021; v1 submitted 23 February, 2020; originally announced February 2020.

arXiv:1808.08502 [pdf, other]

Alignment Strength and Correlation for Graphs

Authors: Donniell E. Fishkind, Lingyao Meng, Ao Sun, Carey E. Priebe, Vince Lyzinski

Abstract: When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $ρ_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both… ▽ More When two graphs have a correlated Bernoulli distribution, we prove that the alignment strength of their natural bijection strongly converges to a novel measure of graph correlation $ρ_T$ that neatly combines intergraph with intragraph distribution parameters. Within broad families of the random graph parameter settings, we illustrate that exact graph matching runtime and also matchability are both functions of $ρ_T$, with thresholding behavior starkly illustrated in matchability. △ Less

Submitted 17 January, 2020; v1 submitted 25 August, 2018; originally announced August 2018.

MSC Class: 05C80; 05C60; 90C35

arXiv:1709.05454 [pdf, other]

Statistical inference on random dot product graphs: a survey

Authors: Avanti Athreya, Donniell E. Fishkind, Keith Levin, Vince Lyzinski, Youngser Park, Yichen Qin, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graph… ▽ More The random dot product graph (RDPG) is an independent-edge random graph that is analytically tractable and, simultaneously, either encompasses or can successfully approximate a wide range of random graphs, from relatively simple stochastic block models to complex latent position graphs. In this survey paper, we describe a comprehensive paradigm for statistical inference on random dot product graphs, a paradigm centered on spectral embeddings of adjacency and Laplacian matrices. We examine the analogues, in graph inference, of several canonical tenets of classical Euclidean inference: in particular, we summarize a body of existing results on the consistency and asymptotic normality of the adjacency and Laplacian spectral embeddings, and the role these spectral embeddings can play in the construction of single- and multi-sample hypothesis tests for graph data. We investigate several real-world applications, including community detection and classification in large social networks and the determination of functional and biologically relevant network properties from an exploratory data analysis of the Drosophila connectome. We outline requisite background and current open problems in spectral graph inference. △ Less

Submitted 16 September, 2017; originally announced September 2017.

Comments: An expository survey paper on a comprehensive paradigm for inference for random dot product graphs, centered on graph adjacency and Laplacian spectral embeddings. Paper outlines requisite background; summarizes theory, methodology, and applications from previous and ongoing work; and closes with a discussion of several open problems

MSC Class: 62FXX; 62GXX; 62HXX; 05CXX

Journal ref: Journal of Machine Learning Research, 2018

arXiv:1607.01369 [pdf, other]

On the Consistency of the Likelihood Maximization Vertex Nomination Scheme: Bridging the Gap Between Maximum Likelihood Estimation and Graph Matching

Authors: Vince Lyzinski, Keith Levin, Donniell E. Fishkind, Carey E. Priebe

Abstract: Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM… ▽ More Given a graph in which a few vertices are deemed interesting a priori, the vertex nomination task is to order the remaining vertices into a nomination list such that there is a concentration of interesting vertices at the top of the list. Previous work has yielded several approaches to this problem, with theoretical results in the setting where the graph is drawn from a stochastic block model (SBM), including a vertex nomination analogue of the Bayes optimal classifier. In this paper, we prove that maximum likelihood (ML)-based vertex nomination is consistent, in the sense that the performance of the ML-based scheme asymptotically matches that of the Bayes optimal scheme. We prove theorems of this form both when model parameters are known and unknown. Additionally, we introduce and prove consistency of a related, more scalable restricted-focus ML vertex nomination scheme. Finally, we incorporate vertex and edge features into ML-based vertex nomination and briefly explore the empirical effectiveness of this approach. △ Less

Submitted 27 August, 2016; v1 submitted 5 July, 2016; originally announced July 2016.

arXiv:1312.2638 [pdf, ps, other]

doi 10.1214/15-AOAS834

Vertex nomination schemes for membership prediction

Authors: D. E. Fishkind, V. Lyzinski, H. Pao, L. Chen, C. E. Priebe

Abstract: Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a ``nomination list'' such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomin… ▽ More Suppose that a graph is realized from a stochastic block model where one of the blocks is of interest, but many or all of the vertices' block labels are unobserved. The task is to order the vertices with unobserved block labels into a ``nomination list'' such that, with high probability, vertices from the interesting block are concentrated near the list's beginning. We propose several vertex nomination schemes. Our basic - but principled - setting and development yields a best nomination scheme (which is a Bayes-Optimal analogue), and also a likelihood maximization nomination scheme that is practical to implement when there are a thousand vertices, and which is empirically near-optimal when the number of vertices is small enough to allow comparison to the best nomination scheme. We then illustrate the robustness of the likelihood maximization nomination scheme to the modeling challenges inherent in real data, using examples which include a social network involving human trafficking, the Enron Graph, a worm brain connectome and a political blog network. △ Less

Submitted 17 November, 2015; v1 submitted 9 December, 2013; originally announced December 2013.

Comments: Published at http://dx.doi.org/10.1214/15-AOAS834 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS834

Journal ref: Annals of Applied Statistics 2015, Vol. 9, No. 3, 1510-1532

arXiv:1310.1297 [pdf, other]

Spectral Clustering for Divide-and-Conquer Graph Matching

Authors: Vince Lyzinski, Daniel L. Sussman, Donniell E. Fishkind, Henry Pao, Li Chen, Joshua T. Vogelstein, Youngser Park, Carey E. Priebe

Abstract: We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through ou… ▽ More We present a parallelized bijective graph matching algorithm that leverages seeds and is designed to match very large graphs. Our algorithm combines spectral graph embedding with existing state-of-the-art seeded graph matching procedures. We justify our approach by proving that modestly correlated, large stochastic block model random graphs are correctly matched utilizing very few seeds through our divide-and-conquer procedure. We also demonstrate the effectiveness of our approach in matching very large graphs in simulated and real data examples, showing up to a factor of 8 improvement in runtime with minimal sacrifice in accuracy. △ Less

Submitted 12 March, 2015; v1 submitted 4 October, 2013; originally announced October 2013.

Comments: 32 pages, 8 figures

arXiv:1304.7844 [pdf, other]

Seeded graph matching for correlated Erdős-Rényi graphs

Authors: Vince Lyzinski, Donniell E. Fishkind, Carey E. Priebe

Abstract: Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated Erdős-Rényi graph setting, we… ▽ More Graph matching is an important problem in machine learning and pattern recognition. Herein, we present theoretical and practical results on the consistency of graph matching for estimating a latent alignment function between the vertex sets of two graphs, as well as subsequent algorithmic implications when the latent alignment is partially observed. In the correlated Erdős-Rényi graph setting, we prove that graph matching provides a strongly consistent estimate of the latent alignment in the presence of even modest correlation. We then investigate a tractable, restricted-focus version of graph matching, which is only concerned with adjacency involving vertices in a partial observation of the latent alignment; we prove that a logarithmic number of vertices whose alignment is known is sufficient for this restricted-focus version of graph matching to yield a strongly consistent estimate of the latent alignment of the remaining vertices. We show how Frank-Wolfe methodology for approximate graph matching, when there is a partially observed latent alignment, inherently incorporates this restricted focus graph matching. Lastly, we illustrate the relationship between seeded graph matching and restricted-focus graph matching by means of an illuminating example from human connectomics. △ Less

Submitted 1 August, 2014; v1 submitted 29 April, 2013; originally announced April 2013.

Comments: 28 pages, 5 figures

arXiv:1301.1954 [pdf, other]

doi 10.1007/s00357-016-9203-9

On the Incommensurability Phenomenon

Authors: Donniell E. Fishkind, Cencheng Shen, Youngser Park, Carey E. Priebe

Abstract: Suppose that two large, multi-dimensional data sets are each noisy measurements of the same underlying random process, and principle components analysis is performed separately on the data sets to reduce their dimensionality. In some circumstances it may happen that the two lower-dimensional data sets have an inordinately large Procrustean fitting-error between them. The purpose of this manuscript… ▽ More Suppose that two large, multi-dimensional data sets are each noisy measurements of the same underlying random process, and principle components analysis is performed separately on the data sets to reduce their dimensionality. In some circumstances it may happen that the two lower-dimensional data sets have an inordinately large Procrustean fitting-error between them. The purpose of this manuscript is to quantify this "incommensurability phenomenon." In particular, under specified conditions, the square Procrustean fitting-error of the two normalized lower-dimensional data sets is (asymptotically) a convex combination (via a correlation parameter) of the Hausdorff distance between the projection subspaces and the maximum possible value of the square Procrustean fitting-error for normalized data. We show how this gives rise to the incommensurability phenomenon, and we employ illustrative simulations as well as a real data experiment to explore how the incommensurability phenomenon may have an appreciable impact. △ Less

Submitted 6 February, 2015; v1 submitted 9 January, 2013; originally announced January 2013.

Journal ref: Journal of Classification 33(2), 185-209, 2016

arXiv:1209.0367 [pdf, other]

Seeded Graph Matching

Authors: Donniell E. Fishkind, Sancar Adali, Heather G. Patsolic, Lingyao Meng, Digvijay Singh, Vince Lyzinski, Carey E. Priebe

Abstract: Given two graphs, the graph matching problem is to align the two vertex sets so as to minimize the number of adjacency disagreements between the two graphs. The seeded graph matching problem is the graph matching problem when we are first given a partial alignment that we are tasked with completing. In this paper, we modify the state-of-the-art approximate graph matching algorithm "FAQ" of Vogelst… ▽ More Given two graphs, the graph matching problem is to align the two vertex sets so as to minimize the number of adjacency disagreements between the two graphs. The seeded graph matching problem is the graph matching problem when we are first given a partial alignment that we are tasked with completing. In this paper, we modify the state-of-the-art approximate graph matching algorithm "FAQ" of Vogelstein et al. (2015) to make it a fast approximate seeded graph matching algorithm, adapt its applicability to include graphs with differently sized vertex sets, and extend the algorithm so as to provide, for each individual vertex, a nomination list of likely matches. We demonstrate the effectiveness of our algorithm via simulation and real data experiments; indeed, knowledge of even a few seeds can be extremely effective when our seeded graph matching algorithm is used to recover a naturally existing alignment that is only partially observed. △ Less

Submitted 10 April, 2018; v1 submitted 3 September, 2012; originally announced September 2012.

Comments: 24 pages, 10 figures

arXiv:1208.4125 [pdf, ps, other]

Counting Spanning Trees of Threshold Graphs

Authors: Stephen R. Chestnut, Donniell E. Fishkind

Abstract: Cayley's formula states that there are $n^{n-2}$ spanning trees in the complete graph on $n$ vertices; it has been proved in more than a dozen different ways over its 150 year history. The complete graphs are a special case of threshold graphs, and using Merris' Theorem and the Matrix Tree Theorem, there is a strikingly simple formula for counting the number of spanning trees in a threshold graph… ▽ More Cayley's formula states that there are $n^{n-2}$ spanning trees in the complete graph on $n$ vertices; it has been proved in more than a dozen different ways over its 150 year history. The complete graphs are a special case of threshold graphs, and using Merris' Theorem and the Matrix Tree Theorem, there is a strikingly simple formula for counting the number of spanning trees in a threshold graph on $n$ vertices; it is simply the product, over $i=2,3, ...,n-1$, of the number of vertices of degree at least $i$. In this manuscript, we provide a direct combinatorial proof for this formula which does not use the Matrix Tree Theorem; the proof is an extension of Joyal's proof for Cayley's formula. Then we apply this methodology to give a formula for the number of spanning trees in any difference graph. △ Less

Submitted 8 January, 2013; v1 submitted 20 August, 2012; originally announced August 2012.

Comments: 14 pages, 5 figures

MSC Class: 05A19

arXiv:1205.0309 [pdf, other]

Consistent adjacency-spectral partitioning for the stochastic block model when the model parameters are unknown

Authors: Donniell E. Fishkind, Daniel L. Sussman, Minh Tang, Joshua T. Vogelstein, Carey E. Priebe

Abstract: For random graphs distributed according to a stochastic block model, we consider the inferential task of partioning vertices into blocks using spectral techniques. Spectral partioning using the normalized Laplacian and the adjacency matrix have both been shown to be consistent as the number of vertices tend to infinity. Importantly, both procedures require that the number of blocks and the rank of… ▽ More For random graphs distributed according to a stochastic block model, we consider the inferential task of partioning vertices into blocks using spectral techniques. Spectral partioning using the normalized Laplacian and the adjacency matrix have both been shown to be consistent as the number of vertices tend to infinity. Importantly, both procedures require that the number of blocks and the rank of the communication probability matrix are known, even as the rest of the parameters may be unknown. In this article, we prove that the (suitably modified) adjacency-spectral partitioning procedure, requiring only an upper bound on the rank of the communication probability matrix, is consistent. Indeed, this result demonstrates a robustness to model mis-specification; an overestimate of the rank may impose a moderate performance penalty, but the procedure is still consistent. Furthermore, we extend this procedure to the setting where adjacencies may have multiple modalities and we allow for either directed or undirected graphs. △ Less

Submitted 21 August, 2012; v1 submitted 1 May, 2012; originally announced May 2012.

Comments: 26 pages, 2 figure

arXiv:1112.5507 [pdf, other]

Fast Approximate Quadratic Programming for Large (Brain) Graph Matching

Authors: Joshua T. Vogelstein, John M. Conroy, Vince Lyzinski, Louis J. Podrazik, Steven G. Kratzer, Eric T. Harley, Donniell E. Fishkind, R. Jacob Vogelstein, Carey E. Priebe

Abstract: Quadratic assignment problems (QAPs) arise in a wide variety of domains, ranging from operations research to graph theory to computer vision to neuroscience. In the age of big data, graph valued data is becoming more prominent, and with it, a desire to run algorithms on ever larger graphs. Because QAP is NP-hard, exact algorithms are intractable. Approximate algorithms necessarily employ an accura… ▽ More Quadratic assignment problems (QAPs) arise in a wide variety of domains, ranging from operations research to graph theory to computer vision to neuroscience. In the age of big data, graph valued data is becoming more prominent, and with it, a desire to run algorithms on ever larger graphs. Because QAP is NP-hard, exact algorithms are intractable. Approximate algorithms necessarily employ an accuracy/efficiency trade-off. We developed a fast approximate quadratic assignment algorithm (FAQ). FAQ finds a local optima in (worst case) time cubic in the number of vertices, similar to other approximate QAP algorithms. We demonstrate empirically that our algorithm is faster and achieves a lower objective value on over 80% of the suite of QAP benchmarks, compared with the previous state-of-the-art. Applying the algorithms to our motivating example, matching C. elegans connectomes (brain-graphs), we find that FAQ achieves the optimal performance in record time, whereas none of the others even find the optimum. △ Less

Submitted 13 September, 2014; v1 submitted 22 December, 2011; originally announced December 2011.

Comments: 17 pages, 5 figures, 2 tables

arXiv:1108.2228 [pdf, other]

A consistent adjacency spectral embedding for stochastic blockmodel graphs

Authors: Daniel L. Sussman, Minh Tang, Donniell E. Fishkind, Carey E. Priebe

Abstract: We present a method to estimate block membership of nodes in a random graph generated by a stochastic blockmodel. We use an embedding procedure motivated by the random dot product graph model, a particular example of the latent position model. The embedding associates each node with a vector; these vectors are clustered via minimization of a square error criterion. We prove that this method is con… ▽ More We present a method to estimate block membership of nodes in a random graph generated by a stochastic blockmodel. We use an embedding procedure motivated by the random dot product graph model, a particular example of the latent position model. The embedding associates each node with a vector; these vectors are clustered via minimization of a square error criterion. We prove that this method is consistent for assigning nodes to blocks, as only a negligible number of nodes will be mis-assigned. We prove consistency of the method for directed and undirected graphs. The consistent block assignment makes possible consistent parameter estimation for a stochastic blockmodel. We extend the result in the setting where the number of blocks grows slowly with the number of nodes. Our method is also computationally feasible even for very large graphs. We compare our method to Laplacian spectral clustering through analysis of simulated data and a graph derived from Wikipedia documents. △ Less

Submitted 27 April, 2012; v1 submitted 10 August, 2011; originally announced August 2011.

Comments: 21 pages

Showing 1–15 of 15 results for author: Fishkind, D E