-
Graphical Modelling without Independence Assumptions for Uncentered Data
Authors:
Bailey Andrew,
David R. Westhead,
Luisa Cutillo
Abstract:
The independence assumption is a useful tool to increase the tractability of one's modelling framework. However, this assumption does not match reality; failing to take dependencies into account can cause models to fail dramatically. The field of multi-axis graphical modelling (also called multi-way modelling, Kronecker-separable modelling) has seen growth over the past decade, but these models re…
▽ More
The independence assumption is a useful tool to increase the tractability of one's modelling framework. However, this assumption does not match reality; failing to take dependencies into account can cause models to fail dramatically. The field of multi-axis graphical modelling (also called multi-way modelling, Kronecker-separable modelling) has seen growth over the past decade, but these models require that the data have zero mean. In the multi-axis case, inference is typically done in the single sample scenario, making mean inference impossible.
In this paper, we demonstrate how the zero-mean assumption can cause egregious modelling errors, as well as propose a relaxation to the zero-mean assumption that allows the avoidance of such errors. Specifically, we propose the "Kronecker-sum-structured mean" assumption, which leads to models with nonconvex-but-unimodal log-likelihoods that can be solved efficiently with coordinate descent.
△ Less
Submitted 5 August, 2024;
originally announced August 2024.
-
Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features
Authors:
Bailey Andrew,
David R. Westhead,
Luisa Cutillo
Abstract:
Gaussian graphical models can be used to extract conditional dependencies between the features of the dataset. This is often done by making an independence assumption about the samples, but this assumption is rarely satisfied in reality. However, state-of-the-art approaches that avoid this assumption are not scalable, with $O(n^3)$ runtime and $O(n^2)$ space complexity. In this paper, we introduce…
▽ More
Gaussian graphical models can be used to extract conditional dependencies between the features of the dataset. This is often done by making an independence assumption about the samples, but this assumption is rarely satisfied in reality. However, state-of-the-art approaches that avoid this assumption are not scalable, with $O(n^3)$ runtime and $O(n^2)$ space complexity. In this paper, we introduce a method that has $O(n^2)$ runtime and $O(n)$ space complexity, without assuming independence.
We validate our model on both synthetic and real-world datasets, showing that our method's accuracy is comparable to that of prior work We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset; this was impossible with previous approaches. Our method maintains the flexibility of prior work, such as the ability to handle multi-modal tensor-variate datasets and the ability to work with data of arbitrary marginal distributions. An additional advantage of our method is that, unlike prior work, our hyperparameters are easily interpretable.
△ Less
Submitted 29 July, 2024;
originally announced July 2024.
-
Detecting Throat Cancer from Speech Signals using Machine Learning: A Scoping Literature Review
Authors:
Mary Paterson,
James Moor,
Luisa Cutillo
Abstract:
Introduction: Cases of throat cancer are rising worldwide. With survival decreasing significantly at later stages, early detection is vital. Artificial intelligence (AI) and machine learning (ML) have the potential to detect throat cancer from patient speech, facilitating earlier diagnosis and reducing the burden on overstretched healthcare systems. However, no comprehensive review has explored th…
▽ More
Introduction: Cases of throat cancer are rising worldwide. With survival decreasing significantly at later stages, early detection is vital. Artificial intelligence (AI) and machine learning (ML) have the potential to detect throat cancer from patient speech, facilitating earlier diagnosis and reducing the burden on overstretched healthcare systems. However, no comprehensive review has explored the use of AI and ML for detecting throat cancer from speech. This review aims to fill this gap by evaluating how these technologies perform and identifying issues that need to be addressed in future research. Materials and Methods: We conducted a scoping literature review across three databases: Scopus,Web of Science, and PubMed. We included articles that classified speech using machine learning and specified the inclusion of throat cancer patients in their data. Articles were categorized based on whether they performed binary or multi-class classification. Results: We found 27 articles fitting our inclusion criteria, 12 performing binary classification, 13 performing multi-class classification, and two that do both binary and multiclass classification. The most common classification method used was neural networks, and the most frequently extracted feature was mel-spectrograms. We also documented pre-processing methods and classifier performance. We compared each article against the TRIPOD-AI checklist, which showed a significant lack of open science, with only one article sharing code and only three using open-access data. Conclusion: Open-source code is essential for external validation and further development in this field. Our review indicates that no single method or specific feature consistently outperforms others in detecting throat cancer from speech. Future research should focus on standardizing methodologies and improving the reproducibility of results.
△ Less
Submitted 24 July, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
GmGM: a Fast Multi-Axis Gaussian Graphical Model
Authors:
Bailey Andrew,
David Westhead,
Luisa Cutillo
Abstract:
This paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a…
▽ More
This paper introduces the Gaussian multi-Graphical Model, a model to construct sparse graph representations of matrix- and tensor-variate data. We generalize prior work in this area by simultaneously learning this representation across several tensors that share axes, which is necessary to allow the analysis of multimodal datasets such as those encountered in multi-omics. Our algorithm uses only a single eigendecomposition per axis, achieving an order of magnitude speedup over prior work in the ungeneralized case. This allows the use of our methodology on large multi-modal datasets such as single-cell multi-omics data, which was challenging with previous approaches. We validate our model on synthetic data and five real-world datasets.
△ Less
Submitted 27 February, 2024; v1 submitted 5 November, 2022;
originally announced November 2022.
-
Scalable Bigraphical Lasso: Two-way Sparse Network Inference for Count Data
Authors:
Sijia Li,
Martín López-García,
Neil D. Lawrence,
Luisa Cutillo
Abstract:
Classically, statistical datasets have a larger number of data points than features ($n > p$). The standard model of classical statistics caters for the case where data points are considered conditionally independent given the parameters. However, for $n\approx p$ or $p > n$ such models are poorly determined. Kalaitzis et al. (2013) introduced the Bigraphical Lasso, an estimator for sparse precisi…
▽ More
Classically, statistical datasets have a larger number of data points than features ($n > p$). The standard model of classical statistics caters for the case where data points are considered conditionally independent given the parameters. However, for $n\approx p$ or $p > n$ such models are poorly determined. Kalaitzis et al. (2013) introduced the Bigraphical Lasso, an estimator for sparse precision matrices based on the Cartesian product of graphs. Unfortunately, the original Bigraphical Lasso algorithm is not applicable in case of large p and n due to memory requirements. We exploit eigenvalue decomposition of the Cartesian product graph to present a more efficient version of the algorithm which reduces memory requirements from $O(n^2p^2)$ to $O(n^2 + p^2)$. Many datasets in different application fields, such as biology, medicine and social science, come with count data, for which Gaussian based models are not applicable. Our multi-way network inference approach can be used for discrete data. Our methodology accounts for the dependencies across both instances and features, reduces the computational complexity for high dimensional data and enables to deal with both discrete and continuous data. Numerical studies on both synthetic and real datasets are presented to showcase the performance of our method.
△ Less
Submitted 15 March, 2022;
originally announced March 2022.
-
Extending Bootstrap AMG for Clustering of Attributed Graphs
Authors:
Pasqua D'Ambra,
Panayot S. Vassilevski,
Luisa Cutillo
Abstract:
In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means…
▽ More
In this paper we propose a new approach to detect clusters in undirected graphs with attributed vertices. We incorporate structural and attribute similarities between the vertices in an augmented graph by creating additional vertices and edges as proposed in [1, 2]. The augmented graph is then embedded in a Euclidean space associated to its Laplacian and we cluster vertices via a modified K-means algorithm, using a new vector-valued distance in the embedding space. Main novelty of our method, which can be classified as an early fusion method, i.e., a method in which additional information on vertices are fused to the structure information before applying clustering, is the interpretation of attributes as new realizations of graph vertices, which can be dealt with as coordinate vectors in a related Euclidean space. This allows us to extend a scalable generalized spectral clustering procedure which substitutes graph Laplacian eigenvectors with some vectors, named algebraically smooth vectors, obtained by a linear-time complexity Algebraic MultiGrid (AMG) method. We discuss the performance of our proposed clustering method by comparison with recent literature approaches and public available results. Extensive experiments on different types of synthetic datasets and real-world attributed graphs show that our new algorithm, embedding attributes information in the clustering, outperforms structure-only-based methods, when the attributed network has an ambiguous structure. Furthermore, our new method largely outperforms the method which originally proposed the graph augmentation, showing that our embedding strategy and vector-valued distance are very effective in taking advantages from the augmented-graph representation.
△ Less
Submitted 5 February, 2023; v1 submitted 20 September, 2021;
originally announced September 2021.
-
ROBustness In Network (robin): an R package for Comparison and Validation of communities
Authors:
Valeria Policastro,
Dario Righelli,
Annamaria Carissimo,
Luisa Cutillo,
Italia De Feis
Abstract:
In network analysis, many community detection algorithms have been developed, however, their implementation leaves unaddressed the question of the statistical validation of the results. Here we present robin(ROBustness In Network), an R package to assess the robustness of the community structure of a network found by one or more methods to give indications about their reliability. The procedure in…
▽ More
In network analysis, many community detection algorithms have been developed, however, their implementation leaves unaddressed the question of the statistical validation of the results. Here we present robin(ROBustness In Network), an R package to assess the robustness of the community structure of a network found by one or more methods to give indications about their reliability. The procedure initially detects if the community structure found by a set of algorithms is statistically significant and then compares two selected detection algorithms on the same graph to choose the one that better fits the network of interest. We demonstrate the use of our package on the American College Football benchmark dataset.
△ Less
Submitted 5 February, 2021;
originally announced February 2021.
-
On community structure validation in real networks
Authors:
Mirko Signorelli,
Luisa Cutillo
Abstract:
Community structure is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity, but are poorly connected between each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a real n…
▽ More
Community structure is a commonly observed feature of real networks. The term refers to the presence in a network of groups of nodes (communities) that feature high internal connectivity, but are poorly connected between each other. Whereas the issue of community detection has been addressed in several works, the problem of validating a partition of nodes as a good community structure for a real network has received considerably less attention and remains an open issue. We propose a set of indices for community structure validation of network partitions that are based on an hypothesis testing procedure that assesses the distribution of links between and within communities. Using both simulations and real data, we illustrate how the proposed indices can be employed to compare the adequacy of different partitions of nodes as community structures in a given network, to assess whether two networks share the same or similar community structures, and to evaluate the performance of different network clustering algorithms.
△ Less
Submitted 6 October, 2021; v1 submitted 18 October, 2017;
originally announced October 2017.
-
Validation of community robustness
Authors:
Annamaria Carissimo,
Luisa Cutillo,
Italia Defeis
Abstract:
The large amount of work on community detection and its applications leaves unaddressed one important question: the statistical validation of the results. In this paper we present a methodology able to clearly detect if the community structure found by some algorithms is statistically significant or is a result of chance, merely due to edge positions in the network. Given a community detection met…
▽ More
The large amount of work on community detection and its applications leaves unaddressed one important question: the statistical validation of the results. In this paper we present a methodology able to clearly detect if the community structure found by some algorithms is statistically significant or is a result of chance, merely due to edge positions in the network. Given a community detection method and a network of interest, our proposal examines the stability of the partition recovered against random perturbations of the original graph structure. To address this issue, we specify a perturbation strategy and a null model to build a set of procedures based on a special measure of clustering distance, namely Variation of Information, using tools set up for functional data analysis. The procedures determine whether the obtained clustering departs significantly from the null model. This strongly supports the robustness against perturbation of the algorithm used to identify the community structure. We show the results obtained with the proposed technique on simulated and real datasets.
△ Less
Submitted 17 October, 2016;
originally announced October 2016.