Search | arXiv e-print repository

Learning Physics for Unveiling Hidden Earthquake Ground Motions via Conditional Generative Modeling

Authors: Pu Ren, Rie Nakata, Maxime Lacour, Ilan Naiman, Nori Nakata, Jialin Song, Zhengfa Bi, Osman Asif Malik, Dmitriy Morozov, Omri Azencot, N. Benjamin Erichson, Michael W. Mahoney

Abstract: Predicting high-fidelity ground motions for future earthquakes is crucial for seismic hazard assessment and infrastructure resilience. Conventional empirical simulations suffer from sparse sensor distribution and geographically localized earthquake locations, while physics-based methods are computationally intensive and require accurate representations of Earth structures and earthquake sources. W… ▽ More Predicting high-fidelity ground motions for future earthquakes is crucial for seismic hazard assessment and infrastructure resilience. Conventional empirical simulations suffer from sparse sensor distribution and geographically localized earthquake locations, while physics-based methods are computationally intensive and require accurate representations of Earth structures and earthquake sources. We propose a novel artificial intelligence (AI) simulator, Conditional Generative Modeling for Ground Motion (CGM-GM), to synthesize high-frequency and spatially continuous earthquake ground motion waveforms. CGM-GM leverages earthquake magnitudes and geographic coordinates of earthquakes and sensors as inputs, learning complex wave physics and Earth heterogeneities, without explicit physics constraints. This is achieved through a probabilistic autoencoder that captures latent distributions in the time-frequency domain and variational sequential models for prior and posterior distributions. We evaluate the performance of CGM-GM using small-magnitude earthquake records from the San Francisco Bay Area, a region with high seismic risks. CGM-GM demonstrates a strong potential for outperforming a state-of-the-art non-ergodic empirical ground motion model and shows great promise in seismology and beyond. △ Less

Submitted 21 July, 2024; originally announced July 2024.

arXiv:2404.03591 [pdf, other]

Wilkins: HPC In Situ Workflows Made Easy

Authors: Orcun Yildiz, Dmitriy Morozov, Arnur Nigmetov, Bogdan Nicolae, Tom Peterka

Abstract: In situ approaches can accelerate the pace of scientific discoveries by allowing scientists to perform data analysis at simulation time. Current in situ workflow systems, however, face challenges in handling the growing complexity and diverse computational requirements of scientific tasks. In this work, we present Wilkins, an in situ workflow system that is designed for ease-of-use while providing… ▽ More In situ approaches can accelerate the pace of scientific discoveries by allowing scientists to perform data analysis at simulation time. Current in situ workflow systems, however, face challenges in handling the growing complexity and diverse computational requirements of scientific tasks. In this work, we present Wilkins, an in situ workflow system that is designed for ease-of-use while providing scalable and efficient execution of workflow tasks. Wilkins provides a flexible workflow description interface, employs a high-performance data transport layer based on HDF5, and supports tasks with disparate data rates by providing a flow control mechanism. Wilkins seamlessly couples scientific tasks that already use HDF5, without requiring task code modifications. We demonstrate the above features using both synthetic benchmarks and two science use cases in materials science and cosmology. △ Less

Submitted 4 April, 2024; originally announced April 2024.

arXiv:2402.15734 [pdf, other]

Data-Efficient Operator Learning via Unsupervised Pretraining and In-Context Learning

Authors: Wuyang Chen, Jialin Song, Pu Ren, Shashank Subramanian, Dmitriy Morozov, Michael W. Mahoney

Abstract: Recent years have witnessed the promise of coupling machine learning methods and physical domainspecific insights for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding t… ▽ More Recent years have witnessed the promise of coupling machine learning methods and physical domainspecific insights for solving scientific problems based on partial differential equations (PDEs). However, being data-intensive, these methods still require a large amount of PDE data. This reintroduces the need for expensive numerical PDE solutions, partially undermining the original goal of avoiding these expensive simulations. In this work, seeking data efficiency, we design unsupervised pretraining for PDE operator learning. To reduce the need for training data with heavy simulation costs, we mine unlabeled PDE data without simulated solutions, and pretrain neural operators with physics-inspired reconstruction-based proxy tasks. To improve out-of-distribution performance, we further assist neural operators in flexibly leveraging in-context learning methods, without incurring extra training costs or designs. Extensive empirical evaluations on a diverse set of PDEs demonstrate that our method is highly data-efficient, more generalizable, and even outperforms conventional vision-pretrained models. △ Less

Submitted 13 June, 2024; v1 submitted 24 February, 2024; originally announced February 2024.

arXiv:2312.10700 [pdf, other]

Cross-Domain Robustness of Transformer-based Keyphrase Generation

Authors: Anna Glazkova, Dmitry Morozov

Abstract: Modern models for text generation show state-of-the-art results in many natural language processing tasks. In this work, we explore the effectiveness of abstractive text summarization models for keyphrase selection. A list of keyphrases is an important element of a text in databases and repositories of electronic documents. In our experiments, abstractive text summarization models fine-tuned for k… ▽ More Modern models for text generation show state-of-the-art results in many natural language processing tasks. In this work, we explore the effectiveness of abstractive text summarization models for keyphrase selection. A list of keyphrases is an important element of a text in databases and repositories of electronic documents. In our experiments, abstractive text summarization models fine-tuned for keyphrase generation show quite high results for a target text corpus. However, in most cases, the zero-shot performance on other corpora and domains is significantly lower. We investigate cross-domain limitations of abstractive text summarization models for keyphrase generation. We present an evaluation of the fine-tuned BART models for the keyphrase selection task across six benchmark corpora for keyphrase extraction including scientific texts from two domains and news texts. We explore the role of transfer learning between different domains to improve the BART model performance on small text corpora. Our experiments show that preliminary fine-tuning on out-of-domain corpora can be effective under conditions of a limited number of samples. △ Less

Submitted 17 December, 2023; originally announced December 2023.

Comments: Presented at the XXV International Conference "Data Analytics and Management in Data Intensive Domains" (DAMDID/RCDL), October 2023

MSC Class: 68T50 ACM Class: I.2.7; I.7.m; H.3.3

arXiv:2310.01698 [pdf, other]

Robustifying State-space Models for Long Sequences via Approximate Diagonalization

Authors: Annan Yu, Arnur Nigmetov, Dmitriy Morozov, Michael W. Mahoney, N. Benjamin Erichson

Abstract: State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have c… ▽ More State-space models (SSMs) have recently emerged as a framework for learning long-range sequence tasks. An example is the structured state-space sequence (S4) layer, which uses the diagonal-plus-low-rank structure of the HiPPO initialization framework. However, the complicated structure of the S4 layer poses challenges; and, in an effort to address these challenges, models such as S4D and S5 have considered a purely diagonal structure. This choice simplifies the implementation, improves computational efficiency, and allows channel communication. However, diagonalizing the HiPPO framework is itself an ill-posed problem. In this paper, we propose a general solution for this and related ill-posed diagonalization problems in machine learning. We introduce a generic, backward-stable "perturb-then-diagonalize" (PTD) methodology, which is based on the pseudospectral theory of non-normal operators, and which may be interpreted as the approximate diagonalization of the non-normal matrices defining SSMs. Based on this, we introduce the S4-PTD and S5-PTD models. Through theoretical analysis of the transfer functions of different initialization schemes, we demonstrate that the S4-PTD/S5-PTD initialization strongly converges to the HiPPO framework, while the S4D/S5 initialization only achieves weak convergences. As a result, our new models show resilience to Fourier-mode noise-perturbed inputs, a crucial property not achieved by the S4D/S5 models. In addition to improved robustness, our S5-PTD model averages 87.6% accuracy on the Long-Range Arena benchmark, demonstrating that the PTD methodology helps to improve the accuracy of deep learning models. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2306.00258 [pdf, other]

Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior

Authors: Shashank Subramanian, Peter Harrington, Kurt Keutzer, Wahid Bhimji, Dmitriy Morozov, Michael Mahoney, Amir Gholami

Abstract: Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained… ▽ More Pre-trained machine learning (ML) models have shown great performance for a wide range of applications, in particular in natural language processing (NLP) and computer vision (CV). Here, we study how pre-training could be used for scientific machine learning (SciML) applications, specifically in the context of transfer learning. We study the transfer behavior of these models as (i) the pre-trained model size is scaled, (ii) the downstream training dataset size is scaled, (iii) the physics parameters are systematically pushed out of distribution, and (iv) how a single model pre-trained on a mixture of different physics problems can be adapted to various downstream applications. We find that-when fine-tuned appropriately-transfer learning can help reach desired accuracy levels with orders of magnitude fewer downstream examples (across different tasks that can even be out-of-distribution) than training from scratch, with consistent behavior across a wide range of downstream examples. We also find that fine-tuning these models yields more performance gains as model size increases, compared to training from scratch on new downstream tasks. These results hold for a broad range of PDE learning tasks. All in all, our results demonstrate the potential of the "pre-train and fine-tune" paradigm for SciML problems, demonstrating a path towards building SciML foundation models. We open-source our code for reproducibility. △ Less

Submitted 31 May, 2023; originally announced June 2023.

Comments: 16 pages, 11 figures

Journal ref: NeurIPS 2023

arXiv:2301.10838 [pdf, other]

Fast Merge Tree Computation via SYCL

Authors: Arnur Nigmetov, Dmitriy Morozov

Abstract: A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. F… ▽ More A merge tree is a topological descriptor of a real-valued function. Merge trees are used in visualization and topological data analysis, either directly or as a means to another end: computing a 0-dimensional persistence diagram, identifying connected components, performing topological simplification, etc. Scientific computing relies more and more on GPUs to achieve fast, scalable computation. For efficiency, data analysis should take place at the same location as the main computation, which motivates interest in parallel algorithms and portable software for merge trees that can run not only on a CPU, but also on a GPU, or other types of accelerators. The SYCL standard defines a programming model that allows the same code, written in standard C++, to compile targets for multiple parallel backends (CPUs via OpenMP or TBB, NVIDIA GPUs via CUDA, AMD GPUs via ROCm, Intel GPUs via Level Zero, FPGAs). In this paper, we adapt the triplet merge tree algorithm to SYCL and compare our implementation with the VTK-m implementation, which is the only other implementation of merge trees for GPUs that we know of. △ Less

Submitted 27 January, 2023; v1 submitted 25 January, 2023; originally announced January 2023.

Comments: Topological Data Analysis and Visualization (TopoInVis) 2022 v2 -- corrected Acknowledgements

arXiv:2209.03791 [pdf, other]

doi 10.1134/S1995080223010134

Applying Transformer-based Text Summarization for Keyphrase Generation

Authors: Anna Glazkova, Dmitry Morozov

Abstract: Keyphrases are crucial for searching and systematizing scholarly documents. Most current methods for keyphrase extraction are aimed at the extraction of the most significant words in the text. But in practice, the list of keyphrases often includes words that do not appear in the text explicitly. In this case, the list of keyphrases represents an abstractive summary of the source text. In this pape… ▽ More Keyphrases are crucial for searching and systematizing scholarly documents. Most current methods for keyphrase extraction are aimed at the extraction of the most significant words in the text. But in practice, the list of keyphrases often includes words that do not appear in the text explicitly. In this case, the list of keyphrases represents an abstractive summary of the source text. In this paper, we experiment with popular transformer-based models for abstractive text summarization using four benchmark datasets for keyphrase extraction. We compare the results obtained with the results of common unsupervised and supervised methods for keyphrase extraction. Our evaluation shows that summarization models are quite effective in generating keyphrases in the terms of the full-match F1-score and BERTScore. However, they produce a lot of words that are absent in the author's list of keyphrases, which makes summarization models ineffective in terms of ROUGE-1. We also investigate several ordering strategies to concatenate target keyphrases. The results showed that the choice of strategy affects the performance of keyphrase generation. △ Less

Submitted 6 October, 2022; v1 submitted 8 September, 2022; originally announced September 2022.

Comments: 15 pages, 4 figures. DAMDID-2022

MSC Class: 68T50 ACM Class: I.2.7; I.7.m; H.3.3

Journal ref: Lobachevskii J Math 44, 123-136 (2023)

arXiv:2203.16748 [pdf, other]

Topological Optimization with Big Steps

Authors: Arnur Nigmetov, Dmitriy Morozov

Abstract: Using persistent homology to guide optimization has emerged as a novel application of topological data analysis. Existing methods treat persistence calculation as a black box and backpropagate gradients only onto the simplices involved in particular pairs. We show how the cycles and chains used in the persistence calculation can be used to prescribe gradients to larger subsets of the domain. In pa… ▽ More Using persistent homology to guide optimization has emerged as a novel application of topological data analysis. Existing methods treat persistence calculation as a black box and backpropagate gradients only onto the simplices involved in particular pairs. We show how the cycles and chains used in the persistence calculation can be used to prescribe gradients to larger subsets of the domain. In particular, we show that in a special case, which serves as a building block for general losses, the problem can be solved exactly in linear time. This relies on another contribution of this paper, which eliminates the need to examine a factorial number of permutations of simplices with the same value. We present empirical experiments that show the practical benefits of our algorithm: the number of steps required for the optimization is reduced by an order of magnitude. △ Less

Submitted 2 November, 2023; v1 submitted 30 March, 2022; originally announced March 2022.

Comments: 26 pages, 29 figures. Updated version (section on consistency of critical sets, more experiments) accepted to DCG

arXiv:2112.03980 [pdf, other]

Output-sensitive Computation of Generalized Persistence Diagrams for 2-filtrations

Authors: Dmitriy Morozov, Amit Patel

Abstract: When persistence diagrams are formalized as the Mobius inversion of the birth-death function, they naturally generalize to the multi-parameter setting and enjoy many of the key properties, such as stability, that we expect in applications. The direct definition in the 2-parameter setting, and the corresponding brute-force algorithm to compute them, require $Ω(n^4)$ operations. But the size of the… ▽ More When persistence diagrams are formalized as the Mobius inversion of the birth-death function, they naturally generalize to the multi-parameter setting and enjoy many of the key properties, such as stability, that we expect in applications. The direct definition in the 2-parameter setting, and the corresponding brute-force algorithm to compute them, require $Ω(n^4)$ operations. But the size of the generalized persistence diagram, $C$, can be as low as linear (and as high as cubic). We elucidate a connection between the 2-parameter and the ordinary 1-parameter settings, which allows us to design an output-sensitive algorithm, whose running time is in $O(n^3 + Cn)$. △ Less

Submitted 16 May, 2023; v1 submitted 7 December, 2021; originally announced December 2021.

Comments: Major revision. The exposition is greatly simplified and background section is expanded

arXiv:2104.04739 [pdf, ps, other]

doi 10.18653/v1/2021.semeval-1.124

MIPT-NSU-UTMN at SemEval-2021 Task 5: Ensembling Learning with Pre-trained Language Models for Toxic Spans Detection

Authors: Mikhail Kotyushev, Anna Glazkova, Dmitry Morozov

Abstract: This paper describes our system for SemEval-2021 Task 5 on Toxic Spans Detection. We developed ensemble models using BERT-based neural architectures and post-processing to combine tokens into spans. We evaluated several pre-trained language models using various ensemble techniques for toxic span identification and achieved sizable improvements over our baseline fine-tuned BERT models. Finally, our… ▽ More This paper describes our system for SemEval-2021 Task 5 on Toxic Spans Detection. We developed ensemble models using BERT-based neural architectures and post-processing to combine tokens into spans. We evaluated several pre-trained language models using various ensemble techniques for toxic span identification and achieved sizable improvements over our baseline fine-tuned BERT models. Finally, our system obtained a F1-score of 67.55% on test data. △ Less

Submitted 10 April, 2021; originally announced April 2021.

Comments: Accepted at SemEval-2021 Workshop, ACL-IJCNLP 2021

MSC Class: 68T50 ACM Class: I.2.7; I.7.m; H.3.3

Journal ref: Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)", pp. 913-918, 2021

arXiv:2011.05290 [pdf, other]

Topological Regularization via Persistence-Sensitive Optimization

Authors: Arnur Nigmetov, Aditi S. Krishnapriyan, Nicole Sanderson, Dmitriy Morozov

Abstract: Optimization, a key tool in machine learning and statistics, relies on regularization to reduce overfitting. Traditional regularization methods control a norm of the solution to ensure its smoothness. Recently, topological methods have emerged as a way to provide a more precise and expressive control over the solution, relying on persistent homology to quantify and reduce its roughness. All such e… ▽ More Optimization, a key tool in machine learning and statistics, relies on regularization to reduce overfitting. Traditional regularization methods control a norm of the solution to ensure its smoothness. Recently, topological methods have emerged as a way to provide a more precise and expressive control over the solution, relying on persistent homology to quantify and reduce its roughness. All such existing techniques back-propagate gradients through the persistence diagram, which is a summary of the topological features of a function. Their downside is that they provide information only at the critical points of the function. We propose a method that instead builds on persistence-sensitive simplification and translates the required changes to the persistence diagram into changes on large subsets of the domain, including both critical and regular points. This approach enables a faster and more precise topological regularization, the benefits of which we illustrate with experimental evidence. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: The first two authors contributed equally to this work

arXiv:2010.16027 [pdf, other]

PersGNN: Applying Topological Data Analysis and Geometric Deep Learning to Structure-Based Protein Function Prediction

Authors: Nicolas Swenson, Aditi S. Krishnapriyan, Aydin Buluc, Dmitriy Morozov, Katherine Yelick

Abstract: Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in… ▽ More Understanding protein structure-function relationships is a key challenge in computational biology, with applications across the biotechnology and pharmaceutical industries. While it is known that protein structure directly impacts protein function, many functional prediction tasks use only protein sequence. In this work, we isolate protein structure to make functional annotations for proteins in the Protein Data Bank in order to study the expressiveness of different structure-based prediction schemes. We present PersGNN - an end-to-end trainable deep learning model that combines graph representation learning with topological data analysis to capture a complex set of both local and global structural features. While variations of these techniques have been successfully applied to proteins before, we demonstrate that our hybridized approach, PersGNN, outperforms either method on its own as well as a baseline neural network that learns from the same information. PersGNN achieves a 9.3% boost in area under the precision recall curve (AUPR) compared to the best individual model, as well as high F1 scores across different gene ontology categories, indicating the transferability of this approach. △ Less

Submitted 29 October, 2020; originally announced October 2020.

Comments: The first two authors contributed equally to this work

arXiv:2010.00532 [pdf, other]

Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks

Authors: Aditi S. Krishnapriyan, Joseph Montoya, Maciej Haranczyk, Jens Hummelshøj, Dmitriy Morozov

Abstract: Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry. This approach b… ▽ More Machine learning has emerged as a powerful approach in materials discovery. Its major challenge is selecting features that create interpretable representations of materials, useful across multiple prediction tasks. We introduce an end-to-end machine learning model that automatically generates descriptors that capture a complex representation of a material's structure and chemistry. This approach builds on computational topology techniques (namely, persistent homology) and word embeddings from natural language processing. It automatically encapsulates geometric and chemical information directly from the material system. We demonstrate our approach on multiple nanoporous metal-organic framework datasets by predicting methane and carbon dioxide adsorption across different conditions. Our results show considerable improvement in both accuracy and transferability across targets compared to models constructed from the commonly-used, manually-curated features, consistently achieving an average 25-30% decrease in root-mean-squared-deviation and an average increase of 40-50% in R2 scores. A key advantage of our approach is interpretability: Our model identifies the pores that correlate best to adsorption at different pressures, which contributes to understanding atomic-level structure--property relationships for materials design. △ Less

Submitted 31 March, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

Comments: 14 pages main text, 8 figures

arXiv:2001.05972 [pdf, other]

Topological Descriptors Help Predict Guest Adsorption in Nanoporous Materials

Authors: Aditi S. Krishnapriyan, Maciej Haranczyk, Dmitriy Morozov

Abstract: Machine learning has emerged as an attractive alternative to experiments and simulations for predicting material properties. Usually, such an approach relies on specific domain knowledge for feature design: each learning target requires careful selection of features that an expert recognizes as important for the specific task. The major drawback of this approach is that computation of only a few s… ▽ More Machine learning has emerged as an attractive alternative to experiments and simulations for predicting material properties. Usually, such an approach relies on specific domain knowledge for feature design: each learning target requires careful selection of features that an expert recognizes as important for the specific task. The major drawback of this approach is that computation of only a few structural features has been implemented so far, and it is difficult to tell a priori which features are important for a particular application. The latter problem has been empirically observed for predictors of guest uptake in nanoporous materials: local and global porosity features become dominant descriptors at low and high pressures, respectively. We investigate a feature representation of materials using tools from topological data analysis. Specifically, we use persistent homology to describe the geometry of nanoporous materials at various scales. We combine our topological descriptor with traditional structural features and investigate the relative importance of each to the prediction tasks. We demonstrate an application of this feature representation by predicting methane adsorption in zeolites, for pressures in the range of 1-200 bar. Our results not only show a considerable improvement compared to the baseline, but they also highlight that topological features capture information complementary to the structural features: this is especially important for the adsorption at low pressure, a task particularly difficult for the traditional features. Furthermore, by investigation of the importance of individual topological features in the adsorption model, we are able to pinpoint the location of the pores that correlate best to adsorption at different pressure, contributing to our atom-level understanding of structure-property relationships. △ Less

Submitted 6 March, 2020; v1 submitted 16 January, 2020; originally announced January 2020.

Comments: 14 pages, 7 figures

arXiv:1910.14499 [pdf, other]

Data-driven model for hydraulic fracturing design optimization: focus on building digital database and production forecast

Authors: A. D. Morozov, D. O. Popkov, V. M. Duplyakov, R. F. Mutalova, A. A. Osiptsov, A. L. Vainshtein, E. V. Burnaev, E. V. Shel, G. V. Paderin

Abstract: Growing amount of hydraulic fracturing (HF) jobs in the recent two decades resulted in a significant amount of measured data available for development of predictive models via machine learning (ML). In multistage fractured completions, post-fracturing production analysis reveals that different stages produce very non-uniformly due to a combination of geomechanics and fracturing design factors. Hen… ▽ More Growing amount of hydraulic fracturing (HF) jobs in the recent two decades resulted in a significant amount of measured data available for development of predictive models via machine learning (ML). In multistage fractured completions, post-fracturing production analysis reveals that different stages produce very non-uniformly due to a combination of geomechanics and fracturing design factors. Hence, there is a significant room for improvement of current design practices. The workflow is essentially split into two stages. As a result of the first stage, the present paper summarizes the efforts into the creation of a digital database of field data from several thousands of multistage HF jobs on wells from circa 20 different oilfields in Western Siberia, Russia. In terms of the number of points (fracturing jobs), the present database is a rare case of a representative dataset of about 5000 data points. Each point in the data base contains the vector of 92 input variables (the reservoir, well and the frac design parameters) and the vector of production data, which is characterized by 16 parameters, including the target, cumulative oil production. Data preparation has been done using various ML techniques: the problem of missing values in the database is solved with collaborative filtering for data imputation; outliers are removed using visualisation of cluster data structure by t-SNE algorithm. The production forecast problem is solved via CatBoost algorithm. Prediction capability of the model is measured with the coefficient of determination (R^2) and reached 0.815. The inverse problem (selecting an optimum set of fracturing design parameters to maximize production) will be considered in the second part of the study to be published in another paper, along with a recommendation system for advising DESC and production stimulation engineers on an optimized fracturing design. △ Less

Submitted 18 July, 2020; v1 submitted 28 October, 2019; originally announced October 2019.

arXiv:1809.09955 [pdf, other]

Knowledge extraction, modeling and formalization: EEG case study

Authors: Dmitry Morozov, Mario Lezoche, Hervé Panetto

Abstract: Formal Concept Analysis (FCA) is a well-established method for data analysis which finds many applications in data mining. Its extension on complex data representation formats brought a wave of new applications to the problems such as gene expression mining, prediction of toxicity of chemical compounds or clustering of sequences in process event logs. Insipired from this work our research inherits… ▽ More Formal Concept Analysis (FCA) is a well-established method for data analysis which finds many applications in data mining. Its extension on complex data representation formats brought a wave of new applications to the problems such as gene expression mining, prediction of toxicity of chemical compounds or clustering of sequences in process event logs. Insipired from this work our research inherits their model and designs an experiment for mining electroencephalographic recordings for patterns of sleep spindles. The contribution of this paper lies in the specification of desritizition procedure and the architecture of FCA experiment. We also provide some reflection on the related research papers. △ Less

Submitted 11 September, 2018; originally announced September 2018.

Comments: arXiv admin note: text overlap with arXiv:1506.05018 by other authors

arXiv:1710.10769 [pdf, other]

Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation

Authors: Penporn Koanantakool, Alnur Ali, Ariful Azad, Aydin Buluc, Dmitriy Morozov, Leonid Oliker, Katherine Yelick, Sang-Yun Oh

Abstract: Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CO… ▽ More Across a variety of scientific disciplines, sparse inverse covariance estimation is a popular tool for capturing the underlying dependency relationships in multivariate data. Unfortunately, most estimators are not scalable enough to handle the sizes of modern high-dimensional data sets (often on the order of terabytes), and assume Gaussian samples. To address these deficiencies, we introduce HP-CONCORD, a highly scalable optimization method for estimating a sparse inverse covariance matrix based on a regularized pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal gradient method uses a novel communication-avoiding linear algebra algorithm and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving parallel scalability on problems with up to ~819 billion parameters (1.28 million dimensions); even on a single node, HP-CONCORD demonstrates scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to estimate the underlying dependency structure of the brain from fMRI data, and use the result to identify functional regions automatically. The results show good agreement with a clustering from the neuroscience literature. △ Less

Submitted 8 April, 2018; v1 submitted 30 October, 2017; originally announced October 2017.

Comments: Main paper: 15 pages, appendix: 24 pages

Journal ref: Artificial Intelligence and Statistics vol. 84 1376-1386 (2018)

arXiv:1606.03357 [pdf, other]

Geometry Helps to Compare Persistence Diagrams

Authors: Michael Kerber, Dmitriy Morozov, Arnur Nigmetov

Abstract: Exploiting geometric structure to improve the asymptotic complexity of discrete assignment problems is a well-studied subject. In contrast, the practical advantages of using geometry for such problems have not been explored. We implement geometric variants of the Hopcroft--Karp algorithm for bottleneck matching (based on previous work by Efrat el al.) and of the auction algorithm by Bertsekas for… ▽ More Exploiting geometric structure to improve the asymptotic complexity of discrete assignment problems is a well-studied subject. In contrast, the practical advantages of using geometry for such problems have not been explored. We implement geometric variants of the Hopcroft--Karp algorithm for bottleneck matching (based on previous work by Efrat el al.) and of the auction algorithm by Bertsekas for Wasserstein distance computation. Both implementations use k-d trees to replace a linear scan with a geometric proximity query. Our interest in this problem stems from the desire to compute distances between persistence diagrams, a problem that comes up frequently in topological data analysis. We show that our geometric matching algorithms lead to a substantial performance gain, both in running time and in memory consumption, over their purely combinatorial counterparts. Moreover, our implementation significantly outperforms the only other implementation available for comparing persistence diagrams. △ Less

Submitted 10 June, 2016; originally announced June 2016.

Comments: 20 pages, 10 figures; extended version of paper published in ALENEX 2016

ACM Class: G.4; G.2.2

arXiv:1107.5665 [pdf, other]

doi 10.1088/0266-5611/27/12/124003

Dualities in persistent (co)homology

Authors: Vin de Silva, Dmitriy Morozov, Mikael Vejdemo-Johansson

Abstract: We consider sequences of absolute and relative homology and cohomology groups that arise naturally for a filtered cell complex. We establish algebraic relationships between their persistence modules, and show that they contain equivalent information. We explain how one can use the existing algorithm for persistent homology to process any of the four modules, and relate it to a recently introduced… ▽ More We consider sequences of absolute and relative homology and cohomology groups that arise naturally for a filtered cell complex. We establish algebraic relationships between their persistence modules, and show that they contain equivalent information. We explain how one can use the existing algorithm for persistent homology to process any of the four modules, and relate it to a recently introduced persistent cohomology algorithm. We present experimental evidence for the practical efficiency of the latter algorithm. △ Less

Submitted 28 July, 2011; originally announced July 2011.

Comments: 16 pages, 3 figures, submitted to the Inverse Problems special issue on Topological Data Analysis

arXiv:1102.4972 [pdf, ps, other]

Witnessed k-Distance

Authors: Leonidas J. Guibas, Quentin Mérigot, Dmitriy Morozov

Abstract: Distance function to a compact set plays a central role in several areas of computational geometry. Methods that rely on it are robust to the perturbations of the data by the Hausdorff noise, but fail in the presence of outliers. The recently introduced distance to a measure offers a solution by extending the distance function framework to reasoning about the geometry of probability measures, whil… ▽ More Distance function to a compact set plays a central role in several areas of computational geometry. Methods that rely on it are robust to the perturbations of the data by the Hausdorff noise, but fail in the presence of outliers. The recently introduced distance to a measure offers a solution by extending the distance function framework to reasoning about the geometry of probability measures, while maintaining theoretical guarantees about the quality of the inferred information. A combinatorial explosion hinders working with distance to a measure as an ordinary (power) distance function. In this paper, we analyze an approximation scheme that keeps the representation linear in the size of the input, while maintaining the guarantees on the inference quality close to those for the exact (but costly) representation. △ Less

Submitted 24 February, 2011; originally announced February 2011.

arXiv:1102.3389 [pdf, ps, other]

Homology and Robustness of Level and Interlevel Sets

Authors: Paul Bendich, Herbert Edelsbrunner, Dmitriy Morozov, Amit Patel

Abstract: Given a function $f: \Xspace \to \Rspace$ on a topological space, we consider the preimages of intervals and their homology groups and show how to read the ranks of these groups from the extended persistence diagram of $f$. In addition, we quantify the robustness of the homology classes under perturbations of $f$ using well groups, and we show how to read the ranks of these groups from the same ex… ▽ More Given a function $f: \Xspace \to \Rspace$ on a topological space, we consider the preimages of intervals and their homology groups and show how to read the ranks of these groups from the extended persistence diagram of $f$. In addition, we quantify the robustness of the homology classes under perturbations of $f$ using well groups, and we show how to read the ranks of these groups from the same extended persistence diagram. The special case $\Xspace = \Rspace^3$ has ramifications in the fields of medical imaging and scientific visualization. △ Less

Submitted 16 February, 2011; originally announced February 2011.

arXiv:0911.2142 [pdf, ps, other]

Quantifying Transversality by Measuring the Robustness of Intersections

Authors: Herbert Edelsbrunner, Dmitriy Morozov, Amit Patel

Abstract: By definition, transverse intersections are stable under infinitesimal perturbations. Using persistent homology, we extend this notion to a measure. Given a space of perturbations, we assign to each homology class of the intersection its robustness, the magnitude of a perturbations in this space necessary to kill it, and prove that robustness is stable. Among the applications of this result is a… ▽ More By definition, transverse intersections are stable under infinitesimal perturbations. Using persistent homology, we extend this notion to a measure. Given a space of perturbations, we assign to each homology class of the intersection its robustness, the magnitude of a perturbations in this space necessary to kill it, and prove that robustness is stable. Among the applications of this result is a stable notion of robustness for fixed points of continuous mappings and a statement of stability for contours of smooth mappings. △ Less

Submitted 20 April, 2010; v1 submitted 11 November, 2009; originally announced November 2009.

Showing 1–23 of 23 results for author: Morozov, D