-
Diffusion Models in Bioinformatics: A New Wave of Deep Learning Revolution in Action
Authors:
Zhiye Guo,
Jian Liu,
Yanli Wang,
Mengrui Chen,
Duolin Wang,
Dong Xu,
Jianlin Cheng
Abstract:
Denoising diffusion models have emerged as one of the most powerful generative models in recent years. They have achieved remarkable success in many fields, such as computer vision, natural language processing (NLP), and bioinformatics. Although there are a few excellent reviews on diffusion models and their applications in computer vision and NLP, there is a lack of an overview of their applicati…
▽ More
Denoising diffusion models have emerged as one of the most powerful generative models in recent years. They have achieved remarkable success in many fields, such as computer vision, natural language processing (NLP), and bioinformatics. Although there are a few excellent reviews on diffusion models and their applications in computer vision and NLP, there is a lack of an overview of their applications in bioinformatics. This review aims to provide a rather thorough overview of the applications of diffusion models in bioinformatics to aid their further development in bioinformatics and computational biology. We start with an introduction of the key concepts and theoretical foundations of three cornerstone diffusion modeling frameworks (denoising diffusion probabilistic models, noise-conditioned scoring networks, and stochastic differential equations), followed by a comprehensive description of diffusion models employed in the different domains of bioinformatics, including cryo-EM data enhancement, single-cell data analysis, protein design and generation, drug and small molecule design, and protein-ligand interaction. The review is concluded with a summary of the potential new development and applications of diffusion models in bioinformatics.
△ Less
Submitted 13 February, 2023;
originally announced February 2023.
-
Mid-Infrared Photothermal-Fluorescence in Situ Hybridization for Functional Analysis and Genetic Identification of Single Cells
Authors:
Yeran Bai,
Zhongyue Guo,
Fátima C. Pereira,
Michael Wagner,
Ji-Xin Cheng
Abstract:
Simultaneous identification and metabolic analysis of microbes with single-cell resolution and high throughput is necessary to answer the question of "who eats what, when, and where" in complex microbial communities. Here, we present a mid-infrared photothermal-fluorescence in situ hybridization (MIP-FISH) platform that enables direct bridging of genotype and phenotype. Through multiple improvemen…
▽ More
Simultaneous identification and metabolic analysis of microbes with single-cell resolution and high throughput is necessary to answer the question of "who eats what, when, and where" in complex microbial communities. Here, we present a mid-infrared photothermal-fluorescence in situ hybridization (MIP-FISH) platform that enables direct bridging of genotype and phenotype. Through multiple improvements of MIP imaging, the sensitive detection of isotopically-labelled compounds incorporated into proteins of individual bacterial cells became possible, while simultaneous detection of FISH labelling with rRNA-targeted probes enabled the identification of the analyzed cells. In proof-of-concept experiments, we showed that the clear spectral red shift in the protein amide I region due to incorporation of $^{13}$C atoms originating from $^{13}$C-labelled-glucose can be exploited by MIP-FISH to discriminate and identify $^{13}$C-labelled bacterial cells within a complex human gut microbiome sample. The presented methods open new opportunities for single-cell structure-function analyses for microbiology.
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
Graph-based Molecular Representation Learning
Authors:
Zhichun Guo,
Kehan Guo,
Bozhao Nan,
Yijun Tian,
Roshni G. Iyer,
Yihong Ma,
Olaf Wiest,
Xiangliang Zhang,
Wei Wang,
Chuxu Zhang,
Nitesh V. Chawla
Abstract:
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science. In particular, it encodes molecules as numerical vectors preserving the molecular structures and features, on top of which the downstream tasks (e.g., property prediction) can be performed. Recently, MRL has achieved considerable progress, especially in methods based on deep…
▽ More
Molecular representation learning (MRL) is a key step to build the connection between machine learning and chemical science. In particular, it encodes molecules as numerical vectors preserving the molecular structures and features, on top of which the downstream tasks (e.g., property prediction) can be performed. Recently, MRL has achieved considerable progress, especially in methods based on deep molecular graph learning. In this survey, we systematically review these graph-based molecular representation techniques, especially the methods incorporating chemical domain knowledge. Specifically, we first introduce the features of 2D and 3D molecular graphs. Then we summarize and categorize MRL methods into three groups based on their input. Furthermore, we discuss some typical chemical applications supported by MRL. To facilitate studies in this fast-developing area, we also list the benchmarks and commonly used datasets in the paper. Finally, we share our thoughts on future research directions.
△ Less
Submitted 28 November, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Free energy landscape of two-state protein Acylphosphatase with large contact order revealed by force-dependent folding and unfolding dynamics
Authors:
Xuening Ma,
Hao Sun,
Haiyan Hong,
Zilong Guo,
Huanhuan Su,
Hu Chen
Abstract:
Acylphosphatase (AcP) is a small protein with 98 amino acid residues that catalyzes the hydrolysis of carboxyl-phosphate bonds. AcP is a typical two-state protein with slow folding rate due to its relatively large contact order in the native structure. The mechanical properties and unfolding behavior of AcP has been studied by atomic force microscope. But the folding and unfolding dynamics at low…
▽ More
Acylphosphatase (AcP) is a small protein with 98 amino acid residues that catalyzes the hydrolysis of carboxyl-phosphate bonds. AcP is a typical two-state protein with slow folding rate due to its relatively large contact order in the native structure. The mechanical properties and unfolding behavior of AcP has been studied by atomic force microscope. But the folding and unfolding dynamics at low forces has not been reported. Here using stable magnetic tweezers, we measured the force-dependent folding rates within a force range from 1 pN to 3 pN, and unfolding rates from 15 pN to 40 pN. The obtained unfolding rates show different force sensitivities at forces below and above ~27 pN, which determines a free energy landscape with two energy barriers. Our results indicate that the free energy landscape of small globule proteins have general Bactrian camel shape, and large contact order of the native state produces a high barrier dominate at low forces.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Multiscale Wavelet Transfer Entropy with Application to Corticomuscular Coupling Analysis
Authors:
Zhenghao Guo,
Verity M. McClelland,
Osvaldo Simeone,
Kerry R. Mills,
Zoran Cvetkovic
Abstract:
Objective: Functional coupling between the motor cortex and muscle activity is commonly detected and quantified by cortico-muscular coherence (CMC) or Granger causality (GC) analysis, which are applicable only to linear couplings and are not sufficiently sensitive: some healthy subjects show no significant CMC and GC, and yet have good motor skills. The objective of this work is to develop measure…
▽ More
Objective: Functional coupling between the motor cortex and muscle activity is commonly detected and quantified by cortico-muscular coherence (CMC) or Granger causality (GC) analysis, which are applicable only to linear couplings and are not sufficiently sensitive: some healthy subjects show no significant CMC and GC, and yet have good motor skills. The objective of this work is to develop measures of functional cortico-muscular coupling that have improved sensitivity and are capable of detecting both linear and non-linear interactions. Methods: A multiscale wavelet transfer entropy (TE) methodology is proposed. The methodology relies on a dyadic stationary wavelet transform to decompose electroencephalogram (EEG) and electromyogram (EMG) signals into functional bands of neural oscillations. Then, it applies TE analysis based on a range of embedding delay vectors to detect and quantify intra- and cross-frequency band cortico-muscular coupling at different time scales. Results: Our experiments with neurophysiological signals substantiate the potential of the developed methodologies for detecting and quantifying information flow between EEG and EMG signals for subjects with and without significant CMC or GC, including non-linear cross-frequency interactions, and interactions across different temporal scales. The obtained results are in agreement with the underlying sensorimotor neurophysiology. Conclusion: These findings suggest that the concept of multiscale wavelet TE provides a comprehensive framework for analysing cortex-muscle interactions. Significance: The proposed methodologies will enable developing novel insights into movement control and neurophysiological processes more generally.
△ Less
Submitted 9 August, 2021;
originally announced August 2021.
-
Study on the multiple characteristics of M3 generation of pea mutants obtained by neutron irradiation
Authors:
Dapeng Xu,
Ze'en Yao,
Jianbin Pan,
Huyuan Feng,
Zhiqi Guo,
Xiaolong Lu
Abstract:
Irradiation breeding is an important technique in the effort to solve food shortages and improve the quality of agricultural products. In this study, a field test was implemented on the M3 generation of two mutant pea plants gained from previous neutron radiation of pea seeds. The relationship between agronomic characteristics and yields of the mutants was investigated. Moreover, differences in ph…
▽ More
Irradiation breeding is an important technique in the effort to solve food shortages and improve the quality of agricultural products. In this study, a field test was implemented on the M3 generation of two mutant pea plants gained from previous neutron radiation of pea seeds. The relationship between agronomic characteristics and yields of the mutants was investigated. Moreover, differences in physiological and biochemical properties and seed nutrients were analyzed. The results demonstrated that the plant height, effective pods per plant, and yield per plant of mutant Leaf-M1 were 45.0%, 43.2%, and 50.9% higher than those of the control group. Further analysis attributed the increase in yield per plant to the increased branching number. The yield per plant of mutant Leaf-M2 was 7.8% higher than that of the control group, which could be related with the increased chlorophyll content in the leaves. There was a significant difference between the two mutants in the increase of yield per plant owing to morphological variation between the two mutants. There were significant differences in SOD activity and MDA content between the two mutants and the control, indicating that the physiological regulation of the two mutants also changed. In addition, the iron element content of seeds of the two mutants were about 10.9% lower than in the seeds of the control group, a significant difference. These findings indicate that the mutants Leaf-M1 and Leaf-M2 have breeding value and material value for molecular biological studies.
△ Less
Submitted 1 June, 2021;
originally announced June 2021.
-
MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis
Authors:
Hanshu Cai,
Yiwen Gao,
Shuting Sun,
Na Li,
Fuze Tian,
Han Xiao,
Jianxiu Li,
Zhengwu Yang,
Xiaowei Li,
Qinglin Zhao,
Zhenyu Liu,
Zhijun Yao,
Minqiang Yang,
Hong Peng,
Jing Zhu,
Xiaowei Zhang,
Guoping Gao,
Fang Zheng,
Rui Li,
Zhihua Guo,
Rong Ma,
Jing Yang,
Lan Zhang,
Xiping Hu,
Yumin Li
, et al. (1 additional authors not shown)
Abstract:
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important…
▽ More
According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis.
△ Less
Submitted 4 March, 2020; v1 submitted 20 February, 2020;
originally announced February 2020.
-
Phylogenomic Analyses of Large-scale Nuclear Genes Provide New Insights into the Evolutionary Relationships within the Rosids
Authors:
Lei Zhao,
Xia Li,
Ning Zhang,
Shu-Dong Zhang,
Ting-Shuang Yi,
Hong Ma,
Zhen-Hua Guo,
De-Zhu Li
Abstract:
The Rosids is one of the largest groups of flowering plants, with 140 families and ~70,000 species. Previous phylogenetic studies of the rosids have primarily utilized organelle genes that likely differ in evolutionary histories from nuclear genes. To better understand the evolutionary history of rosids, it is necessary to investigate their phylogenetic relationships using nuclear genes. Here, we…
▽ More
The Rosids is one of the largest groups of flowering plants, with 140 families and ~70,000 species. Previous phylogenetic studies of the rosids have primarily utilized organelle genes that likely differ in evolutionary histories from nuclear genes. To better understand the evolutionary history of rosids, it is necessary to investigate their phylogenetic relationships using nuclear genes. Here, we employed large-scale phylogenomic datasets composed of nuclear genes, including 891 clusters of putative orthologous genes. Combined with comprehensive taxon sampling covering 63 species representing 14 out of the 17 orders, we reconstructed the rosids phylogeny with coalescence and concatenation methods, yielding similar tree topologies from all datasets. However, these topologies did not agree on the placement of Zygophyllales. Through comprehensive analyses, we found that missing data and gene tree heterogeneity were potential factors that may mislead concatenation methods, in particular, large amounts of missing data under high gene tree heterogeneity. Our results provided new insights into the deep phylogenetic relationships of the rosids, and demonstrated that coalescence methods may effectively resolve the phylogenetic relationships of the rosids with missing data under high gene tree heterogeneity.
△ Less
Submitted 30 June, 2016;
originally announced June 2016.
-
A Collaboration Network Model Of Cytokine-Protein Network
Authors:
Sheng-Rong Zou,
Ta Zhou,
Yu-Jing Peng,
Zhong-Wei Guo,
Chang-gui Gu,
Da-Ren He
Abstract:
Complex networks provide us a new view for investigation of immune systems. In this paper we collect data through STRING database and present a model with cooperation network theory. The cytokine-protein network model we consider is constituted by two kinds of nodes, one is immune cytokine types which can act as acts, other one is protein type which can act as actors. From act degree distributio…
▽ More
Complex networks provide us a new view for investigation of immune systems. In this paper we collect data through STRING database and present a model with cooperation network theory. The cytokine-protein network model we consider is constituted by two kinds of nodes, one is immune cytokine types which can act as acts, other one is protein type which can act as actors. From act degree distribution that can be well described by typical SPL -shifted power law functions, we find that HRAS.TNFRSF13C.S100A8.S100A1.MAPK8.S100A7.LIF.CCL4.CXCL13 are highly collaborated with other proteins. It reveals that these mediators are important in cytokine-protein network to regulate immune activity. Dyad act degree distribution is another important property to generalized collaboration network. Dyad is two proteins and they appear in one cytokine collaboration relationship. The dyad act degree distribution can be well described by typical SPL functions. The length of the average shortest path is 1.29. These results show that this model could describe the cytokine-protein collaboration preferably
△ Less
Submitted 5 December, 2007;
originally announced December 2007.
-
An Empirical Study of Immune System Based On Bipartite Network
Authors:
Sheng-Rong Zou,
Yu-Jing Peng,
Zhong-Wei Guo,
Ta Zhou,
Chang-gui Gu,
Da-Ren He
Abstract:
Immune system is the most important defense system to resist human pathogens. In this paper we present an immune model with bipartite graphs theory. We collect data through COPE database and construct an immune cell- mediators network. The act degree distribution of this network is proved to be power-law, with index of 1.8. From our analysis, we found that some mediators with high degree are ver…
▽ More
Immune system is the most important defense system to resist human pathogens. In this paper we present an immune model with bipartite graphs theory. We collect data through COPE database and construct an immune cell- mediators network. The act degree distribution of this network is proved to be power-law, with index of 1.8. From our analysis, we found that some mediators with high degree are very important mediators in the process of regulating immune activity, such as TNF-alpha, IL-8, TNF-alpha receptors, CCL5, IL-6, IL-2 receptors, TNF-beta receptors, TNF-beta, IL-4 receptors, IL-1 beta, CD54 and so on. These mediators are important in immune system to regulate their activity. We also found that the assortative of the immune system is -0.27. It reveals that our immune system is non-social network. Finally we found similarity of the network is 0.13. Each two cells are similar to small extent. It reveals that many cells have its unique features. The results show that this model could describe the immune system comprehensive.
△ Less
Submitted 5 December, 2007;
originally announced December 2007.
-
A Brand-new Research Method of Neuroendocrine System
Authors:
Sheng-Rong Zou,
Zhong-Wei Guo,
Yu-Jing Peng,
Ta Zhou,
Chang-Gui Gu,
Da-Ren He
Abstract:
In this paper, we present the empirical investigation results on the neuroendocrine system by bipartite graphs. This neuroendocrine network model can describe the structural characteristic of neuroendocrine system. The act degree distribution and cumulate act degree distribution show so-called shifted power law-SPL function forms. In neuroendocrine network, the act degree stands for the number o…
▽ More
In this paper, we present the empirical investigation results on the neuroendocrine system by bipartite graphs. This neuroendocrine network model can describe the structural characteristic of neuroendocrine system. The act degree distribution and cumulate act degree distribution show so-called shifted power law-SPL function forms. In neuroendocrine network, the act degree stands for the number of the cells that secretes a single mediator, in which bFGF(basic fibroblast growth factor) is the largest node act degree. It is an important mitogenic cytokine, followed by TGF-beta, IL-6, IL1-beta, VEGF, IGF-1and so on. They are critical in neuroendocrine system to maintain bodily healthiness, emotional stabilization and endocrine harmony. The average act degree of neuroendocrine network is h = 3.01, It means each mediator is secreted by three cells on an average . The similarity that stand for the average probability of secreting the same mediators by all the neuroendocrine cells is s = 0.14. Our results may be used in the research of the medical treatment of neuroendocrine diseases.
△ Less
Submitted 2 December, 2007;
originally announced December 2007.