Search | arXiv e-print repository

Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model

Authors: Yuran Xiang, Haiteng Zhao, Chang Ma, Zhi-Hong Deng

Abstract: Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named… ▽ More Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named $\textbf{UTGDiff (Unified Text-Graph Diffusion Model)}$, which utilizes language models for discrete graph diffusion to generate molecular graphs from instructions. UTGDiff features a unified text-graph transformer as the denoising network, derived from pre-trained language models and minimally modified to process graph data through attention bias. Our experimental results demonstrate that UTGDiff consistently outperforms sequence-based baselines in tasks involving instruction-based molecule generation and editing, achieving superior performance with fewer parameters given an equivalent level of pretraining corpus. Our code is availble at https://github.com/ran1812/UTGDiff. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2407.14020 [pdf, other]

NeuroBind: Towards Unified Multimodal Representations for Neural Signals

Authors: Fengyu Yang, Chao Feng, Daniel Wang, Tianye Wang, Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong

Abstract: Understanding neural activity and information representation is crucial for advancing knowledge of brain function and cognition. Neural activity, measured through techniques like electrophysiology and neuroimaging, reflects various aspects of information processing. Recent advances in deep neural networks offer new approaches to analyzing these signals using pre-trained models. However, challenges… ▽ More Understanding neural activity and information representation is crucial for advancing knowledge of brain function and cognition. Neural activity, measured through techniques like electrophysiology and neuroimaging, reflects various aspects of information processing. Recent advances in deep neural networks offer new approaches to analyzing these signals using pre-trained models. However, challenges arise due to discrepancies between different neural signal modalities and the limited scale of high-quality neural data. To address these challenges, we present NeuroBind, a general representation that unifies multiple brain signal types, including EEG, fMRI, calcium imaging, and spiking data. To achieve this, we align neural signals in these image-paired neural datasets to pre-trained vision-language embeddings. Neurobind is the first model that studies different neural modalities interconnectedly and is able to leverage high-resource modality models for various neuroscience tasks. We also showed that by combining information from different neural signal modalities, NeuroBind enhances downstream performance, demonstrating the effectiveness of the complementary strengths of different neural modalities. As a result, we can leverage multiple types of neural signals mapped to the same space to improve downstream tasks, and demonstrate the complementary strengths of different neural modalities. This approach holds significant potential for advancing neuroscience research, improving AI systems, and developing neuroprosthetics and brain-computer interfaces. △ Less

Submitted 19 July, 2024; originally announced July 2024.

arXiv:2407.07930 [pdf]

Token-Mol 1.0: Tokenized drug design with large language model

Authors: Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

Abstract: Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug… ▽ More Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts. △ Less

Submitted 19 August, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

arXiv:2406.15534 [pdf, other]

Geneverse: A collection of Open-source Multimodal Large Language Models for Genomic and Proteomic Research

Authors: Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, W. Jim Zheng, Hongyu Zhao

Abstract: The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for th… ▽ More The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible. △ Less

Submitted 21 June, 2024; originally announced June 2024.

Comments: 8 pages

arXiv:2405.06642 [pdf, other]

PPFlow: Target-aware Peptide Design with Torsional Flow Matching

Authors: Haitao Lin, Odin Zhang, Huifeng Zhao, Dejun Jiang, Lirong Wu, Zicheng Liu, Yufei Huang, Stan Z. Li

Abstract: Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure… ▽ More Therapeutic peptides have proven to have great pharmaceutical value and potential in recent decades. However, methods of AI-assisted peptide drug discovery are not fully explored. To fill the gap, we propose a target-aware peptide design method called \textsc{PPFlow}, based on conditional flow matching on torus manifolds, to model the internal geometries of torsion angles for the peptide structure design. Besides, we establish a protein-peptide binding dataset named PPBench2024 to fill the void of massive data for the task of structure-based peptide drug design and to allow the training of deep learning methods. Extensive experiments show that PPFlow reaches state-of-the-art performance in tasks of peptide drug generation and optimization in comparison with baseline models, and can be generalized to other tasks including docking and side-chain packing. △ Less

Submitted 16 June, 2024; v1 submitted 5 March, 2024; originally announced May 2024.

Comments: 18 pages

arXiv:2404.19230 [pdf]

Deep Lead Optimization: Leveraging Generative AI for Structural Modification

Authors: Odin Zhang, Haitao Lin, Hui Zhang, Huifeng Zhao, Yufei Huang, Yuansheng Huang, Dejun Jiang, Chang-yu Hsieh, Peichen Pan, Tingjun Hou

Abstract: The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead opt… ▽ More The idea of using deep-learning-based molecular generation to accelerate discovery of drug candidates has attracted extraordinary attention, and many deep generative models have been developed for automated drug design, termed molecular generation. In general, molecular generation encompasses two main strategies: de novo design, which generates novel molecular structures from scratch, and lead optimization, which refines existing molecules into drug candidates. Among them, lead optimization plays an important role in real-world drug design. For example, it can enable the development of me-better drugs that are chemically distinct yet more effective than the original drugs. It can also facilitate fragment-based drug design, transforming virtual-screened small ligands with low affinity into first-in-class medicines. Despite its importance, automated lead optimization remains underexplored compared to the well-established de novo generative models, due to its reliance on complex biological and chemical knowledge. To bridge this gap, we conduct a systematic review of traditional computational methods for lead optimization, organizing these strategies into four principal sub-tasks with defined inputs and outputs. This review delves into the basic concepts, goals, conventional CADD techniques, and recent advancements in AIDD. Additionally, we introduce a unified perspective based on constrained subgraph generation to harmonize the methodologies of de novo design and lead optimization. Through this lens, de novo design can incorporate strategies from lead optimization to address the challenge of generating hard-to-synthesize molecules; inversely, lead optimization can benefit from the innovations in de novo design by approaching it as a task of generating molecules conditioned on certain substructures. △ Less

Submitted 29 April, 2024; originally announced April 2024.

arXiv:2404.00014 [pdf]

Deep Geometry Handling and Fragment-wise Molecular 3D Graph Generation

Authors: Odin Zhang, Yufei Huang, Shichen Cheng, Mengyao Yu, Xujun Zhang, Haitao Lin, Yundian Zeng, Mingyang Wang, Zhenxing Wu, Huifeng Zhao, Zaixi Zhang, Chenqing Hua, Yu Kang, Sunliang Cui, Peichen Pan, Chang-Yu Hsieh, Tingjun Hou

Abstract: Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a co… ▽ More Most earlier 3D structure-based molecular generation approaches follow an atom-wise paradigm, incrementally adding atoms to a partially built molecular fragment within protein pockets. These methods, while effective in designing tightly bound ligands, often overlook other essential properties such as synthesizability. The fragment-wise generation paradigm offers a promising solution. However, a common challenge across both atom-wise and fragment-wise methods lies in their limited ability to co-design plausible chemical and geometrical structures, resulting in distorted conformations. In response to this challenge, we introduce the Deep Geometry Handling protocol, a more abstract design that extends the design focus beyond the model architecture. Through a comprehensive review of existing geometry-related models and their protocols, we propose a novel hybrid strategy, culminating in the development of FragGen - a geometry-reliable, fragment-wise molecular generation method. FragGen marks a significant leap forward in the quality of generated geometry and the synthesis accessibility of molecules. The efficacy of FragGen is further validated by its successful application in designing type II kinase inhibitors at the nanomolar level. △ Less

Submitted 15 March, 2024; originally announced April 2024.

arXiv:2312.17670 [pdf, other]

Benchmarking the CoW with the TopCoW Challenge: Topology-Aware Anatomical Segmentation of the Circle of Willis for CTA and MRA

Authors: Kaiyuan Yang, Fabio Musio, Yihui Ma, Norman Juchler, Johannes C. Paetzold, Rami Al-Maskari, Luciano Höher, Hongwei Bran Li, Ibrahim Ethem Hamamci, Anjany Sekuboyina, Suprosanna Shit, Houjing Huang, Chinmay Prabhakar, Ezequiel de la Rosa, Diana Waldmannstetter, Florian Kofler, Fernando Navarro, Martin Menten, Ivan Ezhov, Daniel Rueckert, Iris Vos, Ynte Ruigrok, Birgitta Velthuis, Hugo Kuijf, Julien Hämmerli , et al. (59 additional authors not shown)

Abstract: The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modaliti… ▽ More The Circle of Willis (CoW) is an important network of arteries connecting major circulations of the brain. Its vascular architecture is believed to affect the risk, severity, and clinical outcome of serious neuro-vascular diseases. However, characterizing the highly variable CoW anatomy is still a manual and time-consuming expert task. The CoW is usually imaged by two angiographic imaging modalities, magnetic resonance angiography (MRA) and computed tomography angiography (CTA), but there exist limited public datasets with annotations on CoW anatomy, especially for CTA. Therefore we organized the TopCoW Challenge in 2023 with the release of an annotated CoW dataset. The TopCoW dataset was the first public dataset with voxel-level annotations for thirteen possible CoW vessel components, enabled by virtual-reality (VR) technology. It was also the first large dataset with paired MRA and CTA from the same patients. TopCoW challenge formalized the CoW characterization problem as a multiclass anatomical segmentation task with an emphasis on topological metrics. We invited submissions worldwide for the CoW segmentation task, which attracted over 140 registered participants from four continents. The top performing teams managed to segment many CoW components to Dice scores around 90%, but with lower scores for communicating arteries and rare variants. There were also topological mistakes for predictions with high Dice scores. Additional topological analysis revealed further areas for improvement in detecting certain CoW components and matching CoW variant topology accurately. TopCoW represented a first attempt at benchmarking the CoW anatomical segmentation task for MRA and CTA, both morphologically and topologically. △ Less

Submitted 29 April, 2024; v1 submitted 29 December, 2023; originally announced December 2023.

Comments: 24 pages, 11 figures, 9 tables. Summary Paper for the MICCAI TopCoW 2023 Challenge

arXiv:2310.17112 [pdf, other]

Modeling and Analysis of the Epidemic-Behavior Co-evolution Dynamics with User Irrationality

Authors: Wenxiang Dong, H. Vicky Zhao

Abstract: During a public health crisis like COVID-19, individuals' adoption of protective behaviors, such as self-isolation and wearing masks, can significantly impact the spread of the disease. In the meanwhile, the spread of the disease can also influence individuals' behavioral choices. Moreover, when facing uncertain losses, individuals' decisions tend to be irrational. Therefore, it is critical to stu… ▽ More During a public health crisis like COVID-19, individuals' adoption of protective behaviors, such as self-isolation and wearing masks, can significantly impact the spread of the disease. In the meanwhile, the spread of the disease can also influence individuals' behavioral choices. Moreover, when facing uncertain losses, individuals' decisions tend to be irrational. Therefore, it is critical to study individuals' irrational behavior choices in the context of a pandemic. In this paper, we propose an epidemic-behavior co-evolution model that captures the dynamic interplay between individual decision-making and disease spread. To account for irrational decision-making, we incorporate the Prospect Theory in our individual behavior modeling. We conduct a theoretical analysis of the model, examining the steady states that emerge from the co-evolutionary process. We use simulations to validate our theoretical findings and gain further insights. This investigation aims to enhance our understanding of the complex dynamics between individual behavior and disease spread during a pandemic. △ Less

Submitted 25 October, 2023; originally announced October 2023.

arXiv:2310.02275 [pdf, other]

MuSe-GNN: Learning Unified Gene Representation From Multimodal Biological Graph Data

Authors: Tianyu Liu, Yuge Wang, Rex Ying, Hongyu Zhao

Abstract: Discovering genes with similar functions across diverse biomedical contexts poses a significant challenge in gene representation learning due to data heterogeneity. In this study, we resolve this problem by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations fro… ▽ More Discovering genes with similar functions across diverse biomedical contexts poses a significant challenge in gene representation learning due to data heterogeneity. In this study, we resolve this problem by introducing a novel model called Multimodal Similarity Learning Graph Neural Network, which combines Multimodal Machine Learning and Deep Graph Neural Networks to learn gene representations from single-cell sequencing and spatial transcriptomic data. Leveraging 82 training datasets from 10 tissues, three sequencing techniques, and three species, we create informative graph structures for model training and gene representations generation, while incorporating regularization with weighted similarity learning and contrastive learning to learn cross-data gene-gene relationships. This novel design ensures that we can offer gene representations containing functional similarity across different contexts in a joint space. Comprehensive benchmarking analysis shows our model's capacity to effectively capture gene function similarity across multiple modalities, outperforming state-of-the-art methods in gene representation learning by up to 97.5%. Moreover, we employ bioinformatics tools in conjunction with gene representations to uncover pathway enrichment, regulation causal networks, and functions of disease-associated or dosage-sensitive genes. Therefore, our model efficiently produces unified gene representations for the analysis of gene functions, tissue functions, diseases, and species evolution. △ Less

Submitted 29 September, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2309.15397 [pdf, ps, other]

Short-Term Postsynaptic Plasticity Facilitates Predictive Tracking in Continuous Attractors

Authors: Huilin Zhao, Sungchil Yang, Chi Chung Alan Fung

Abstract: The N-methyl-D-aspartate receptor (NMDAR) is a crucial component of synaptic transmission, and its dysfunction is implicated in many neurological diseases and psychiatric conditions. NMDAR-based short-term postsynaptic plasticity (STPP) is a newly discovered postsynaptic response facilitation mechanism. Our group has suggested that long-lasting glutamate binding of NMDAR allows input information t… ▽ More The N-methyl-D-aspartate receptor (NMDAR) is a crucial component of synaptic transmission, and its dysfunction is implicated in many neurological diseases and psychiatric conditions. NMDAR-based short-term postsynaptic plasticity (STPP) is a newly discovered postsynaptic response facilitation mechanism. Our group has suggested that long-lasting glutamate binding of NMDAR allows input information to be held for up to 500 ms or longer in brain slices, which contributes to response facilitation. However, the implications of STPP in the dynamics of neuronal populations remain unknown. In this study, we implemented STPP in a continuous attractor neural network (CANN) model to describe the neural information encoded in neuronal populations. Unlike short-term facilitation, which is a kind of presynaptic plasticity, the temporally enhanced synaptic efficacy induced by STPP destabilizes the network state of the CANN by increasing the mobility of the system. This nontrivial dynamical effect enables a CANN with STPP to track a moving stimulus predictively, i.e., the network state responds to the anticipated stimulus. Our findings reveal a novel STPP-based mechanism for sensory prediction that can help develop brain-inspired computational algorithms for prediction. △ Less

Submitted 27 September, 2023; originally announced September 2023.

Comments: 29 pages, 9 figures

arXiv:2309.08616 [pdf]

Introduction of accelerated BOIN design and facilitation of its application

Authors: Masahiro Kojima, Wu Wende, Henry Zhao

Abstract: Purpose: During discussions at the Data Science Roundtable meeting in Japan, there were instances where the adoption of the BOIN design was declined, attributed to the extension of study duration and increased sample size in comparison to the 3+3 design. We introduce an accelerated BOIN design aimed at completing a clinical phase I trial at a pace comparable to the 3+3 design. Additionally, we int… ▽ More Purpose: During discussions at the Data Science Roundtable meeting in Japan, there were instances where the adoption of the BOIN design was declined, attributed to the extension of study duration and increased sample size in comparison to the 3+3 design. We introduce an accelerated BOIN design aimed at completing a clinical phase I trial at a pace comparable to the 3+3 design. Additionally, we introduce how we could have applied the BOIN design within our company, which predominantly utilized the 3+3 design for most of its clinical oncology dose escalation trials. Methods: The accelerated BOIN design is adaptable by using efficiently designated stopping criterion for the existing BOIN framework. Our approach is to terminate the dose escalation study if the number of evaluable patients treated at the current dose reaches 6 and the decision is to stay at the current dose for the next cohort of patients. In addition, for lower dosage levels, considering a cohort size smaller than 3 may be feasible when there are no safety concerns from non-clinical studies. We demonstrate the accelerated BOIN design using a case study and subsequently evaluate the performance of our proposed design through a simulation study. Results: In the simulation study, the average difference in the percentage of correct MTD selection between the accelerated BOIN design and the standard BOIN design was -2.43%, the average study duration and the average sample size of the accelerated BOIN design was reduced by 14.8 months and 9.22 months, respectively, compared with the standard BOIN design. Conclusion: We conclude that our proposed accelerated BOIN design not only provides superior operating characteristics but also enables the study to be completed as fast as the 3+3 design. △ Less

Submitted 31 August, 2023; originally announced September 2023.

arXiv:2308.02172 [pdf]

Delete: Deep Lead Optimization Enveloped in Protein Pocket through Unified Deleting Strategies and a Structure-aware Network

Authors: Haotian Zhang, Huifeng Zhao, Xujun Zhang, Qun Su, Hongyan Du, Chao Shen, Zhe Wang, Dan Li, Peichen Pan, Guangyong Chen, Yu Kang, Chang-yu Hsieh, Tingjun Hou

Abstract: Drug discovery is a highly complicated process, and it is unfeasible to fully commit it to the recently developed molecular generation methods. Deep learning-based lead optimization takes expert knowledge as a starting point, learning from numerous historical cases about how to modify the structure for better drug-forming properties. However, compared with the more established de novo generation s… ▽ More Drug discovery is a highly complicated process, and it is unfeasible to fully commit it to the recently developed molecular generation methods. Deep learning-based lead optimization takes expert knowledge as a starting point, learning from numerous historical cases about how to modify the structure for better drug-forming properties. However, compared with the more established de novo generation schemes, lead optimization is still an area that requires further exploration. Previously developed models are often limited to resolving one (or few) certain subtask(s) of lead optimization, and most of them can only generate the two-dimensional structures of molecules while disregarding the vital protein-ligand interactions based on the three-dimensional binding poses. To address these challenges, we present a novel tool for lead optimization, named Delete (Deep lead optimization enveloped in protein pocket). Our model can handle all subtasks of lead optimization involving fragment growing, linking, and replacement through a unified deleting (masking) strategy, and is aware of the intricate pocket-ligand interactions through the geometric design of networks. Statistical evaluations and case studies conducted on individual subtasks demonstrate that Delete has a significant ability to produce molecules with superior binding affinities to protein targets and reasonable drug-likeness from given fragments or atoms. This feature may assist medicinal chemists in developing not only me-too/me-better products from existing drugs but also hit-to-lead for first-in-class drugs in a highly efficient manner. △ Less

Submitted 4 August, 2023; originally announced August 2023.

arXiv:2306.13089 [pdf, other]

GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning

Authors: Haiteng Zhao, Shengchao Liu, Chang Ma, Hannan Xu, Jie Fu, Zhi-Hong Deng, Lingpeng Kong, Qi Liu

Abstract: Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We disco… ▽ More Molecule property prediction has gained significant attention in recent years. The main bottleneck is the label insufficiency caused by expensive lab experiments. In order to alleviate this issue and to better leverage textual knowledge for tasks, this study investigates the feasibility of employing natural language instructions to accomplish molecule-related tasks in a zero-shot setting. We discover that existing molecule-text models perform poorly in this setting due to inadequate treatment of instructions and limited capacity for graphs. To overcome these issues, we propose GIMLET, which unifies language models for both graph and text data. By adopting generalized position embedding, our model is extended to encode both graph structures and instruction text without additional graph encoding modules. GIMLET also decouples encoding of the graph from tasks instructions in the attention mechanism, enhancing the generalization of graph features across novel tasks. We construct a dataset consisting of more than two thousand molecule tasks with corresponding instructions derived from task descriptions. We pretrain GIMLET on the molecule tasks along with instructions, enabling the model to transfer effectively to a broad range of tasks. Experimental results demonstrate that GIMLET significantly outperforms molecule-text baselines in instruction-based zero-shot learning, even achieving closed results to supervised GNN models on tasks such as toxcast and muv. △ Less

Submitted 22 October, 2023; v1 submitted 28 May, 2023; originally announced June 2023.

arXiv:2306.07812 [pdf, other]

doi 10.1145/3580305.3599252

Automated 3D Pre-Training for Molecular Property Prediction

Authors: Xu Wang, Huan Zhao, Weiwei Tu, Quanming Yao

Abstract: Molecular property prediction is an important problem in drug discovery and materials science. As geometric structures have been demonstrated necessary for molecular property prediction, 3D information has been combined with various graph learning methods to boost prediction performance. However, obtaining the geometric structure of molecules is not feasible in many real-world applications due to… ▽ More Molecular property prediction is an important problem in drug discovery and materials science. As geometric structures have been demonstrated necessary for molecular property prediction, 3D information has been combined with various graph learning methods to boost prediction performance. However, obtaining the geometric structure of molecules is not feasible in many real-world applications due to the high computational cost. In this work, we propose a novel 3D pre-training framework (dubbed 3D PGT), which pre-trains a model on 3D molecular graphs, and then fine-tunes it on molecular graphs without 3D structures. Based on fact that bond length, bond angle, and dihedral angle are three basic geometric descriptors corresponding to a complete molecular 3D conformer, we first develop a multi-task generative pre-train framework based on these three attributes. Next, to automatically fuse these three generative tasks, we design a surrogate metric using the \textit{total energy} to search for weight distribution of the three pretext task since total energy corresponding to the quality of 3D conformer.Extensive experiments on 2D molecular graphs are conducted to demonstrate the accuracy, efficiency and generalization ability of the proposed 3D PGT compared to various pre-training baselines. △ Less

Submitted 2 July, 2023; v1 submitted 13 June, 2023; originally announced June 2023.

arXiv:2304.10494 [pdf]

Infinite Physical Monkey: Do Deep Learning Methods Really Perform Better in Conformation Generation?

Authors: Haotian Zhang, Jintu Zhang, Huifeng Zhao, Dejun Jiang, Yafeng Deng

Abstract: Conformation Generation is a fundamental problem in drug discovery and cheminformatics. And organic molecule conformation generation, particularly in vacuum and protein pocket environments, is most relevant to drug design. Recently, with the development of geometric neural networks, the data-driven schemes have been successfully applied in this field, both for molecular conformation generation (in… ▽ More Conformation Generation is a fundamental problem in drug discovery and cheminformatics. And organic molecule conformation generation, particularly in vacuum and protein pocket environments, is most relevant to drug design. Recently, with the development of geometric neural networks, the data-driven schemes have been successfully applied in this field, both for molecular conformation generation (in vacuum) and binding pose generation (in protein pocket). The former beats the traditional ETKDG method, while the latter achieves similar accuracy compared with the widely used molecular docking software. Although these methods have shown promising results, some researchers have recently questioned whether deep learning (DL) methods perform better in molecular conformation generation via a parameter-free method. To our surprise, what they have designed is some kind analogous to the famous infinite monkey theorem, the monkeys that are even equipped with physics education. To discuss the feasibility of their proving, we constructed a real infinite stochastic monkey for molecular conformation generation, showing that even with a more stochastic sampler for geometry generation, the coverage of the benchmark QM-computed conformations are higher than those of most DL-based methods. By extending their physical monkey algorithm for binding pose prediction, we also discover that the successful docking rate also achieves near-best performance among existing DL-based docking models. Thus, though their conclusions are right, their proof process needs more concern. △ Less

Submitted 7 March, 2023; originally announced April 2023.

arXiv:2302.12563 [pdf, other]

Retrieved Sequence Augmentation for Protein Representation Learning

Authors: Chang Ma, Haiteng Zhao, Lin Zheng, Jiayi Xin, Qintong Li, Lijun Wu, Zhihong Deng, Yang Lu, Qi Liu, Lingpeng Kong

Abstract: Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, a… ▽ More Protein language models have excelled in a variety of tasks, ranging from structure prediction to protein engineering. However, proteins are highly diverse in functions and structures, and current state-of-the-art models including the latest version of AlphaFold rely on Multiple Sequence Alignments (MSA) to feed in the evolutionary knowledge. Despite their success, heavy computational overheads, as well as the de novo and orphan proteins remain great challenges in protein representation learning. In this work, we show that MSAaugmented models inherently belong to retrievalaugmented methods. Motivated by this finding, we introduce Retrieved Sequence Augmentation(RSA) for protein representation learning without additional alignment or pre-processing. RSA links query protein sequences to a set of sequences with similar structures or properties in the database and combines these sequences for downstream prediction. We show that protein language models benefit from the retrieval enhancement on both structure prediction and property prediction tasks, with a 5% improvement on MSA Transformer on average while being 373 times faster. In addition, we show that our model can transfer to new protein domains better and outperforms MSA Transformer on de novo protein prediction. Our study fills a much-encountered gap in protein prediction and brings us a step closer to demystifying the domain knowledge needed to understand protein sequences. Code is available on https://github.com/HKUNLP/RSA. △ Less

Submitted 24 February, 2023; originally announced February 2023.

arXiv:2302.11662 [pdf]

eQTL Studies: from Bulk Tissues to Single Cells

Authors: Jingfei Zhang, Hongyu Zhao

Abstract: An expression quantitative trait locus (eQTL) is a chromosomal region where genetic variants are associated with the expression levels of certain genes that can be both nearby or distant. The identifications of eQTLs for different tissues, cell types, and contexts have led to better understanding of the dynamic regulations of gene expressions and implications of functional genes and variants for c… ▽ More An expression quantitative trait locus (eQTL) is a chromosomal region where genetic variants are associated with the expression levels of certain genes that can be both nearby or distant. The identifications of eQTLs for different tissues, cell types, and contexts have led to better understanding of the dynamic regulations of gene expressions and implications of functional genes and variants for complex traits and diseases. Although most eQTL studies to date have been performed on data collected from bulk tissues, recent studies have demonstrated the importance of cell-type-specific and context-dependent gene regulations in biological processes and disease mechanisms. In this review, we discuss statistical methods that have been developed to enable the detections of cell-type-specific and context-dependent eQTLs from bulk tissues, purified cell types, and single cells. We also discuss the limitations of the current methods and future research opportunities. △ Less

Submitted 22 February, 2023; originally announced February 2023.

arXiv:2210.08749 [pdf, other]

A Transformer-based Generative Model for De Novo Molecular Design

Authors: Wenlu Wang, Ye Wang, Honggang Zhao, Simone Sciabola

Abstract: In the scope of drug discovery, the molecular design aims to identify novel compounds from the chemical space where the potential drug-like molecules are estimated to be in the order of 10^60 - 10^100. Since this search task is computationally intractable due to the unbounded search space, deep learning draws a lot of attention as a new way of generating unseen molecules. As we seek compounds with… ▽ More In the scope of drug discovery, the molecular design aims to identify novel compounds from the chemical space where the potential drug-like molecules are estimated to be in the order of 10^60 - 10^100. Since this search task is computationally intractable due to the unbounded search space, deep learning draws a lot of attention as a new way of generating unseen molecules. As we seek compounds with specific target proteins, we propose a Transformer-based deep model for de novo target-specific molecular design. The proposed method is capable of generating both drug-like compounds (without specified targets) and target-specific compounds. The latter are generated by enforcing different keys and values of the multi-head attention for each target. In this way, we allow the generation of SMILES strings to be conditional on the specified target. Experimental results demonstrate that our method is capable of generating both valid drug-like compounds and target-specific compounds. Moreover, the sampled compounds from conditional model largely occupy the real target-specific molecules' chemical space and also cover a significant fraction of novel compounds. △ Less

Submitted 22 October, 2022; v1 submitted 17 October, 2022; originally announced October 2022.

arXiv:2201.01855 [pdf, other]

Graph Neural Networks for Double-Strand DNA Breaks Prediction

Authors: XU Wang, Huan Zhao, Weiwei TU, Hao Li, Yu Sun, Xiaochen Bo

Abstract: Double-strand DNA breaks (DSBs) are a form of DNA damage that can cause abnormal chromosomal rearrangements. Recent technologies based on high-throughput experiments have obvious high costs and technical challenges.Therefore, we design a graph neural network based method to predict DSBs (GraphDSB), using DNA sequence features and chromosome structure information. In order to improve the expression… ▽ More Double-strand DNA breaks (DSBs) are a form of DNA damage that can cause abnormal chromosomal rearrangements. Recent technologies based on high-throughput experiments have obvious high costs and technical challenges.Therefore, we design a graph neural network based method to predict DSBs (GraphDSB), using DNA sequence features and chromosome structure information. In order to improve the expression ability of the model, we introduce Jumping Knowledge architecture and several effective structural encoding methods. The contribution of structural information to the prediction of DSBs is verified by the experiments on datasets from normal human epidermal keratinocytes (NHEK) and chronic myeloid leukemia cell line (K562), and the ablation studies further demonstrate the effectiveness of the designed components in the proposed GraphDSB framework. Finally, we use GNNExplainer to analyze the contribution of node features and topology to DSBs prediction, and proved the high contribution of 5-mer DNA sequence features and two chromatin interaction modes. △ Less

Submitted 4 January, 2022; originally announced January 2022.

arXiv:2111.15411 [pdf]

EndHiC: assemble large contigs into chromosomal-level scaffolds using the Hi-C links from contig ends

Authors: Sen Wang, Hengchao Wang, Fan Jiang, Anqi Wang, Hangwei Liu, Hanbo Zhao, Boyuan Yang, Dong Xu, Yan Zhang, Wei Fan

Abstract: Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate onl… ▽ More Motivation: The application of PacBio HiFi and ultra-long ONT reads have achieved huge progress in the contig-level assembly, but it is still challenging to assemble large contigs into chromosomes with available Hi-C scaffolding software, which all compute the contact value between contigs using the Hi-C links from the whole contig regions. As the Hi-C links of two adjacent contigs concentrate only at the neighbor ends of the contigs, larger contig size will reduce the power to differentiate adjacent (signal) and non-adjacent (noise) contig linkages, leading to a higher rate of mis-assembly. Results: We present a software package EndHiC, which is suitable to assemble large contigs (> 1-Mb) into chromosomal-level scaffolds, using Hi-C links from only the contig end regions instead of the whole contig regions. Benefiting from the increased signal to noise ratio, EndHiC achieves much higher scaffolding accuracy compared to existing software LACHESIS, ALLHiC, and 3D-DNA. Moreover, EndHiC has few parameters, runs 10-1000 times faster than existing software, needs trivial memory, provides robustness evaluation, and allows graphic viewing of the scaffold results. The high scaffolding accuracy and user-friendly interface of EndHiC, liberate the users from labor-intensive manual checks and revision works. Availability and implementation: EndHiC is written in Perl, and is freely available at https://github.com/fanagislab/EndHiC. Contact: [email protected] and [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. △ Less

Submitted 30 November, 2021; originally announced November 2021.

Comments: 25 pages, 1 figure, 6 supplemental figures, and 6 supplemental Tables

arXiv:2101.03784 [pdf]

Estimate Metabolite Taxonomy and Structure with a Fragment-Centered Database and Fragment Network

Authors: Hansen Zhao, Xu Zhao, Huan Yao, Jiaxin Feng, Sichun Zhang, Xinrong Zhang

Abstract: Metabolite structure identification has become the major bottleneck of the mass spectrometry based metabolomics research. Till now, number of mass spectra databases and search algorithms have been developed to address this issue. However, two critical problems still exist: the low chemical component record coverage in databases and significant MS/MS spectra variations related to experiment equipme… ▽ More Metabolite structure identification has become the major bottleneck of the mass spectrometry based metabolomics research. Till now, number of mass spectra databases and search algorithms have been developed to address this issue. However, two critical problems still exist: the low chemical component record coverage in databases and significant MS/MS spectra variations related to experiment equipment and parameter settings. In this work, we considered the molecule fragment as basic building blocks of the metabolic components which had relatively consistent signatures in MS/MS spectra. And from a bottom-up point of view, we built a fragment centered database, MSFragDB, by reorganizing the data from the Human Metabolome Database (HMDB) and developed an intensity-free searching algorithm to search and rank the most relative metabolite according to the users' input. We also proposed the concept of fragment network, a graph structure that encoded the relationship between the molecule fragments to find close motif that indicated a specific chemical structure. Although based on the same dataset as the HMDB, validation results implied that the MSFragDB had a higher hit ratio and furthermore, estimated possible taxonomy that a query spectrum belongs to when the corresponding chemical component was missing in the database. Aid by the Fragment Network, the MSFragDB was also proved to be able to estimate the right structure while the MS/MS spectrum suffers from the precursor-contamination. The strategy proposed is general and can be adopted in existing databases. We believe MSFragDB and Fragment Network can improve the performance of structure identification with existing data. The beta version of the database is freely available at www.xrzhanglab.com/msfragdb/. △ Less

Submitted 11 January, 2021; originally announced January 2021.

arXiv:2011.10955 [pdf, other]

doi 10.1007/s11571-021-09731-9

Autonomous learning of nonlocal stochastic neuron dynamics

Authors: Tyler E. Maltba, Hongli Zhao, Daniel M. Tartakovsky

Abstract: Neuronal dynamics is driven by externally imposed or internally generated random excitations/noise, and is often described by systems of random or stochastic ordinary differential equations. Such systems admit a distribution of solutions, which is (partially) characterized by the single-time joint probability density function (PDF) of system states. It can be used to calculate such information-the… ▽ More Neuronal dynamics is driven by externally imposed or internally generated random excitations/noise, and is often described by systems of random or stochastic ordinary differential equations. Such systems admit a distribution of solutions, which is (partially) characterized by the single-time joint probability density function (PDF) of system states. It can be used to calculate such information-theoretic quantities as the mutual information between the stochastic stimulus and various internal states of the neuron (e.g., membrane potential), as well as various spiking statistics. When random excitations are modeled as Gaussian white noise, the joint PDF of neuron states satisfies exactly a Fokker-Planck equation. However, most biologically plausible noise sources are correlated (colored). In this case, the resulting PDF equations require a closure approximation. We propose two methods for closing such equations: a modified nonlocal large-eddy-diffusivity closure and a data-driven closure relying on sparse regression to learn relevant features. The closures are tested for the stochastic non-spiking leaky integrate-and-fire and FitzHugh-Nagumo (FHN) neurons driven by sine-Wiener noise. Mutual information and total correlation between the random stimulus and the internal states of the neuron are calculated for the FHN neuron. △ Less

Submitted 7 September, 2021; v1 submitted 22 November, 2020; originally announced November 2020.

Comments: 28 pages, 12 figures, First author: Tyler E. Maltba, Corresponding author: Daniel M. Tartakovsky

Journal ref: Cogn Neurodyn 16, 683-705 (2022)

arXiv:2007.09524 [pdf, ps, other]

A Manifold Proximal Linear Method for Sparse Spectral Clustering with Application to Single-Cell RNA Sequencing Data Analysis

Authors: Zhongruo Wang, Bingyuan Liu, Shixiang Chen, Shiqian Ma, Lingzhou Xue, Hongyu Zhao

Abstract: Spectral clustering is one of the fundamental unsupervised learning methods widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering and it improves the interpretability of the model. This paper considers a widely adopted model for SSC, which can be formulated as an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objecti… ▽ More Spectral clustering is one of the fundamental unsupervised learning methods widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering and it improves the interpretability of the model. This paper considers a widely adopted model for SSC, which can be formulated as an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objective. Such an optimization problem is very challenging to solve. Existing methods usually solve its convex relaxation or need to smooth its nonsmooth part using certain smoothing techniques. In this paper, we propose a manifold proximal linear method (ManPL) that solves the original SSC formulation. We also extend the algorithm to solve the multiple-kernel SSC problems, for which an alternating ManPL algorithm is proposed. Convergence and iteration complexity results of the proposed methods are established. We demonstrate the advantage of our proposed methods over existing methods via the single-cell RNA sequencing data analysis. △ Less

Submitted 30 October, 2020; v1 submitted 18 July, 2020; originally announced July 2020.

arXiv:2006.08355 [pdf, other]

A Two-Phase Dynamic Contagion Model for COVID-19

Authors: Zezhun Chen, Angelos Dassios, Valerie Kuan, Jia Wei Lim, Yan Qu, Budhi Surya, Hongbiao Zhao

Abstract: In this paper, we propose a continuous-time stochastic intensity model, namely, two-phase dynamic contagion process(2P-DCP), for modelling the epidemic contagion of COVID-19 and investigating the lockdown effect based on the dynamic contagion model introduced by Dassios and Zhao (2011). It allows randomness to the infectivity of individuals rather than a constant reproduction number as assumed by… ▽ More In this paper, we propose a continuous-time stochastic intensity model, namely, two-phase dynamic contagion process(2P-DCP), for modelling the epidemic contagion of COVID-19 and investigating the lockdown effect based on the dynamic contagion model introduced by Dassios and Zhao (2011). It allows randomness to the infectivity of individuals rather than a constant reproduction number as assumed by standard models. Key epidemiological quantities, such as the distribution of final epidemic size and expected epidemic duration, are derived and estimated based on real data for various regions and countries. The associated time lag of the effect of intervention in each country or region is estimated. Our results are consistent with the incubation time of COVID-19 found by recent medical study. We demonstrate that our model could potentially be a valuable tool in the modeling of COVID-19. More importantly, the proposed model of 2P-DCP could also be used as an important tool in epidemiological modelling as this type of contagion models with very simple structures is adequate to describe the evolution of regional epidemic and worldwide pandemic. △ Less

Submitted 11 June, 2020; originally announced June 2020.

Comments: 29 pages, 14 figures

MSC Class: 60G55(Primary); 60J75(Secondary)

arXiv:2005.05549 [pdf, other]

Staggered Release Policies for COVID-19 Control: Costs and Benefits of Sequentially Relaxing Restrictions by Age

Authors: Henry Zhao, Zhilan Feng, Carlos Castillo-Chavez, Simon A. Levin

Abstract: Strong social distancing restrictions have been crucial to controlling the COVID-19 outbreak thus far, and the next question is when and how to relax these restrictions. A sequential timing of relaxing restrictions across groups is explored in order to identify policies that simultaneously reduce health risks and economic stagnation relative to current policies. The goal will be to mitigate health… ▽ More Strong social distancing restrictions have been crucial to controlling the COVID-19 outbreak thus far, and the next question is when and how to relax these restrictions. A sequential timing of relaxing restrictions across groups is explored in order to identify policies that simultaneously reduce health risks and economic stagnation relative to current policies. The goal will be to mitigate health risks, particularly among the most fragile sub-populations, while also managing the deleterious effect of restrictions on economic activity. The results of this paper show that a properly constructed sequential release of age-defined subgroups from strict social distancing protocols can lead to lower overall fatality rates than the simultaneous release of all individuals after a lockdown. The optimal release policy, in terms of minimizing overall death rate, must be sequential in nature, and it is important to properly time each step of the staggered release. This model allows for testing of various timing choices for staggered release policies, which can provide insights that may be helpful in the design, testing, and planning of disease management policies for the ongoing COVID-19 pandemic and future outbreaks. △ Less

Submitted 12 May, 2020; originally announced May 2020.

Comments: 22 pages (including Appendix), 10 figures

arXiv:1902.03429 [pdf]

Clustering Bioactive Molecules in 3D Chemical Space with Unsupervised Deep Learning

Authors: Chu Qin, Ying Tan, Shang Ying Chen, Xian Zeng, Xingxing Qi, Tian Jin, Huan Shi, Yiwei Wan, Yu Chen, Jingfeng Li, Weidong He, Yali Wang, Peng Zhang, Feng Zhu, Hongping Zhao, Yuyang Jiang, Yuzong Chen

Abstract: Unsupervised clustering has broad applications in data stratification, pattern investigation and new discovery beyond existing knowledge. In particular, clustering of bioactive molecules facilitates chemical space mapping, structure-activity studies, and drug discovery. These tasks, conventionally conducted by similarity-based methods, are complicated by data complexity and diversity. We ex-plored… ▽ More Unsupervised clustering has broad applications in data stratification, pattern investigation and new discovery beyond existing knowledge. In particular, clustering of bioactive molecules facilitates chemical space mapping, structure-activity studies, and drug discovery. These tasks, conventionally conducted by similarity-based methods, are complicated by data complexity and diversity. We ex-plored the superior learning capability of deep autoencoders for unsupervised clustering of 1.39 mil-lion bioactive molecules into band-clusters in a 3-dimensional latent chemical space. These band-clusters, displayed by a space-navigation simulation software, band molecules of selected bioactivity classes into individual band-clusters possessing unique sets of common sub-structural features beyond structural similarity. These sub-structural features form the frameworks of the literature-reported pharmacophores and privileged fragments. Within each band-cluster, molecules are further banded into selected sub-regions with respect to their bioactivity target, sub-structural features and molecular scaffolds. Our method is potentially applicable for big data clustering tasks of different fields. △ Less

Submitted 9 February, 2019; originally announced February 2019.

arXiv:1803.06393 [pdf, other]

Phylogeny-based tumor subclone identification using a Bayesian feature allocation model

Authors: Li Zeng, Joshua L. Warren, Hongyu Zhao

Abstract: Tumor cells acquire different genetic alterations during the course of evolution in cancer patients. As a result of competition and selection, only a few subgroups of cells with distinct genotypes survive. These subgroups of cells are often referred to as subclones. In recent years, many statistical and computational methods have been developed to identify tumor subclones, leading to biologically… ▽ More Tumor cells acquire different genetic alterations during the course of evolution in cancer patients. As a result of competition and selection, only a few subgroups of cells with distinct genotypes survive. These subgroups of cells are often referred to as subclones. In recent years, many statistical and computational methods have been developed to identify tumor subclones, leading to biologically significant discoveries and shedding light on tumor progression, metastasis, drug resistance and other processes. However, most existing methods are either not able to infer the phylogenetic structure among subclones, or not able to incorporate copy number variations (CNV). In this article, we propose SIFA (tumor Subclone Identification by Feature Allocation), a Bayesian model which takes into account both CNV and tumor phylogeny structure to infer tumor subclones. We compare the performance of SIFA with two other commonly used methods using simulation studies with varying sequencing depth, evolutionary tree size, and tree complexity. SIFA consistently yields better results in terms of Rand Index and cellularity estimation accuracy. The usefulness of SIFA is also demonstrated through its application to whole genome sequencing (WGS) samples from four patients in a breast cancer study. △ Less

Submitted 16 March, 2018; originally announced March 2018.

Comments: 35 pages, 11 figures

arXiv:1803.03910 [pdf, other]

A pathway-based kernel boosting method for sample classification using genomic data

Authors: Li Zeng, Zhaolong Yu, Hongyu Zhao

Abstract: The analysis of cancer genomic data has long suffered "the curse of dimensionality". Sample sizes for most cancer genomic studies are a few hundreds at most while there are tens of thousands of genomic features studied. Various methods have been proposed to leverage prior biological knowledge, such as pathways, to more effectively analyze cancer genomic data. Most of the methods focus on testing m… ▽ More The analysis of cancer genomic data has long suffered "the curse of dimensionality". Sample sizes for most cancer genomic studies are a few hundreds at most while there are tens of thousands of genomic features studied. Various methods have been proposed to leverage prior biological knowledge, such as pathways, to more effectively analyze cancer genomic data. Most of the methods focus on testing marginal significance of the associations between pathways and clinical phenotypes. They can identify relevant pathways, but do not involve predictive modeling. In this article, we propose a Pathway-based Kernel Boosting (PKB) method for integrating gene pathway information for sample classification, where we use kernel functions calculated from each pathway as base learners and learn the weights through iterative optimization of the classification loss function. We apply PKB and several competing methods to three cancer studies with pathological and clinical information, including tumor grade, stage, tumor sites, and metastasis status. Our results show that PKB outperforms other methods, and identifies pathways relevant to the outcome variables. △ Less

Submitted 11 March, 2018; originally announced March 2018.

arXiv:1712.04386 [pdf, other]

Hawkes Processes for Invasive Species Modeling and Management

Authors: Amrita Gupta, Mehrdad Farajtabar, Bistra Dilkina, Hongyuan Zha

Abstract: The spread of invasive species to new areas threatens the stability of ecosystems and causes major economic losses in agriculture and forestry. We propose a novel approach to minimizing the spread of an invasive species given a limited intervention budget. We first model invasive species propagation using Hawkes processes, and then derive closed-form expressions for characterizing the effect of an… ▽ More The spread of invasive species to new areas threatens the stability of ecosystems and causes major economic losses in agriculture and forestry. We propose a novel approach to minimizing the spread of an invasive species given a limited intervention budget. We first model invasive species propagation using Hawkes processes, and then derive closed-form expressions for characterizing the effect of an intervention action on the invasion process. We use this to obtain an optimal intervention plan based on an integer programming formulation, and compare the optimal plan against several ecologically-motivated heuristic strategies used in practice. We present an empirical study of two variants of the invasive control problem: minimizing the final rate of invasions, and minimizing the number of invasions at the end of a given time horizon. Our results show that the optimized intervention achieves nearly the same level of control that would be attained by completely eradicating the species, with a 20% cost saving. Additionally, we design a heuristic intervention strategy based on a combination of the density and life stage of the invasive individuals, and find that it comes surprisingly close to the optimized strategy, suggesting that this could serve as a good rule of thumb in invasive species management. △ Less

Submitted 12 December, 2017; originally announced December 2017.

arXiv:1710.04173 [pdf, other]

Structural Stability of Lexical Semantic Spaces: Nouns in Chinese and French

Authors: Sabine Ploux, Rui Wang, ZhengFeng Zhong, Hai Zhao, Yang Xin, Bao-Liang Lu

Abstract: Many studies in the neurosciences have dealt with the semantic processing of words or categories, but few have looked into the semantic organization of the lexicon thought as a system. The present study was designed to try to move towards this goal, using both electrophysiological and corpus-based data, and to compare two languages from different families: French and Mandarin Chinese. We conduct… ▽ More Many studies in the neurosciences have dealt with the semantic processing of words or categories, but few have looked into the semantic organization of the lexicon thought as a system. The present study was designed to try to move towards this goal, using both electrophysiological and corpus-based data, and to compare two languages from different families: French and Mandarin Chinese. We conducted an EEG-based semantic-decision experiment using 240 words from eight categories (clothing, parts of a house, tools, vehicles, fruits/vegetables, animals, body parts, and people) as the material. A data-analysis method (correspondence analysis) commonly used in computational linguistics was applied to the electrophysiological signals. The present cross-language comparison indicated stability for the following aspects of the languages' lexical semantic organizations: (1) the living/nonliving distinction, which showed up as a main factor for both languages; (2) greater dispersion of the living categories as compared to the nonliving ones; (3) prototypicality of the \emph{animals} category within the living categories, and with respect to the living/nonliving distinction; and (4) the existence of a person-centered reference gradient. Our electrophysiological analysis indicated stability of the networks at play in each of these processes. Stability was also observed in the data taken from word usage in the languages (synonyms and associated words obtained from textual corpora). △ Less

Submitted 11 October, 2017; originally announced October 2017.

Comments: 17 pages, 4 figures

arXiv:1701.02287 [pdf]

doi 10.1016/j.bpj.2017.02.020

Sedimentation of Reversibly Interacting Macromolecules with Changes in Fluorescence Quantum Yield

Authors: Sumit K. Chaturvedi, Huaying Zhao, Peter Schuck

Abstract: Sedimentation velocity analytical ultracentrifugation with fluorescence detection has emerged as a powerful method for the study of interacting systems of macromolecules. It combines picomolar sensitivity with high hydrodynamic resolution, and can be carried out with photoswitchable fluorophores for multi-component discrimination, to determine the stoichiometry, affinity, and shape of macromolecul… ▽ More Sedimentation velocity analytical ultracentrifugation with fluorescence detection has emerged as a powerful method for the study of interacting systems of macromolecules. It combines picomolar sensitivity with high hydrodynamic resolution, and can be carried out with photoswitchable fluorophores for multi-component discrimination, to determine the stoichiometry, affinity, and shape of macromolecular complexes with dissociation equilibrium constants from picomolar to micromolar. A popular approach for data interpretation is the determination of the binding affinity by isotherms of weight-average sedimentation coefficients, sw. A prevailing dogma in sedimentation analysis is that the weight-average sedimentation coefficient from the transport method corresponds to the signal- and population-weighted average of all species. We show that this does not always hold true for systems that exhibit significant signal changes with complex formation - properties that may be readily encountered in practice, e.g., from a change in fluorescence quantum yield. Coupled transport in the reaction boundary of rapidly reversible systems can make significant contributions to the observed migration in a way that cannot be accounted for in the standard population-based average. Effective particle theory provides a simple physical picture for the reaction-coupled migration process. On this basis we develop a more general binding model that converges to the well-known form of sw with constant signals, but can account simultaneously for hydrodynamic co-transport in the presence of changes in fluorescence quantum yield. We believe this will be useful when studying interacting systems exhibiting fluorescence quenching, enhancement or Forster resonance energy transfer with transport methods. △ Less

Submitted 21 February, 2017; v1 submitted 9 January, 2017; originally announced January 2017.

Comments: 22 pages, 5 figures

arXiv:1602.07181 [pdf]

doi 10.1371/journal.pone.0155201

3D-Printing for Analytical Ultracentrifugation

Authors: Abhiksha Desai, Jonathan Krynitsky, Thomas J. Pohida, Huaying Zhao, Peter Schuck

Abstract: Analytical ultracentrifugation (AUC) is a classical technique of physical biochemistry providing information on size, shape, and interactions of macromolecules from the analysis of their migration in centrifugal fields while free in solution. A key mechanical element in AUC is the centerpiece, a component of the sample cell assembly that is mounted between the optical windows to allow imaging and… ▽ More Analytical ultracentrifugation (AUC) is a classical technique of physical biochemistry providing information on size, shape, and interactions of macromolecules from the analysis of their migration in centrifugal fields while free in solution. A key mechanical element in AUC is the centerpiece, a component of the sample cell assembly that is mounted between the optical windows to allow imaging and to seal the sample solution column against high vacuum while exposed to gravitational forces in excess of 300,000 g. For sedimentation velocity it needs to be precisely sector-shaped to allow unimpeded radial macromolecular migration. During the history of AUC a great variety of centerpiece designs have been developed for different types of experiments. Here, we report that centerpieces can now be readily fabricated by 3D printing at low cost, from a variety of materials, and with customized designs. The new centerpieces can exhibit sufficient mechanical stability to withstand the gravitational forces at the highest rotor speeds and be sufficiently precise for sedimentation equilibrium and sedimentation velocity experiments. Sedimentation velocity experiments with bovine serum albumin as a reference molecule in 3D printed centerpieces with standard double-sector design result in sedimentation boundaries virtually indistinguishable from those in commercial double-sector epoxy centerpieces, with sedimentation coefficients well within the range of published values. The statistical error of the measurement is slightly above that obtained with commercial epoxy, but still below 1%. Facilitated by modern open-source design and fabrication paradigms, we believe 3D printed centerpieces and AUC accessories can spawn a variety of improvements in AUC experimental design, efficiency and resource allocation. △ Less

Submitted 23 February, 2016; originally announced February 2016.

Comments: 25 pages, 6 figures

Journal ref: PLoS One. 2016; 11(8):e0155201 PMID: 27525659

arXiv:1407.8382 [pdf, ps, other]

doi 10.1214/14-AOAS724

Detection boundary and Higher Criticism approach for rare and weak genetic effects

Authors: Zheyang Wu, Yiming Sun, Shiquan He, Judy Cho, Hongyu Zhao, Jiashun Jin

Abstract: Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this… ▽ More Genome-wide association studies (GWAS) have identified many genetic factors underlying complex human traits. However, these factors have explained only a small fraction of these traits' genetic heritability. It is argued that many more genetic factors remain undiscovered. These genetic factors likely are weakly associated at the population level and sparsely distributed across the genome. In this paper, we adapt the recent innovations on Tukey's Higher Criticism (Tukey [The Higher Criticism (1976) Princeton Univ.]; Donoho and Jin [Ann. Statist. 32 (2004) 962-994]) to SNP-set analysis of GWAS, and develop a new theoretical framework in large-scale inference to assess the joint significance of such rare and weak effects for a quantitative trait. In the core of our theory is the so-called detection boundary, a curve in the two-dimensional phase space that quantifies the rarity and strength of genetic effects. Above the detection boundary, the overall effects of genetic factors are strong enough for reliable detection. Below the detection boundary, the genetic factors are simply too rare and too weak for reliable detection. We show that the HC-type methods are optimal in that they reliably yield detection once the parameters of the genetic effects fall above the detection boundary and that many commonly used SNP-set methods are suboptimal. The superior performance of the HC-type approach is demonstrated through simulations and the analysis of a GWAS data set of Crohn's disease. △ Less

Submitted 31 July, 2014; originally announced July 2014.

Comments: Published in at http://dx.doi.org/10.1214/14-AOAS724 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS724

Journal ref: Annals of Applied Statistics 2014, Vol. 8, No. 2, 824-851

arXiv:1310.7249 [pdf, other]

A Spatial Simulation Approach to Account for Protein Structure When Identifying Non-Random Somatic Mutations

Authors: Gregory Ryslik, Yuwei Cheng, Kei-Hoi Cheung, Robert Bjornson, Daniel Zelterman, Yorgo Modis, Hongyu Zhao

Abstract: Background: Current research suggests that a small set of "driver" mutations are responsible for tumorigenesis while a larger body of "passenger" mutations occurs in the tumor but does not progress the disease. Due to recent pharmacological successes in treating cancers caused by driver mutations, a variety of of methodologies that attempt to identify such mutations have been developed. Based on t… ▽ More Background: Current research suggests that a small set of "driver" mutations are responsible for tumorigenesis while a larger body of "passenger" mutations occurs in the tumor but does not progress the disease. Due to recent pharmacological successes in treating cancers caused by driver mutations, a variety of of methodologies that attempt to identify such mutations have been developed. Based on the hypothesis that driver mutations tend to cluster in key regions of the protein, the development of cluster identification algorithms has become critical. Results: We have developed a novel methodology, SpacePAC (Spatial Protein Amino acid Clustering), that identifies mutational clustering by considering the protein tertiary structure directly in 3D space. By combining the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC) and the spatial information in the Protein Data Bank (PDB), SpacePAC is able to identify novel mutation clusters in many proteins such as FGFR3 and CHRM2. In addition, SpacePAC is better able to localize the most significant mutational hotspots as demonstrated in the cases of BRAF and ALK. The R package is available on Bioconductor at: http://www.bioconductor.org/packages/release/bioc/html/SpacePAC.html Conclusion: SpacePAC adds a valuable tool to the identification of mutational clusters while considering protein tertiary structure △ Less

Submitted 28 October, 2013; v1 submitted 27 October, 2013; originally announced October 2013.

Comments: 16 pages, 8 Figures, 4 Tables

arXiv:1309.5337 [pdf, other]

Change Point Analysis of Histone Modifications Reveals Epigenetic Blocks Linking to Physical Domains

Authors: Mengjie Chen, Haifan Lin, Hongyu Zhao

Abstract: Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that,… ▽ More Histone modification is a vital epigenetic mechanism for transcriptional control in eukaryotes. High-throughput techniques have enabled whole-genome analysis of histone modifications in recent years. However, most studies assume one combination of histone modification invariantly translates to one transcriptional output regardless of local chromatin environment. In this study we hypothesize that, the genome is organized into local domains that manifest similar enrichment pattern of histone modification, which leads to orchestrated regulation of expression of genes with relevant bio- logical functions. We propose a multivariate Bayesian Change Point (BCP) model to segment the Drosophila melanogaster genome into consecutive blocks on the basis of combinatorial patterns of histone marks. By modeling the sparse distribution of histone marks across the chromosome with a zero-inflated Gaussian mixture, our partitions capture local BLOCKs that manifest relatively homogeneous enrichment pattern of histone modifications. We further characterized BLOCKs by their transcription levels, distribution of genes, degree of co-regulation and GO enrichment. Our results demonstrate that these BLOCKs, although inferred merely from histone modifications, reveal strong relevance with physical domains, which suggest their important roles in chromatin organization and coordinated gene regulation. △ Less

Submitted 9 May, 2014; v1 submitted 20 September, 2013; originally announced September 2013.

Comments: 23 pages, 6 figures

arXiv:1307.8229 [pdf, other]

Posterior Contraction Rates of the Phylogenetic Indian Buffet Processes

Authors: Mengjie Chen, Chao Gao, Hongyu Zhao

Abstract: By expressing prior distributions as general stochastic processes, nonparametric Bayesian methods provide a flexible way to incorporate prior knowledge and constrain the latent structure in statistical inference. The Indian buffet process (IBP) is such an example that can be used to define a prior distribution on infinite binary features, where the exchangeability among subjects is assumed. The ph… ▽ More By expressing prior distributions as general stochastic processes, nonparametric Bayesian methods provide a flexible way to incorporate prior knowledge and constrain the latent structure in statistical inference. The Indian buffet process (IBP) is such an example that can be used to define a prior distribution on infinite binary features, where the exchangeability among subjects is assumed. The phylogenetic Indian buffet process (pIBP), a derivative of IBP, enables the modeling of non-exchangeability among subjects through a stochastic process on a rooted tree, which is similar to that used in phylogenetics, to describe relationships among the subjects. In this paper, we study the theoretical properties of IBP and pIBP under a binary factor model. We establish the posterior contraction rates for both IBP and pIBP and substantiate the theoretical results through simulation studies. This is the first work addressing the frequentist property of the posterior behaviors of IBP and pIBP. We also demonstrated its practical usefulness by applying pIBP prior to a real data example arising in the field of cancer genomics where the exchangeability among subjects is violated. △ Less

Submitted 19 May, 2015; v1 submitted 31 July, 2013; originally announced July 2013.

arXiv:1304.7417 [pdf, other]

Improving genetic risk prediction by leveraging pleiotropy

Authors: Cong Li, Can Yang, Joel Gelernter, Hongyu Zhao

Abstract: An important task of human genetics studies is to accurately predict disease risks in individuals based on genetic markers, which allows for identifying individuals at high disease risks, and facilitating their disease treatment and prevention. Although hundreds of genome-wide association studies (GWAS) have been conducted on many complex human traits in recent years, there has been only limited s… ▽ More An important task of human genetics studies is to accurately predict disease risks in individuals based on genetic markers, which allows for identifying individuals at high disease risks, and facilitating their disease treatment and prevention. Although hundreds of genome-wide association studies (GWAS) have been conducted on many complex human traits in recent years, there has been only limited success in translating these GWAS data into clinically useful risk prediction models. The predictive capability of GWAS data is largely bottlenecked by the available training sample size due to the presence of numerous variants carrying only small to modest effects. Recent studies have shown that different human traits may share common genetic bases. Therefore, an attractive strategy to increase the training sample size and hence improve the prediction accuracy is to integrate data of genetically correlated phenotypes. Yet the utility of genetic correlation in risk prediction has not been explored in the literature. In this paper, we analyzed GWAS data for bipolar and related disorders (BARD) and schizophrenia (SZ) with a bivariate ridge regression method, and found that jointly predicting the two phenotypes could substantially increase prediction accuracy as measured by the AUC (area under the receiver operating characteristic curve). We also found similar prediction accuracy improvements when we jointly analyzed GWAS data for Crohn's disease (CD) and ulcerative colitis (UC). The empirical observations were substantiated through our comprehensive simulation studies, suggesting that a gain in prediction accuracy can be obtained by combining phenotypes with relatively high genetic correlations. Through both real data and simulation studies, we demonstrated pleiotropy as a valuable asset that opens up a new opportunity to improve genetic risk prediction in the future. △ Less

Submitted 19 August, 2013; v1 submitted 27 April, 2013; originally announced April 2013.

arXiv:1303.5889 [pdf, other]

A Graph Theoretic Approach to Utilizing Protein Structure to Identify Non-Random Somatic Mutations

Authors: Gregory Ryslik, Yuwei Cheng, Kei-Hoi Cheung, Yorgo Modis, Hongyu Zhao

Abstract: Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of "driver" mutations that are primarily responsible for tumorigenesis. Further, due to some recent pharmacological successes in treating these driver mutations and their resulting tumors, a… ▽ More Background: It is well known that the development of cancer is caused by the accumulation of somatic mutations within the genome. For oncogenes specifically, current research suggests that there is a small set of "driver" mutations that are primarily responsible for tumorigenesis. Further, due to some recent pharmacological successes in treating these driver mutations and their resulting tumors, a variety of methods have been developed to identify potential driver mutations using methods such as machine learning and mutational clustering. We propose a novel methodology that increases our power to identify mutational clusters by taking into account protein tertiary structure via a graph theoretical approach. Results: We have designed and implemented GraphPAC (Graph Protein Amino Acid Clustering) to identify mutational clustering while considering protein spatial structure. Using GraphPAC, we are able to detect novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of prior clustering based on current methods. Specifically, by utilizing the spatial information available in the Protein Data Bank (PDB) along with the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC), GraphPAC identifies new mutational clusters in well known oncogenes such as EGFR and KRAS. Further, by utilizing graph theory to account for the tertiary structure, GraphPAC identifies clusters in DPP4, NRP1 and other proteins not identified by existing methods. The R package is available at: http://bioconductor.org/packages/release/bioc/html/GraphPAC.html Conclusion: GraphPAC provides an alternative to iPAC and an extension to current methodology when identifying potential activating driver mutations by utilizing a graph theoretic approach when considering protein tertiary structure. △ Less

Submitted 12 July, 2013; v1 submitted 23 March, 2013; originally announced March 2013.

Comments: 25 pages, 8 figures, 3 Tables

arXiv:1302.6977 [pdf, ps, other]

doi 10.1186/1471-2105-14-190

Utilizing Protein Structure to Identify Non-Random Somatic Mutations

Authors: Gregory Ryslik, Yuwei Cheng, Kei-Hoi Cheung, Yorgo Modis, Hongyu Zhao

Abstract: Motivation: Human cancer is caused by the accumulation of somatic mutations in tumor suppressors and oncogenes within the genome. In the case of oncogenes, recent theory suggests that there are only a few key "driver" mutations responsible for tumorigenesis. As there have been significant pharmacological successes in developing drugs that treat cancers that carry these driver mutations, several me… ▽ More Motivation: Human cancer is caused by the accumulation of somatic mutations in tumor suppressors and oncogenes within the genome. In the case of oncogenes, recent theory suggests that there are only a few key "driver" mutations responsible for tumorigenesis. As there have been significant pharmacological successes in developing drugs that treat cancers that carry these driver mutations, several methods that rely on mutational clustering have been developed to identify them. However, these methods consider proteins as a single strand without taking their spatial structures into account. We propose a new methodology that incorporates protein tertiary structure in order to increase our power when identifying mutation clustering. Results: We have developed a novel algorithm, iPAC: identification of Protein Amino acid Clustering, for the identification of non-random somatic mutations in proteins that takes into account the three dimensional protein structure. By using the tertiary information, we are able to detect both novel clusters in proteins that are known to exhibit mutation clustering as well as identify clusters in proteins without evidence of clustering based on existing methods. For example, by combining the data in the Protein Data Bank (PDB) and the Catalogue of Somatic Mutations in Cancer, our algorithm identifies new mutational clusters in well known cancer proteins such as KRAS and PI3KCa. Further, by utilizing the tertiary structure, our algorithm also identifies clusters in EGFR, EIF2AK2, and other proteins that are not identified by current methodology. △ Less

Submitted 27 February, 2013; originally announced February 2013.

arXiv:0810.1968 [pdf, other]

A Simulation of Blood Cells in Branching Capillaries

Authors: Amir H. G. Isfahani, Hong Zhao, Jonathan B. Freund

Abstract: The multi-cellular hydrodynamic interactions play a critical role in the phenomenology of blood flow in the microcirculation. A fast algorithm has been developed to simulate large numbers of cells modeled as elastic thin membranes. For red blood cells, which are the dominant component in blood, the membrane has strong resistance to surface dilatation but is flexible in bending. Our numerical met… ▽ More The multi-cellular hydrodynamic interactions play a critical role in the phenomenology of blood flow in the microcirculation. A fast algorithm has been developed to simulate large numbers of cells modeled as elastic thin membranes. For red blood cells, which are the dominant component in blood, the membrane has strong resistance to surface dilatation but is flexible in bending. Our numerical method solves the boundary integral equations built upon Green's functions for Stokes flow in periodic domains. This fluid dynamics video is an example of the capabilities of this model in handling complex geometries with a multitude of different cells. The capillary branch geometries have been modeled based upon observed capillary networks. The diameter of the branches varies between 10-20 mum. A constant mean pressure gradient drives the flow. For the purpose of this fluid dynamics video, the red blood cells are initiated as biconcave discs and white blood cells and platelets are initiated as spheres and ellipsoids respectively. The hematocrit as well as the average velocity vary in the different branches due to the nonlinear interactions of the cells. Physiologically, red blood cells enclose a concentrated solution of hemoglobin which is about five times more viscous than plasma but for this simulation the viscosity of the fluid inside and outside of red blood cells have been taken to be the same. △ Less

Submitted 10 October, 2008; originally announced October 2008.

Showing 1–41 of 41 results for author: Zhao, H