Search | arXiv e-print repository

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Authors: Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang

Abstract: Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts… ▽ More Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2407.13637 [pdf]

Autonomous self-evolving research on biomedical data: the DREAM paradigm

Authors: Luojia Deng, Yijie Wu, Yongyong Ren, Hui Lu

Abstract: In contemporary biomedical research, the efficiency of data-driven approaches is hindered by large data volumes, tool selection complexity, and human resource limitations, necessitating the development of fully autonomous research systems to meet complex analytical needs. Such a system should include the ability to autonomously generate research questions, write analytical code, configure the comp… ▽ More In contemporary biomedical research, the efficiency of data-driven approaches is hindered by large data volumes, tool selection complexity, and human resource limitations, necessitating the development of fully autonomous research systems to meet complex analytical needs. Such a system should include the ability to autonomously generate research questions, write analytical code, configure the computational environment, judge and interpret the results, and iteratively generate in-depth questions or solutions, all without human intervention. Here we developed DREAM, the first biomedical Data-dRiven self-Evolving Autonomous systeM, which can independently conduct scientific research without human involvement. Utilizing a clinical dataset and two omics datasets, DREAM demonstrated its ability to raise and deepen scientific questions, with difficulty scores for clinical data questions surpassing top published articles by 5.7% and outperforming GPT-4 and bioinformatics graduate students by 58.6% and 56.0%, respectively. Overall, DREAM has a success rate of 80% in autonomous clinical data mining. Certainly, human can participate in different steps of DREAM to achieve more personalized goals. After evolution, 10% of the questions exceeded the average scores of top published article questions on originality and complexity. In the autonomous environment configuration of the eight bioinformatics workflows, DREAM exhibited an 88% success rate, whereas GPT-4 failed to configure any workflows. In clinical dataset, DREAM was over 10,000 times more efficient than the average scientist with a single computer core, and capable of revealing new discoveries. As a self-evolving autonomous research system, DREAM provides an efficient and reliable solution for future biomedical research. This paradigm may also have a revolutionary impact on other data-driven scientific research fields. △ Less

Submitted 10 August, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

Comments: 11 pages, 4 figures, content added, typos in figure corrected, references revised and font changed

arXiv:2406.10391 [pdf, other]

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

Authors: Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, Dong Yuan, Wanli Ouyang, Xihui Liu

Abstract: RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we i… ▽ More RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2403.03425 [pdf, other]

Sculpting Molecules in 3D: A Flexible Substructure Aware Framework for Text-Oriented Molecular Optimization

Authors: Kaiwei Zhang, Yange Lin, Guangcheng Wu, Yuxiang Ren, Xuecang Zhang, Bo wang, Xiaoyu Zhang, Weitao Du

Abstract: The integration of deep learning, particularly AI-Generated Content, with high-quality data derived from ab initio calculations has emerged as a promising avenue for transforming the landscape of scientific research. However, the challenge of designing molecular drugs or materials that incorporate multi-modality prior knowledge remains a critical and complex undertaking. Specifically, achieving a… ▽ More The integration of deep learning, particularly AI-Generated Content, with high-quality data derived from ab initio calculations has emerged as a promising avenue for transforming the landscape of scientific research. However, the challenge of designing molecular drugs or materials that incorporate multi-modality prior knowledge remains a critical and complex undertaking. Specifically, achieving a practical molecular design necessitates not only meeting the diversity requirements but also addressing structural and textural constraints with various symmetries outlined by domain experts. In this article, we present an innovative approach to tackle this inverse design problem by formulating it as a multi-modality guidance generation/optimization task. Our proposed solution involves a textural-structure alignment symmetric diffusion framework for the implementation of molecular generation/optimization tasks, namely 3DToMolo. 3DToMolo aims to harmonize diverse modalities, aligning them seamlessly to produce molecular structures adhere to specified symmetric structural and textural constraints by experts in the field. Experimental trials across three guidance generation settings have shown a superior hit generation performance compared to state-of-the-art methodologies. Moreover, 3DToMolo demonstrates the capability to generate novel molecules, incorporating specified target substructures, without the need for prior knowledge. This work not only holds general significance for the advancement of deep learning methodologies but also paves the way for a transformative shift in molecular design strategies. 3DToMolo creates opportunities for a more nuanced and effective exploration of the vast chemical space, opening new frontiers in the development of molecular entities with tailored properties and functionalities. △ Less

Submitted 5 March, 2024; originally announced March 2024.

arXiv:2309.17366 [pdf, other]

3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information

Authors: Taojie Kuang, Yiming Ren, Zhixiang Ren

Abstract: Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous r… ▽ More Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance. △ Less

Submitted 27 June, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

arXiv:2308.06911 [pdf, other]

doi 10.1016/j.compbiomed.2024.108073

GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text

Authors: Pengfei Liu, Yiming Ren, Jun Tao, Zhixiang Ren

Abstract: Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates th… ▽ More Large language models have made significant strides in natural language processing, enabling innovative applications in molecular science by processing textual representations of molecules. However, most existing language models cannot capture the rich information with complex molecular structures or images. In this paper, we introduce GIT-Mol, a multi-modal large language model that integrates the Graph, Image, and Text information. To facilitate the integration of multi-modal molecular data, we propose GIT-Former, a novel architecture that is capable of aligning all modalities into a unified latent space. We achieve a 5%-10% accuracy increase in properties prediction and a 20.2% boost in molecule generation validity compared to the baselines. With the any-to-language molecular translation strategy, our model has the potential to perform more downstream tasks, such as compound name recognition and chemical reaction prediction. △ Less

Submitted 6 February, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

Comments: The article has been accepted by Computers in Biology and Medicine, with 14 pages and 4 figures

Journal ref: Computers in Biology and Medicine, 108073, 2024, ISSN 0010-4825

arXiv:2308.01921 [pdf, other]

Transferable Graph Neural Fingerprint Models for Quick Response to Future Bio-Threats

Authors: Wei Chen, Yihui Ren, Ai Kagawa, Matthew R. Carbone, Samuel Yen-Chi Chen, Xiaohui Qu, Shinjae Yoo, Austin Clyde, Arvind Ramanathan, Rick L. Stevens, Hubertus J. J. van Dam, Deyu Lu

Abstract: Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for developing molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we… ▽ More Fast screening of drug molecules based on the ligand binding affinity is an important step in the drug discovery pipeline. Graph neural fingerprint is a promising method for developing molecular docking surrogates with high throughput and great fidelity. In this study, we built a COVID-19 drug docking dataset of about 300,000 drug candidates on 23 coronavirus protein targets. With this dataset, we trained graph neural fingerprint docking models for high-throughput virtual COVID-19 drug screening. The graph neural fingerprint models yield high prediction accuracy on docking scores with the mean squared error lower than $0.21$ kcal/mol for most of the docking targets, showing significant improvement over conventional circular fingerprint methods. To make the neural fingerprints transferable for unknown targets, we also propose a transferable graph neural fingerprint method trained on multiple targets. With comparable accuracy to target-specific graph neural fingerprint models, the transferable model exhibits superb training and data efficiency. We highlight that the impact of this study extends beyond COVID-19 dataset, as our approach for fast virtual ligand screening can be easily adapted and integrated into a general machine learning-accelerated pipeline to battle future bio-threats. △ Less

Submitted 14 September, 2023; v1 submitted 17 July, 2023; originally announced August 2023.

Comments: 8 pages, 5 figures, 2 tables, accepted by ICLMA2023

ACM Class: I.2.1

arXiv:2307.15719 [pdf]

Identifying acute illness phenotypes via deep temporal interpolation and clustering network on physiologic signatures

Authors: Yuanfang Ren, Yanjun Li, Tyler J. Loftus, Jeremy Balch, Kenneth L. Abbott, Shounak Datta, Matthew M. Ruppert, Ziyuan Guan, Benjamin Shickel, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac

Abstract: Initial hours of hospital admission impact clinical trajectory, but early clinical decisions often suffer due to data paucity. With clustering analysis for vital signs within six hours of admission, patient phenotypes with distinct pathophysiological signatures and outcomes may support early clinical decisions. We created a single-center, longitudinal EHR dataset for 75,762 adults admitted to a te… ▽ More Initial hours of hospital admission impact clinical trajectory, but early clinical decisions often suffer due to data paucity. With clustering analysis for vital signs within six hours of admission, patient phenotypes with distinct pathophysiological signatures and outcomes may support early clinical decisions. We created a single-center, longitudinal EHR dataset for 75,762 adults admitted to a tertiary care center for 6+ hours. We proposed a deep temporal interpolation and clustering network to extract latent representations from sparse, irregularly sampled vital sign data and derived distinct patient phenotypes in a training cohort (n=41,502). Model and hyper-parameters were chosen based on a validation cohort (n=17,415). Test cohort (n=16,845) was used to analyze reproducibility and correlation with biomarkers. The training, validation, and testing cohorts had similar distributions of age (54-55 yrs), sex (55% female), race, comorbidities, and illness severity. Four clusters were identified. Phenotype A (18%) had most comorbid disease with higher rate of prolonged respiratory insufficiency, acute kidney injury, sepsis, and three-year mortality. Phenotypes B (33%) and C (31%) had diffuse patterns of mild organ dysfunction. Phenotype B had favorable short-term outcomes but second-highest three-year mortality. Phenotype C had favorable clinical outcomes. Phenotype D (17%) had early/persistent hypotension, high rate of early surgery, and substantial biomarker rate of inflammation but second-lowest three-year mortality. After comparing phenotypes' SOFA scores, clustering results did not simply repeat other acuity assessments. In a heterogeneous cohort, four phenotypes with distinct categories of disease and outcomes were identified by a deep temporal interpolation and clustering network. This tool may impact triage decisions and clinical decision-support under time constraints. △ Less

Submitted 27 July, 2023; originally announced July 2023.

Comments: 28 pages (79 pages incl. supp. material), 4 figures, 2 tables, 19 supplementary figures, 9 supplementary tables

arXiv:2303.06071 [pdf]

Clinical Courses of Acute Kidney Injury in Hospitalized Patients: A Multistate Analysis

Authors: Esra Adiyeke, Yuanfang Ren, Ziyuan Guan, Matthew M. Ruppert, Parisa Rashidi, Azra Bihorac, Tezcan Ozrazgat-Baslanti

Abstract: Objectives: We aim to quantify longitudinal acute kidney injury (AKI) trajectories and to describe transitions through progressing and recovery states and outcomes among hospitalized patients using multistate models. Methods: In this large, longitudinal cohort study, 138,449 adult patients admitted to a quaternary care hospital between 2012 and 2019 were staged based on Kidney Disease: Improving G… ▽ More Objectives: We aim to quantify longitudinal acute kidney injury (AKI) trajectories and to describe transitions through progressing and recovery states and outcomes among hospitalized patients using multistate models. Methods: In this large, longitudinal cohort study, 138,449 adult patients admitted to a quaternary care hospital between 2012 and 2019 were staged based on Kidney Disease: Improving Global Outcomes serum creatinine criteria for the first 14 days of their hospital stay. We fit multistate models to estimate probability of being in a certain clinical state at a given time after entering each one of the AKI stages. We investigated the effects of selected variables on transition rates via Cox proportional hazards regression models. Results: Twenty percent of hospitalized encounters (49,325/246,964) had AKI; among patients with AKI, 66% had Stage 1 AKI, 18% had Stage 2 AKI, and 17% had AKI Stage 3 with or without RRT. At seven days following Stage 1 AKI, 69% (95% confidence interval [CI]: 68.8%-70.5%) were either resolved to No AKI or discharged, while smaller proportions of recovery (26.8%, 95% CI: 26.1%-27.5%) and discharge (17.4%, 95% CI: 16.8%-18.0%) were observed following AKI Stage 2. At 14 days following Stage 1 AKI, patients with more frail conditions (Charlson comorbidity index greater than or equal to 3 and had prolonged ICU stay) had lower proportion of transitioning to No AKI or discharge states. Discussion: Multistate analyses showed that the majority of Stage 2 and higher severity AKI patients could not resolve within seven days; therefore, strategies preventing the persistence or progression of AKI would contribute to the patients' life quality. Conclusions: We demonstrate multistate modeling framework's utility as a mechanism for a better understanding of the clinical course of AKI with the potential to facilitate treatment and resource planning. △ Less

Submitted 8 March, 2023; originally announced March 2023.

arXiv:2303.05504 [pdf]

Computable Phenotypes to Characterize Changing Patient Brain Dysfunction in the Intensive Care Unit

Authors: Yuanfang Ren, Tyler J. Loftus, Ziyuan Guan, Rayon Uddin, Benjamin Shickel, Carolina B. Maciel, Katharina Busl, Parisa Rashidi, Azra Bihorac, Tezcan Ozrazgat-Baslanti

Abstract: In the United States, more than 5 million patients are admitted annually to ICUs, with ICU mortality of 10%-29% and costs over $82 billion. Acute brain dysfunction status, delirium, is often underdiagnosed or undervalued. This study's objective was to develop automated computable phenotypes for acute brain dysfunction states and describe transitions among brain dysfunction states to illustrate the… ▽ More In the United States, more than 5 million patients are admitted annually to ICUs, with ICU mortality of 10%-29% and costs over $82 billion. Acute brain dysfunction status, delirium, is often underdiagnosed or undervalued. This study's objective was to develop automated computable phenotypes for acute brain dysfunction states and describe transitions among brain dysfunction states to illustrate the clinical trajectories of ICU patients. We created two single-center, longitudinal EHR datasets for 48,817 adult patients admitted to an ICU at UFH Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acute brain dysfunction status including coma, delirium, normal, or death at 12-hour intervals of each ICU admission and to identify acute brain dysfunction phenotypes using continuous acute brain dysfunction status and k-means clustering approach. There were 49,770 admissions for 37,835 patients in UFH GNV dataset and 18,472 admissions for 10,982 patients in UFH JAX dataset. In total, 18% of patients had coma as the worst brain dysfunction status; every 12 hours, around 4%-7% would transit to delirium, 22%-25% would recover, 3%-4% would expire, and 67%-68% would remain in a coma in the ICU. Additionally, 7% of patients had delirium as the worst brain dysfunction status; around 6%-7% would transit to coma, 40%-42% would be no delirium, 1% would expire, and 51%-52% would remain delirium in the ICU. There were three phenotypes: persistent coma/delirium, persistently normal, and transition from coma/delirium to normal almost exclusively in first 48 hours after ICU admission. We developed phenotyping scoring algorithms that determined acute brain dysfunction status every 12 hours while admitted to the ICU. This approach may be useful in developing prognostic and decision-support tools to aid patients and clinicians in decision-making on resource use and escalation of care. △ Less

Submitted 9 March, 2023; originally announced March 2023.

Comments: 21 pages, 5 figures, 3 tables, 1 eTable

arXiv:2211.06262

Principles for generation of reverberation

Authors: Yi Ren, Yanyang Xiao, Guo-Qiang Bi, Pek-Ming Lau

Abstract: In modern neuroscience, memory has been postulated to stored in neural circuits as sequential spike train and Reverberation is one of the specific example.Former research has made much progress on phenomenon description. However, the mechanism of reverberation has been unclear yet. In this study, combining electrophysiological record and numerical simulation, we confirmed a formerly unrealized n… ▽ More In modern neuroscience, memory has been postulated to stored in neural circuits as sequential spike train and Reverberation is one of the specific example.Former research has made much progress on phenomenon description. However, the mechanism of reverberation has been unclear yet. In this study, combining electrophysiological record and numerical simulation, we confirmed a formerly unrealized neuron property that is necessary for the burst generation in reverberation. Secondly, we find out the mechanism of sequential pattern generation which clearly explained by network topology and asynchronous neurotransmitter release. In addition, we also developed a pipeline that could design the network fire in manually set order. Thirdly, we explored the dynamics of STDP learning and chased down the effects of STDP Rule in reverberation. With these understandings, we developed a STDP based learning rule which could drive the network to remember any presupposed sequence. These results indicated that neuron circuit can remember malformation through STDP rule. Those information are stored in synapse connections. By this way, animals remember information as spike sequence pattern. △ Less

Submitted 28 November, 2022; v1 submitted 11 November, 2022; originally announced November 2022.

Comments: There are some mistakes in the figure and expression. We need a bit long time to fix it, and it's my duty to stop an immature and misleading paper spread on the internet.

arXiv:2008.06642 [pdf, other]

Group Testing Enables Asymptomatic Screening for COVID-19 Mitigation: Feasibility and Optimal Pool Size Selection with Dilution Effects

Authors: Yifan Lin, Yuxuan Ren, Jingyuan Wan, Massey Cashore, Jiayue Wan, Yujia Zhang, Peter Frazier, Enlu Zhou

Abstract: Repeated asymptomatic screening for SARS-CoV-2 promises to control spread of the virus but would require too many resources to implement at scale. Group testing is promising for screening more people with fewer test resources: multiple samples tested together in one pool can be excluded with one negative test result. Existing approaches to group testing design for SARS-CoV-2 asymptomatic screening… ▽ More Repeated asymptomatic screening for SARS-CoV-2 promises to control spread of the virus but would require too many resources to implement at scale. Group testing is promising for screening more people with fewer test resources: multiple samples tested together in one pool can be excluded with one negative test result. Existing approaches to group testing design for SARS-CoV-2 asymptomatic screening, however, do not consider dilution effects: that false negatives become more common with larger pools. As a consequence, they may recommend pool sizes that are too large or misestimate the benefits of screening. Modeling dilution effects, we derive closed-form expressions for the expected number of tests and false negative/positives per person screened under two popular group testing methods: the linear and square array methods. We find that test error correlation induced by a common viral load across an individual's samples results in many fewer false negatives than would be expected from less realistic but more widely assumed independent errors. This insight also suggests that false positives can be controlled through repeated tests without significantly increasing false negatives. Using these closed-form expressions to trace a Pareto frontier over error rates and tests, we design testing protocols for repeated asymptomatic screening of a large population. We minimize disease prevalence by optimizing a time-varying pool sizes and screening frequency constrained by daily test capacity and a false positive limit. This provides a testing protocol practitioners can use for mitigating COVID-19. In a case study, we demonstrate the effectiveness of this methodology in controlling spread. △ Less

Submitted 16 November, 2020; v1 submitted 14 August, 2020; originally announced August 2020.

arXiv:2006.02431 [pdf, other]

Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: A First Data Release

Authors: Yadu Babuji, Ben Blaiszik, Tom Brettin, Kyle Chard, Ryan Chard, Austin Clyde, Ian Foster, Zhi Hong, Shantenu Jha, Zhuozhao Li, Xuefeng Liu, Arvind Ramanathan, Yi Ren, Nicholaus Saint, Marcus Schwarting, Rick Stevens, Hubertus van Dam, Rick Wagner

Abstract: Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort,… ▽ More Researchers across the globe are seeking to rapidly repurpose existing drugs or discover new drugs to counter the the novel coronavirus disease (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). One promising approach is to train machine learning (ML) and artificial intelligence (AI) tools to screen large numbers of small molecules. As a contribution to that effort, we are aggregating numerous small molecules from a variety of sources, using high-performance computing (HPC) to computer diverse properties of those molecules, using the computed properties to train ML/AI models, and then using the resulting models for screening. In this first data release, we make available 23 datasets collected from community sources representing over 4.2 B molecules enriched with pre-computed: 1) molecular fingerprints to aid similarity searches, 2) 2D images of molecules to enable exploration and application of image-based deep learning methods, and 3) 2D and 3D molecular descriptors to speed development of machine learning models. This data release encompasses structural information on the 4.2 B molecules and 60 TB of pre-computed data. Future releases will expand the data to include more detailed molecular simulations, computed models, and other products. △ Less

Submitted 27 May, 2020; originally announced June 2020.

Comments: 11 pages, 5 figures

arXiv:2005.05163 [pdf]

Computable Phenotypes of Patient Acuity in the Intensive Care Unit

Authors: Yuanfang Ren, Jeremy Balch, Kenneth L. Abbott, Tyler J. Loftus, Benjamin Shickel, Parisa Rashidi, Azra Bihorac, Tezcan Ozrazgat-Baslanti

Abstract: Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records… ▽ More Continuous monitoring and patient acuity assessments are key aspects of Intensive Care Unit (ICU) practice, but both are limited by time constraints imposed on healthcare providers. Moreover, anticipating clinical trajectories remains imprecise. The objectives of this study are to (1) develop an electronic phenotype of acuity using automated variable retrieval within the electronic health records and (2) describe transitions between acuity states that illustrate the clinical trajectories of ICU patients. We gathered two single-center, longitudinal electronic health record datasets for 51,372 adult ICU patients admitted to the University of Florida Health (UFH) Gainesville (GNV) and Jacksonville (JAX). We developed algorithms to quantify acuity status at four-hour intervals for each ICU admission and identify acuity phenotypes using continuous acuity status and k-means clustering approach. 51,073 admissions for 38,749 patients in the UFH GNV dataset and 22,219 admissions for 12,623 patients in the UFH JAX dataset had at least one ICU stay lasting more than four hours. There were three phenotypes: persistently stable, persistently unstable, and transitioning from unstable to stable. For stable patients, approximately 0.7%-1.7% would transition to unstable, 0.02%-0.1% would expire, 1.2%-3.4% would be discharged, and the remaining 96%-97% would remain stable in the ICU every four hours. For unstable patients, approximately 6%-10% would transition to stable, 0.4%-0.5% would expire, and the remaining 89%-93% would remain unstable in the ICU in the next four hours. We developed phenotyping algorithms for patient acuity status every four hours while admitted to the ICU. This approach may be useful in developing prognostic and clinical decision-support tools to aid patients, caregivers, and providers in shared decision-making processes regarding escalation of care and patient values. △ Less

Submitted 1 November, 2023; v1 submitted 27 April, 2020; originally announced May 2020.

arXiv:2004.13066 [pdf]

Application of Deep Interpolation Network for Clustering of Physiologic Time Series

Authors: Yanjun Li, Yuanfang Ren, Tyler J. Loftus, Shounak Datta, M. Ruppert, Ziyuan Guan, Dapeng Wu, Parisa Rashidi, Tezcan Ozrazgat-Baslanti, Azra Bihorac

Abstract: Background: During the early stages of hospital admission, clinicians must use limited information to make diagnostic and treatment decisions as patient acuity evolves. However, it is common that the time series vital sign information from patients to be both sparse and irregularly collected, which poses a significant challenge for machine / deep learning techniques to analyze and facilitate the c… ▽ More Background: During the early stages of hospital admission, clinicians must use limited information to make diagnostic and treatment decisions as patient acuity evolves. However, it is common that the time series vital sign information from patients to be both sparse and irregularly collected, which poses a significant challenge for machine / deep learning techniques to analyze and facilitate the clinicians to improve the human health outcome. To deal with this problem, We propose a novel deep interpolation network to extract latent representations from sparse and irregularly sampled time-series vital signs measured within six hours of hospital admission. Methods: We created a single-center longitudinal dataset of electronic health record data for all (n=75,762) adult patient admissions to a tertiary care center lasting six hours or longer, using 55% of the dataset for training, 23% for validation, and 22% for testing. All raw time series within six hours of hospital admission were extracted for six vital signs (systolic blood pressure, diastolic blood pressure, heart rate, temperature, blood oxygen saturation, and respiratory rate). A deep interpolation network is proposed to learn from such irregular and sparse multivariate time series data to extract the fixed low-dimensional latent patterns. We use k-means clustering algorithm to clusters the patient admissions resulting into 7 clusters. Findings: Training, validation, and testing cohorts had similar age (55-57 years), sex (55% female), and admission vital signs. Seven distinct clusters were identified. M Interpretation: In a heterogeneous cohort of hospitalized patients, a deep interpolation network extracted representations from vital sign data measured within six hours of hospital admission. This approach may have important implications for clinical decision-support under time constraints and uncertainty. △ Less

Submitted 27 April, 2020; originally announced April 2020.

arXiv:2003.11817 [pdf]

Estimation of genome size using k-mer frequencies from corrected long reads

Authors: Hengchao Wang, Bo Liu, Yan Zhang, Fan Jiang, Yuwei Ren, Lijuan Yin, Hangwei Liu, Sen Wang, Wei Fan

Abstract: The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-gener… ▽ More The third-generation long reads sequencing technologies, such as PacBio and Nanopore, have great advantages over second-generation Illumina sequencing in de novo assembly studies. However, due to the inherent low base accuracy, third-generation sequencing data cannot be used for k-mer counting and estimating genomic profile based on k-mer frequencies. Thus, in current genome projects, second-generation data is also necessary for accurately determining genome size and other genomic characteristics. We show that corrected third-generation data can be used to count k-mer frequencies and estimate genome size reliably, in replacement of using second-generation data. Therefore, future genome projects can depend on only one sequencing technology to finish both assembly and k-mer analysis, which will largely decrease sequencing cost in both time and money. Moreover, we present a fast light-weight tool kmerfreq and use it to perform all the k-mer counting tasks in this work. We have demonstrated that corrected third-generation sequencing data can be used to estimate genome size and developed a new open-source C/C++ k-mer counting tool, kmerfreq, which is freely available at https://github.com/fanagislab/kmerfreq. △ Less

Submitted 26 March, 2020; originally announced March 2020.

Comments: In total, 24 pages include maintext and supplemental. 1 maintext figure, 1 table, 3 supplemental figures, 8 supplemental tables

arXiv:1912.12587 [pdf]

Highly fluorescent copper nanoclusters for sensing and bioimaging

Authors: Yu An, Ying Ren, Jing Tang, Jun Chen, Baisong Chang

Abstract: Metal nanoclusters (NCs), typically consisting of a few to tens of metal atoms, bridge the gap between organometallic compounds and crystalline metal nanoparticles. As their size approaches the Fermi wavelength of electrons, metal NCs exhibit discrete energy levels, which in turn results in the emergence of intriguing physical and chemical (or physicochemical) properties, especially strong fluores… ▽ More Metal nanoclusters (NCs), typically consisting of a few to tens of metal atoms, bridge the gap between organometallic compounds and crystalline metal nanoparticles. As their size approaches the Fermi wavelength of electrons, metal NCs exhibit discrete energy levels, which in turn results in the emergence of intriguing physical and chemical (or physicochemical) properties, especially strong fluorescence. In the past few decades, dramatic growth has been witnessed in the development of different types of noble metal NCs (mainly AuNCs and AgNCs). However, compared with noble metals, copper is a relatively earth-abundant and cost-effective metal. Theoretical and experimental studies have shown that copper NCs (CuNCs) possess unique catalytic and photoluminescent properties. In this context, CuNCs are emerging as a new class of nontoxic, economic, and effective phosphors and catalysts, drawing significant interest across the life and medical sciences. To highlight these achievements, this review begins by providing an overview of a multitude of factors that play central roles in the fluorescence of CuNCs. Additionally, a critical perspective of how the aggregation of CuNCs can efficiently improve the florescent stability, tunability, and intensity is also discussed. Following, we present representative applications of CuNCs in detection and bioimaging. Finally, we outline current challenges and our perspective on the development of CuNCs. △ Less

Submitted 29 December, 2019; originally announced December 2019.

arXiv:1906.10006 [pdf, ps, other]

Cooperativity, Absolute Interaction, and Algebraic Optimization

Authors: Nidhi Kaihnsa, Yue Ren, Mohab Safey El Din, Johannes W. R. Martini

Abstract: We consider a measure of cooperativity based on the minimal absolute interaction required to generate an observed titration behavior. We describe the corresponding algebraic optimization problem and show how it can be solved using the nonlinear algebra tool \texttt{SCIP}. Moreover, we compute the minimal absolute interactions for various binding polynomials that describe the oxygen binding of vari… ▽ More We consider a measure of cooperativity based on the minimal absolute interaction required to generate an observed titration behavior. We describe the corresponding algebraic optimization problem and show how it can be solved using the nonlinear algebra tool \texttt{SCIP}. Moreover, we compute the minimal absolute interactions for various binding polynomials that describe the oxygen binding of various hemoglobins under different conditions. While calculated minimal absolute interactions are consistent with the expected outcome of the chemical modifications, it ranks the cooperativity of the molecules differently than the maximal Hill slope. △ Less

Submitted 24 June, 2019; originally announced June 2019.

Comments: 21 pages

arXiv:1711.06865 [pdf, ps, other]

Decoupled molecules with binding polynomials of bidegree (n,2)

Authors: Yue Ren, Johannes W. R. Martini, Jacinta Torres

Abstract: We present a result on the number of decoupled molecules for systems binding two different types of ligands. In the case of $n$ and $2$ binding sites respectively, we show that, generically, there are $2(n!)^{2}$ decoupled molecules with the same binding polynomial. For molecules with more binding sites for the second ligand, we provide computational results. We present a result on the number of decoupled molecules for systems binding two different types of ligands. In the case of $n$ and $2$ binding sites respectively, we show that, generically, there are $2(n!)^{2}$ decoupled molecules with the same binding polynomial. For molecules with more binding sites for the second ligand, we provide computational results. △ Less

Submitted 18 November, 2017; originally announced November 2017.

Comments: 18 pages, 8 figures

MSC Class: 92C40; 65H10; 68W30

Journal ref: Journal of Mathematical Biology (2019) https://doi.org/10.1007/s00285-018-1295-x

arXiv:1710.10391 [pdf]

Introduction and reconciliation of the ROS and aging paradoxes

Authors: Yaguang Ren, Chao Zhang

Abstract: This paper suggests that aging is influenced synthetically by pro-aging factors such as ROS and anti-aging factors such as protective responses. The anti-aging effect may be side effects of retrograde responses motivated against adverse circumstances. ROS may be more closely correlated with metabolism rather than aging. This paper suggests that aging is influenced synthetically by pro-aging factors such as ROS and anti-aging factors such as protective responses. The anti-aging effect may be side effects of retrograde responses motivated against adverse circumstances. ROS may be more closely correlated with metabolism rather than aging. △ Less

Submitted 28 October, 2017; originally announced October 2017.

arXiv:1708.02626 [pdf, other]

A combinatorial method for connecting BHV spaces representing different numbers of taxa

Authors: Yingying Ren, Sihan Zha, Jingwen Bi, José A. Sanchez, Cara Monical, Michelle Delcourt, Rosemary K. Guzman, Ruth Davidson

Abstract: The phylogenetic tree space introduced by Billera, Holmes, and Vogtmann (BHV tree space) is a CAT(0) continuous space that represents trees with edge weights with an intrinsic geodesic distance measure. The geodesic distance measure unique to BHV tree space is well known to be computable in polynomial time, which makes it a potentially powerful tool for optimization problems in phylogenetics and p… ▽ More The phylogenetic tree space introduced by Billera, Holmes, and Vogtmann (BHV tree space) is a CAT(0) continuous space that represents trees with edge weights with an intrinsic geodesic distance measure. The geodesic distance measure unique to BHV tree space is well known to be computable in polynomial time, which makes it a potentially powerful tool for optimization problems in phylogenetics and phylogenomics. Specifically, there is significant interest in comparing and combining phylogenetic trees. For example, BHV tree space has been shown to be potentially useful in tree summary and consensus methods, which require combining trees with different number of leaves. Yet an open problem is to transition between BHV tree spaces of different maximal dimension, where each maximal dimension corresponds to the complete set of edge-weighted trees with a fixed number of leaves. We show a combinatorial method to transition between copies of BHV tree spaces in which trees with different numbers of taxa can be studied, derived from its topological structure and geometric properties. This method removes obstacles for embedding problems such as supertree and consensus methods in the BHV treespace framework. △ Less

Submitted 3 December, 2017; v1 submitted 8 August, 2017; originally announced August 2017.

Comments: Updated section on applications and link to github software release

MSC Class: 46N60; 37F20; 90C57; 97K20; 05C05; 92B10

arXiv:1704.06086 [pdf]

Do ROS really slow down aging in C. elegans?

Authors: Yaguang Ren, Sixi Chen, Mengmeng Ma, Congjie Zhang, Kejie Wang, Feng Li, Wenxuan Guo, Jiatao Huang, Chao Zhang

Abstract: The view that ROS slow down aging is getting popular. We here proposed an idea that aging is slowed down by secondary responses rather than ROS. The view that ROS slow down aging is getting popular. We here proposed an idea that aging is slowed down by secondary responses rather than ROS. △ Less

Submitted 26 July, 2017; v1 submitted 20 April, 2017; originally announced April 2017.

arXiv:1312.0329 [pdf]

Cellphone based Portable Bacteria Pre-Concentrating microfluidic Sensor and Impedance Sensing System

Authors: Jing Jiang, Xinhao Wang, Ran Chao, Yukun Ren, Chengpeng Hu, Zhida Xu, Gang Logan Liu

Abstract: Portable low-cost sensors and sensing systems for the identification and quantitative measurement of bacteria in field water are critical in preventing drinking water from being contaminated by bacteria. In this article, we reported the design, fabrication and testing of a low-cost, miniaturized and sensitive bacteria sensor based on electrical impedance spectroscopy method using a smartphone as t… ▽ More Portable low-cost sensors and sensing systems for the identification and quantitative measurement of bacteria in field water are critical in preventing drinking water from being contaminated by bacteria. In this article, we reported the design, fabrication and testing of a low-cost, miniaturized and sensitive bacteria sensor based on electrical impedance spectroscopy method using a smartphone as the platform. Our design of microfluidics enabled the pre-concentration of the bacteria which lowered the detection limit to 10 bacterial cells per milliliter. We envision that our demonstrated smartphone-based sensing system will realize highly-sensitive and rapid in-field quantification of multiple species of bacteria and pathogens. △ Less

Submitted 1 December, 2013; originally announced December 2013.

Comments: 15 pages, 5 figures, accepted in Sensors and Actuators B: Chemical

arXiv:1307.4147 [pdf]

Exploring the mechanisms of protein folding

Authors: Ji Xu, Mengzhi Han, Ying Ren, Jinghai Li

Abstract: Neither of the two prevalent theories, namely thermodynamic stability and kinetic stability, provides a comprehensive understanding of protein folding. The thermodynamic theory is misleading because it assumes that free energy is the exclusive dominant mechanism of protein folding, and attributes the structural transition from one characteristic state to another to energy barriers. Conversely, the… ▽ More Neither of the two prevalent theories, namely thermodynamic stability and kinetic stability, provides a comprehensive understanding of protein folding. The thermodynamic theory is misleading because it assumes that free energy is the exclusive dominant mechanism of protein folding, and attributes the structural transition from one characteristic state to another to energy barriers. Conversely, the concept of kinetic stability overemphasizes dominant mechanisms that are related to kinetic factors. This article explores the stability condition of protein structures from the viewpoint of meso-science, paying attention to the compromise in the competition between minimum free energy and other dominant mechanisms. Based on our study of complex systems, we propose that protein folding is a meso-scale, dissipative, nonlinear and non-equilibrium process that is dominated by the compromise between free energy and other dominant mechanisms such as environmental factors. Consequently, a protein shows dynamic structures, featuring characteristic states that appear alternately and dynamically, only one of which is the state with minimum free energy. To provide evidence for this concept, we analyzed the time series of energetic and structural changes of three simulations of protein folding/unfolding. Our results indicate that thorough consideration of the multiple dynamic characteristic structures generated by multiple mechanisms may be the key to understanding protein folding. △ Less

Submitted 18 July, 2013; v1 submitted 15 July, 2013; originally announced July 2013.

Comments: 19 pages, 9 figures

Showing 1–24 of 24 results for author: Ren, Y