Search | arXiv e-print repository

Machine Learning Driven Biomarker Selection for Medical Diagnosis

Authors: Divyagna Bavikadi, Ayushi Agarwal, Shashank Ganta, Yunro Chung, Lusheng Song, Ji Qiu, Paulo Shakarian

Abstract: Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely unde… ▽ More Recent advances in experimental methods have enabled researchers to collect data on thousands of analytes simultaneously. This has led to correlational studies that associated molecular measurements with diseases such as Alzheimer's, Liver, and Gastric Cancer. However, the use of thousands of biomarkers selected from the analytes is not practical for real-world medical diagnosis and is likely undesirable due to potentially formed spurious correlations. In this study, we evaluate 4 different methods for biomarker selection and 4 different machine learning (ML) classifiers for identifying correlations, evaluating 16 approaches in all. We found that contemporary methods outperform previously reported logistic regression in cases where 3 and 10 biomarkers are permitted. When specificity is fixed at 0.9, ML approaches produced a sensitivity of 0.240 (3 biomarkers) and 0.520 (10 biomarkers), while standard logistic regression provided a sensitivity of 0.000 (3 biomarkers) and 0.040 (10 biomarkers). We also noted that causal-based methods for biomarker selection proved to be the most performant when fewer biomarkers were permitted, while univariate feature selection was the most performant when a greater number of biomarkers were permitted. △ Less

Submitted 15 May, 2024; originally announced May 2024.

arXiv:2402.12993 [pdf, other]

An Autonomous Large Language Model Agent for Chemical Literature Data Mining

Authors: Kexin Chen, Hanqun Cao, Junyou Li, Yuyang Du, Menghao Guo, Xin Zeng, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, Guangyong Chen

Abstract: Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase… ▽ More Chemical synthesis, which is crucial for advancing material synthesis and drug discovery, impacts various sectors including environmental science and healthcare. The rise of technology in chemistry has generated extensive chemical data, challenging researchers to discern patterns and refine synthesis processes. Artificial intelligence (AI) helps by analyzing data to optimize synthesis and increase yields. However, AI faces challenges in processing literature data due to the unstructured format and diverse writing style of chemical literature. To overcome these difficulties, we introduce an end-to-end AI agent framework capable of high-fidelity extraction from extensive chemical literature. This AI agent employs large language models (LLMs) for prompt generation and iterative optimization. It functions as a chemistry assistant, automating data collection and analysis, thereby saving manpower and enhancing performance. Our framework's efficacy is evaluated using accuracy, recall, and F1 score of reaction condition data, and we compared our method with human experts in terms of content correctness and time efficiency. The proposed approach marks a significant advancement in automating chemical literature extraction and demonstrates the potential for AI to revolutionize data management and utilization in chemistry. △ Less

Submitted 20 February, 2024; originally announced February 2024.

arXiv:2401.10009 [pdf, other]

An optimization-based equilibrium measure describes non-equilibrium steady state dynamics: application to edge of chaos

Authors: Junbin Qiu, Haiping Huang

Abstract: Understanding neural dynamics is a central topic in machine learning, non-linear physics and neuroscience. However, the dynamics is non-linear, stochastic and particularly non-gradient, i.e., the driving force can not be written as gradient of a potential. These features make analytic studies very challenging. The common tool is the path integral approach or dynamical mean-field theory, but the dr… ▽ More Understanding neural dynamics is a central topic in machine learning, non-linear physics and neuroscience. However, the dynamics is non-linear, stochastic and particularly non-gradient, i.e., the driving force can not be written as gradient of a potential. These features make analytic studies very challenging. The common tool is the path integral approach or dynamical mean-field theory, but the drawback is that one has to solve the integro-differential or dynamical mean-field equations, which is computationally expensive and has no closed form solutions in general. From the aspect of associated Fokker-Planck equation, the steady state solution is generally unknown. Here, we treat searching for the steady states as an optimization problem, and construct an approximate potential related to the speed of the dynamics, and find that searching for the ground state of this potential is equivalent to running an approximate stochastic gradient dynamics or Langevin dynamics. Only in the zero temperature limit, the distribution of the original steady states can be achieved. The resultant stationary state of the dynamics follows exactly the canonical Boltzmann measure. Within this framework, the quenched disorder intrinsic in the neural networks can be averaged out by applying the replica method, which leads naturally to order parameters for the non-equilibrium steady states. Our theory reproduces the well-known result of edge-of-chaos, and further the order parameters characterizing the continuous transition are derived, and the order parameters are explained as fluctuations and responses of the steady states. Our method thus opens the door to analytically study the steady state landscape of the deterministic or stochastic high dimensional dynamics. △ Less

Submitted 7 June, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

Comments: 21 pages, 9 figures, revised version 2

arXiv:2401.06173 [pdf, other]

Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization

Authors: Jiahao Qiu, Hui Yuan, Jinghong Zhang, Wentao Chen, Huazheng Wang, Mengdi Wang

Abstract: While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and… ▽ More While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein. Protein engineering is typically conducted through an iterative process of adding mutations to the wild-type or lead sequences, recombination of mutations, and running new rounds of screening. To enhance the efficiency of such a process, we propose a tree search-based bandit learning method, which expands a tree starting from the initial sequence with the guidance of a bandit machine learning model. Under simplified assumptions and a Gaussian Process prior, we provide theoretical analysis and a Bayesian regret bound, demonstrating that the combination of local search and bandit learning method can efficiently discover a near-optimal design. The full algorithm is compatible with a suite of randomized tree search heuristics, machine learning models, pre-trained embeddings, and bandit techniques. We test various instances of the algorithm across benchmark protein datasets using simulated screens. Experiment results demonstrate that the algorithm is both sample-efficient and able to find top designs using reasonably small mutation counts. △ Less

Submitted 8 January, 2024; originally announced January 2024.

Comments: AAAI 2024

arXiv:2309.04106 [pdf, other]

doi 10.1103/PhysRevE.109.044309

Meta predictive learning model of languages in neural circuits

Authors: Chan Li, Junbin Qiu, Haiping Huang

Abstract: Large language models based on self-attention mechanisms have achieved astonishing performances not only in natural language itself, but also in a variety of tasks of different nature. However, regarding processing language, our human brain may not operate using the same principle. Then, a debate is established on the connection between brain computation and artificial self-supervision adopted in… ▽ More Large language models based on self-attention mechanisms have achieved astonishing performances not only in natural language itself, but also in a variety of tasks of different nature. However, regarding processing language, our human brain may not operate using the same principle. Then, a debate is established on the connection between brain computation and artificial self-supervision adopted in large language models. One of most influential hypothesis in brain computation is the predictive coding framework, which proposes to minimize the prediction error by local learning. However, the role of predictive coding and the associated credit assignment in language processing remains unknown. Here, we propose a mean-field learning model within the predictive coding framework, assuming that the synaptic weight of each connection follows a spike and slab distribution, and only the distribution, rather than specific weights, is trained. This meta predictive learning is successfully validated on classifying handwritten digits where pixels are input to the network in sequence, and moreover on the toy and real language corpus. Our model reveals that most of the connections become deterministic after learning, while the output connections have a higher level of variability. The performance of the resulting network ensemble changes continuously with data load, further improving with more training data, in analogy with the emergent behavior of large language models. Therefore, our model provides a starting point to investigate the connection among brain computation, next-token prediction and general intelligence. △ Less

Submitted 9 October, 2023; v1 submitted 7 September, 2023; originally announced September 2023.

Comments: 31 pages, 6 figures, codes are available in the main text with the link

Journal ref: Phys. Rev. E 109, 044309 (2024)

arXiv:2306.04886 [pdf, other]

Multi-task Bioassay Pre-training for Protein-ligand Binding Affinity Prediction

Authors: Jiaxian Yan, Zhaofeng Ye, Ziyi Yang, Chengqiang Lu, Shengyu Zhang, Qi Liu, Jiezhong Qiu

Abstract: Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still li… ▽ More Protein-ligand binding affinity (PLBA) prediction is the fundamental task in drug discovery. Recently, various deep learning-based models predict binding affinity by incorporating the three-dimensional structure of protein-ligand complexes as input and achieving astounding progress. However, due to the scarcity of high-quality training data, the generalization ability of current models is still limited. In addition, different bioassays use varying affinity measurement labels (i.e., IC50, Ki, Kd), and different experimental conditions inevitably introduce systematic noise, which poses a significant challenge to constructing high-precision affinity prediction models. To address these issues, we (1) propose Multi-task Bioassay Pre-training (MBP), a pre-training framework for structure-based PLBA prediction; (2) construct a pre-training dataset called ChEMBL-Dock with more than 300k experimentally measured affinity labels and about 2.8M docked three-dimensional structures. By introducing multi-task pre-training to treat the prediction of different affinity labels as different tasks and classifying relative rankings between samples from the same bioassay, MBP learns robust and transferrable structural knowledge from our new ChEMBL-Dock dataset with varied and noisy labels. Experiments substantiate the capability of MBP as a general framework that can improve and be tailored to mainstream structure-based PLBA prediction tasks. To the best of our knowledge, MBP is the first affinity pre-training model and shows great potential for future development. △ Less

Submitted 20 December, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Comments: 21 pages, 7 figures

arXiv:2208.06348 [pdf, other]

Can Brain Signals Reveal Inner Alignment with Human Languages?

Authors: William Han, Jielin Qiu, Jiacheng Zhu, Mengdi Xu, Douglas Weber, Bo Li, Ding Zhao

Abstract: Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \… ▽ More Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}. △ Less

Submitted 4 May, 2024; v1 submitted 10 August, 2022; originally announced August 2022.

Comments: EMNLP 2023 Findings

arXiv:2205.04871 [pdf, other]

doi 10.1038/s41398-022-02162-y

Accelerated functional brain aging in major depressive disorder: evidence from a large scale fMRI analysis of Chinese participants

Authors: Yunsong Luo, Wenyu Chen, Jiang Qiu, Tao Jia

Abstract: Major depressive disorder (MDD) is one of the most common mental health conditions that has been intensively investigated for its association with brain atrophy and mortality. Recent studies reveal that the deviation between the predicted and the chronological age can be a marker of accelerated brain aging to characterize MDD. However, current conclusions are usually drawn based on structural MRI… ▽ More Major depressive disorder (MDD) is one of the most common mental health conditions that has been intensively investigated for its association with brain atrophy and mortality. Recent studies reveal that the deviation between the predicted and the chronological age can be a marker of accelerated brain aging to characterize MDD. However, current conclusions are usually drawn based on structural MRI information collected from Caucasian participants. The universality of this biomarker needs to be further validated by subjects with different ethnic/racial backgrounds and by different types of data. Here we make use of the REST-meta-MDD, a large scale resting-state fMRI dataset collected from multiple cohort participants in China. We develop a stacking machine learning model based on 1101 healthy controls, which estimates a subject's chronological age from fMRI with promising accuracy. The trained model is then applied to 1276 MDD patients from 24 sites. We observe that MDD patients exhibit a $+4.43$ years ($\text{$p$} < 0.0001$, $\text{Cohen's $d$} = 0.35$, $\text{95\% CI}:1.86 - 3.91$) higher brain-predicted age difference (brain-PAD) compared to controls. In the MDD subgroup, we observe a statistically significant $+2.09$ years ($\text{$p$} < 0.05$, $\text{Cohen's $d$} = 0.134483$) brain-PAD in antidepressant users compared to medication-free patients. The statistical relationship observed is further checked by three different machine learning algorithms. The positive brain-PAD observed in participants in China confirms the presence of accelerated brain aging in MDD patients. The utilization of functional brain connectivity for age estimation verifies existing findings from a new dimension. △ Less

Submitted 8 May, 2022; originally announced May 2022.

Comments: 32 pages,13 figures

Journal ref: Transl Psychiatry 12, 397 (2022)

arXiv:2108.07435 [pdf, other]

Modeling Protein Using Large-scale Pretrain Language Model

Authors: Yijia Xiao, Jiezhong Qiu, Ziang Li, Chang-Yu Hsieh, Jie Tang

Abstract: Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data poss… ▽ More Protein is linked to almost every life process. Therefore, analyzing the biological structure and property of protein sequences is critical to the exploration of life, as well as disease detection and drug discovery. Traditional protein analysis methods tend to be labor-intensive and time-consuming. The emergence of deep learning models makes modeling data patterns in large quantities of data possible. Interdisciplinary researchers have begun to leverage deep learning methods to model large biological datasets, e.g. using long short-term memory and convolutional neural network for protein sequence classification. After millions of years of evolution, evolutionary information is encoded in protein sequences. Inspired by the similarity between natural language and protein sequences, we use large-scale language models to model evolutionary-scale protein sequences, encoding protein biology information in representation. Significant improvements are observed in both token-level and sequence-level tasks, demonstrating that our large-scale model can accurately capture evolution information from pretraining on evolutionary-scale individual sequences. Our code and model are available at https://github.com/THUDM/ProteinLM. △ Less

Submitted 7 December, 2021; v1 submitted 17 August, 2021; originally announced August 2021.

Comments: Accepted paper in Pretrain@KDD 2021 (The International Workshop on Pretraining: Algorithms, Architectures, and Applications)

arXiv:2103.14321 [pdf, other]

Online Learning Koopman operator for closed-loop electrical neurostimulation in epilepsy

Authors: Zhichao Liang, Zixiang Luo, Keyin Liu, Jingwei Qiu, Quanying Liu

Abstract: Electrical neuromodulation as a palliative treatment has been increasingly used in the control of epilepsy. However, current neuromodulations commonly implement predetermined actuation strategies and lack the capability of self-adaptively adjusting stimulation inputs. In this work, rooted in optimal control theory, we propose a Koopman-MPC framework for real-time closed-loop electrical neuromodula… ▽ More Electrical neuromodulation as a palliative treatment has been increasingly used in the control of epilepsy. However, current neuromodulations commonly implement predetermined actuation strategies and lack the capability of self-adaptively adjusting stimulation inputs. In this work, rooted in optimal control theory, we propose a Koopman-MPC framework for real-time closed-loop electrical neuromodulation in epilepsy, which integrates i) a deep Koopman operator based dynamical model to predict the temporal evolution of epileptic EEG with an approximate finite-dimensional linear dynamics and ii) a model predictive control (MPC) module to design optimal seizure suppression strategies. The Koopman operator based linear dynamical model is embedded in the latent state space of the autoencoder neural network, in which we can approximate and update the Koopman operator online. The linear dynamical property of the Koopman operator ensures the convexity of the optimization problem for subsequent MPC control. The proposed deep Koopman operator model shows greater predictive capability than the baseline models (e.g., vector autoregressive model, kernel based method and recurrent neural network (RNN)) in both synthetic and real epileptic EEG data. Moreover, compared with the RNN-MPC framework, our Koopman-MPC framework can suppress seizure dynamics with better computational efficiency in both the Jansen-Rit model and the Epileptor model. Koopman-MPC framework opens a new window for model-based closed-loop neuromodulation and sheds light on nonlinear neurodynamics and feedback control policies. △ Less

Submitted 20 August, 2022; v1 submitted 26 March, 2021; originally announced March 2021.

arXiv:1103.5181 [pdf]

doi 10.1371/journal.pone.0017243

Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA

Authors: Huijun Wang, Ying Ding, Jie Tang, Xiao Dong, Bing He, Judy Qiu, David J. Wild

Abstract: The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA… ▽ More The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms. Relationships identified using those approaches are combined with existing data in life science datasets to provide additional insight. Three case studies demonstrate the utility of the Bio-LDA model, including association predication, association search and connectivity map generation. This combined approach offers new opportunities for knowledge discovery in many areas of biology including target identification, lead hopping and drug repurposing. △ Less

Submitted 27 March, 2011; originally announced March 2011.

Comments: 14 pages, 8 figures, 10 tables

Journal ref: PLoS ONE (2011) 6(3): e17243

arXiv:q-bio/0610040 [pdf, ps, other]

Metric learning pairwise kernel for graph inference

Authors: Jean-Philippe Vert, Jian Qiu, William Stafford Noble

Abstract: Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression,… ▽ More Much recent work in bioinformatics has focused on the inference of various types of biological networks, representing gene regulation, metabolic processes, protein-protein interactions, etc. A common setting involves inferring network edges in a supervised fashion from a set of high-confidence edges, possibly characterized by multiple, heterogeneous data sets (protein sequence, gene expression, etc.). Here, we distinguish between two modes of inference in this setting: direct inference based upon similarities between nodes joined by an edge, and indirect inference based upon similarities between one pair of nodes and another pair of nodes. We propose a supervised approach for the direct case by translating it into a distance metric learning problem. A relaxation of the resulting convex optimization problem leads to the support vector machine (SVM) algorithm with a particular kernel for pairs, which we call the metric learning pairwise kernel (MLPK). We demonstrate, using several real biological networks, that this direct approach often improves upon the state-of-the-art SVM for indirect inference with the tensor product pairwise kernel. △ Less

Submitted 21 October, 2006; originally announced October 2006.

Showing 1–12 of 12 results for author: Qiu, J