-
BEND: Benchmarking DNA Language Models on biologically meaningful tasks
Authors:
Frederikke Isa Marin,
Felix Teufel,
Marc Horlacher,
Dennis Madsen,
Dennis Pultz,
Ole Winther,
Wouter Boomsma
Abstract:
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that…
▽ More
The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements encoded in the DNA sequence remains both expensive and challenging. This has sparked interest in unsupervised language modeling of genomic DNA, a paradigm that has seen great success for protein sequence data. Although various DNA language models have been proposed, evaluation tasks often differ between individual works, and might not fully recapitulate the fundamental challenges of genome annotation, including the length, scale and sparsity of the data. In this study, we introduce BEND, a Benchmark for DNA language models, featuring a collection of realistic and biologically meaningful downstream tasks defined on the human genome. We find that embeddings from current DNA LMs can approach performance of expert methods on some tasks, but only capture limited information about long-range features. BEND is available at https://github.com/frederikkemarin/BEND.
△ Less
Submitted 9 April, 2024; v1 submitted 21 November, 2023;
originally announced November 2023.
-
Decoding ChatGPT: A Taxonomy of Existing Research, Current Challenges, and Possible Future Directions
Authors:
Shahab Saquib Sohail,
Faiza Farhat,
Yassine Himeur,
Mohammad Nadeem,
Dag Øivind Madsen,
Yashbir Singh,
Shadi Atalla,
Wathiq Mansoor
Abstract:
Chat Generative Pre-trained Transformer (ChatGPT) has gained significant interest and attention since its launch in November 2022. It has shown impressive performance in various domains, including passing exams and creative writing. However, challenges and concerns related to biases and trust persist. In this work, we present a comprehensive review of over 100 Scopus-indexed publications on ChatGP…
▽ More
Chat Generative Pre-trained Transformer (ChatGPT) has gained significant interest and attention since its launch in November 2022. It has shown impressive performance in various domains, including passing exams and creative writing. However, challenges and concerns related to biases and trust persist. In this work, we present a comprehensive review of over 100 Scopus-indexed publications on ChatGPT, aiming to provide a taxonomy of ChatGPT research and explore its applications. We critically analyze the existing literature, identifying common approaches employed in the studies. Additionally, we investigate diverse application areas where ChatGPT has found utility, such as healthcare, marketing and financial services, software engineering, academic and scientific writing, research and education, environmental science, and natural language processing. Through examining these applications, we gain valuable insights into the potential of ChatGPT in addressing real-world challenges. We also discuss crucial issues related to ChatGPT, including biases and trustworthiness, emphasizing the need for further research and development in these areas. Furthermore, we identify potential future directions for ChatGPT research, proposing solutions to current challenges and speculating on expected advancements. By fully leveraging the capabilities of ChatGPT, we can unlock its potential across various domains, leading to advancements in conversational AI and transformative impacts in society.
△ Less
Submitted 25 August, 2023; v1 submitted 26 July, 2023;
originally announced July 2023.
-
Hypergraphs for multiscale cycles in structured data
Authors:
Agnese Barbensi,
Iris H. R. Yoon,
Christian Degnbol Madsen,
Deborah O. Ajayi,
Michael P. H. Stumpf,
Heather A. Harrington
Abstract:
Scientific data has been growing in both size and complexity across the modern physical, engineering, life and social sciences. Spatial structure, for example, is a hallmark of many of the most important real-world complex systems, but its analysis is fraught with statistical challenges. Topological data analysis can provide a powerful computational window on complex systems. Here we present a fra…
▽ More
Scientific data has been growing in both size and complexity across the modern physical, engineering, life and social sciences. Spatial structure, for example, is a hallmark of many of the most important real-world complex systems, but its analysis is fraught with statistical challenges. Topological data analysis can provide a powerful computational window on complex systems. Here we present a framework to extend and interpret persistent homology summaries to analyse spatial data across multiple scales. We introduce hyperTDA, a topological pipeline that unifies local (e.g. geodesic) and global (e.g. Euclidean) metrics without losing spatial information, even in the presence of noise. Homology generators offer an elegant and flexible description of spatial structures and can capture the information computed by persistent homology in an interpretable way. Here the information computed by persistent homology is transformed into a weighted hypergraph, where hyperedges correspond to homology generators. We consider different choices of generators (e.g. matroid or minimal) and find that centrality and community detection are robust to either choice. We compare hyperTDA to existing geometric measures and validate its robustness to noise. We demonstrate the power of computing higher-order topological structures on spatial curves arising frequently in ecology, biophysics, and biology, but also in high-dimensional financial datasets. We find that hyperTDA can select between synthetic trajectories from the landmark 2020 AnDi challenge and quantifies movements of different animal species, even when data is limited.
△ Less
Submitted 14 October, 2022;
originally announced October 2022.
-
GiNGR: Generalized Iterative Non-Rigid Point Cloud and Surface Registration Using Gaussian Process Regression
Authors:
Dennis Madsen,
Jonathan Aellen,
Andreas Morel-Forster,
Thomas Vetter,
Marcel Lüthi
Abstract:
In this paper, we unify popular non-rigid registration methods for point sets and surfaces under our general framework, GiNGR. GiNGR builds upon Gaussian Process Morphable Models (GPMM) and hence separates modeling the deformation prior from model adaptation for registration. In addition, it provides explainable hyperparameters, multi-resolution registration, trivial inclusion of expert annotation…
▽ More
In this paper, we unify popular non-rigid registration methods for point sets and surfaces under our general framework, GiNGR. GiNGR builds upon Gaussian Process Morphable Models (GPMM) and hence separates modeling the deformation prior from model adaptation for registration. In addition, it provides explainable hyperparameters, multi-resolution registration, trivial inclusion of expert annotation, and the ability to use and combine analytical and statistical deformation priors. But more importantly, the reformulation allows for a direct comparison of registration methods. Instead of using a general solver in the optimization step, we show how Gaussian process regression (GPR) iteratively can warp a reference onto a target, leading to smooth deformations following the prior for any dense, sparse, or partial estimated correspondences in a principled way. We show how the popular CPD and ICP algorithms can be directly explained with GiNGR. Furthermore, we show how existing algorithms in the GiNGR framework can perform probabilistic registration to obtain a distribution of different registrations instead of a single best registration. This can be used to analyze the uncertainty e.g. when registering partial observations. GiNGR is publicly available and fully modular to allow for domain-specific prior construction.
△ Less
Submitted 18 March, 2022;
originally announced March 2022.
-
A Closest Point Proposal for MCMC-based Probabilistic Surface Registration
Authors:
Dennis Madsen,
Andreas Morel-Forster,
Patrick Kahr,
Dana Rahbani,
Thomas Vetter,
Marcel Lüthi
Abstract:
We propose to view non-rigid surface registration as a probabilistic inference problem. Given a target surface, we estimate the posterior distribution of surface registrations. We demonstrate how the posterior distribution can be used to build shape models that generalize better and show how to visualize the uncertainty in the established correspondence. Furthermore, in a reconstruction task, we s…
▽ More
We propose to view non-rigid surface registration as a probabilistic inference problem. Given a target surface, we estimate the posterior distribution of surface registrations. We demonstrate how the posterior distribution can be used to build shape models that generalize better and show how to visualize the uncertainty in the established correspondence. Furthermore, in a reconstruction task, we show how to estimate the posterior distribution of missing data without assuming a fixed point-to-point correspondence.
We introduce the closest-point proposal for the Metropolis-Hastings algorithm. Our proposal overcomes the limitation of slow convergence compared to a random-walk strategy. As the algorithm decouples inference from modeling the posterior using a propose-and-verify scheme, we show how to choose different distance measures for the likelihood model.
All presented results are fully reproducible using publicly available data and our open-source implementation of the registration framework.
△ Less
Submitted 17 July, 2020; v1 submitted 2 July, 2019;
originally announced July 2019.