Developing machine-learning-based amyloidogenicity predictors with Cross-Beta DB

Valentin Gonay; Michael P Dunne; Javier Caceres-Delpiano; Andrey V Kajava

doi:10.1002/alz.14510

Developing machine-learning-based amyloidogenicity predictors with Cross-Beta DB

Alzheimers Dement. 2025 Jan 8. doi: 10.1002/alz.14510. Online ahead of print.

Authors

Valentin Gonay^{1

2}, Michael P Dunne², Javier Caceres-Delpiano³, Andrey V Kajava¹

Affiliations

¹ CRBM UMR 5237 CNRS, Université Montpellier, Montpellier, France.
² PROTERA SAS, Paris, France.
³ PROTERA (GEAEnzymes SpA), Santiago, Chile.

PMID: 39776173
DOI: 10.1002/alz.14510

Abstract

Introduction: The importance of protein amyloidogenesis, associated with various diseases and functional roles, has driven the creation of computational predictors of amyloidogenicity. The accuracy of these predictors, particularly those utilizing artificial intelligence technologies, heavily depends on the quality of the data.

Methods: We built Cross-Beta DB, a database containing high-quality data on known cross-β amyloids formed under natural conditions. We used it to train and benchmark several machine-learning (ML) algorithms to predict amyloid-forming potential of proteins.

Results: We developed the Cross-Beta predictor using an Extra trees ML algorithm, which outperforms other amyloid predictors with the highest F1 score (0.852) and accuracy (0.844) compared to existing methods.

Discussion: The development of the Cross-Beta DB database and a new ML-based Cross-Beta predictor may enable the creation of personalized risk profiles for neurodegenerative diseases and other amyloidoses-especially as genome sequencing becomes more affordable.

Highlights: Accuracy of ML-based predictors depends on the quality of training data We built Cross-Beta DB, a database of high-quality data on naturally-occurring amyloids Using this data, we developed an amyloid predictor that outperforms other predictors This computational tool enables the creation of risk profiles for neurodegenerative diseases.

Keywords: GWAS; amyloidosis; artificial intelligence; computational methods; cross‐β structure; database; machine‐learning.