Introduction: The importance of protein amyloidogenesis, associated with various diseases and functional roles, has driven the creation of computational predictors of amyloidogenicity. The accuracy of these predictors, particularly those utilizing artificial intelligence technologies, heavily depends on the quality of the data.
Methods: We built Cross-Beta DB, a database containing high-quality data on known cross-β amyloids formed under natural conditions. We used it to train and benchmark several machine-learning (ML) algorithms to predict amyloid-forming potential of proteins.
Results: We developed the Cross-Beta predictor using an Extra trees ML algorithm, which outperforms other amyloid predictors with the highest F1 score (0.852) and accuracy (0.844) compared to existing methods.
Discussion: The development of the Cross-Beta DB database and a new ML-based Cross-Beta predictor may enable the creation of personalized risk profiles for neurodegenerative diseases and other amyloidoses-especially as genome sequencing becomes more affordable.
Highlights: Accuracy of ML-based predictors depends on the quality of training data We built Cross-Beta DB, a database of high-quality data on naturally-occurring amyloids Using this data, we developed an amyloid predictor that outperforms other predictors This computational tool enables the creation of risk profiles for neurodegenerative diseases.
Keywords: GWAS; amyloidosis; artificial intelligence; computational methods; cross‐β structure; database; machine‐learning.
© 2024 The Author(s). Alzheimer's & Dementia published by Wiley Periodicals LLC on behalf of Alzheimer's Association.