Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C Wallace; Anna Noel-Storr; Iain J Marshall; Aaron M Cohen; Neil R Smalheiser; James Thomas

doi:10.1093/jamia/ocx053

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

J Am Med Inform Assoc. 2017 Nov 1;24(6):1165-1168. doi: 10.1093/jamia/ocx053.

Authors

Byron C Wallace¹, Anna Noel-Storr², Iain J Marshall³, Aaron M Cohen⁴, Neil R Smalheiser⁵, James Thomas⁶

Affiliations

¹ College of Computer and Information Science, Northeastern University, Boston MA, USA.
² Radcliffe Department of Medicine, University of Oxford, Oxford, UK.
³ Department of Primary Care and Public Health Sciences, King's College London, London, UK.
⁴ Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA.
⁵ Department of Psychiatry and Psychiatric Institute, University of Illinois College of Medicine, Chicago, IL, USA.
⁶ EPPI-Centre, Department of Social Science, University College London, London, UK.

Abstract

Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed to make this process more efficient via a hybrid approach using both crowdsourcing and ML.

Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise.

Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone.

Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Keywords: crowdsourcing; evidence-based medicine; human computation; machine learning; natural language processing.

MeSH terms

Biomedical Research
Crowdsourcing*
Databases, Bibliographic
Information Storage and Retrieval / methods*
Machine Learning*
Natural Language Processing
ROC Curve
Randomized Controlled Trials as Topic*
Review Literature as Topic
Support Vector Machine

Abstract

MeSH terms

Grants and funding