Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

Martin Hunt; Angie S Hinrichs; Daniel Anderson; Lily Karim; Bethany L Dearlove; Jeff Knaggs; Bede Constantinides; Philip W Fowler; Gillian Rodger; Teresa Street; Sheila Lumley; Hermione Webster; Theo Sanderson; Christopher Ruis; Benjamin Kotzen; Nicola de Maio; Lucas N Amenga-Etego; Dominic S Y Amuzu; Martin Avaro; Gordon A Awandare; Reuben Ayivor-Djanie; Timothy Barkham; Matthew Bashton; Elizabeth M Batty; Yaw Bediako; Denise De Belder; Estefania Benedetti; Andreas Bergthaler; Stefan A Boers; Josefina Campos; Rosina Afua Ampomah Carr; Yuan Yi Constance Chen; Facundo Cuba; Maria Elena Dattero; Wanwisa Dejnirattisai; Alexander Dilthey; Kwabena Obeng Duedu; Lukas Endler; Ilka Engelmann; Ngiambudulu M Francisco; Jonas Fuchs; Etienne Z Gnimpieba; Soraya Groc; Jones Gyamfi; Dennis Heemskerk; Torsten Houwaart; Nei-Yuan Hsiao; Matthew Huska; Martin Hölzer; Arash Iranzadeh; Hanna Jarva; Chandima Jeewandara; Bani Jolly; Rageema Joseph; Ravi Kant; Karrie Ko Kwan Ki; Satu Kurkela; Maija Lappalainen; Marie Lataretu; Jacob Lemieux; Chang Liu; Gathsaurie Neelika Malavige; Tapfumanei Mashe; Juthathip Mongkolsapaya; Brigitte Montes; Jose Arturo Molina Mora; Collins M Morang'a; Bernard Mvula; Niranjan Nagarajan; Andrew Nelson; Joyce M Ngoi; Joana Paula da Paixão; Marcus Panning; Tomas Poklepovich; Peter K Quashie; Diyanath Ranasinghe; Mara Russo; James Emmanuel San; Nicholas D Sanderson; Vinod Scaria; Gavin Screaton; October Michael Sessions; Tarja Sironen; Abay Sisay; Darren Smith; Teemu Smura; Piyada Supasa; Chayaporn Suphavilai; Jeremy Swann; Houriiyah Tegally; Bryan Tegomoh; Olli Vapalahti; Andreas Walker; Robert J Wilkinson; Carolyn Williamson; Xavier Zair; IMSSC2 Laboratory Network Consortium; Tulio de Oliveira; Timothy Ea Peto; Derrick Crook; Russell Corbett-Detig; Zamin Iqbal

doi:10.1101/2024.04.29.591666

Addressing pandemic-wide systematic errors in the SARS-CoV-2 phylogeny

bioRxiv [Preprint]. 2024 Nov 5:2024.04.29.591666. doi: 10.1101/2024.04.29.591666.

Authors

Martin Hunt^{1

2

3

4}, Angie S Hinrichs⁵, Daniel Anderson¹, Lily Karim^{5

6}, Bethany L Dearlove⁷, Jeff Knaggs^{1

2

3

4}, Bede Constantinides^{2

4}, Philip W Fowler^{2

3

4}, Gillian Rodger^{2

4}, Teresa Street^{2

3}, Sheila Lumley^{2

8}, Hermione Webster², Theo Sanderson⁹, Christopher Ruis^{10

11}, Benjamin Kotzen¹², Nicola de Maio¹, Lucas N Amenga-Etego¹³, Dominic S Y Amuzu¹³, Martin Avaro¹⁴, Gordon A Awandare¹³, Reuben Ayivor-Djanie^{15

16}, Timothy Barkham¹⁷, Matthew Bashton¹⁸, Elizabeth M Batty^{19

20}, Yaw Bediako¹³, Denise De Belder²¹, Estefania Benedetti¹⁴, Andreas Bergthaler⁷, Stefan A Boers²², Josefina Campos²¹, Rosina Afua Ampomah Carr^{16

23}, Yuan Yi Constance Chen¹⁷, Facundo Cuba²¹, Maria Elena Dattero¹⁴, Wanwisa Dejnirattisai²⁴, Alexander Dilthey²⁵, Kwabena Obeng Duedu^{16

26}, Lukas Endler⁷, Ilka Engelmann²⁷, Ngiambudulu M Francisco²⁸, Jonas Fuchs²⁹, Etienne Z Gnimpieba³⁰, Soraya Groc³¹, Jones Gyamfi^{16

32}, Dennis Heemskerk²², Torsten Houwaart²⁵, Nei-Yuan Hsiao³³, Matthew Huska³⁴, Martin Hölzer³⁴, Arash Iranzadeh³⁵, Hanna Jarva³⁶, Chandima Jeewandara³⁷, Bani Jolly^{38

39}, Rageema Joseph³⁵, Ravi Kant^{40

41

42}, Karrie Ko Kwan Ki⁴³, Satu Kurkela³⁶, Maija Lappalainen³⁶, Marie Lataretu³⁴, Jacob Lemieux¹², Chang Liu^{44

45}, Gathsaurie Neelika Malavige³⁷, Tapfumanei Mashe⁴⁶, Juthathip Mongkolsapaya^{20

44

45}, Brigitte Montes³¹, Jose Arturo Molina Mora⁴⁷, Collins M Morang'a¹³, Bernard Mvula⁴⁸, Niranjan Nagarajan^{49

50}, Andrew Nelson⁵¹, Joyce M Ngoi¹³, Joana Paula da Paixão²⁸, Marcus Panning²⁹, Tomas Poklepovich²¹, Peter K Quashie¹³, Diyanath Ranasinghe³⁷, Mara Russo¹⁴, James Emmanuel San^{52

53}, Nicholas D Sanderson^{2

3}, Vinod Scaria^{39

54}, Gavin Screaton², October Michael Sessions⁵⁵, Tarja Sironen^{40

41}, Abay Sisay⁵⁶, Darren Smith¹⁸, Teemu Smura^{40

41}, Piyada Supasa^{44

45}, Chayaporn Suphavilai⁴⁹, Jeremy Swann², Houriiyah Tegally⁵⁷, Bryan Tegomoh^{58

59

60}, Olli Vapalahti^{40

41}, Andreas Walker⁶¹, Robert J Wilkinson^{9

62

63}, Carolyn Williamson³⁵, Xavier Zair⁵⁵; IMSSC2 Laboratory Network Consortium; Tulio de Oliveira^{57

64}, Timothy Ea Peto², Derrick Crook², Russell Corbett-Detig^{5

6}, Zamin Iqbal^{1

65}

Affiliations

¹ European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, UK.
² Nuffield Department of Medicine, University of Oxford, Oxford, UK.
³ National Institute of Health Research Oxford Biomedical Research Centre, John Radcliffe Hospital, Headley Way, Oxford, UK.
⁴ Health Protection Research Unit in Healthcare Associated Infections and Antimicrobial Resistance, University of Oxford, Oxford, UK.
⁵ Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA.
⁶ Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA.
⁷ Institute for Hygiene and Applied Immunology, Center for Pathophysiology, Infectiology and Immunology, Medical University of Vienna, Vienna 1090, Austria.
⁸ Department of Infectious Diseases and Microbiology, John Radcliffe Hospital, Oxford, UK.
⁹ Francis Crick Institute, London, UK.
¹⁰ Victor Phillip Dahdaleh Heart & Lung Research Institute, University of Cambridge, Cambridge, UK.
¹¹ Department of Veterinary Medicine, University of Cambridge, Cambridge, UK.
¹² Department of Infectious Diseases, Massachusetts General Hospital., Boston, Massachusetts, USA.
¹³ West African Centre for Cell Biology of Infectious Pathogens (WACCBIP), University of Ghana, Accra, Ghana.
¹⁴ Servicio de Virus Respiratorios, Instituto Nacional Enfermedades Infecciosas, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
¹⁵ Laboratory for Medical Biotechnology and Biomanufacturing, International Centre for Genetic Engineering and Biotechnology, Tristie, Italy.
¹⁶ Department of Biomedical Sciences, University of Health and Allied Sciences, Ho, Ghana.
¹⁷ Tan Tock Seng Hospital, Singapore.
¹⁸ The Hub for Biotechnology in the Built Environment, Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
¹⁹ Centre for Tropical Medicine and Global Health, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
²⁰ Mahidol-Oxford Tropical Medicine Research Unit, Bangkok, Thailand.
²¹ Unidad Operativa Centro Nacional de Genómica y Bioinformática, ANLIS "Dr. Carlos G. Malbrán", Buenos Aires, Argentina.
²² Dept. Medical Microbiology, Leiden University Medical Center, Albinusdreef 2, 2333 ZA, Leiden, The Netherlands.
²³ Department of Computational Medicine and Bioinformatics, University of Michigan, Michigan, Ann Arbor, MI, USA.
²⁴ Division of Emerging Infectious Disease, Research Department, Faculty of Medicine Siriraj Hospital, Mahidol University, Bangkoknoi, Bangkok 10700, Thailand.
²⁵ Institute of Medical Microbiology and Hospital Hygiene, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
²⁶ College of Life Sciences, Birmingham City University, Birmingham, UK.
²⁷ Pathogenesis and Control of Chronic and Emerging Infections, Univ Montpellier, INSERM, Etablissement Français du Sang, Virology Laboratory, CHU Montpellier, Montpellier, France.
²⁸ Grupo de Investigação Microbiana e Imunológica, Instituto Nacional de Investigação em Saúde (National Institute for Health Research), Luanda, Angola.
²⁹ Institute of Virology, Freiburg University Medical Center, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
³⁰ Biomedical Engineering Department, University of South Dakota, Sioux Falls, SD 57107.
³¹ Virology Laboratory, CHU Montpellier, Montpellier, France.
³² School of Health and Life Sciences, Teesside University, Middlesbrough, UK.
³³ Divison of Medical Virology, University of Cape Town and National Health Laboratory Service.
³⁴ Genome Competence Center (MF1), Robert Koch Institute, Nordufer 20, 13353 Berlin, Germany.
³⁵ Computational Biology Division, University of Cape Town.
³⁶ HUS Diagnostic Center, Clinical Microbiology, University of Helsinki and Helsinki University Hospital, Helsinki, Finland.
³⁷ Allergy Immunology and Cell Biology Unit, Department of Immunology and Molecular Medicine, University of Sri Jayewardenepura, Nugegoda, Sri Lanka.
³⁸ Karkinos Healthcare Private Limited (KHPL), Aurbis Business Parks, Bellandur, Bengaluru, Karnataka, 560103, India.
³⁹ Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, India.
⁴⁰ Department of Veterinary Biosciences, University of Helsinki, 00014 Helsinki, Finland.
⁴¹ Department of Virology, University of Helsinki, 00014 Helsinki, Finland.
⁴² Department of Tropical Parasitology, Institute of Maritime and Tropical Medicine, Medical University of Gdansk, 81-519 Gdynia, Poland.
⁴³ Department of Microbiology, Singapore General Hospital, Singapore.
⁴⁴ Chinese Academy of Medical Science (CAMS) Oxford Institute (COI), University of Oxford, Oxford, UK.
⁴⁵ Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK.
⁴⁶ Health System Strengthening Unit, World Health Organisation, Harare, Zimbabwe.
⁴⁷ Centro de investigación en Enfermedades Tropicales & Facultad de Microbiología, Universidad de Costa Rica, Costa Rica.
⁴⁸ Public Health Institute of Malawi, Ministry of Health, Malawi.
⁴⁹ Genome Institute of Singapore, Agency for Science, Technology and Research (A*STAR), Singapore.
⁵⁰ Yong Loo Lin School of Medicine, National University of Singapore, Singapore.
⁵¹ Department of Applied Sciences, Faculty of Health and Life Sciences, Northumbria University, Newcastle upon Tyne, NE1 8ST, UK.
⁵² Duke Human Vaccine Institute, Duke University, Durham, NC 27710.
⁵³ University of KwaZulu Natal, Durban, South Africa, 4001.
⁵⁴ Vishwanath Cancer Care Foundation (VCCF), Neelkanth Business Park Kirol Village, West Mumbai, Maharashtra, 400086, India.
⁵⁵ Saw Swee Hock School of Public Health, National Univeristy of Singapore.
⁵⁶ Department of Medical Laboratory Sciences, College of Health Sciences, Addis Ababa University, P.O.Box 1176, Addis Ababa, Ethiopia.
⁵⁷ Centre for Epidemic Response and Innovation (CERI), Stellenbosch University, South Africa.
⁵⁸ Centre de Coordination des Opérations d'Urgences de Santé Publique, Ministere de Sante Publique, Cameroun.
⁵⁹ University of California, Berkeley, Berkeley, California, USA.
⁶⁰ Nebraska Department of Health and Human Services, Lincoln, Nebraska, USA.
⁶¹ Institute of Virology, University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
⁶² Centre for Infectious Diseases Research in Africa, University of Cape Town.
⁶³ Imperial College London, UK.
⁶⁴ KwaZulu-Natal Research Innovation and Sequencing Platform (KRISP), University of KwaZulu-Natal, South Africa.
⁶⁵ Milner Centre for Evolution, University of Bath, UK.

Abstract

The SARS-CoV-2 genome occupies a unique place in infection biology - it is the most highly sequenced genome on earth (making up over 20% of public sequencing datasets) with fine scale information on sampling date and geography, and has been subject to unprecedented intense analysis. As a result, these phylogenetic data are an incredibly valuable resource for science and public health. However, the vast majority of the data was sequenced by tiling amplicons across the full genome, with amplicon schemes that changed over the pandemic as mutations in the viral genome interacted with primer binding sites. In combination with the disparate set of genome assembly workflows and lack of consistent quality control (QC) processes, the current genomes have many systematic errors that have evolved with the virus and amplicon schemes. These errors have significant impacts on the phylogeny, and therefore over the last few years, many thousands of hours of researchers time has been spent in "eyeballing" trees, looking for artefacts, and then patching the tree. Given the huge value of this dataset, we therefore set out to reprocess the complete set of public raw sequence data in a rigorous amplicon-aware manner, and build a cleaner phylogeny. Here we provide a global tree of 4,471,579 samples, built from a consistently assembled set of high quality consensus sequences from all available public data as of June 2024, viewable at https://viridian.taxonium.org. Each genome was constructed using a novel assembly tool called Viridian (https://github.com/iqbal-lab-org/viridian), developed specifically to process amplicon sequence data, eliminating artefactual errors and mask the genome at low quality positions. We provide simulation and empirical validation of the methodology, and quantify the improvement in the phylogeny. We hope the tree, consensus sequences and Viridian will be a valuable resource for researchers.

Publication types

Preprint

Abstract

Publication types

Grants and funding