Constructing training sets for genomic selection to identify superior genotypes in candidate populations

Szu-Ping Chen; Wen-Hsiu Sung; Chen-Tuo Liao

doi:10.1007/s00122-024-04766-y

Constructing training sets for genomic selection to identify superior genotypes in candidate populations

Theor Appl Genet. 2024 Nov 17;137(12):270. doi: 10.1007/s00122-024-04766-y.

Authors

Szu-Ping Chen^{1

2}, Wen-Hsiu Sung¹, Chen-Tuo Liao³

Affiliations

¹ Department of Agronomy, National Taiwan University, Taipei, Taiwan.
² Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, USA.
³ Department of Agronomy, National Taiwan University, Taipei, Taiwan. [email protected].

PMID: 39550734
DOI: 10.1007/s00122-024-04766-y

Abstract

Approaches for constructing training sets in genomic selection are proposed to efficiently identify top-performing genotypes from a breeding population. Identifying superior genotypes from a candidate population is a key objective in plant breeding programs. This study evaluates various methods for the training set optimization in genomic selection, with the goal of enhancing efficiency in discovering top-performing genotypes from a breeding population. Additionally, two approaches, inspired by classical optimal design criteria, are proposed to expand the search space for the best genotypes and compared with methods focusing on maximizing accuracy in breeding value prediction. Evaluation metrics such as normalized discounted cumulative gain, Spearman's rank correlation, and Pearson's correlation are employed to assess performance in both simulation studies and real trait analyses. Overall, for candidate populations lacking a strong subpopulation structure, a ridge regression-based method, referred to as ${MSPE}^{Ridge},$ is recommended. For candidate populations with a strong subpopulation structure, a heuristic-based version of generalized coefficient of determination $({CD}_{mean (v 2)})$ and a D-optimality-like method that maximizes overall genomic variation $({GV}_{overall})$ are preferred approaches for the primary objective of plant breeding. For populations with a large number of candidates, a proposed ranking method ( ${GV}_{average}$ ) can first be used to down-scale the candidate population, after which a heuristic-based method is employed to identify the best genotypes. Notably, the proposed ${CD}_{mean (v 2)}$ has been verified to be equivalent to the original version, known as ${CD}_{mean}$ , but its implementation is much more computationally efficient.

MeSH terms

Algorithms
Computer Simulation
Genetics, Population
Genome, Plant
Genomics / methods
Genotype*
Models, Genetic*
Phenotype
Plant Breeding* / methods
Selection, Genetic*

Grants and funding

NSTC 112-2118-M-002-003-MY2/Ministry of Science and Technology, Taiwan