Dissecting the Genetic Architecture of Biofuel-Related Traits in a Sorghum Breeding Population

Abstract

In sorghum [Sorghum bicolor (L.) Moench], hybrid cultivars for the biofuel industry are desired. Along with selection based on testcross performance, evaluation of the breeding population per se is also important for the success of hybrid breeding. In addition to additive genetic effects, non-additive (i.e., dominance and epistatic) effects are expected to contribute to the performance of early generations. Unfortunately, studies on early generations in sorghum breeding programs are limited. In this study, we analyzed a breeding population for bioenergy sorghum, which was previously developed based on testcross performance, to compare genomic selection models both trained on and evaluated for the per se performance of the 3rd generation S0 individuals. Of over 200 ancestral inbred accessions in the base population, only 13 founders contributed to the 3rd generation as progenitors. Compared to the founders, the performances of the population per se were improved for target traits. The total genetic variance within the S0 generation progenies themselves for all traits was mainly additive, although non-additive variances contributed to each trait to some extent. For genomic selection, linear regression models explicitly considering all genetic components showed a higher predictive ability than other linear and non-linear models. Although the number and effect distribution of underlying loci was different among the traits, the influence of priors for marker effects was relatively small. These results indicate the importance of considering non-additive effects for dissecting the genetic architecture of early breeding generations and predicting the performance per se.

Sorghum [Sorghum bicolor (L.) Moench] is a promising bioenergy crop (Regassa and Wortmann 2014). Commercial F1 hybrid sorghums for biofuel production need to exhibit superiority in multiple traits, such as biomass, sugar content, and stress tolerance. In hybrid breeding, testcrosses are generally used to evaluate progeny performance. Many studies have focused on the relationship between the performance of partially or completely inbred lines and their testcrosses to a common unrelated tester to predict the effectiveness of selection on line per se performance for improving testcross performance. In maize, the correlation between line per se and testcross performance was intermediate to high for some traits, but small for grain yield (Mihaljevic et al. 2005). Similarly, other studies in maize and rye revealed that the correlations between phenotypes measured in lines per se and their testcrosses were often small for complex traits (Bekavac et al. 2008; Falke et al. 2010; Miedaner et al. 2014).

Although the breeding value of lines or individual breeding candidates for testcross performance is the most important selection criterion in hybrid breeding, the characteristics of the per se are also considered because they impact the efficiency of F1 seed production. For example, dwarf genotypes, which are easier to handle and more resistant to lodging than taller ones, are often utilized as the seed parents in hybrid sorghum (Pedersen et al. 2013). Cytoplasmic male sterility (CMS) is also necessary for hybrid seed parents because sorghum is a predominantly self-pollinated crop (Rooney 2007). Conversely, the pollen parent needs the complementary phenotypes, in which multiple traits (e.g., culm length, panicle length, and heading days) are suitable for hybridization. In a sorghum hybrid breeding program, only candidates with the suitability as hybrid parents are preselected in the advanced generations (e.g., F5 lines) before the evaluation based on testcross (Rooney 2007). On the other hand, for utilizing testcrosses for selection beginning in early generations, the evaluation of the breeding population per se is appropriately incorporated into the breeding process.

In addition to the additive genetic effects for parents, the performance of early generations per se can be affected by non-additive effects, especially dominance due to high heterozygosity before inbreeding has advanced. In sorghum, biomass-related traits (e.g., plant height) showed a considerable level of dominance while others were primarily additive (Felderhoff et al. 2012). In addition to additive and dominance effects, epistasis also contributed to various traits for bioenergy sorghum (Shiringani et al. 2010; Shiringani and Friedt 2011). The degree of non-additive effects has been examined in various traits across species (Yu et al. 1997; Lu et al. 2003; Frascaroli et al. 2007; Jiang et al. 2017). However, the contribution of non-additive effects in early generations of inbreeding in populations derived from several selection cycles is poorly understood. In contrast to genetic mapping populations (e.g., F2), breeding populations also have other complications (e.g., unequal allele frequencies) that need to be considered in dissecting the genetic architecture (Würschum 2012). Therefore, statistical models considering the genetic properties of breeding populations with limited inbreeding are necessary.

Genomic prediction (GP) was proposed to evaluate genetic potentials by the regression of target traits on genome-wide dense markers (Meuwissen et al. 2001). In plant and animal breeding, GP models have primarily been based on additive effects although non-parametric regression models considering non-additive effects are also used (Hayes and Goddard 2010; Jannink et al. 2010). Recently, the importance of non-additive effects in GP was considered (Varona et al. 2018). Some studies showed that GP models explicitly including non-additive effects improved prediction accuracy (Su et al. 2012; Nishio and Satoh 2014; Jiang and Reif 2015; Vitezica et al. 2017). The use of GP models accounting for only additive effects is not suitable when genetic architecture is predominantly regulated by non-additive effects (Howard et al. 2014). Alves et al. (2019) suggested that Bayesian GP models have the advantage of dissecting complex genetic architecture regulated by non-additive effects.

Bayesian GP models can be applied for a genome-wide association study (GWAS) (Fernando and Garrick 2013). In Bayesian GP models, different priors for marker effects, e.g., Bayesian ridge regression (BRR) and BayesA, are utilized for dealing with various genetic architectures of traits (reviewed by Gianola 2013). In particular, the number of QTL is an important factor in addition to the number of independent chromosome segments for selecting the appropriate prior (Daetwyler et al. 2010). For example, Wolc et al. (2016) showed that BayesB was suitable in the presence of QTL with large effects. The suitability of priors depends on multiple factors (e.g., heritability, marker density, and the training dataset size), and therefore the optimal settings for prior are generally unknown (de los Campos et al. 2013).

The objective of this study is to give insights into the genetics of important agronomic performance per se in a non-inbred (S0) generation of a bioenergy sorghum population that was selected based on testcross performance. The primary objectives of this research are: i) description of the genetic architecture of the S0 generation per se performance, ii) utilization of GP for evaluating the performance of the breeding population per se, and iii) comparison among models and priors for target traits. Finally, we discuss the importance and reliability of modeling non-additive effects for non-inbred and early inbreeding generations in plant breeding programs.

Materials and MethodsMating design of breeding population

The long-term goal of our project is the breeding of bioenergy sorghum. F1 hybrid cultivars for bioenergy sorghum require both high biomass and high sugar contents. The mating design used to derive the sorghum breeding population is described in Figure 1A. We use the term “family” as a full-sib family derived from the same parents, specifically for segregating generations. The term “individual” is used to describe each plant. The term “genotype” is used to identify genetically different plants. Genotype may refer to a single individual of a breeding family or a genetically uniform inbred accession or F1 hybrid.

Figure 1Figure 1Figure 1

(A) The mating design for the sorghum breeding population. (B) The genetic proportion of 13 remaining founders in 260 breeding families (3rd generation). The 260 families (each row) are divided into 13 groups (G01-G13) based on the proportion of eight progenitors (columns). (C) Principal component analysis on marker genotype (the two testers and base population). The testers (CMS-A and -B) and 13 remaining founders are shown in addition to the other base accessions.

The breeding population used in this study corresponds to the 3rd intermating generation derived from the base population (the 0th generation). To combine diverse genetic variations derived from the base population, intercrossing was performed in each generation. In theory, the maximum number of founders per family was eight in the 3rd generation. For selection, at least one tester was crossed with each breeding candidate to check the progeny performance. In the selection, total weight (TW) and the Brix value of culm juice (BR) of testcrosses were considered as the main target traits, although other traits were also considered secondarily (see Phenotypic data). The selection criteria were determined comprehensively because the decision of the best testcross across multiple traits was generally difficult (i.e., it depended on the breeder’s eye to some extent).

Breeding populations

A base population (the 0th generation) was composed of 243 inbred accessions, which had been obtained from public genebanks (Table S1). For the next population (the 1st generation), we selected 29 accessions from the base population based on the progeny performance with each tester separately. Four accessions that were not tested in this project were added to the 29 accessions. Therefore, a total of 33 accessions was used as the parents for the 1st generation. We performed 23 intercrosses among the 33 accessions, which corresponds to the 1st generation (23 intra-population F1 hybrids). Of the 23 F1 hybrids, 9 F1 hybrids were selected for the next population (the 2nd generation). All crosses genetically segregate starting in the 2nd generation. The 2nd generation included a total of 11 families, which were produced by 10 intercrosses among 8 F1 hybrids (S0 progenies) in addition to a family derived from the selfing of an F1 hybrid (S1 progenies). In this study, we used all 11 families of the 2nd generation as the parents for the next population without selection. The next population (3rd) was derived from intercrosses among different families of the 2nd generation. A few exceptions were derived from crosses within the same family (full-sib mating). A total of 137 individuals (average of 12.5 sibs per family in the 2nd generation) was used as the parents for intercrossing, resulting in 260 full-sib families in the 3rd generation.

Intercrossing was limited to the combinations within the group that had been selected for performance on each tester. This means that a selected genotype (or family) for a particular tester was intercrossed with another genotype (family) selected for performance in combination with the same tester.

Test populations

Test populations were evaluated only for the selection of parents for the next breeding populations. In other words, test populations per se were not directly used as parents for the breeding populations. For selection, we performed testcrosses with two testers with cytoplasmic male sterility (CMS), CMS-A and CMS-B. The 1st test population corresponds to the F1 hybrids between the two testers and the 0th generation. Based on the performance of the F1 hybrids, the parental inbred accessions were selected for each tester.

The 2nd test population corresponds to the three-way crosses between each tester and the 1st generation. We created two sub-populations of selected F1s corresponding to the best families in combination with each of the two testers. Subsequent testcrossing on advanced generations involved the same tester used to initially select the breeding families. For example, when we selected multiple inbreds based on their testcross performance with the CMS-A tester in the 1st test population, we evaluated the testcross between CMS-A and the F1 hybrid among the selected inbreds in the 2nd test population. Because the 2nd test population genetically segregates, the progeny rows for each testcross family were grown and evaluated. Considering the testcross performance (e.g., the average and variance of phenotypic values within each testcross family, the suitability across target traits, and the degree of lodging and diseases), we applied the family selection to the 2nd test population and selected families (i.e., the parental F1 hybrids) for the next breeding population (the 2nd generation).

Phenotypic data

We performed a field trial of the S0 plants of the 3rd generation per se (not as testcrosses) from June to September in 2017. The single experiment field was located at Corerepe, Sinaloa, Mexico (25° 37′ N, 108° 43′ W). We germinated seedlings in a greenhouse for three weeks before transplanting to the field. During the field trial, we adjusted the amount of irrigated water using drip irrigation. The fertilizer level was the standard for high-biomass sorghum (N:P:K = 17:60:50 kg ha-1). The space between ridges was 1 m with 15-cm distances between individuals in the same ridge. We divided the field into two blocks for allocating each breeding family. A total of 260 breeding families was randomly assigned to each plot across two blocks without replication. For check plots, the ancestral accessions that contributed to the 3rd generation were incorporated into each block with at least a plot. In addition, 26 genotypes (F1 hybrids), which included 24 superior testcrosses generally selected from the 1st test population and two high-biomass varieties provided by EARTHNOTE Co. Ltd., were replicated within blocks. Each plot included five individuals. For the breeding families, all five individuals within a plot were measured for the phenotypes, resulting in 1,300 individuals (260 families × 5 individuals). However, only 1,020 individuals (259 families, averagely 3.9 individuals per family) were used in this study due to missing data. For the ancestral accessions and checks, only two healthy individuals in a plot were measured. In this study, we evaluated six important traits for bioenergy sorghum (culm length [cm], CL; total biomass weight [kg transformed to natural logarithm], TW; the brix value of culm juice [%], BR; culm diameter [mm], CD; culm number [number in natural logarithm], CN; panicle length [cm], PL). All traits had been considered for each selection, mainly focusing on the performance of TW and BR.

To consider the field heterogeneity between the two experimental blocks, we calculated the adjusted phenotypic values using the following formula:Embedded ImageEmbedded Imagewhere Embedded ImageEmbedded Image is the phenotypic value of the ith genotypes on the kth block, μ is an intercept, and Embedded ImageEmbedded Image is the effect of the ith genotype, which is treated as random for unreplicated individuals of the breeding population and as fixed for replicated ancestral accessions and checks (Kempton and Gleeson 1996), Embedded ImageEmbedded Image is the random effect of the kth block [where Embedded ImageEmbedded Image], and Embedded ImageEmbedded Image is the residual [where Embedded ImageEmbedded Image].

The adjusted phenotypic value of the ith genotype (Embedded ImageEmbedded Imagei) was calculated as Embedded ImageEmbedded Image, where Embedded ImageEmbedded Image is the estimated mean value and Embedded ImageEmbedded Image is the best linear unbiased prediction of the ith genotype. The adjusted phenotypic values (Embedded ImageEmbedded Image) were used as the response variable for the following Bayesian regression models. This model was implemented using the package RAINBOWR (Hamazaki and Iwata 2020) in R (R Core Team 2019).

Marker data

DNA extraction and library preparation followed the procedure by Kobayashi et al. (2017). Although both the founder accessions and the breeding population were genotyped by restriction site-associated DNA sequencing (RAD-seq) (Baird et al. 2008), different restriction enzyme pairs (BglII and MseI for the former, and BglII and EcoRI for the latter) were used due to a procedural reason. Therefore, we obtained different marker datasets for the founder accessions and the breeding population, respectively. We treated each marker dataset independently for the analysis.

Each marker dataset was available through the following procedures. We mapped the RAD reads to the sorghum reference genome sequence (Sbicolor_313_v3.0) (McCormick et al. 2018) using BWA version 0.7.15 (Li and Durbin 2009). We carried out variant calling using UnifiedGenotyper implemented in the Genome Analysis Toolkit version 3.5 (McKenna et al. 2010) and obtained the raw variant call format (VCF) output. Using VCF tools (Danecek et al. 2011), we chose only variant sites fulfilling the following conditions: mean depth range (3–30), missing score (≤5%), minor allele frequency (≥5%), quality value (>20), and bi-allelic single nucleotide polymorphism (excluding variant sites derived from insertions and deletions). Besides, only highly homozygous variant sites (the heterozygosity rate was less than 5%) were selected for the dataset of the founder inbred accessions. We imputed missing genotypes using Beagle 4.0 (Browning and Browning 2007). Of the highly linked variant sites (r2 ≥ 0.95 between two variant sites) on the same chromosome, only the first variant site in the VCF file (which was near the zero position on a chromosome) were kept. Finally, 6,410 (for the founder accessions) and 3,260 markers (the 3rd breeding population) remained for the following analyses, respectively.

Principal component analysis (PCA) on each marker dataset was independently carried out using the function “prcomp” in R.

Bayesian regression models

The Bayesian regression models used in this study can be classified into four categories: the additive linear model (A), the additive-dominance linear model (AD), the additive-dominance-epistasis linear model (ADE), and the Gaussian kernel model (GK). We adjusted marker genotype scores and calculated genomic relationship matrices using the natural and orthogonal interactions (NOIA) approach (Vitezica et al. 2017). The NOIA model is based on the genotypic frequency (not allele frequency), which can be applied also for populations without assuming a Hardy–Weinberg equilibrium, such as our breeding population. Here, we will briefly explain the statistical models.

The A model can be written as the next formula:Embedded ImageEmbedded Image(1)where Embedded ImageEmbedded Image is the adjusted phenotypic value of the ith genotype in the breeding population, μ is the mean value across all genotypes, L is the total number of markers, Embedded ImageEmbedded Image is the marker coefficient of the ith genotype at the zth marker for the additive effect, Embedded ImageEmbedded Image is the additive effect of the zth marker, and Embedded ImageEmbedded Image is the residual [where Embedded ImageEmbedded Image].

The AD model can be described as the extended form of the A model:Embedded ImageEmbedded Image(2)where Embedded ImageEmbedded Image is the marker coefficient of the ith genotype at the zth marker for the dominance effect, and Embedded ImageEmbedded Image is the dominance effect of the zth marker.

The AD model can be further extended to the ADE model by incorporating the first-order epistasis terms:Embedded ImageEmbedded Image(3)where Embedded ImageEmbedded Image, Embedded ImageEmbedded Image, and Embedded ImageEmbedded Image are the additive×additive, additive×dominance (including dominance×additive), and dominance×dominance epistatic effects between the zth and wth markers, respectively.

The epistatic effect terms are equivalent to the random effects that follow the multivariate Gaussian distributions, whose variance–covariance matrices are proportional to the Hadamard products of the corresponding relationship matrices (Jiang and Reif 2015). To derive epistatic matrices, the additive (A) and dominance (D) relationship matrices were first calculated using each marker coefficient based on the genotypic frequency (Vitezica et al. 2017). Using A and D, the additive×additive (Embedded ImageEmbedded Image), additive×dominance (Embedded ImageEmbedded Image, including dominance×additive), and dominance×dominance (Embedded ImageEmbedded Image) epistatic relationship matrices can be described as Embedded ImageEmbedded Image and Embedded ImageEmbedded Image where Embedded ImageEmbedded Image represents the Hadamard product of two matrices, X and Y, Embedded ImageEmbedded Image is the trace, and n is the number of diagonal elements (i.e., the number of genotypes). In this study, we fitted the ADE model using the epistatic relationship matrices (Embedded ImageEmbedded Image, Embedded ImageEmbedded Image, and Embedded ImageEmbedded Image) for the epistatic effect terms.

In the A, AD, and ADE models, four priors (BRR, BayesA, BayesB, and BayesC) were used to estimate a (additive marker effects) and d (dominance marker effects) (Meuwissen et al. 2001; Habier et al. 2011; Gianola 2013). In the AD and ADE models, the combination of the same priors for a and d was utilized (two different priors together were not examined).

Both additive and non-additive effects can be implicitly captured by the reproducing kernel Hilbert spaces (RKHS) regression based on a Gaussian kernel (GK) (Gianola and van Kaam 2008). The GK model can be written as Embedded ImageEmbedded Image where Embedded ImageEmbedded Image is the random effect of the ith individual [where Embedded ImageEmbedded Image]. The Gaussian kernel is calculated as Embedded ImageEmbedded Image where S is the squared-Euclidean distance matrix between genotypes in the breeding population, and h is the bandwidth parameter for adjusting the genetic covariance. To optimize the value of h in GP, we used the approach based on the restricted maximum-likelihood (REML) in each training dataset (Endelman 2011).

All Bayesian regression models were performed using Markov Chain Monte Carlo (MCMC) implementations in the R package BGLR (Pérez and de los Campos 2014). For the posterior density, the total iteration of the sampler is 30,000 and the number of discarded samples (as burn-in) is 15,000, which showed consistent results with more MCMC samples (300,000 iterations with 150,000 discards).

Estimation of variance components

For estimating genetic variance components, we calculated genotypic values in each MCMC sample after burn-in (Lehermeier et al. 2017; Alves et al. 2019). The additive genotypic value (Embedded ImageEmbedded Image) can be calculated using the following formula:Embedded ImageEmbedded Imagewhere Embedded ImageEmbedded Image is the estimated additive value of the ith genotype, and Embedded ImageEmbedded Image is the estimated additive effect of the zth marker. Similarly, the dominance genetic value (Embedded ImageEmbedded Image) was also calculated in the AD and ADE models. The three epistatic genetic values (Embedded ImageEmbedded Image) were implicitly estimated in the ADE models. The total genetic value (Embedded ImageEmbedded Image) is the sum of these genetic values. The total genetic variance (Embedded ImageEmbedded Image) and variance components (Embedded ImageEmbedded Image) were calculated as the variance of estimated values across all genotypes in each MCMC sample (Alves et al. 2019).

Genome-wide association studies (GWAS)

For GWAS, we also used each MCMC sample after burn-in. In Bayesian whole-genome regression models, the estimated effect of any single marker can be small because of the correlation among adjacent markers. Fernando and Garrick (2013) applied the genomic window approach, which calculated the regional genetic variances using markers included in a genomic window. We calculated each regional genetic variance using a 1 Mb sliding window without overlaps. The formula for the regional additive variance in each MCMC sample can be as follows:Embedded ImageEmbedded ImageEmbedded ImageEmbedded Imagewhere Embedded ImageEmbedded Image is the estimated additive genotypic value of the ith individual at the qth region, and Embedded ImageEmbedded Image is the number of markers included in the qth region. The relative additive variance at the qth region (Embedded ImageEmbedded Image) for the total genetic variance Embedded ImageEmbedded Image was estimated as follows:Embedded ImageEmbedded ImageThe regional dominance variance can also be calculated using a similar procedure. Although any priors for marker effects can be used for GWAS, we described only the results of the ADE model with BayesB for the GWAS. In this study, we inferred regions with over 1% of the total genetic variance in each MCMC sample as being associated with target traits. The window posterior probability of association (WPPA) is calculated by counting the number of MCMC samples over the threshold for the total number of samples (Fernando and Garrick 2013).

For variance estimation and GWAS, Bayesian regression models were fitted as the full model using all genotypes of the breeding population.

Genomic prediction

The predictive ability of the Bayesian regression models was evaluated by a fivefold cross-validation approach. We randomly divided the breeding population into five subsets. Of these five subsets, four subsets were used for training the model, and the remaining subset was validated using the trained model. This process was repeated until all subsets were validated, which corresponded to a single replication. The correlation coefficient (r) between the adjusted phenotypic values (Embedded ImageEmbedded Image) and predicted values (Embedded ImageEmbedded Image) was recorded in each replication. We carried out the fivefold cross-validation approach for each model and each trait with 20 replications. After the values of the correlation coefficient were corrected using the Fisher z-transformation for model comparison, Tukey’s test (P < 0.01) was performed using the R package agricolae (de Mendiburu 2019).

Data availability

RAD-seq data have been submitted to the NCBI Sequence Read Archive with the BioProject PRJNA614576. All supplemental materials are available at FigShare, including the phenotype, genotype, and ancestry data used in this study. Table S1 contains the results of GWAS. File S1 contains information on base accessions. File S2 contains phenotype data. File S3 contains genotype data. File S4 contains ancestry data. Other information is also available upon request. Supplemental material available at figshare: https://doi.org/10.25387/g3.12674369.

ResultsDeveloping a sorghum breeding population

We developed a sorghum breeding population by the procedure summarized in Figure 1A. Of over 200 accessions in the base population (0th generation), only 13 founders contributed to the ancestry of the 3rd generation (Figure 1B). We classified 260 breeding families in the 3rd generation into 13 groups (G01–G13) based on their relationships to the founders. The genetic contribution of each founder in the ancestry ranged from 12.5 to 25.0% in the breeding families, except a family in G12 in which F170 and F247 each contributed 50.0%. G12 and G13 had been selected based on the testcross with CMS-A while we had selected the other groups for CMS-B. Only one founder (F210) was included in the ancestry of both subpopulations selected on the basis of performance with each tester. The remaining 13 founders in the 3rd generation represented a good sample of genetic variations of the base population (Figure 1C).

Phenotypic variation of the breeding population

Compared to the founder accessions, each breeding group showed relatively high performances for most traits (Figure 2). The breeding groups, except for G13, had greater TW than did the founders. All groups also showed a higher performance than the founders in BR. Within the breeding population, the phenotypic variations were unique to each group. G12 showed the best performance in TW, which might be mainly due to the improvement of CD and CN. In G06, the improvement of all traits progressed si

Comments (0)

No login
gif