How Good Are Predictions of the Effects of Selective Sweeps on Levels of Neutral Diversity? [Population and Evolutionary Genetics]

MAYNARD SMITH and Haigh (1974) introduced the concept of hitchhiking into population genetics, showing how the spread of a favorable mutation reduces the level of neutral variability at a linked locus. Nearly 20 years later, it was shown that selection against recurrent deleterious mutations can also reduce variability, by the hitchhiking process known as background selection (BGS) (Charlesworth et al. 1993). It is, therefore, preferable to use the term “selective sweep” (Berry et al. 1991) for the hitchhiking effects of favorable mutations. There is now a large theoretical and empirical literature on both types of hitchhiking, recently reviewed by Walsh and Lynch (2018) and Stephan (2019). With sufficiently weak selection, recurrent partially recessive deleterious mutations can sometimes increase variability at linked sites, because fluctuations in their frequencies due to genetic drift create associative overdominance (Zhao and Charlesworth 2016; Becher et al. 2020; Gilbert et al. 2020).

These theoretical studies have provided the basis for methods for inferring the nature and parameters of selection from population genomic data, recently reviewed by Booker et al. (2017). Several recent studies have concluded that the level of DNA sequence variability in a species is often much smaller than would be expected in the absence of selection (Corbett-Detig et al. 2015; Elyashiv et al. 2016; Campos et al. 2017; Comeron 2017), especially for synonymous sites in coding sequences, reflecting the effects of both selective sweeps and BGS. However, estimates of the parameters involved differ substantially among different studies. There is also an ongoing debate about the extent to which the level of genetic variability in a species is controlled by classical genetic drift, reflecting its population size, or by the effects of selection in removing variability. The possibility that the effects of selective sweeps dominate over drift was originally raised by Maynard Smith and Haigh (1974), and later advocated by Kaplan et al. (1989) and Gillespie (2002); see Kern and Hahn (2018) and Jensen et al. (2019) for recent discussions of this question.

The model of Maynard Smith and Haigh (1974) assumed that the trajectory of the selectively favored allele was purely deterministic. Kaplan et al. (1989) developed a representation of the dual processes of recombination and coalescence during a sweep, which allowed for stochastic effects on the frequency of the selected allele when it is either rare or very common. This approach enabled calculations of the effect of a sweep on both pairwise diversity and the site frequency spectrum, but did not provide simple formulae. Explicit formulae for the effect of a sweep on pairwise diversity that removed the assumption of a purely deterministic trajectory were derived by Stephan et al. (1992) using diffusion equations. Barton (1998, 2000) developed an alternative approach using a combination of branching processes and diffusion equations, from which the properties of a postsweep sample of n alleles could be calculated. Kaplan et al. (1989), Stephan et al. (1992), Wiehe and Stephan (1993), Barton (2000), Kim and Stephan (2000) and Gillespie (2002) also analyzed the effects of recurrent selective sweeps, treating coalescent events caused by classical genetic drift and by sweeps as competing exponential processes. All of these approaches assumed either a haploid population or an autosomal locus with semidominant fitness effects.

A great simplification in such calculations was achieved by the following approximation, proposed by Barton (1998, 2000) and extended by Durrett and Schweinsberg (2004)—see also Kim (2006) and Coop and Ralph (2012). This approach is based on two assumptions. The first is that the fixation of a favorable mutation happens so fast that nonrecombinant alleles at a linked neutral site, sampled after the completion of the sweep, effectively coalesce instantaneously. The second is that linkage is sufficiently tight that, at most, a single recombination event occurs during the sweep, placing a neutral site onto a wild-type background with which it remains associated throughout the sweep. These assumptions mean that the gene genealogy for a set of alleles sampled immediately after a sweep, and that failed to recombine onto the wild-type background, has a “star-like” shape. The reduction in diversity and site frequency spectrum at the neutral site can then be calculated in a straightforward fashion (Barton 2000; Durrett and Schweinsberg 2004; Kim 2006; Coop and Ralph 2012; Weissman and Barton 2012). This approximation provides the basis for detecting recent sweeps in the programs SweepFinder (Nielsen et al. 2005) and Sweed (Pavlidis et al. 2013). It can readily be incorporated into models of recurrent selective sweeps (Barton 2000; Weissman and Barton 2012; Berg and Coop 2015; Elyashiv et al. 2016; Campos et al. 2017; Campos and Charlesworth 2019), which has stimulated the development of methods for estimating the parameters of recurrent sweeps from population genomic data (Elyashiv et al. 2016; Campos et al. 2017; Campos and Charlesworth 2019).

This approach is likely to be accurate for favorable mutations that are sufficiently strongly selected that their time to fixation is short compared with the expected neutral coalescent time of 2Ne generations (where Ne is the effective population size), provided that the ratio of the recombination rate to the selection coefficient is sufficiently small. Conversely, when this ratio is large, a sweep will have a negligible effect on variability. There is a need to examine the properties of sweeps when the selection and recombination parameters do not meet these conditions, especially as recent population genomic analyses suggest that there may be important contributions from relatively weakly selected favorable mutations, which take as long as 10% or more of the neutral coalescent time to become fixed (Sella et al. 2009; Keightley et al. 2016; Chen et al. 2020). In such cases, the time to coalescence during the sweep cannot necessarily be neglected, and the assumption that a pair of nonrecombined alleles are identical in state leads to an underestimate of diversity at the end of the sweep, especially with very low rates of recombination. In contrast, coalescence during the sweep competes with recombination, so that calculating the probability that one of a pair of alleles recombines onto the wild-type background without including the probability that they have escaped prior coalescence underestimates the effect of a sweep (Barton 1998). More generally, when the assumption that the duration of a sweep is negligible compared with the neutral coalescent time is invalid, the mean coalescent time of a pair of alleles cannot accurately be calculated simply from the probability that they escape recombination onto the wild-type background.

The present paper describes a general analytical model of selective sweep effects on the mean time to coalescence of a pair of alleles at a linked neutral locus (which determines the expected pairwise neutral diversity), for the case of weak selection at a single locus, where the selection coefficient is sufficiently small that a differential equation can used instead of a difference equation. This is based on a recent study of the expected time to fixation of a favorable mutation in a single population (Charlesworth 2020), which provided a general framework for analyzing both autosomal and sex-linked inheritance with arbitrary levels of inbreeding and dominance. There are, of course, other statistics of importance for population genetic inferences, such as the effect of sweeps on site frequency spectra. Results on these are hard to obtain analytically without the use of the star phylogeny approximation (Barton 2000; Durrett and Schweinsberg 2004; Kim 2006; Coop and Ralph 2012), and are therefore not considered in this paper.

The resulting formulae, which include a heuristic treatment of multiple recombination events during a sweep, enable predictions of the effects on diversity of both a single sweep and recurrent selective sweeps, and allow for the action of BGS as well as sweeps. They apply to cases when the product of Ne and the strength of selection is sufficiently large that the expected trajectory of allele frequency change at the selected locus is close to the deterministic predictions, except for allele frequencies close to 0 or 1. Hartfield and Bataillon (2020) have recently presented similar results for an autosomal locus with coalescence during a sweep, in the case of a single sweep in the absence of BGS, but without modeling multiple recombination events. Only hard sweeps will be considered here, although it is straightforward to extend the models to soft sweeps by the approach of Berg and Coop (2015) and Hartfield and Bataillon (2020). The validity of the approximations is tested against computer simulations, including those of Campos and Charlesworth (2019) and Hartfield and Bataillon (2020). For the sake of brevity, these papers will be referred to as CC and HB, respectively.

MethodsSimulating the effect of a single sweep

The algorithm described by Equation 27 of Tajima (1990) was used to calculate the effects of a sweep on pairwise diversity at a neutral locus with an arbitrary degree of linkage to a selected locus with two alleles, A1 and A2, where A2 is the selectively favored allele. A Wright–Fisher population with constant size N was assumed. The equations provide three coupled, forward-in-time recurrence relations for the expected diversities at the neutral locus for pairs of alleles carrying either A1 or A2, and for the divergence between A1 and A2 alleles. These are conditioned on a given generation-by-generation trajectory of allele frequencies at the selected locus, and assume an infinite sites model of mutation and drift (Kimura 1971).

The initial conditions for a simulation run were that a single A2 allele was introduced into the population, with zero expected pairwise diversity at the associated neutral locus; the expected pairwise diversity among A1 alleles and the divergence between A1 and A2 was equal to those for an equilibrium population in the absence of selection, θ = 4Nu, where u is the neutral mutation rate. Since only diversities relative to θ are of interest here, θ was set to 0.001 in order to satisfy the infinite sites assumption for the neutral locus. The expected change in the frequency q of A2 in a given generation for an assigned selection model was calculated using the standard discrete-generation selection formulation (see the section Theoretical results—single sweep for details of the models of selection). Binomial sampling using the frequency after selection and 2N as parameters was used to obtain the value of q in the next generation. Equation 27 of Tajima (1990) were applied to the old value of q in order to obtain the state of the neutral locus in the new generation.

This procedure was repeated generation by generation until A2 was lost or fixed; only runs in which A2 was lost were retained, and the value of the pairwise diversity among A2 alleles at the time of its fixation was determined. This gives the expected diversity after a sweep conditional on a given trajectory, so that an estimate of the overall expected diversity relative to θ can be found by taking the mean over a large number of replicate simulations. It was found that 100 replicates were sufficient to produce a standard error of 2% or less of the mean. The value of N was chosen so that the selection coefficient s for a given value of the scaled selection parameter γ = 2Ns was sufficiently small that terms of order s2 could be neglected, to satisfy the assumptions of the model described in the section Theoretical results—single sweep.

Recurrent sweeps: simulation methods

For checking the theoretical predictions concerning recurrent sweeps, the simulation results described in CC were used. These involved groups of linked autosomal genes separated by 2 kb of selectively neutral intergenic sequence, with all UTR sites and 70% of nonsynonymous (NS) sites subject to both positive and negative selection, and the same selection parameters for 5′ and 3′ UTRs (see Figure 1 of CC). There were five exons of 300 basepairs (bp) each, interrupted by four introns of 100 bp. The lengths of the 5′and 3′ UTRs were 190 and 280 bp, respectively. The selection coefficients for favorable and deleterious mutations at the NS and UTR sites, and the proportions of mutations at these sites that were favorable, were chosen to match the values inferred by Campos et al. (2017) from the relation between the synonymous diversity of a gene and its rate of protein sequence evolution. Both favorable and deleterious mutations were assumed to be semidominant.

Five different rates of reciprocal crossing over (CO) were used to model recombination, which were chosen to be multiples of the approximate standard autosomal recombination rate in Drosophila melanogaster, adjusted by a factor of 1/2 to take into account the absence of recombinational exchange in males (Campos et al. 2017): 0.5 × 10−8, 1 × 10−8, 1.5 × 10−8, 2 × 10−8, and 2.5 × 10−8 cM/Mb, respectively, where 10−8 is the mean rate across the genome.

The simulations were run with and without BGS acting on both NS and UTR sites, and with and without noncrossover associated gene conversion events. Cases with gene conversion assumed a rate of initiation of conversion events of 1 × 10−8 cM/Mb for autosomes (after correcting for the lack of gene conversion in males), and a mean tract length of 440 bp, with an exponential distribution of tract lengths.

Recurrent sweeps at multiple sites: numerical predictions based on analytical formulae

A single gene is considered in the analytical models, so that a linear genetic map can be assumed, because there is a negligible frequency of double crossovers. The CO contribution to the frequency of recombination between a pair of sites separated by z basepairs is rcz, where z is the physical distance between the neutral and selected sites and rc is the CO rate CO per bp.

An important point regarding the cases with gene conversion should be noted here. CC stated that, because the simulation program they used (SLiM 1.8) modeled gene conversion by considering only events that are initiated on one side of a given nucleotide site, the rate of initiation of a gene conversion tract covering this site is one-half of that used in the standard formula for the frequency of recombination caused by gene conversion; see Equation 1 of Frisse et al. (2001). However, this statement is incorrect, because it overlooks the fact that the standard model of gene conversion assumes that there are equal probabilities of a tract moving toward and away from the site. If tracts are constrained to move in one direction, the net probability that a tract started at a random point moves toward a given site is the same as in the standard formula, for a given probability of initiation of a tract.

Since no derivation of the formula of Frisse et al. (2001) appears to have been given, one is provided in File S1, section 1, which makes this point explicit (Equation S5 is equivalent to the formula in question). Gene conversion tract lengths are assumed to be exponentially distributed, with a mean tract length of dg, and a probability of initiation rg. It follows that the effective rates of initiation of gene conversion events (rg) used in the theoretical calculations in CC should have been twice the values that were used there. Diversity values were thus underestimated by these calculations, because there was more recombination than was included in the predictions. The correct theoretical results for sweep effects are presented here.

The effects of selective sweeps on neutral sites within a gene were obtained by summing the expected effects of substitutions at each NS and UTR site in the gene on a given neutral site (synonymous site), assuming that every third basepair in an exon is a neutral site, with the other two (NS) sites being subject to selection, as described by Campos et al. (2017). This differs from the SLiM procedure of randomly assigning selection status to exonic sites, with a probability ps of being under selection (ps = 0.7 in the simulations used in CC). To correct for this, the overall rate of NS substitutions per NS site was adjusted by multiplication by 0.7 × 1.5. Furthermore, to correct for the effects of interference among co-occurring favorable mutations in reducing their probabilities of fixation, their predicted rates of substitution were multiplied by a factor of 0.95, following the procedure in CC.

In order to speed up the computations, mean values of the variables used to calculate the effects of sweeps on neutral diversity were calculated by thinning the neutral sites by considering only a subset of them, starting with the first codon at the 5′end of the gene. For the results reported here, 10% of all neutral sites were used to calculate the values of the variables. Comparisons with results from using all sites showed a negligible effect of using this thinning procedure.

Background selection effects on diversity for autosomes and X chromosomes for genes in regions with different CO rates were calculated as described in sections S9 and S10 of File S1 of CC, which included estimates of the effects of BGS caused by selectively constrained noncoding sequences as well as coding sequences, derived from (Charlesworth 2012). If gene conversion was absent, the correction factors for gene conversion used to calculate these effects were omitted.

Data availability statement

The author states that all data necessary for confirming the conclusions presented in the article are represented fully within the article. The author states that no new data or reagents were generated by this research. Details of some of the mathematical derivations are described in the Supplementary Information, File S1 on Figshare. The codes for the computer programs used to implement the analytical models described below are available in the Supplementary Information, File S2 on Figshare. The detailed statistics for the results of the computer simulations shown in Figure 3 were provided in Files S2–S3 of Campos and Charlesworth (2019). Supplemental material available at figshare: https://doi.org/10.25386/genetics.13136012.

ResultsThe effect of a single sweep on expected nucleotide site diversityTheoretical results—single sweep:

The aim of this section is obtain an expression for the mean coalescent time at a neutral site linked to a selected locus, at the time of fixation of the selectively favored allele; under the infinite sites model, this yields the expected pairwise diversity at the neutral site. All times are expressed on the coalescent timescale of 2Ne generations, where Ne is the neutral effective population size for the genetic system under consideration (autosomal or X-linked loci, random mating, or partial inbreeding). If we use Ne0 to denote the value of Ne for a randomly mating population with autosomal inheritance, Ne for a given genetic system can be written as kNe0, where k depends on the details of the system in question (Wright 1939, 1969; Crow and Kimura 1970; Charlesworth and Charlesworth 2010). For example, with an autosomal locus in a partially inbreeding population with Wright’s fixation index F > 0, we have k ≈1/(1+F) under a wide range of conditions (Pollak 1987; Nordborg 1997; Laporte and Charlesworth 2002). In addition, following Kim and Stephan (2000) and CC, if BGS is operating, it is assumed that, for purely neutral processes, Ne can replaced by the quantity B1Ne, where B1 measures the effect of BGS on the mean neutral coalescent time of a pair of alleles. The effect of BGS on the fixation probabilities of favorable mutations is likely to be somewhat less than that for neutral processes, so that a second coefficient, B2, should ideally be used as a multiplier of Ne, where B2 = B1/λ (λ ≤ 1). As discussed in CC, B1 can be determined analytically for a given genetic model, whereas B2 usually requires simulations, so it is often more convenient to use B1 for both purposes, although this procedure introduces some inaccuracies.

As has been discussed in previous treatments of sweeps, there are two stochastic phases during the spread of a favorable mutation, A2, in competition with a wild-type allele, A1. A detailed analysis of these stochastic phases for the general model of selection used here is given by Charlesworth (2020). In the first phase, the frequency of A2 is so low that it is subject to random fluctuations that can lead to the loss of A2 from the population. Provided that the product of Ne and the selection coefficient for homozygotes for the favorable allele (s) is >>1, a mutation that survives this phase will enter the deterministic phase, where it has a negligible probability of loss, and in which its trajectory of allele frequency change is well approximated by the deterministic selection equation (Equation 6 below). When A2 reaches a frequency close to 1, A1 is now vulnerable to stochastic loss, so that there is a second stochastic phase. Formulae for the frequencies of A2 at the boundaries of the two stochastic phases, q1 and q2, are given by Charlesworth (2020), together with expressions for the durations of the stochastic and deterministic phases. For mutations with intermediate levels of dominance, q1, 1 – q2 and the durations of the two stochastic phases are all of the order of 1/(2Nes), measured on the coalescent timescale of 2Ne generations.

If q2 is close to 1, A2 has only a small chance of encountering an A1 allele, so that there is a negligible chance that a neutral site in a haplotype carrying A2 will recombine onto a background recombination during the final stochastic phase. In addition, the rate of coalescence within haplotypes carrying A2 is then close to the neutral value, and so does not greatly affect the mean time to coalescence of a pair of alleles sampled after the end of the sweep compared with neutral expectation. Under these conditions, the second stochastic phase has little effect on the mean coalescent time of the alleles compared with neutral expectation. Provided that the duration of the first stochastic phase on the coalescent time scale is <<1 (i.e., q1 is close to 0), this phase will also have a minimal impact on the mean coalescent time of such a pair of alleles. Accurate approximations for the effect of a single sweep on diversity can, therefore, usually be obtained by treating the beginning and end of the deterministic phase as equivalent to that for the sweep as a whole, as discussed by Charlesworth (2020).

The general framework presented in HB can then be used to determine the effect of a sweep on pairwise diversity, extended to include a more general model of selection as well as the possibility of BGS effects, and using analytical expressions for probabilities of coalescence and recombination during the sweep rather than numerical evaluations. This approach assumes that all evolutionary forces are weak (i.e., second order terms in changes in allele frequencies and linkage disequilibrium can be neglected), so that a continuous time scale approximation can be applied.

Let Td be the duration of the deterministic phase, defined as the period between frequencies q1 and q2 as given by Charlesworth (2020). With BGS, the terms in Ne in the relevant expressions are each to be multiplied by B2, as was done in CC. For a pair of haplotypes that carry the favorable allele A2 at the end of the sweep, the rate of coalescence at a time T back from this time point is [B1q(T)]−1, where q(T) is the frequency of A2 at time T. The rate at which a linked neutral site recombines from A2 onto the wild-type background at time T is ρ[1 – q(T)] = ρp(T), where ρ = 2Ner is the scaled recombination rate and r is the absolute recombination rate between the selected and neutral loci. With inbreeding and/or sex-linkage, r differs from its random mating autosomal value, r0, such that r = cr0, where c is a function of the genetic system and mating system. For example, with autosomal inheritance with partial inbreeding, c ≈ 1 – 2F + φ, where φ is the joint probability of identity by descent at a pair of neutral loci (Roze 2009; Hartfield and Bataillon 2020). Unless both r0 and F are sufficiently large that their second-order terms cannot be neglected, we have c ≈ 1 – F (Nordborg 1997; Charlesworth and Charlesworth 2010, p. 381). The exact value of φ is determined by the mating system; in the case of self-fertilization, Equation 1 of HB gives an expression for φ as a function of r0 and the rate of self-fertilization, which is used in the calculations presented here.

Under these assumptions, the probability density function (p.d.f.) for a coalescent event at time T for a pair of alleles sampled at the end of the sweep is:Embedded ImageEmbedded Image(1)where Pnc(T) is the probability of no coalescence by time T in the absence of recombination, and Pnr(T) is the probability that neither allele has recombined onto the wild-type background by time T, in the absence of coalescence.

Similarly, the p.d.f. for the event that one of the two sampled haplotypes recombines onto the wild-type background at time T (assuming that r is sufficiently small that simultaneous recombination events can be ignored) is given by:Embedded ImageEmbedded Image(2)We therefore have:Embedded ImageEmbedded Image(3)Embedded ImageEmbedded Image(4)The net probability that the pair of sampled alleles coalesce during the deterministic phase of the sweep is given by:Embedded ImageEmbedded Image(5a)If it is assumed that haplotypes that have neither recombined nor coalesced during the sweep coalesce with probability one at the start of the sweep, there is an additional contribution to the coalescence probability, given by:Embedded ImageEmbedded Image(5b)The net probability of coalescence caused by the sweep is thus:Embedded ImageEmbedded Image(5c)These equations are simple in form, but getting explicit formulae is made difficult by the nonlinearity of the equation for the rate of change of q under selection. Following Charlesworth (2020), for the case of weak selection (when terms of order s2 can be ignored) we can write the forward-in-time selection equation as:Embedded ImageEmbedded Image(6)where tildes are used to denote time measured from the start of the sweep; γ = 2Nes is the scaled selection coefficient for A2A2, assigning a fitness of 1 tο A1A1 and an increase in relative fitness of s to A2A2. Here, a and b depend on the dominance coefficient h and fixation index F, the genetic and mating systems, and the sex-specificity of fitness effects (Glémin 2012; Charlesworth 2020). For example, for an autosomal locus, the weak selection approximation gives a = F + (1 – F)h and b = (1 – F)(1 – 2h).

For a > 0 and a + b > 0, corresponding to intermediate levels of dominance, integration of Equation 6 yields the following expression for the expectation of the duration of the deterministic phase, Td (Charlesworth 2020):Embedded ImageEmbedded Image(7)Here, q1 ≈ 1/2aγ and p2 ≈ 1/2(a + b)γ (Charlesworth 2020).

Similar expressions are available for the cases when a = 0 (complete recessivity) or a + b = 0 (complete dominance), as described by Charlesworth (2020); see Equations A1b and A1c, respectively.

Using Equation 6, we can write T as a monotonic function of q, T(q). Substituting q for T and using the relation Embedded ImageEmbedded Image Equations 3, 4, and 5a then become:Embedded ImageEmbedded Image(8a)Embedded ImageEmbedded Image(8b)Embedded ImageEmbedded Image(8c)Explicit formulae for Pnc(q) and Pnr(q) are given in the Appendix (Equations A2 and A3).

Substituting q1 for q in Equations 5b and 5c, Equation 5b can be written as:Embedded ImageEmbedded Image(8d)The net expected pairwise coalescence time associated with the sweep, Ts, includes a contribution from the case when no coalescence occurs until the start of the sweep, given by the product of Pc2 and Td, and a contribution from coalescent events that occur during the sweep, denoted by Tc. We have:Embedded ImageEmbedded Image(9a)whereEmbedded ImageEmbedded Image(9b)and T(q) is the time to reach frequency q of A2, given by Equations A1.

Results with only a single recombination event:

The possibility of recombination back onto the background of A2, examined in CC, is ignored for the present, as is the possibility of a second recombination event from A2 onto A1. From Equation 2, the probability of at least one recombination event is given by:Embedded ImageEmbedded Image(10)Using Equations 6 and 10a and A1-A3, Pr can be expressed explicitly in terms of ρ, γ, a, and b, but the resulting expression has to be evaluated numerically.

The net expected pairwise coalescence time in the presence of BGS under this set of assumptions is given by B1Pr + Ts. Under the infinite sites model (Kimura 1971), the expected reduction in pairwise nucleotide site diversity for alleles sampled at the end of the sweep, relative to its value in the absence of selection (θ), is given by:Embedded ImageEmbedded Image(11a)Equation 9 of HB for the case of a hard sweep is equivalent to Equation 11a without the term in Ts. In addition, if Ts and the probability of coalescence during the sweep are both negligible, it is easily seen that Pr ≈ 1 – Pnr(Td), yielding the following result for the star phylogeny approximation (Barton 1998, 2000; Durrett and Schweinsberg 2004; Weissman and Barton 2012):Embedded ImageEmbedded Image(11b)In the case of an autosomal locus with random mating and semidominant selection (h = 0.5), this yields the following convenient formula:Embedded ImageEmbedded Image(11c)As mentioned in the Introduction, this formula has been used in several methods for making inferences from population genomic data.

The importance of coalescence during a sweep:

These results bring out the potential importance of considering coalescence during a sweep, as opposed to the coalescence of nonrecombined alleles at the start of a sweep. Consider the case with incomplete dominance (a ≠ 0). The probability of no coalescence during the sweep conditional on no recombination, Pnc(q1), is given by Equation A2a with q = q1, where q1 ≈ (2aγ)−1 (Charlesworth 2020). Somewhat surprisingly, for large γ this expression becomes independent of a and γ, provided that a −2 >> γ, and approaches e−2 ≈ 0.135, so that the probability of coalescence during a sweep in the absence of recombination is ∼0.865 (see the Appendix). With low rates of recombination, there is thus a high probability of coalescence during the sweep itself, in contrast to what is assumed in Equation 11b and 11c. If such a coalescent event is not preceded by a recombination event, the mean coalescent time will thus be smaller than predicted by these Equations.

This raises the question of the magnitude of Ts in the more exact treatment. While Equation 9 can only be evaluated exactly by numerical integration, a rough estimate of Ts for the case of no recombination can be obtained as follows (this is the maximum value, as the terms involving the probability of no recombination must decrease with the frequency of recombination). By the above result for Pnc(q1), the first term in Equation 9 is approximately e−2Td. The second term is equivalent to the mean coalescent time associated with events during the sweep; by the argument presented in section S3 of File S1 in CC, this is approximately equal to the harmonic mean of 1/q between q1 and q2. Equation S10 of CC for this quantity can be generalized as shown in the Appendix, with the result that the expected coalescent time associated with the sweep (Tc) is approximately Embedded ImageEmbedded ImageTd for large γ, giving Ts ≈ 0.635Td.

Table 1 and Supplemental Material, Table S1 of File S1 compare the results from numerical integrations with this approximation; as expected from the assumptions made in deriving this approximation, it is most accurate when γ is large and a is not too close to 1. Overall, for low frequencies of recombination, Ts is a non-negligible fraction of Td, but decreases toward zero with increasing rates of recombination, as would be expected.

Table 1 Parameters describing the effect of a single sweepMultiple recombination events:

Finally, the problem of multiple recombination events needs to be considered. In principle, this problem can be dealt with on the lines of Equation 10, but this involves multiple integrals of increasing complexity as more and more possible events are considered. The following heuristic argument can be used instead. A first approximation is to assume that, if the frequency of recombination is sufficiently high, multiple recombination events are associated with a coalescent time equal to that of an unswept background, B1. In contrast, a single recombinant event is associated with a mean coalescent time of B1 +Td, since the recombinant cannot coalesce with the nonrecombinant haplotype until the end of the sweep. If the probability of a single recombinant event is denoted by Prs, Equation 11a is replaced by:Embedded ImageEmbedded Image(12)Prs is given by the probability of a recombination event that is followed by no further recombination events. This event requires both the recombinant A1 haplotype (whose rate of recombination at an A2 frequency of x is ρx) and the nonrecombinant A2 haplotype (whose rate of recombination is ρ[1 – x]) to fail to recombine.

We thus have:Embedded ImageEmbedded Image(13a)where Pnr(q1, q), is the probability of no further recombination after an A2 frequency of q, given by:Embedded ImageEmbedded Image(13b)However, Equation 12 ignores the fact that there is a time-lag until the initial recombination event, whose expectation, conditioned on the occurrence of the initial recombination event, is denoted by Tr. This lag contributes to the time to coalescence of multiple recombinant alleles, causing the reduction in diversity to be smaller than predicted by Equation 12b. The probability of multiple recombination events is (Pr – Prs), so that a better approximation is to deduct (Pr – Prs)Tr from the left-hand side of Equation 12, giving:Embedded ImageEmbedded Image(14a)whereEmbedded ImageEmbedded Image(14b)The integral for Tr can be expressed in terms of ρ, γ, a, and b, on the same lines as for Equation 10.

Equation 14 are likely to overestimate the effect of recombination on the sweep effect, as complete randomization of the sampled pair of haplotypes is unlikely to be achieved, whereas Equation 11a clearly underestimates it; Equation 12 should produce an intermediate prediction. The correct result should thus lie between the predictions of Equation 11 and Equation 14. When the ratio of the rate of recombination to the selection coefficient, r/s, is <<1, all three expressions agree, and predict a slightly smaller sweep effect than Equation 9 of HB.

Comparisons with simulation results

Numerical results for Equation 11 can be obtained by numerical integration of the formulae given in the Appendix. For speed of computation, Simpson’s rule with n + 1 points was used here; this method approximates the integral of a function by a weighted sum of discrete values of the integrand over n equally spaced subdivisions of the range of the function (Atkinson 1989). It was found that n = 200 usually gave values that were close to those for a more exact method of integration; for the results in the figures in this section, n = 2000 was used. Background selection effects are ignored here, so that B1 and B2 are set to 1. Simulation results for hard sweeps for an autosomal locus with random mating were obtained using the algorithm of Tajima (1990) (see the Methods section), providing a basis for comparison with the predictions based on Equations 11a and 14 (denoted by C1 and C2, respectively), and on the star phylogeny approximation that ignores coalescence of nonrecombined alleles during the sweep, Equation 11b (NC). The results are shown in

Comments (0)

No login
gif