Background selection and hitchhiking effect

Background selection and hitchhiking effect

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

As I read the maintenance or removal of an allele from a population is dependent upon the alleles in its linkage group; so, when an advantageous mutation occurs in the population, variation in nearby alleles that linked to that mutated allele is reduced to fix (maintain) the mutated allele in the population (hitchhiking effect). This is true for deleterious mutations and background selection. But, I cannot understand how background selection can remove the deleterious mutation from the population? Could you please kindly explain to me about it?


The phrasing in your question is a little unclear which makes it hard to know exactly what is unclear to you. But I'll give it a try hoping that it helps.

Background selection (BGS) is the process by which deleterious mutations reduce genetic diversity at nearby loci. When you say

I cannot understand how background selection can remove the deleterious mutation from the population

I would like to highlight that BGS does not remove the deleterious mutations per say. BGS is caused by deleterious mutations and has for consequence to reduce genetic diversity whether this diversity represent any variant for fitness.

An easy way to get an intuition of this effect is to think of the effective population size. In a Wright-Fisher population, where all individuals have the same probability to reproduce, the effective population size is equal to the population size. Imagine now that half of the individuals in the population carry a mutation that make them completely sterile. If this is the case, then the effective population size is reduced to half the population size. As the effective population is smaller, so is the genetic diversity. Hence, deleterious mutations (and fitness variance in general) reduce the effective population (in comparison to the actual population size).

Now, a single deleterious mutation that disappears in the next generation will affect the entire genome equally. However, repetitive deleterious mutations at specific loci (conserved regions) that stay around for a few generations will affect the effective population size at this region more than at other regions. Yes, the effective population size, for a population, can vary along the genome.

The strength of BGS to affect various statistics such as expected heterozygosity, $D_{XY}$ or $F_{ST}$ is hard to estimate as both hitchhiking (selective sweeps) and BGS leave very similar genetic signature. See for example Matthey-Doret and Whitlock (2018; bioRxiv) (shameless self-citation).

Genetic hitchhiking and the dynamic buildup of genomic divergence during speciation with gene flow

A major issue in evolutionary biology is explaining patterns of differentiation observed in population genomic data, as divergence can be due to both direct selection on a locus and genetic hitchhiking. "Divergence hitchhiking" (DH) theory postulates that divergent selection on a locus reduces gene flow at physically linked sites, facilitating the formation of localized clusters of tightly linked, diverged loci. "Genome hitchhiking" (GH) theory emphasizes genome-wide effects of divergent selection. Past theoretical investigations of DH and GH focused on static snapshots of divergence. Here, we used simulations assessing a variety of strengths of selection, migration rates, population sizes, and mutation rates to investigate the relative importance of direct selection, GH, and DH in facilitating the dynamic buildup of genomic divergence as speciation proceeds through time. When divergently selected mutations were limiting, GH promoted divergence, but DH had little measurable effect. When populations were small and divergently selected mutations were common, DH enhanced the accumulation of weakly selected mutations, but this contributed little to reproductive isolation. In general, GH promoted reproductive isolation by reducing effective migration rates below that due to direct selection alone, and was important for genome-wide "congealing" or "coupling" of differentiation (F(ST)) across loci as speciation progressed.

Keywords: De novo speciation divergent selection effective migration evolutionary theory genome architecture individual-based models primary contact.

© 2013 The Author(s). Evolution © 2013 The Society for the Study of Evolution.


Accumulating evidence from ecological and genomic surveys of microbial diversity indicates that archaea and bacteria (collectively, prokaryotes) in nature are organized into genotypic clusters that largely coincide with distinct ecological characteristics [1]. How such patterns of microbial diversity are formed and maintained is an open question in microbial ecology. It is generally agreed that restriction of recombination and balancing selection between genotypic clusters are necessary for stable, sympatric coexistence of multiple distinct clusters [1-4]. More controversial are the roles played by selection and recombination in the formation of such clusters [5,6].

A prominent concept, known as the ecotype model, posits a central role for positive selection and restricted recombination for cluster formation [2]. According to this model, when positive selection causes fixation of a beneficial gene (allele) at one locus in the genome within a population, it also entails fixation at all other loci because recombination is not frequent enough to unlink the beneficial gene from the rest of the genome (Figure 1a). This phenomenon is known as genome-wide selective sweep [7] or genetic hitchhiking [8]. Genome-wide selective sweeps can repeatedly occur in a population adapting to a new environment, each time purging within-population genetic diversity, a phenomenon known as periodic selection [9]. Periodic selection makes a population genetically cohesive and distinct from other populations, leading to cluster formation [5].

Schematic diagram showing the mode of selective sweeps under various scenarios. A closed curve within a cell denotes a prokaryote genome. The different line styles of the genomes indicate the presence of neutral diversity. Green triangles denote the ecologically advantageous allele that spreads through the population. White triangles denote the wild-type allele at the same locus. Other symbols on the genomes denote genes that determine susceptibility to viruses. (a) A genome-wide selective sweep occurs in the absence of NFDS without (frequent) recombination. Neutral diversity is lost after the fixation of the beneficial allele. (b) A gene-specific selective sweep occurs in the presence of NFDS with recombination only at the locus where the ecologically beneficial allele appears. Neutral diversity is maintained after the fixation of the beneficial allele. (c) A genome-wide selective sweep occurs in the presence of NFDS with recombination only at the locus that determines susceptibility to viruses. Neutral diversity is lost. NFDS, negative frequency-dependent selection.

Recently, this common view of prokaryote evolution has been challenged by the study of Shapiro et al. [10], which explored the spread of adaptive genes through natural populations of ocean microbes, Vibrio cyclitrophicus. This study has shown that adaptive genes spread through microbial populations via horizontal gene transfer (i.e., recombination) without purging genome-wide diversity that was present before the selective sweep. Accordingly, this mode of evolution is denoted gene-specific selective sweep. Under this evolutionary regime, ecologically differentiated individuals are not genetically differentiated at the vast majority of polymorphic loci. Thus, cluster formation rests on establishment of recombination barriers between ecologically differentiated populations [10,11], a situation similar to that of sexually reproducing eukaryotes (e.g., [12-14]). Gene-specific selective sweeps have additional implications for the ecology and evolution of prokaryotes. In particular, under this scenario, genetic diversity within a population is protected from genome-wide selective sweeps that could potentially exert major effects on ecological characteristics of populations such as primary productivity [15]. Moreover, under the gene-specific sweep model, the evolutionary history of a population cannot be described by a single line of succession of common ancestors, in contrast to the implications of the ecotype model [16].

The gene-specific selective sweeps in prokaryotes not only challenge the common view of prokaryote evolution, but also are puzzling with respect to the underlying mechanism. At face value, a gene-specific selective sweep implies that recombination is so frequent as to unlink ecologically beneficial alleles from the rest of the genome before they rise to high frequencies [17,18]. This apparently would require recombination rates far higher than those currently inferred from the available data from many prokaryotes [17,19-22] (see also ‘Discussion’). Although one cannot completely exclude the possibility that current methods underestimate recombination rates by several orders of magnitude, it seems worthwhile to seek an alternative explanation as suggested by the following consideration. The concept of periodic selection, on which the ecotype model is based, was originally derived from experimental evolution of pure bacterial cultures in an isolated environment [9]. Most microbes in the wild do not evolve under such controlled conditions. Rather, they struggle for existence amidst a highly heterogeneous, dynamic ecological community including other evolving microbes, hosts, predators, viruses and plasmids. Ecological interactions with these diverse biological entities substantially impact the course of microbial evolution as indicated by recent work on experimental coevolution [23].

A general mechanism by which such an impact can be made is negative frequency-dependent selection (NFDS), which is a type of selection that favors rare phenotypes in a population [24]. NFDS can be caused by various ecological interactions such as evasion of parasites (also known as the kill-the-winner dynamics) and attack on competitors (e.g., antibiotics production) as well as social interactions such as provision of public goods (e.g., siderophores and virulence factors) [24-28]. NFDS can generate and maintain genetic diversity within a population [29], with the implication that genes adaptive for every individual would spread through a population via recombination, hence gene-specific selective sweeps (Figure 1b) (NFDS has been suggested as a potential cause of gene sweeps by Shapiro et al. [10] related ideas have been explored by Maynard Smith [30], and Majewski and Cohan [31] see Discussion). According to this scenario, gene-specific selective sweeps would be a general phenomenon because many, if not most, free-living prokaryotes are likely to evolve under NFDS caused, in particular, by the ubiquitous viruses [32].

A potential problem with the NFDS scenario, however, is that the loci involved in these interactions often are located in genomic islands that appear to experience significantly elevated recombination rates [24,27,33-35]. For example, the O-antigen, the outermost part of lipopolysaccharide protruding from the surface of Gram-negative bacteria, is a typical virus receptor [27]. Genes encoding the O-antigen are likely to evolve under NFDS as attested by their high variability among closely related bacteria [27]. These genes typically are clustered in genomic islands that undergo frequent horizontal gene transfer [36]. In the Vibrio splendidus genome, O-antigen-encoding regions contain conserved signal sequences known as JUMP sites, which are exclusively found in these regions and are thought to be involved in natural transformation [37]. Other examples include genomic islands of Prochlorococcus cyanobacteria, which encompass genes for various metabolite transporters (potential targets of virus recognition) as well as many tRNA genes, repeat elements and integrases, which can enhance the rate of recombination in these islands [38]. Furthermore, genes encoding various antiviral defense mechanisms, such as restriction-modification systems, also form clusters known as defense islands, which are significantly co-localized with genes encoding transposons and prophage components [35]. Finally, genes encoding synthetases of secondary metabolites that can act as public goods have been shown to appear in mobile regions of prokaryotic genomes. For example, in marine vibrios, toxin-coding genes have been found in genomic islands [39]. Also, secreted virulence factor genes are typically located in the hyper-recombinant regions of many prokaryotic genomes [40]. As a consequence of elevated recombination rates, the loci affected by NFDS can be unlinked from the rest of the genome, and accordingly, NFDS would be unable to prevent genome-wide selective sweeps driven by other adaptive alleles (Figure 1c) [24]. Thus, evaluation of the potential effect of NFDS on fixation of beneficial alleles requires consideration of such biased recombination rates.

Here, we investigate whether and under what conditions NFDS can cause gene-specific selective sweeps in the presence of elevated recombination rates at the loci affected by NFDS. Using mathematical modeling, we show that NFDS indeed can cause gene-specific selective sweeps in large prokaryotic populations, but only when the basal recombination rate is sufficiently low, apparently contradicting the intuition that high recombination rates are required for gene-specific selective sweeps.


The predicted effect of BGS in a multisite context can be described by the quantity B = exp(–E), where B is the ratio of expected neutral diversity at a focal neutral site under BGS to its value in the absence of BGS (which is equivalent to the corresponding ratio of mean coalescence times), and E is the sum of the effects of each selected site (Hudson and Kaplan 1995 Nordborg et al. 1996 Santiago and Caballero 1998). We assume a genomic region containing many genes, with selected sites that are continuously distributed with constant density, as in Model 3 of Charlesworth (2012b). We distinguish between NS sites and UTRs. This is, of course, a somewhat crude approximation, given that our genic model includes neutrally evolving intronic and intergenic sequences. For simplicity, we describe the case of autosomal inheritance, but parallel results hold for X-linked loci, with the appropriate changes in selection, mutation, and recombination parameters.

We model both reciprocal exchange via crossing over and noncrossover-associated gene conversion. We assume that the main contribution from gene conversion comes from sites that are sufficiently distant that gene conversion causes recombination between them at a fixed rate g = rgdg (rg is the rate of initiation of gene conversion events in females and dg is the mean tract length). This is the limiting value of the general expression for the rate of recombination due to gene conversion for sites separated by z base pairs, g[1 – exp (– z/dg)] (Langley et al. 2000 Frisse et al. 2001), after correcting for the lack of gene conversion in male meiosis.

Because SLiM assumes no crossover interference, the relationship between the frequency of crossing over and map distance in the simulations follows the Haldane mapping function (Haldane 1919), such that the frequency of crossing over between a pair of sites separated by z base pairs is given by: (1) where rc is the rate of crossing over per base pair.

The net frequency of recombination between the sites is r(z) = g + c(z). The predicted value of E for a given selection coefficient, t = hs, against heterozygous carriers of a deleterious mutation, Et, is given by Equations S1–S5 in section S1 of File S1. To obtain the final value of E, this equation is numerically integrated over the probability distribution of t values for NS and UTR sites separately, with total deleterious mutation rates UN and UU for NS and UTR sites, respectively, giving values EN and EU for the corresponding BGS effects.

To mimic the simulation results, we assume a γ distribution with a shape parameter of 0.3. As in previous studies, we ignore all deleterious mutations with a scaled selection coefficient γ = 2Nes below a critical value γc, to deal with the problem that very weakly selected mutations are subject to drift and contribute little to BGS effects (Nordborg et al. 1996). Following Nordborg et al. (1996) and Campos et al. (2017), we set γc = 5, and the γ distributions for both NS and UTR mutations were truncated accordingly. Numerical results for the integral of the kernel of the γ distribution from γc to infinity allow the proportion of mutations that exceed γc to be calculated these are denoted by PN and PU for NS and UTR sites, respectively. With the parameters used in the simulations of autosomes, this gives PN = 0.871 and PU = 0.694. The final value for E is PNEN + PUEU, from which B can be obtained as exp(– E).

Various methods have been used to predict the approximate effect of a single SSW on diversity statistics at a partially linked neutral site in a randomly mating population, as well as for the associated distortion of the neutral site frequency spectrum at segregating sites (Maynard Smith and Haigh 1974 Kaplan et al. 1989 Stephan et al. 1992 Barton 1998, 2000 Gillespie 2000, 2001 Durrett and Schweinsberg 2004 Kim 2006 Pfaffelhuber et al. 2006 Coop and Ralph 2012 Bossert and Pfaffelhuber 2013). Here, we present a simple heuristic derivation of the effect of a sweep on the pairwise neutral nucleotide site diversity, π, based on a combination of coalescent process and diffusion equation approaches. Following earlier approaches, we obtain the probability that a neutral lineage associated with a favorable allele at the end of a sweep was also associated with it at the start of the sweep, rather with the wild-type allele at the selected locus.

We consider separately the deterministic and stochastic phases of the spread of a favorable mutation, which were identified early in the history of the study of sweeps (Maynard Smith and Haigh 1974 Kaplan et al. 1989 Stephan et al. 1992 Barton 1998). The initial spread of a favorable allele A2 from a frequency of 1/(2N) is subject to large stochastic effects. With semidominance, the probability that A2 survives this effectively neutral period is approximately Q = Nes/N in a large population (Kimura 1964), assuming that the scaled selection coefficient, γ = 2Nes, is much greater than one (s is the selective advantage to homozygotes for the favorable mutation). As pointed out by Maynard Smith (1976), the overall expected frequency of A2 during this quasi-neutral phase (including losses) is approximately 1/(2N), after which it starts to behave deterministically. The expected frequency of A2 at the end of the quasi-neutral phase, conditioning on its surviving with probability Q, is thus 1/(2NQ) = γ −1 . More rigorously, Martin and Lambert (2015) have used branching process theory to show that the frequency of A2 at the end of the first stochastic phase is exponentially distributed, with mean γ −1 and variance γ −2 .

In the presence of BGS, we follow Kim and Stephan (2000) and assume that Ne in the formula for fixation probability is multiplied by a constant, B (see above). As shown below, this constant is somewhat different for the effect of BGS on purely neutral processes, such as the level of neutral variability, and for the effect of BGS on the fixation of favorable mutations, since selected variants are more resistant to the effects of BGS than neutral variants (Johnson and Barton 2002). We denote these two constants by B1 and B2, respectively, and write λ for the ratio B1/B2. The critical frequency at which A2 can be treated as behaving deterministically is then (B2γ) −1 , using the argument in the preceding paragraph. When A2 reaches a frequency close to 1, there is a second stochastic phase in which it drifts to fixation fairly rapidly, as described below. We assume that all other effects of BGS are similar to those for neutral variability, with B1 as the factor that multiplies Ne.

The expectation of the time spent in the deterministic phase can be found as follows. As described by Ewens (2004, page 169), a semidominant favorable allele has the property that the expected time spent in a small interval of allele frequency q to q + dq is the same as the time spent in the interval 1 – q to 1 – q – dq. This implies that the expected time that A2 spends between 1/(2N) and (B2γ) −1 is the same as the expected time it spends between 1 – (B2γ) −1 and 1 – 1/(2N), so that q during the deterministic phase can conveniently be treated as lying between (B2γ) −1 and 1 – (B2γ) −1 . Using the solution of the deterministic selection equation dq/dt = 1/2 spq for a semidominant allele (Haldane 1924), the expected time spent in this interval (expressed in units of coalescent time, 2Ne generations) ≈ 2γ −1 ln(B2 2 γ 2 ) = 4γ −1 ln(B2γ).

he expected times spent in the two stochastic phases can be found as follows. Using Equation 16 of Kimura and Ohta (1973) and the fact that Ne is multiplied by B1 to take BGS into account, the expected first passage time of a neutral allele from initial frequency 1/(2N) to a frequency q is: (2) For q << 1, this time is approximately equal to B1q, so that the additional expected time spent in the first stochastic phase is approximately λγ −1 . By the above symmetry argument, the same applies to the time between 1 – (B2γ) −1 and 1–1/(2N). The total expected time to fixation of A2 when γ >> 1 is thus: (3) This expression is very close to Equation A17 of Hermisson and Pennings (2005) for the case with B1 = B2 = 1, which was derived directly from the diffusion equation for the mean sojourn time of a favorable mutation in a finite population.

As far as the effect of a substitution on neutral diversity is concerned, we note that the rate (in units of coalescent time) at which a neutral lineage that is associated with A2 at time T recombines onto a background of A1 is p(T)ρ, where p(T) is the frequency of the wild-type allele at time T and ρ = 2Ner is the scaled recombination rate. Here, T = 0 at the time of fixation of the favorable allele and T = Ts at the time when it arose in the population. From the symmetry of the selection equation, the mean frequency of A1 over the deterministic phase is 0.5, so that that ρ should be discounted by a factor of 1/2 during this part of the process (note that this argument ignores the possibility of coalescence competing with recombination during the sweep, as was pointed out to us by Matthew Hartfield a more rigorous treatment that includes such competition will be presented elsewhere).

For sample paths in which A2 reaches the critical frequency (B2γ) −1 , the expected duration of the first stochastic phase is equal to the expected value of the first passage time to this frequency, λγ −1 , and its variance is λ 2 γ −2 /3 (File S1, section S2). During this period, a single lineage recombines with A1 haplotypes at a rate close to ρ, since A1 dominates the population, thus contributing 2ρ (λγ −1 + δTs1) to the mean number of recombination events, where δTs1 is the departure of the duration of the first stochastic phase from its expectation. The final stochastic phase has effectively zero probability of contributing to recombination, due to the prevalence of the favored allele, and can be ignored for this purpose.

We exploit the fact that the frequency of A2 at the end of the first stochastic phase is exponentially distributed (Martin and Lambert 2015) to show that the variance of the time to fixation caused by fluctuations in the initial frequency of A2 at the start of the deterministic phase yields an additional variance term of 16(γ) −2 (File S1, section S2). Since this phase has a mean frequency of A2 of 0.5, the relevant product of the recombination rate and a fluctuation in deterministic sweep time (δTs2) is ρ δTs2 rather than 2ρ δTs2.

The probability Pcs that the two sampled haplotypes coalesce as a result of the sweep is equivalent to the probability that neither member of a pair of haplotypes sampled at time T = 0 recombined onto an A1 background, provided that the sweep durations are so short that no coalescence can occur among nonrecombined haplotypes during the sweep (Wiehe and Stephan 1993). This probability is given by the first term of a Poisson distribution, whose mean is equal to the expected number of recombination events over the duration of a substitution. We thus have: (4) The term (B2γ) −4 r / s in the third line of Equation 4 is the deterministic phase contribution to the effect of a sweep, first derived by Barton (2000, 1998) for the case of B1 = B2 = 1, using a more rigorous approach. It has been used in several subsequent studies (Weissman and Barton 2012 Elyashiv et al. 2016 Campos et al. 2017). The last term is second-order in r/s and is likely to be of minor importance, since sweeps only have substantial effects on variability when r/s << 1. The second term has a somewhat larger effect e.g., with no BGS, γ = 100, and r/s = 0.1, it reduces Pcs from 0.158 to 0.130. An extension to these results is described in section S3 of File S1 (Equation S20), which allows for multiple recombination events that bring a recombined lineage back onto an A2 background.

Sweeps at multiple sites:

We now consider the effects of recurrent sweeps at multiple sites. We use the standard assumption that substitutions of favorable alleles are sufficiently rare that their effects on a given site can be treated as mutually exclusive events (Kaplan et al. 1989 Wiehe and Stephan 1993 Kim and Stephan 2000 Kim 2006). We consider only a single gene, which is reasonable for favorable mutations whose selection coefficients are less than the rate of recombination between sites in different genes, as is usually the case here. These procedures are supported by our simulation results, except for cases with very low rates of recombination.

We use the expression for the probability of a sweep-induced coalescent derived above (Equation 4) to obtain an approximate expression for the net rate of coalescent events experienced at a given neutral site (in units of 2Ne generations), due to recurrent SSWs at NS and UTR sites: (5) where νa and νu are the rates (in units of coalescent time) at which substitutions of favorable mutations occur at NS and UTR sites, respectively Pcs Ni and Pcs Uj are the rates of sweep-induced coalescent events induced by the ith NS site and jth UTR site, respectively, which can be obtained from Equation 4 and Equation S20. The summations are taken over all the sites in the gene that are under selection. The notation S −1 is used to denote the reciprocal of the expected time to coalescence due to sweeps, S, where subscripts a and u denote NS and UTR mutations, respectively.

If we assume that the fixation probability of a favorable mutation in the presence of BGS is discounted by a factor of B2 compared with the standard value (see above), we have: (6a) (6b) where u is the mutation rate per nucleotide site, and pa and pu are the proportions of all new NS and UTR mutations, respectively, that are selectively favored.

Because we are confining ourselves to a single gene, a linear genetic map can be assumed. The crossing over contribution to ri is then given by rczi, where zi is the physical distance between the neutral and selected sites, and rc is the rate of crossing over per base pair. There is also a contribution from gene conversion, as described in the section on modeling BGS. The summation formula used in the sweep calculations assumes that every third base pair in an exon is a neutral site, with the other two being subject to selection (Campos et al. 2017). This differs from the SLiM procedure of randomly assigning selection status to exonic sites, with a probability ps of being under selection (ps = 0.7 in the simulations used here). To correct for this, the overall rate of NS substitutions in Equation 5 was adjusted by multiplying by 0.7 × 1.5.

Following Kaplan et al. (1989), Wiehe and Stephan (1993), and Kim and Stephan (2000), coalescent events caused by SSWs and coalescent events caused by neutral drift can be considered as competing exponential processes with rates S −1 and B1 −1 , respectively, on the coalescent timescale of 2Ne generations. Under the infinite sites model (Kimura 1971), the expected diversity at a neutral nucleotide site, relative to its value in the absence of selection at linked sites (θ = 4Neu), can then be written as the expected time to coalescence when time is measured in units of 2Ne generations: (7) The simulation results for synonymous site nucleotide site diversities were presented as mean values over all genes in the region simulated. Since we are modeling only a single gene, the mean of π/θ in Equation 7 over all synonymous sites in a gene should be used for comparison with the simulation results. In practice, the values obtained by substituting the mean value of S −1 across synonymous sites into Equation 7 give almost identical results, and this is used for the results described below.

The effect of sweep duration on mean coalescent time:

Equation 7 assumes that the duration of a sweep is negligible in comparison to the times between successive sweeps and to the mean neutral coalescent time 2Ne, so that sweeps can be treated as point events. However, this assumption is violated if selection is sufficiently weak. For example, with γ = 250, the deterministic component of the duration of an adaptive substitution given by Equation 3 is ∼10% of the coalescent time. Assuming that the entire time between sweep-induced coalescent events is available for neutral coalescent events can cause an underestimation of the effects of sweeps when these are sufficiently frequent.

Here, we develop an alternative approach that uses the mean diversity between successive substitutions as an estimate of the expected diversity under recurrent sweeps. This is likely to overestimate the effects of sweeps compared with the mean for randomly sampled time points, but the simulation results described below show that the resulting expressions (Equation 12) provide a good fit. We assume that adaptive substitutions occur in a gene at a constant rate ω per unit of coalescent time, given by the sum over the rates per site for the NS and UTR sites in the gene. This quantity can be found from Equation 6, by multiplying νa by 70% of the number of NS sites in a gene and νu by the number of UTR sites. We then look back in time, and evaluate the time average of the divergence of π/θ from its equilibrium value over the period since the previous substitution.

To do this, we denote the expected neutral diversity at a neutral site immediately after a substitution by π0, and the expected neutral diversity at the time of initiation of a new substitution by π1. We have: (8) where D is the probability that each member of a pair of lineages carrying the favorable mutation has failed to recombine during the substitution, conditioned on the completion of a substitution. Because the expected reduction in neutral diversity due to recurrent sweeps is S −1 , we have D = (ωS) −1 , thereby establishing the relationship between π0 and π1 (the assumption that the coalescence time for the pair of swept lineages is zero is relaxed below).

Under the infinite sites model (with θ << 1), the equilibrium diversity in the absence of sweeps is B1θ. In this case, the standard formula for the rate of approach of neutral diversity to its equilibrium value (Malécot 1969, p.40 Wiehe and Stephan 1993, Equation 6a) gives the following expression for the diversity at a time T after a substitution: (9) (the factor of B1 −1 in the exponent reflects the reduction in Ne caused by BGS, resulting in a corresponding acceleration in the rate of approach to equilibrium).

The expected diversity over the relevant period, π, is thus given by: (10) Formulae for I(ω, B1) are derived in File S1, section 5.

In the absence of any recovery of diversity during the sweep itself, Equations 8–11 yield the final expression: (12a) In the limit as ω approaches zero, ωI and A both tend to 0, and AD tends to B1S −1 . The value of π/θ for small ω is thus approximately 1/(B1 + S −1 ), corresponding to Equation 7.

To allow for a nonzero mean time Tcs to coalescence during the sweep, the postsweep diversity π0 is modified by adding DTcsθ to Equation 8, where Tcs is given by Equation S11 (this is an underestimation, since it ignores recombination during the sweep). This adds a small additional component to Equation 12a, giving: (12b) Equations 12a and 12b assume that the sample is taken in an interval between two successful sweeps. A correction can be applied to take into account the possibility that a sample is taken during a sweep this effect is expected to be small unless sweep-induced coalescents are very frequent and the time occupied by a sweep is relatively large compared with the neutral coalescent time (File S1, section S6).

Continuum approximation for effects of recurrent sweeps:

A useful approximation can be obtained by treating a gene as a continuum, following Wiehe and Stephan (1993), Coop and Ralph (2012), and Weissman and Barton (2012). We correct for the effect of introns simply by reducing the density of NS sites in the coding sequence. This is done by multiplying the density within exons by the fraction of the sites that are exons among the total length of exons, introns, and UTRs. In addition, we approximate the effect of gene conversion by writing the net recombination rate between sites separated by z base pairs as (rc + gc)z when z ≤ dg, and as rc z + g (where g = gcdg) when z > dg (Andolfatto and Nordborg 1998). The resulting expressions for sweep effects are derived in File S1, section S7. These do not include any corrections for multiple recombination events, or for the variances in the first stochastic phase and deterministic phase durations, since these make the integrations analytically intractable.

Simulation results

Effects of BGS alone:

Figure 2 shows the simulation and theoretical results for B1, the ratio of the mean synonymous site nucleotide diversity (π) to its value without selection (θ) in the absence of SSWs, using the gene model in Figure 1. Chromosomal regions containing 70 and 210 genes, with and without gene conversion at the standard rate, were modeled. The mean value of θ from simulations of neutral mutations in the absence of selection at linked sites was 0.0228, with 95% C.I. (0.0227, 0.0229), which is slightly lower than the theoretical value on the infinite sites model (0.0239), presumably due to the slight deviations from the infinite sites assumption in SLiM. The ratios of the mean simulated synonymous site diversities to 0.02283 were used for the estimates of B1. Table S1 of File S1 shows more detailed results for the autosomal case, as well as for the model of X-linked loci described in Table 1.

The effect of background selection on neutral diversity at autosomal loci. The bars show values of B1 = π/θ, where π is the mean diversity at synonymous sites and θ is the value in the absence of selection, in relation to the rate of crossing over relative to the standard value given in Table 1 (x-axes). Results with and without gene conversion under the standard parameters are displayed. Both the simulation results (red and green bars), with error bars indicating 95% C.I.s, and the theoretical predictions (blue and white bars) are shown.

Overall, there is a fairly good fit between the theoretical predictions and the simulation results, although the theoretical values of B1 are mostly slightly smaller than the simulation values, probably because intergenic sequences have been ignored. However, if the additional term in E contributed from neutral mutations that arise in repulsion from a linked deleterious mutation (Equations S1b, S5d and S5e) is ignored, the fits are much less good, especially with the larger numbers of genes. For example, with 210 genes and gene conversion, the predicted B1 values are 0.583, 0.696, 0.739, 0.762, and 0.776 for crossover rate factors of 0.5, 1, 1.5, 2, and 2.5, respectively the last value is 18% larger than when the additional term is included.

Similarly, use of a linear relationship between physical distance and map distance, which has been assumed in most theoretical models of BGS, generally gives a poorer fit to the results for the higher rates of crossing over (Table S2 of File S1), except when the number of genes and the map length of the region are both small, reflecting the effect of double crossing over in reducing the net rate of recombination between distant sites. Nonetheless, the fit is surprisingly good overall indeed, the linear map predictions often provide a better fit to the simulation results for the cases with 20 and 70 genes. The implications of these effects of the inclusion of the repulsion mutations, and the differences between the linear and Haldane maps, are considered in the Discussion.

Effects of BGS on the rate of fixation of favorable mutations:

The main goal of our work is to analyze the joint effects on neutral diversity of BGS and SSWs, and the extent to which these can be predicted by the relatively simple Equations 7 and 12. A core assumption behind these equations is that the fixation probability of a new favorable mutation is affected by BGS as though Ne is multiplied by a factor that is close to the value that applies to neutral diversity (Kim and Stephan 2000).

We have tested this assumption by comparing the mean numbers of fixations of favorable mutations observed over the last 20,000 (8N) generations of the simulations, both without BGS and with BGS. The ratio of these means provides a measure of B (B2) that can be compared to the value of B estimated from neutral diversity (B1). There are two reasons why we would not expect perfect agreement. First, a sufficiently strongly selected favorable variant could resist elimination due to its association with deleterious mutations, and instead might drag one or more of them to high frequencies or fixation (Johnson and Barton 2002 Hartfield and Otto 2011). Second, the incursion of selectively favorable mutations may perturb linked deleterious mutations away from their equilibrium, even if they do not cause their fixation.

Such Hill–Robertson interference effects (Hill and Robertson 1966 Felsenstein 1974) reduce the Ne experienced by deleterious mutations and hence their nucleotide site diversity, which is correlated with the mean number of segregating deleterious mutations. This reduction in the number of segregating deleterious mutations reduces the effects of BGS on incoming favorable mutations. For both these reasons, B1 is likely to be smaller than B2. Table S3 of File S1 provides evidence that the mean number of segregating deleterious mutations is indeed reduced by SSWs, except for the cases with no crossing over, where the rate of sweeps is greatly reduced compared with cases with crossing over.

The results for autosomal loci in Figure 3 show that BGS has a substantial effect on the rate of adaptive substitutions (Table S4 of File S1 presents more detailed results for autosomal and X-linked loci). The most extreme case is when there is no crossing over, a regime in which the efficacy of BGS is undermined by Hill–Robertson interference among the deleterious mutations, so that the assumptions underlying the BGS equations tested in the previous section are violated (McVean and Charlesworth 2000 Comeron and Kreitman 2002 Kaiser and Charlesworth 2009 Seger et al. 2010 Good et al. 2014 Hough et al. 2017). For example, B1 for 70 genes with gene conversion is 0.086, close to the value found by Kaiser and Charlesworth (2009) for a similar sized region, whereas the standard BGS prediction is 0.0004. In contrast, the B2 values for favorable NS and UTR mutations are 0.26 and 0.28, respectively, around three times greater. This still represents a massive reduction in the efficacy of selection on favorable mutations, consistent with the evidence that their rates of substitution in noncrossover regions of the Drosophila genome are much lower than elsewhere (Charlesworth and Campos 2014).

The effect of background selection (BGS) on the numbers of fixations of favorable nonsynonymous (NS) and UTR mutations at autosomal loci, in relation to the relative rate of crossing over (x-axes). The blue and red bars shows the numbers of fixations over the last 20,000 generations of the simulations (with 95% C.I.s), expressed as ratios of fixations to the number of genes simulated. Blue and red bars show the results for simulations with and without BGS, respectively. The corresponding ratios of numbers of fixations (B2) with and without BGS (green bars), and ratios of neutral diversities with and without BGS (white bars), obtained from the simulations (B1) are also shown. The standard gene conversion parameters are used.

For the other rates of crossing over, there is much closer agreement between the two estimations of B, although we always have B1 > B2. The discrepancy is largest for crossover rates of one-half the standard value, and seems to level off after the standard rate. As might be expected, it is smaller in the presence of gene conversion.

Effects of interference among favorable mutations on their rates of substitution:

With no recombination, Hill–Robertson interference among adaptive substitutions is likely to be important, and makes analytical models of substitution rates much harder to develop. The effects of such interference can be predicted using the approximate Equation 4 of Neher (2013), which is based on Equation 39 of Desai and Fisher (2007). When this is adapted for the case of diploids with semidominance with s >> Ub, the rate of substitution of favorable mutations, ω, is equal to 0.5s ln(Ns)/[ln(2Ub/s)] 2 , where s is the homozygous selection coefficient for a favorable mutation and Ub is the net mutation rate to favorable mutations for the region. Combining NS and UTR mutations (these have similar selection coefficients in our simulations), and putting s = 0.05, Ub = 0.00436, N = 2500, and ω = 0.00406, the ratio of ω to the baseline substitution rate in the absence of interference is 0.163.

The observed ratio of the rates of substitution for relative rates of crossing over of 0 and 2.5, with 70 genes, and no gene conversion and no BGS, was equal to 0.235, suggesting that the effect of interference is overpredicted by the approximation. Gene conversion increases the ratio to 0.570, so that it greatly reduces interference when crossing over is absent. BGS thus seems to play a more important role than SSWs in reducing the rate of substitution of favorable mutations when crossing over is absent, especially in the presence of gene conversion, as was suggested by Campos et al. (2014). The properties of genomic regions with very low rates of crossing over will be analyzed in more detail in a later publication.

In the absence of BGS, but with nonzero rates of crossing over, Figure 3 and Figure S4 show little effect of the crossing over rate on the rate of fixation of favorable mutations. At first sight, this suggests that there is little interference among selectively favorable mutations, with a rate of crossing over of one-half or more of the standard rate. However, there is indirect evidence for such interference effects, from estimates of the extent of underdispersion of the numbers of adaptive substitutions observed over the last 8N generations of the simulations compared with the expectation for a Poisson distribution, as described in File S1, section S8. Here, underdispersion is measured by the ratio of the variance to the mean of the number of substitutions over the period of observation (Sellers and Morris 2017).

This analysis shows that interference causes a small loss of substitutions, leading to a reduction in the extent of the reduction in diversity caused by SSWs for the cases with crossing over, with ∼5.5% of substitutions being lost due to interference. An approximate correction for interference can be made by multiplying the substitution rates for both NS and UTR mutations by the estimated proportion of substitutions that survive interference, although this ignores some of the complexities associated with the effects of interference on diversity (Kim and Stephan 2003 Chevin et al. 2008). In addition, it should be noted that the existence of underdispersion implies that the Poisson model of sweeps that is usually assumed is not exact, as pointed out by Gillespie (2001), introducing a further source of error into the predictions.

Effects of SSWs on neutral diversity:

This section is concerned with four main questions. First, to what extent does treating sweeps as point events affect the predictions of models of recurrent sweeps? Second, how well does the integral approximation for SSWs perform (Equations S24–S33) compared with the more exact summation formulae (Equations 5 and 6)? Third, how well do the competing coalescent process approximations for the joint effects of BGS and SSWs perform when the various corrections described above have been included? Finally, is less accuracy obtained by using the neutral BGS value (B1) instead of B2 in the formulae for the effect of BGS on the fixation probability of a favorable mutation?

Figure 4 presents the mean values of synonymous site diversities relative to the value in the absence of selection for simulations with 70 autosomal genes, together with the predictions for the integral and summation formulae, with and without the corrections described (the correction for interference was applied to all these cases). In the case of the corrected summation formulae, all the corrections described above were applied for the integral results, only the corrections for expected sweep duration and interference were used. More detailed results for autosomal and X-linked genes are shown in Table S6 of File S1.

The effects of selective sweeps on mean diversity at autosomal synonymous sites in relation to the relative rate of crossing over (x-axes), with and without background selection (BGS), or gene conversion at the standard rate. The red bars shows the simulation values of mean synonymous diversities relative to the values without any selection (π/θ), with 95% C.I.s. The green and blue bars are the results for the integral approximations, with and without the corrections (corr.) described in the text, respectively. The gray and white bars are results for the summation formulae (sum.), with and without the corrections.

Concerning the first point, diversities are considerably overpredicted by the uncorrected values from Equation 5 (which included the correction for interference) by up to 20%, with the lowest rate of crossing over used in Figure 4 and Table S6. This shows that treating recurrent sweeps as point events can produce significant errors, especially when crossing over is infrequent and there is no gene conversion.

For the second point, the agreement between the integral and summation results is surprisingly good overall. The largest discrepancies occur when the rate of crossing over is low, and gene conversion and BGS are absent, when they are of the order of 7.6% of the lower value.

For the third point, the agreement between the simulation means and the predictions with the corrections is generally very good, although the integral results underpredict diversity by ∼20% for the autosomal case with the lowest rate of crossing over, no gene conversion, and no BGS. If the correction for interference is not applied, lower diversities are predicted, which sometimes give better agreement with the simulation results, but the effects are not major (Table S7). The main contribution to the improvements in fit from the other corrections comes from the sweep duration, as can be seen from results where one or both of the other factors (multiple recombination events and coalescence during a sweep), as well as interference, are omitted (Table S7). Omission of the correction for coalescence during sweeps usually has the next largest effect, mainly because it reduces the contribution to coalescent time from samples taken during sweeps (section S6 of File S1). Overall, omission of all the corrections except that for sweep duration produces remarkably good results.

With respect to the fourth point above, the fits with B1 alone are good, except for the lowest rate of crossing over and no gene conversion (an error of 9% in Figure 4). Overall, it seems that relatively little is to be gained by using B2.

The predictions of the effects of SSWs use a single gene model, which assumes that the effects of sweeps with the parameters assumed here are localized to single gene regions. The simulation results with sweeps alone in regions with crossing over (File S2) show that there is no noticeable effect of the numbers of genes on the mean synonymous site diversities, consistent with this assumption. This is not surprising, given that the expected reduction in diversity at a neutral site due to a single sweep at recombination distance r is approximately γ – 4 r / s , where γ and s are the scaled and absolute selection coefficients for the favorable allele, respectively. With the values of γ and s for autosomal NS mutations assumed here (250 and 1 × 10 −4 for natural populations, respectively), an effective crossing over rate of 1 × 10 −8 , and a distance of 2000 bp between sites (the minimum for sites in separate genes), the expected reduction in diversity with no gene conversion is 250 (–0.8) = 0.01, which is essentially trivial.

This conclusion does not apply in the absence of recombination, which has been studied theoretically by Kim and Stephan (2003) and Weissman and Hallatschek (2014). In this case, the simulation results displayed in File S2 show that there is a large effect of the number of genes. With no crossing over, gene conversion or BGS, the mean autosomal diversities relative to neutral expectation were 0.0819, 0.0700, and 0.0675 for 70, 140, and 210 genes, respectively. These results can be compared to the predictions from the approximate Equation 5 of Weissman and Hallatschek (2014), modified for diploidy with semidominance, which gives the absolute neutral nucleotide diversity with recurrent sweeps as 8μ ln[2ln(γ)/Ub]/s. The resulting predicted values are 0.195, 0.183, and 0.176, respectively.

As was also found by Weissman and Hallatschek (2014), the theoretical results considerably overpredict diversity. Gene conversion greatly reduces the effects of sweeps, with relative diversities of 0.130, 0.090, and 0.0832 in the absence of BGS. BGS has a much greater effect on diversity than sweeps when crossing over is absent. With gene conversion, it gives relative diversity values of 0.0867, 0.0429, and 0.0293 for 70, 140, and 210 genes, respectively. Essentially the same values are seen with both BGS and SSWs, reflecting the fact that the rate of sweeps is greatly reduced in the presence of BGS (see Figure 3). The predicted relative diversity value for a 70-gene region with no crossing over is quite close that observed for the fourth chromosome of D. melanogaster, which has a similar number of genes (Campos et al. 2014), suggesting that diversity in noncrossover regions of the genome is strongly influenced by BGS, as was also inferred by Hough et al. (2017) for the case of the newly evolved Y chromosome of Rumex.

Results and Discussion

Effects of SNP Numbers, Density, and Genome Size on Inference under Neutral Equilibrium

The accuracy and performance of demographic inference were evaluated using two popular methods, MSMC ( Schiffels and Durbin 2014) and fastsimcoal2 ( Excoffier et al. 2013). In order to assess performance, it was first necessary to determine how much genomic information is required to make accurate inference when the assumptions of neutrality are met. Chromosomal segments of varying sizes (1 Mb, 10 Mb, 50 Mb, 200 Mb, and 1 Gb) were simulated under neutrality and demographic equilibrium (i.e., a constant population size of 5,000 diploid individuals) with 100 independent replicates each. For each replicate, this amounted to the mean [SD] number of segregating sites for each diploid individual being 1,944 [283], 9,996 [418], 40,046 [957], and 200,245 [1,887] for 50 diploid individuals, these values were 10,354 [225], 51,863 [567], 207,118 [1,139], and 1,035,393 [2,476] for 10 Mb, 50 Mb, 200 Mb, and 1 Gb, respectively. Use of MSMC resulted in incorrect inferences for all segments smaller than 1 Gb ( supplementary figs. 1 and 2 , Supplementary Material online). Specifically, very strong recent growth was inferred instead of demographic equilibrium, although ancestral population sizes were correctly estimated. In addition, when two or four diploid genomes were used for inference, MSMC again inferred a recent many-fold growth for all segment sizes even when the true model was equilibrium, but performed well when using one diploid genome with large segments ( supplementary figs. 1 and 2 , Supplementary Material online). These results suggest caution when performing inference with MSMC on smaller regions or genomes, specifically when the number of SNPs is less than ∼200,000 per single diploid individual. Extra caution should be used when interpreting population size changes inferred by MSMC when using more than one diploid individual.

When using fastsimcoal2 to perform demographic inference, parameters were accurately estimated for all chromosomal segment sizes when the correct model (i.e., equilibrium) was specified ( supplementary table 1 , Supplementary Material online). However, when model selection was performed using a choice of four models (equilibrium, instantaneous size change, exponential size change, and instantaneous bottleneck), the correct model was chosen more often (∼30% of replicates) when the simulated chromosome sizes were small (1 and 10 Mb), whereas an alternative model of either instantaneous size change or instant bottleneck was increasingly preferred for larger regions ( supplementary tables 2 and 3 , Supplementary Material online), although the estimates of ancestral sizes were correct. This finding suggests that the nonindependence of SNPs may result in model mis-identification. Indeed, since the model choice procedure assumes that SNPs are independent, the true number of independent SNPs is overestimated, which results in an overestimation in the confidence of the model choice with an increasing amount of data. However, it is interesting to note that the parameter values underlying the non-constant size preferred model were often pointing towards a constant population size (see below). When model selection was performed using sparser SNP densities (i.e., 1 SNP per 5, 50, or 100 kb), the correct model was recovered for longer chromosomes up to 200 Mb ( supplementary tables 2 and 3 and figs. 3 and 4, Supplementary Material online), although model selection was slightly less accurate for smaller chromosomes due to the decrease in the total amount of data. As suspected, the biases introduced by the nonindependence of SNPs were found to be concordant with the level of linkage disequilibrium among SNPs used for the analysis (for ten SNP windows, in which SNPs were separated by 50 kb [100 kb], mean r 2 = 0.027 (0.020), compared with the all-SNP mean r 2 of 0.118, and to the completely unlinked SNPs mean r 2 of 0.010 supplementary table 4 , Supplementary Material online). Additionally, AIC performed on partially linked SNPs may impose an insufficient penalty on a larger number of parameters, resulting in an undesirable preference for parameter-rich models. We found that implementing a more severe penalty improved inference considerably, even for 1-Gb chromosome sizes ( supplementary tables 5 and 6 , Supplementary Material online). This model selection performance, the potential corrections related to increased penalties, as well as the total number of SNPs and SNP thinning, should be investigated on a case-by-case basis in empirical applications, owing to the contribution of multiple underlying parameters (e.g., chromosome length, recombination rates, and SNP densities).

In the light of this performance assessment, all further analyses were restricted to characterizing demographic inference on data that far exceeded 1 Gb and roughly matched the structure and size of the human genome—for every diploid individual, 22 chromosomes (autosomes) of size 150 Mb each were simulated, which amounted to roughly 3 Gb of total sequence. Ten independent replicates of each parameter combination were performed throughout, and inference utilized one and fifty diploid individuals for MSMC and fastsimcoal2, respectively.

Effect of the Strength of Purifying Selection on Demographic Inference

In order to test the accuracy of demographic inference in the presence of BGS, all 22 chromosomes were simulated with exons of size 350 bp each, with varying sizes of introns and intergenic regions (see Materials and Methods) in order to vary the fraction (5%, 10%, and 20%) of the genome under selection. Because the strength of selection acting on deleterious mutations affects the distance over which the effects of BGS extend, demographic inference was evaluated for various DFEs ( table 1). The DFE was modeled as a discrete distribution with four fixed classes: 0 ≤ 2 N anc s < 1 ⁠ , 1 ≤ 2 N anc s < 10 ⁠ , 10 ≤ 2 N anc s < 100 , and 100 ≤ 2 N anc s < 2 N anc ⁠ , where N anc is the ancestral effective population size and s is the reduction in the fitness of the homozygous mutant relative to wildtype. The fitness effects of mutations were uniformly distributed within each bin, and assumed to be semidominant, following a multiplicative fitness model for multiple loci the DFE shape was altered by varying the proportion of mutations belonging to each class, given by f0, f1, f2, and f3, respectively (see Materials and Methods). Three DFEs highly skewed towards a particular class were initially used to assess the impact of the strength of selection on demographic inference (with the remaining mutations equally distributed among the other three classes): DFE1: a DFE in which 70% of mutations have weakly deleterious fitness effects (i.e., f1 = 0.7) DFE2: a DFE in which 70% of mutations have moderately deleterious fitness effects (i.e., f2 = 0.7) and DFE3: a DFE in which 70% of mutations have strongly deleterious fitness effects (i.e., f3 = 0.7). A DFE with equal proportions of all deleterious classes (i.e., DFE4: f 0 = f 1 = f 2 = f 3 = 0.25 ⁠ ) was also simulated to evaluate the combined effect of different selective strengths. In addition, two bimodal DFEs consisting of only the neutral and the strongly deleterious class of mutations were simulated to characterize the role of strongly deleterious mutations (DFE5: a DFE in which 50% of mutations have strongly deleterious effects (i.e., f3 = 0.5) with the remaining being neutral and DFE6: a DFE in which 30% of mutations were strongly deleterious (i.e., f3 = 0.3) with the remaining being neutral).

Proportion ( ⁠ f i ⁠ ) of Mutations in Each Class of the Discrete Distribution of Fitness Effects (DFE) Simulated in This Study.

Looking Back at Our Ancestors

About 10,000 years ago, our human ancestors learned the art of agriculture and subsequently started to domesticate animals. The domestication of cows in Europe allowed these people to use cow's milk for nutrition. Over time, those individuals who had the allele to make lactase possessed the favorable trait over those who could not digest the cow's milk.

A selective sweep occurred for the Europeans and the ability to get nutrition from milk and milk products was highly positively selected. Therefore, the majority of Europeans possessed the ability to make lactase. Other genes hitchhiked along with this selection. In fact, researchers estimate that about a million base pairs of DNA hitchhiked along with the sequence that coded for the lactase enzyme.


The neutral theory of molecular evolution predicts that the amount of neutral polymorphisms within a species will increase proportionally with the census population size (Nc). However, this prediction has not been borne out in practice: while the range of Nc spans many orders of magnitude, levels of genetic diversity within species fall in a comparatively narrow range. Although theoretical arguments have invoked the increased efficacy of natural selection in larger populations to explain this discrepancy, few direct empirical tests of this hypothesis have been conducted. In this work, we provide a direct test of this hypothesis using population genomic data from a wide range of taxonomically diverse species. To do this, we relied on the fact that the impact of natural selection on linked neutral diversity depends on the local recombinational environment. In regions of relatively low recombination, selected variants affect more neutral sites through linkage, and the resulting correlation between recombination and polymorphism allows a quantitative assessment of the magnitude of the impact of selection on linked neutral diversity. By comparing whole genome polymorphism data and genetic maps using a coalescent modeling framework, we estimate the degree to which natural selection reduces linked neutral diversity for 40 species of obligately sexual eukaryotes. We then show that the magnitude of the impact of natural selection is positively correlated with Nc, based on body size and species range as proxies for census population size. These results demonstrate that natural selection removes more variation at linked neutral sites in species with large Nc than those with small Nc and provides direct empirical evidence that natural selection constrains levels of neutral genetic diversity across many species. This implies that natural selection may provide an explanation for this longstanding paradox of population genetics.


Overinterpretation of results and storytelling

Identifying genomic regions that have undergone recent and strong positive selection is an important challenge of modern evolutionary biology. Neutral evolutionary processes, such as random genetic drift enhanced by population size changes and/or gene flow, increase the rate of false positives and make it more challenging to detect genomic regions which have been targeted by positive selection. Frequently, additional validity of results is provided by the fact that loci identified by selective sweep scans ‘make sense’. Pavlidis et al. [87] showed that such an approach of perceiving an increased validity of results, simply because they make sense can be dramatically misleading. They designed a simple simulation experiment, in which a neutrally evolved X-chromosome of D. melanogaster is scanned for selective sweeps. Then, they performed a literature mining for the (by definition false positive) identified selective sweep targets. They showed that by means of gene ontology it would make perfect sense to identify such targets even though they are false positives. The study by Pavlidis et al. [87] showed that interpretation of the results should be treated very carefully and overinterpretation should be avoided.

Combining methods to decrease the false positive rate

To increase the validity of selective sweep scans, analyses typically consist of a multitude of neutrality tests. The rationale is that ‘the more tests agree on an outcome, e.g., selection, the more plausible this outcome is’. The problem with this, however, is that the outcome of different neutrality tests are usually correlated, since they depend profoundly on the underlying coalescent tree. Consider a neutrally evolved genomic region that is characterized by an exceptional ‘sweep-like’ collection of coalescent trees. Several neutrality tests will give a good signal for a selective sweep in this region. For instance, assume a set of unbalanced trees, such as those shown in Fig. 6, where all lineages except for one coalesce relatively fast on one side of the tree. Tajima’s D assumes extreme values because of the skewed SFS. The same is true for SweeD and SweepFinder. Furthermore, since the tree is unbalanced with long internal branches, LD is increased locally. The number of polymorphic sites might be reduced since the total tree length is reduced. Thus, independently applying several neutrality tests and then showing that several of them reject neutrality (or showing only those that reject neutrality) should be avoided. A better practice is to combine the tests in a unified framework and not independently. For example, [55, 88, 89] used supervised learning algorithms and several neutrality tests (variables) to classify genomic regions as either neutral or selected. Any correlation between the variables is incorporated implicitly in the learning algorithms and does not affect the accuracy of the classifier. Since, however, a large number of simulations is typically required for the execution of the learning algorithms, the running time of such approaches increases considerably.

An unbalanced genealogy with several short external branches can generate extreme values for a multitude of neutrality tests

The need for high performance

Driven by the advent of DNA sequencing, several projects have focused on sequencing whole genomes from various species in the past years. This has led to the discovery of thousands of new SNPs and the availability of a plethora of datasets that are suitable for population genetics analyses. As more genomes are being sequenced, contributing to the increasing dataset sizes, the computational demands for the respective analyses increase as well. This poses a challenge to existing and future software tools as High Performance Computing (HPC) techniques are becoming a prerequisite for conducting large-scale analyses.

Reducing execution times and enabling processing of large-scale datasets on limited hardware resources, such as off-the-shelf workstations, requires source codes to abide by several basic HPC principles. For instance, understanding how memory accesses affect performance, or which scheduling/communication strategy among multiple cores is the most efficient for a particular task, can substantially reduce execution times by allowing the software to utilize the hardware resources in current x 86 processors in the most effective way. With Moore’s law being continued in the form of an increasing number of cores per processor and an increasing width for vector registers Footnote 1 , not employing multithreading Footnote 2 and/or vector intrinsic instructions in newly developed tools can lead to significant underutilization of processors.

However, although optimization techniques such as kernel vectorization have the potential to accelerate processing, the nature of operations and the computational demands of the target task for performance improvement need to be carefully examined. For instance, a recent study [90] revealed that in order to achieve high-performance for large-scale LD computations that comprise thousands of sequences and SNPs, vector intrinsics must be avoided. This is due to the fact that the computational bottleneck in LD-based analyses for large sample sizes is the enumeration of ancestral and derived alleles in SNPs. This operation is efficiently implemented via the use of an intrinsic population count command, which however operates only on regular registers, i.e., 32- or 64-bit words. Deploying vector intrinsics for LD leads to poorer performance due to increased data preparation times (storing and retrieving words in vector registers).

In addition to software-level optimizations for faster completion of bioinformatics analyses, a variety of hardware-accelerated solutions have also been proposed in the previous years. Hardware platforms, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), have been widely targeted for the acceleration of large-scale analyses, and a variety of bioinformatics algorithms have been successfully ported on these architectures, from sequence alignment kernels [91] and phylogenetic tree scoring functions [92, 93] to large-scale LD computations [90] and epistasis detection in Genome Wide Association Studies [94].

Why do species get a thin slice of π? Revisiting Lewontin’s Paradox of Variation

Under neutral theory, the level of polymorphism in an equilibrium population is expected to increase with population size. However, observed levels of diversity across metazoans vary only two orders of magnitude, while census population sizes (Nc) are expected to vary over several. This unexpectedly narrow range of diversity is a longstanding enigma in evolutionary genetics known as Lewontin’s Paradox of Variation (1974). Since Lewontin’s observation, it has been argued that selection constrains diversity across species, yet tests of this hypothesis seem to fall short of explaining the orders-of-magnitude reduction in diversity observed in nature. In this work, I revisit Lewontin’s Paradox and assess whether current models of linked selection are likely to constrain diversity to this extent. To quantify the discrepancy between pairwise diversity and census population sizes across species, I combine genetic data from 172 metazoan taxa with estimates of census sizes from geographic occurrence data and population densities estimated from body mass. Next, I fit the relationship between previously-published estimates of genomic diversity and these approximate census sizes to quantify Lewontin’s Paradox. While previous across-taxa population genetic studies have avoided accounting for phylogenetic non-independence, I use phylogenetic comparative methods to investigate the diversity census size relationship, estimate phylogenetic signal, and explore how diversity changes along the phylogeny. I consider whether the reduction in diversity predicted by models of recurrent hitch-hiking and background selection could explain the observed pattern of diversity across species. Since the impact of linked selection is mediated by recombination map length, I also investigate how map lengths vary with census sizes. I find species with large census sizes have shorter map lengths, leading these species to experience greater reductions in diversity due to linked selection. Even after using high estimates of the strength of sweeps and background selection, I find linked selection likely cannot explain the shortfall between predicted and observed diversity levels across metazoan species. Furthermore, the predicted diversity under linked selection does not fit the observed diversity–census-size relationship, implying that processes other than background selection and recurrent hitchhiking must be limiting diversity.


Many of the ideas for this review were first formulated from discussions between co-authors during our symposium on ‘The genomic landscape of speciation’ at ESEB 2015, Lausanne, Switzerland. We were kindly sponsored by Floragenex, Oregon, USA, and also by Stab Vida, Portugal. We are grateful to Mike Ritchie, Jeffrey Feder and an anonymous reviewer for their comments on an earlier draft of this manuscript. Mark Ravinet was funded by a JSPS Postdoctoral Fellowship for Foreign Researchers and by the Norwegian Research Council. RF was funded by FCT under the Programa Operacional Potencial Humano – Quadro de Referência Estratégico Nacional from the European Social Fund and the Portuguese Ministério da Educação e Ciência (SFRH/BPD/89313/2012, PTDC/BIA-EVF/113805/2009 and FCOMP-01-0124-FEDER-014272) as well as by the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 706376. JG was funded by a postdoctoral fellowship from Xunta de Galicia (Modalidade B). AMW and RKB are funded by NERC. BM and MRaf are supported by the Centre for Marine Evolutionary Biology, University of Gothenburg, Sweden. MRaf is additionally supported by the Adlerbert Research Foundation. NB is funded by ANR (HYSEA project, ANR-12-BSV7-0011).

Appendix S2 Parameter choices.

Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.

Watch the video: Background selection and patterns of molecular evolution and variation, part 1 of 2 (August 2022).