# Understanding F-statistics in population genetics

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am reading the classic Weir and Cockerham 1984 paper about \$F_{ST}\$ estimation. At the beginning (first page, right column), they define 3 statistics.

• \$F\$ is the correlation of genes within individuals ("inbreeding")

• \$ heta\$ is the correlation of genes of different individuals in the same population ("coancestry")

• \$f\$ is the correlation of genes within individuals within populations.

They also state that the 3 statistics are related by

\$\$f = (F- heta)(1- heta)\$\$

I don't quite understand those 3 statistics and especially I don't understand why this relationship holds true. Can you help me with that?

I am a bit wobbly on the subject, but I think the most important bit of information is that they are re-parametrising Wright's (1951) hierarchical analysis of variation, "F-statistics," "hierarchical partitioning of variation," or "population parameters," depending on whom you ask. The parameters correspond as follows (on the bottom of p.1358): Fit=F, Fis=f, Fst=θ.

The relationship arises given some assumptions. Crucially here, if Fis (or f) is a measure of departure from Hardy-Weinberg Principle, and all populations identically depart from HWP, then Fit = 1 - Hi/Ht. It follows that, 1 - Fit = Hi/Ht. As well, we can rewrite this, so that, Hi/Ht = (Hi/Hs)(Hs/Ht).

Together, you can (maybe) see that, 1-Fit = (1−Fis)(1−Fst). Substituting, 1-F = (1-f)(1-θ).

(I realise that this is not a complete answer, but you can rearrange it with some algebra to get the Weir&Cockerham equation, I think).

[Update Oct 25, 2016]: it eventually yields f = (F-θ)/(1-θ). I think the posted question (above) contains a typo--specifically a missing division operator. Perhaps someone missed the stroke on a typewriter in the original paper?

## What Use Is Population Genetics?

The Genetic Society of America’s Thomas Hunt Morgan Medal is awarded to an individual GSA member for lifetime achievement in the field of genetics. For over 40 years, 2015 recipient Brian Charlesworth has been a leader in both theoretical and empirical evolutionary genetics, making substantial contributions to our understanding of how evolution acts on genetic variation. Some of the areas in which Charlesworth’s research has been most influential are the evolution of sex chromosomes, transposable elements, deleterious mutations, sexual reproduction, and life history. He also developed the influential theory of background selection, whereby the recurrent elimination of deleterious mutations reduces variation at linked sites, providing a general explanation for the correlation between recombination rate and genetic variation.

I am grateful to the Genetics Society of America for honoring me with the Thomas Hunt Morgan Medal and for inviting me to contribute this essay. I have spent nearly 50 years doing research in population genetics. This branch of genetics uses knowledge of the rules of inheritance to predict how the genetic composition of a population will change under the forces of evolution and compares the predictions to relevant data. As our knowledge of how genomes are organized and function has increased, so has the range of problems confronted by population geneticists. We are, however, a relatively small part of the genetics community, and sometimes it seems that our field is regarded as less important than those branches of genetics concerned with the properties of cells and individual organisms.

I will take this opportunity to explain why I believe that population genetics is useful to a broad range of biologists. The fundamental importance of population genetics is the basic insights it provides into the mechanisms of evolution, some of which are far from intuitively obvious. Many of these insights came from the work of the first generation of population geneticists, notably Fisher, Haldane, and Wright. Their mathematical models showed that, contrary to what was believed by the majority of biologists in the 1920s, natural selection operating on Mendelian variation can cause evolutionary change at rates sufficient to explain historical patterns of evolution. This led to the modern synthesis of evolution (Provine 1971). No one can claim to understand how evolution works without some basic understanding of classical population genetics those who do run the risk of making mistakes such as asserting that rapid evolutionary change is most likely to occur in small founder populations (Mayr 1954).

As our knowledge of how genomes are organized and function has increased, so has the range of problems confronted by population geneticists. We are, however, a relatively small part of the genetics community, and sometimes it seems that our field is regarded as less important than those branches of genetics concerned with the properties of cells and individual organisms.—B.C.

The modern synthesis is getting on for 80 years old, so this argument will probably not convince skeptical molecular geneticists that population genetics has a lot to offer the modern biologist. I provide two examples of the useful role that population genetic studies can play. First, one of the most notable discoveries of the past 40 years was the finding that the genomes of most species contain families of transposable elements (TEs) with the capacity to make new copies that insert elsewhere in the genome (Shapiro 1983). This led to two schools of thought about why they are present in the genome. One claimed that TEs are maintained because they confer benefits on the host by producing adaptively useful mutations (Syvanen 1984) the other believed that they are parasites, maintained by their ability to replicate within the genome despite potentially deleterious fitness effects of TE insertions (Doolittle and Sapienza 1980 Orgel and Crick 1980).

The second hypothesis can be tested by comparing population genetic predictions with the results of TE surveys within populations. In the early 1980s, Chuck Langley, myself and several collaborators tried to do just this, using populations of Drosophila melanogaster (Charlesworth and Langley 1989). The models predicted that most Drosophila TEs should be found at low population frequencies at their insertion sites. This is so because D. melanogaster populations have large effective sizes (Ne). Ne is essentially the number of individuals that genetically contribute to the next generation. Large Ne means that a very small selection pressure can keep deleterious elements at low frequencies. This is a consequence of one of the most important findings of classical population genetics—the fate of a variant in a population is the product of Ne and the strength of selection (Fisher 1930 Kimura 1962). If, for example, Ne is 1000, a mutation that reduces fitness relative to wild type by 0.001 will be eliminated from the population with near certainty.

Using the crude tools then available (restriction mapping of cloned genomic regions and in situ hybridization of labeled TE probes to polytene chromosomes), we found that nearly all TEs are indeed present at low frequencies in the population (Charlesworth and Langley 1989). Most of the exceptions to this rule were found in genomic regions in which little crossing over occurs (Maside et al. 2005). This is consistent with Chuck’s proposal that a major contributor to the removal of TEs from the population is selection against aneuploid progeny created by crossing over among homologous TEs at different locations in the genome (Langley et al. 1988). It is now a familiar finding that nonrecombining genomes or genomic regions tend to be full of TEs and other kinds of repetitive sequences the population genetic reasons for this, discussed by Charlesworth et al. (1994), are perhaps not so familiar.

Modern genomic methods provide much more powerful means for identifying TE insertions. Recent population surveys using these methods have confirmed the older findings: most TEs in Drosophila are present at low frequencies, and there is statistical evidence for selection against insertions (Barron et al. 2014). This is consistent with the existence of elaborate molecular mechanisms for repressing TE activity, such as the Piwi-interacting RNA (piRNA) pathway of animals (Senti and Brennecke 2010) there would be no reason to evolve such mechanisms if TEs were harmless. In a few cases, TEs have swept to high frequencies or fixation, and there is convincing evidence that at least some of these events are associated with increased fitness caused by the TE insertions themselves (Barron et al. 2014). These cases do not contradict the intragenomic parasite hypothesis for the maintenance of TEs favorable mutations induced by TEs are too rare to outweigh the elimination of deleterious insertions unless new insertions continually replace those that are lost.

From the theory of aging, to the degeneration of Y chromosomes, to the dynamics of transposable elements, our understanding of the genetic basis of evolution is deeper and richer as a result of Charlesworth’s many contributions to the field. —Charles Langley, University of California, Davis

My other example is a population genetics discovery about a fundamental biological process: the PRDM9 protein involved in establishing recombination hot spots in humans. This was enabled by the revolution in population genetics brought about by coalescence theory (Hudson 1990), which is a powerful tool for looking at the statistical properties of a sample from a population under the hypothesis of selective neutrality. The basic idea is simple: if we sample two homologous, nonrecombining haploid genomes (e.g., mitochondrial DNA) from a large population, there is a probability of 1/(2Ne) that they are derived from the same parental genome in the preceding generation i.e., they coalesce (Ne is the effective population size for the genome region in question). If they fail to coalesce in that generation, there is a probability of 1/(2Ne) that they coalesce one generation further back, and so on. If n genomes are sampled, there is a bifurcating tree connecting them back to their common ancestor. The size and shape of this tree are highly random, so genetically independent components of the genome experience different trees, even if they share the same Ne. The properties of sequence variability in the sample can be modeled by throwing mutations at random onto the tree (Hudson 1990).

Recombination causes different sites in the genome to experience different trees, but closely linked sites have much more similar trees than independent sites. At the level of sequence variability, close linkage results in nonrandom associations between neutral variants—linkage disequilibrium (LD). The extent of LD among neutral variants at different sites is determined by the product of Ne and the frequency of recombination between them c (Ohta and Kimura 1971 McVean 2002). Richard Hudson proposed a statistical method for estimating Nec from data on variants at multiple sites across the genome (Hudson 2001) that was implemented in a widely used computer program LDhat by Gil McVean and colleagues (McVean et al. 2002). Applications to large data sets on human sequence variability showed that the genome is full of recombination hot spots and cold spots, consistent with previous molecular genetic studies of specific loci (Myers et al. 2005). Most recombination occurs in hot spots and very little in between them, accounting for the fact that there is almost complete LD over tens or even hundreds of kilobases in humans. The identification of a large number of hot spots led to the discovery of a sequence motif bound by a zinc finger protein, PRDM9, at about the same time that mouse geneticists also discovered that PRDM9 promotes recombination (McVean and Myers 2010 Baudat et al. 2014). These discoveries have led to many interesting observations, such as associations between PRDM9 variants in humans and individual variation in recombination rates, generating an ongoing research program of great scientific interest (Baudat et al. 2014).

With the ever-increasing use of genomic data, I am confident that many more such fruitful interactions between molecular and population genetics will take place. A take-home message is that more needs to be done to integrate training in population, molecular, and computational approaches to provide the next generation of researchers with the broad range of knowledge they will need.

## Molecular systematics and population genetics of biological invasions: towards a better understanding of invasive species management

Dr J. Le Roux, Centre of Excellence for Invasion Biology, University of Stellenbosch, Natural Sciences Building, Matieland 7602, South Africa. Email: [email protected] Search for more papers by this author

Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, Hawaii, USA

Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, Hawaii, USA

Present address: Centre of Excellence for Invasion Biology, University of Stellenbosch, Natural Sciences Building, Matieland 7602, South Africa

Dr J. Le Roux, Centre of Excellence for Invasion Biology, University of Stellenbosch, Natural Sciences Building, Matieland 7602, South Africa. Email: [email protected] Search for more papers by this author

Department of Tropical Plant and Soil Sciences, University of Hawaii at Manoa, Honolulu, Hawaii, USA

##### Purchase Instant Access
• View the article PDF and any associated supplements and figures for a period of 48 hours.
• Article can not be printed.
• Article can not be redistributed.
• Unlimited viewing of the article PDF and any associated supplements and figures.
• Article can not be printed.
• Article can not be redistributed.
• Unlimited viewing of the article/chapter PDF and any associated supplements and figures.
• Article/chapter can be printed.
• Article/chapter can not be redistributed.

### Abstract

The study of population genetics of invasive species offers opportunities to investigate rapid evolutionary processes at work, and while the ecology of biological invasions has enjoyed extensive attention in the past, the recentness of molecular techniques makes their application in invasion ecology a fairly new approach. Despite this, molecular biology has already proved powerful in inferring aspects not only relevant to the evolutionary biologist but also to those concerned with invasive species management. Here, we review the different molecular markers routinely used in such studies and their application(s) in addressing different questions in invasion ecology. We then review the current literature on molecular genetic studies aimed at improving management and the understanding of invasive species by resolving of taxonomic issues, elucidating geographical sources of invaders, detecting hybridisation and introgression, tracking dispersal and spread and assessing the importance of genetic diversity in invasion success. Finally, we make some suggestions for future research efforts in molecular ecology of biological invasions.

## PhD positions in Population Genetics

### Job description

Over the past years, Vienna has developed into one of the leading centres of population genetics. The Vienna Graduate School of Population Genetics has been founded to provide a training opportunity for PhD students to build on this excellent on-site expertise.

We invite applications from highly motivated and outstanding students with a love for evolutionary research and a background in one of the following disciplines: evolutionary genetics, functional genetics, theoretical or experimental population genetics, bioinformatics, mathematics, statistics.

• Evolution from de novo mutations - influence of elevated mutation rates.
• Evolution of sex-specific neuronal signaling.
• Genome evolution in columbines.
• Inference of selection signatures from time-series data.
• Long-term dynamics of local Drosophila populations.
• Molecular genetics of epigenetics.
• Seed ecology.
• Structural variation and genome evolution.

Only complete applications (application form, CV, motivation letter, university certificates, indication of the two preferred topics in a single pdf) received by May 16, 2021 will be considered. Two letters of recommendation need to be sent directly by the referees.

Depending on the project, PhD degrees will be awarded either in genetics, mathematics or statistics. PhD students will receive a monthly salary based on currently 2.237,60 before tax according to the regulations of the Austrian Science Fund (FWF).

## Population Dynamics and Regulation

The logistic model of population growth, while valid in many natural populations and a useful model, is a simplification of real-world population dynamics. Implicit in the model is that the carrying capacity of the environment does not change, which is not the case. The carrying capacity varies annually. For example, some summers are hot and dry whereas others are cold and wet in many areas, the carrying capacity during the winter is much lower than it is during the summer. Also, natural events such as earthquakes, volcanoes, and fires can alter an environment and hence its carrying capacity. Additionally, populations do not usually exist in isolation. They share the environment with other species, competing with them for the same resources (interspecific competition). These factors are also important to understanding how a specific population will grow.

Population growth is regulated in a variety of ways. These are grouped into density-dependent factors, in which the density of the population affects growth rate and mortality, and density-independent factors, which cause mortality in a population regardless of population density. Wildlife biologists, in particular, want to understand both types because this helps them manage populations and prevent extinction or overpopulation.

## Insight into rare and common diseases from natural selection

Genes associated with Mendelian or complex diseases would be expected to be subject to unequal selective pressures. We can therefore use selection signatures to predict the involvement of genes in human disease [11, 12, 32, 37, 115, 163]. Mendelian disorders are typically severe, compromising survival and reproduction, and are caused by highly penetrant, rare deleterious mutations. Mendelian disease genes should therefore fit the mutation–selection balance model, with an equilibrium between the rate of mutation and the rate of risk allele removal by purifying selection [12]. The use of population genetics models is less straightforward when it comes to predicting the genes involved in complex disease risk. Models of adaptive evolution based on positive or balancing selection apply to a few Mendelian traits or disorders, most notably, but not exclusively, those related to malaria resistance (reviewed in [76, 98]). However, the complex patterns of inheritance observed for common diseases, including incomplete penetrance, late onset and gene-by-environment interactions, make it more difficult to decipher the connection between disease risk and fitness [12].

### Purifying selection, rare variants, and severe disorders

According to population genetics theory, strongly deleterious mutations are rapidly removed from the population by purifying selection, whereas mildly deleterious mutations generally remain present, albeit at low frequencies, depending on population sizes and fitness effects. Genome-wide studies are providing increasing amounts of support for these predictions, as “essential” genes—identified as such on the basis of association with Mendelian diseases or experimental evidence from model organisms—are enriched in signs of purifying selection [32, 37, 115, 164]. Purifying selection has also been shown to be widespread in regulatory variation, acting against variants with large effects on transcription, conserved noncoding regions of the genome, and genes that are central in regulatory and protein–protein interaction networks [8, 10, 165–171].

Mutations associated with Mendelian diseases or with deleterious effects on the phenotype of the organism are generally rare and display familial segregation, but such mutations may also be restricted to specific populations [11]. This restriction, in some cases, may be due to a selective advantage provided by the disease risk allele (e.g., the sickle cell allele in populations exposed to malaria [98]), but it mostly reflects a departure from the mutation–selection balance. Small population sizes or specific demographic events may randomly increase the frequency of some disease risk alleles, because too little time has elapsed for purifying selection to remove them from the population, as observed in French Canadians, Ashkenazi Jews, or Finns [11, 66, 67].

According to these principles of population genetics, searches for genes or functional elements evolving under strong purifying selection can be used to identify the genes of major relevance for survival, mutations of which are likely to impair function and lead to severe clinical phenotypes. In this context, the immune response and host defense functions appear to be the prime targets of purifying selection [37, 95, 102]. For example, a recent study based on whole-genome sequences from the 1000 Genomes Project estimated the degree to which purifying selection acted on

1500 innate immunity genes. The genes of this class, taken as a whole, were found to have evolved under globally stronger purifying selection than the rest of the protein-coding genome [95]. This study also assessed the strength of selective constraints in the different innate immunity modules, organizing these constraints into a hierarchy of biological relevance, and providing information about the degree to which the corresponding genes were essential or redundant [95].

Population genetics has also facilitated the identification of immune system genes and signaling pathways that fulfill essential, non-redundant functions in host defense, variants of which are associated with severe, life-threatening infectious diseases (for examples, see [94, 95, 101, 106], and for reviews [29, 103, 172, 173]). This is well illustrated by the cases of STAT1 and TRAF3 they belong to the 1 % of genes presenting the strongest signals of purifying selection at the genome-wide level [95], and mutations in these genes have been associated with severe viral and bacterial diseases, Mendelian susceptibility to mycobacterial disease, and herpes simplex virus 1 encephalitis [174, 175]. Using the paradigm of immunity and infectious disease risk, these studies highlight the value of population genetics as a complement to clinical and epidemiological genetic studies, for determining the biological relevance of human genes in natura and in predicting their involvement in human disease [29, 103, 173, 176].

### Genetic adaptation, common variants, and complex disease

The relationship between selection and complex disease risk is less clear than for Mendelian disorders, but patterns are beginning to emerge. Genes associated with complex disease display signs of less pervasive purifying selection than Mendelian disease genes [32, 173], and are generally enriched in signals of positive selection [23, 28, 32, 37, 110, 122, 169]. There is also increasing evidence to suggest that genetic adaptations can alter complex disease susceptibility, and the population distribution of common susceptibility alleles is unlikely to result from neutral processes alone [12, 91, 177–179]. For example, the difference in susceptibility to hypertension and metabolic disorders between populations is thought to result from past adaptation to different environmental pressures [91, 179, 180]. Another study characterized the structure of complex genetic risk for 102 diseases in the context of human migration [178]. Differences between populations in the genetic risk of diseases such as type 2 diabetes, biliary liver cirrhosis, inflammatory bowel disease, systemic lupus erythematosus, and vitiligo could not be explained by simple genetic drift, providing evidence of a role for past genetic adaptation [178]. Likewise, Grossman and coworkers found overlaps between their candidate positively selected regions and genes associated with traits or diseases in GWAS [28], including height, and multiple regions associated with infectious and autoimmune disease risks, including tuberculosis and leprosy.

Like purifying selection, positive selection is prevalent among genes related to immunity and host defense [24, 37, 95, 109, 112, 115, 181]. Notable examples of immunity-related genes evolving in an adaptive manner, through different forms of positive or balancing selection, and reported to be associated with complex traits or diseases include:TLR1 and TLR5, which have selection signals that seem to be related to decreases in NF-kB signaling in Europe and Africa, respectively [28, 94, 95] many genes involved in malaria resistance in Africa and Southeast Asia [98, 100] type-III interferon genes in Europeans and Asians, related to higher levels of spontaneous viral clearance [101, 182] LARGE and IL21, which have been implicated in Lassa fever infectivity and immunity in West Africans [181] and components of the NF-kB signaling pathway and inflammasome activation related to cholera resistance in a population from the Ganges river delta [97]. These cases of selection related to infectious disease and many others (see [29–31, 96, 103] for reviews and references therein) indicate that the pressures imposed by infectious disease agents have been paramount among the different threats faced by humans [183]. They also highlight the value of population genetics approaches in elucidating the variants and mechanisms underlying complex disease risk.

### Changes in selective pressures and advantageous/deleterious variants

Most of the rare and common variants associated with susceptibility to disease in modern populations have emerged through neutral selection processes [184]. However, there is increasing evidence to suggest that, following changes in environmental variables or human lifestyle, alleles that were previously adaptive can become “maladaptive” and associated with disease risk [12, 13, 29, 30, 105]. For example, according to the popular “thrifty genotype” hypothesis based on epidemiological data, the high prevalence of type 2 diabetes and obesity in modern societies results from the selection of alleles associated with efficient fat and carbohydrate storage during periods of famine in the past. Increases in food abundance and a sedentary lifestyle have rendered these alleles detrimental [185]. The strongest evidence that past selection can lead to present-day maladaptation and disease susceptibility is provided by infectious and inflammatory disorders [12, 29–31, 77, 105]. According to the hygiene hypothesis, decreases in the diversity of the microbes we are exposed to, following improvements in hygiene and the introduction of antibiotics and vaccines, have led to an imbalance in the immune response, with alleles that helped us to fight infection in the past now being associated with a higher risk of inflammation or autoimmunity [105].

Population genetics studies have provided strong support for the hygiene hypothesis, by showing that genetic variants associated with susceptibility to certain autoimmune, inflammatory, or allergic diseases, such as inflammatory bowel disease, celiac disease, type 1 diabetes, multiple sclerosis, and psoriasis, also display strong positive selection signals [29, 30, 106, 186–188]. For example, genes conferring susceptibility to inflammatory diseases have been shown to be enriched in positive selection signals, with the selected loci forming a highly interconnected protein–protein interaction network, suggesting that a shared molecular function was adaptive in the past but now affects susceptibility to various inflammatory diseases [187]. Greater protection against pathogens is thought to be the most likely driver of past selection, but it has been suggested that other traits, such as anti-inflammatory conditions in utero, skin color, and hypoxic responses, might account for the past selective advantage of variants, contributing to the higher frequencies of chronic disease risk alleles in current populations [30]. Additional molecular, clinical, and epidemiological studies are required to support this hypothesis, but these observations highlight, more generally, the evolutionary trade-offs between past selection and current disease risk in the context of changes in environmental pressures and human lifestyle.

## 4. Discussion

Assessment of genetic diversity and population structure of G. kola in Benin is important for the management and conservation of the species. This study provides the first-ever molecular assessment of G. kola diversity and population structure through a genome-wide SNP dataset. In this study, 12,585 informative DArTseq SNP markers were identified across 100 G. kola accessions. The PIC values were useful in studying the level of polymorphisms among the accessions. The average PIC of 0.3 indicates the markers to be moderately informative. The average PIC is also close to the value obtained for SNP markers identified using GBS in the study of rice [29] and wheat [30] diversity. A previous study on G. kola in Nigeria using RAPD markers reported a PIC value of 0.93 [31]. The bi-allelic nature of DArT-SNP markers for which the maximum value for PIC is 0.5, compared to multi-allelic RAPD markers with the maximum PIC value of 1 [32] can explain the difference in PIC values observed in the two marker systems. Low to moderate levels of observed heterozygosity (0.223–0.248) in the analyzed populations in this study point to high heterozygote deficiency. In contrast, very high heterozygosity was observed in G. kola accessions from Nigeria [31]. The low heterozygosity may be due to a severe bottleneck event that occurred during domestication and selection [33]. In Benin, G. kola is only found in home gardens, farms, and fallows and therefore may have been subjected to voluntary or involuntary selection. The results of this study are consistent with other studies reporting a reduction in genetic diversity in domesticated crops as compared to their wild progenitors [29, 34]. Heterozygote deficit has been observed in many threatened species such as Cycad balansae [35], Glyptostrobus pensilis [36], Pulsatilla patens [18]. Furthermore, the results of this study also reveal very high levels of inbreeding (FIS = 0.781–0.848). This could be attributed to self-pollination in G. kola populations [18]. Indeed, G. kola is a dioecious species and the ability of G. kola to mate with half sibs may have resulted in inbreeding among closely related individuals. Low genetic diversity and inbreeding depression have been observed in many studies on threatened species as a consequence of a decrease in population size. [18, 37]. This is likely the case in the studied populations of G. kola in Benin, which confirms the recent report of the species' disappearance in the wild [2] with a limited number of accessions found in some populations.

Population differentiation is important for understanding the relative effect of evolutionary gene flow, mating system, selection, adaptation, and genetic drift on populations [38]. Pairwise genetic differentiation estimates (FST) values is a measure of population substructure and is useful in examining the overall genetic differentiation/divergence among populations. The FST values below 0.05 indicate low genetic differentiation, while values between 0.05–0.15, 0.15–0.25, and above 0.25 indicate moderate, high, and very high genetic differentiation respectively [39]. In the present study, pairwise FST showed low but significant (p < 0.05) differentiation among the studied populations. It was also observed that genetic variation was mainly found within populations (97.86%). This could be due to the small distribution range of G. kola in Benin and the short distances between the studied populations, which facilitate gene flow between populations. In addition, G. kola populations in Benin are under heavy anthropogenic exploitation [2], and human activities significantly affect the dynamics of genetic differentiation [40]. Discriminant analysis of principal components (DAPC) and the UPGMA analyses partitioned the 100 G. kola individuals into two principal genetic clusters. Hierarchical clustering analysis performed on G. kola accessions in Nigeria also revealed two clusters [31]. However, in the present study, an admixture of almost all the populations was noted within the two clusters. The presence of admixture within the two genetic clusters implied the lack of any discernable population structure [41], thus further indicating that interbreeding or sharing of alleles has occurred between the populations. An admixture analysis was performed with the program admixture, which reduces the false-positive rates, corrects for bias toward spurious admixture, and allows identification of different mating systems in structured as well as unstructured populations [42]. A finding of K = 1 suggests the accessions in this study are actually part of the large, non-contiguous, single population with low genetic differentiation and high gene flow.

## DMCA Complaint

If you believe that content available by means of the Website (as defined in our Terms of Service) infringes one or more of your copyrights, please notify us by providing a written notice (“Infringement Notice”) containing the information described below to the designated agent listed below. If Varsity Tutors takes action in response to an Infringement Notice, it will make a good faith attempt to contact the party that made such content available by means of the most recent email address, if any, provided by such party to Varsity Tutors.

Your Infringement Notice may be forwarded to the party that made the content available or to third parties such as ChillingEffects.org.

Please be advised that you will be liable for damages (including costs and attorneys’ fees) if you materially misrepresent that a product or activity is infringing your copyrights. Thus, if you are not sure content located on or linked-to by the Website infringes your copyright, you should consider first contacting an attorney.

You must include the following:

Send your complaint to our designated agent at:

Charles Cohn Varsity Tutors LLC
101 S. Hanley Rd, Suite 300
St. Louis, MO 63105

## Motivation

This is, most of all, not a book about R. This is also not a “Population Genetics in R” textbook. It is a book about how we do population genetic analyses, for which R is a tool that allows us to reach beyond the limitations of point-and-click interfaces. As a field, Population Genetics has a broad set of textbooks describing the underlying theory. As a student, I cut my teeth on the texts of Hartl (1981), Hartl & Clark (1997) and have used other great texts such as Hamilton (2011) and Hedrick (2009) in the classroom to teach the subject for the last decade. In late 2015, there are a host of texts available to the student of population genetics—amazon lists 150 different books under the search term “Population Genetics textbook”—why do another one? What I have found is that while the theory behind this discipline has been well developed, its application has been largely neglected.

As a new graduate student, fresh out of my first population genetics course, I felt armed with the understanding of how microevolutionary processes influence the distribution of alleles within and among populations. What I wasn’t prepared for was sitting in front of the computer, looking at a few thousand individuals assayed for several different loci and actually ‘doing’ population genetics. All of those textbooks provide me with what is expected and the theory behind it, though often fall short on teaching me how I could apply those inferences to data I actually collect. If you are a theoretical population geneticist, those texts and your ability to integrate mathematical equations will provide you a research lifetime of work. However, if you are practitioner who uses population genetic tools to answer conservation, management, or ecologically inspired questions, the evolutionary expectations of population genetic processes will most likely not be as important as directly estimating inbreeding, exploring ongoing connectivity, or determining genetic granularity of existing populations. This is where this textbook is focusing, a seemingly uninhabited niche in the knowledge ecosystem of graduate level population genetics.

This text was developed out of a graduate course in Population Genetics that I’ve been teaching at Virginia Commonwealth University since 2005. This texts uses R and many additional libraries available within the R ecosystem to illustrate how to perform specific types of analyses and what kind of biological inferences we can gain from them. In the process, we cover materials that are commonly needed in the application of population genetic analysis such as spatial autocorrelation, paternity analysis, and the use of permutation while at the same time highlighting logistical challenges commonly encountered in analyzing real data such as incomplete sampling, missing data, and rarefaction.

## 31: Population Genetics and Evolution

Topics covered: Population Genetics and Evolution

Instructor: Prof. Martin Polz, Guest Lecturer

13: Molecular Biology IV (c.

17: Carbon and Energy Metab.

18: Productivity and Food Webs

19: Regulation of Productivity

20: Limiting Factors and Bi.

27: Recombinant DNA III (co.

31: Population Genetics and.

36: Ecological Applications

So, for today's lecture as you can see up there is molecular -- evolution, and ecology.

And what I mean by this, it's basically the study or what we try to figure out in molecular evolution and ecology is what genes or gene sequences can tell us about the evolution and ultimately also the ecology of organisms in the environment. And it's particularly relevant for thinking about microorganisms, prokaryotes and the environment.

And I hope I can actually convince you today of that.

This is interesting. The topics that I want to cover today is, first of all, I want to review a little bit what we know about life on Earth, sort of give an overview of the evolution of life on Earth. Then, I want to go into specific topic that's of particular relevance for the evolution of eukaryotes.

That's the endosymbiosis theory. And then I'll explain how we can use gene sequences to actually reconstruct events that have happened a very, very long time ago.

OK, so we'll look at what we call molecular phylogenies, with the use of gene sequences to reconstruct the evolutionary history of organisms on Earth. Derived from that, we'll look at what we call the tree of life. That's sort of the big picture overview of the evolutionary relationships of all organisms on the planet. And then finally, I'll introduce you to a topic called molecular ecology. Again, that's how we can use gene sequences to learn something about the diversity of microorganisms in the environment that lead us then, next time, when I come back on Monday, into this big topic of environmental genomics, how we can actually expand this analysis to learn much more about organisms in the environment. So, first of all, let's look at life on Earth. Does anybody know how old we think Earth is? Say again? Yeah, 4.5 to 4.6, I haven't my notes 4.6. So, Earth's thought to have originated about 4.6 billion years ago. When did the first solid rocks appear on earth? So, when was the surface kind of solidified? Anybody know? About 3.9 billion years ago, OK?

And when do we think life started to develop on the planet?

Any ideas? Take a guess. Two? One? 3.5 billion years ago, OK? So, this is really remarkable.

We think it didn't, I mean, of course it took a long time because were talking about millions of years and hundreds of millions of years, but still, if you look at the big picture, it didn't actually take life that long to evolve on the planet. So, why do we think that is the case? What's the evidence for that? Well, we look into sedimentary rocks, so old rocks that arose from sediments, what you find around this time, you find that chemicals start to appear, organic molecules that really resemble organic molecules in modern life.

So, we have sort of chemical tracers, or chemical fossils.

So, tracers that indicate the presence of organisms. But what we also find is so-called micro-fossils, and I have a picture of that here where when you actually take rocks and actually slice them into very, very then slices, you can put them under specific microscopes.

And what you then find is that many rocks that are very, very old, have those kinds of inclusions in them.

And these things really resemble very much modern prokaryotic cells, modern bacterial cells, for example. And so, those micro-fossils are generally taken as an indication, also, that life is already present during those times. Now, when we take a quick sort of overlook of the evolution of life on the planet, again this graph here summarizes sort of the last 4. billion years or so when life originated. We see that there was a period of chemical evolution, and then somewhere here that region, it's, of course, not really well understood when that exactly happens, the origin of life is placed.

But I want to alert you to a couple of really, really critical steps here that are shown on this graph which we'll actually talk more about.

It is thought that life very early on is split into three major lineages: the bacteria, the archaea, in what is called here nuclear line. And I'll come back to that in a minute or so.

Then, a further major event which you may remember is oxygenic photosynthesis actually evolved -- -- which means that cyanobacteria evolved that started to produce oxygen as a byproduct of photosynthesis. And that really fundamentally changed the chemistry of the Earth. It actually became an oxidizing atmosphere. And what you see here is, once the oxygen concentration goes over a certain level, it allowed the development of an ozone shield. Now, what does that mean?

What was the critical significance of the presence of an ozone shield?

Does anybody know? What does it block out? Anybody remember that?

What's the big significance of the ozone hole over Antarctica for example? It allows UV radiation to heat the Earth's surface, and in fact if there were no ozone, the UV radiation would be so strong that there would be no life possible on land.

So, once the ozone shield actually developed, organisms could conquer, basically, the land's surface and settle on the land surface.

In this, then, is thought to be at least correlated with the development of endosymbiosis. And I'll explain what I mean by that. But it basically led to the origin of modern eukaryotes, so your ancestors essentially. But there was still a long time, obviously, until humans appeared. We have here the origin of animals and metazoans, and then the age of the dinosaurs is already a very small blip here on this graph. And humans don't even get featured on that because we are so recent. So, but what I want to show you here is that three major lineages evolved early on. These are the bacteria, archaea, and what we call a nuclear lineage. And the significance of those nuclear lineages is that it basically combined with bacteria to form the modern eukaryotic cell. So, the eukarya, or eukaryotes they're also called. And it was this combination that we called the endosymbiosis event. I want to explain this a little bit more, and then I'll show you finally why we actually know that those things are very likely to have occurred a long time ago.

Yes? It means the bacteria and the nuclear lineages combine to form a eukaryote, OK? And I'm actually going to explain this on the slide here. So, if you have any more questions after that, please let me know. So, again, this shows you this early evolution, this early split in two archaea, bacteria, and this sort of nuclear line. It is thought that this nuclear line, this was single celled organisms that increased in cell size, and then developed or partitioned the DNA into a nucleus, basically. So exactly how you find it in modern eukaryotic cells.

But then what happened is the cell took up a bacterial cell, and over time this bacterial cell became symbiont.

In fact it became the mitochondria. And so what this mitochondria now does in the moderate eukaryotic cell as you all know is it really took over the energy metabolism. So, the proto-eukaryotic cell took up a heterotrophic bacteria that form the mitochondria.

And this ultimately then gave rise to protozoa and to modern-day animals. But there was a secondary symbiotic event.

This cell, once it had taken up a heterotrophic bacterium, it took up an autotrophic bacterium, a cyanobacterium, an oxygenic photosynthesizer. And this actually that led to the development of modern algae and modern plants.

So what we can say is that mitochondria our ancient heterotrophic bacteria -- And the chloroplasts are ancient cyanobacteria, so, oxygenic, photosynthetic bacteria. And these obviously have coevolved to then form animals and finally your plants.

So now, obviously we are talking here about events that happened a very, very long time ago. And so, the big question is really how do we really know this? But this takes me to the third topic, which is that of molecular evolution. So, we can state the problem again, And that is very simply put, evolution is incredibly slow, OK? And therefore, its processes are not directly observable.

And we need to actually use inference techniques to reconstruct evolutionary processes. Now, what do we use when we want to reconstruct the evolutionary history of animals and plants usually?

Anybody? Fossils. Exactly. So you take a shovel, essentially, and dig down into the different layers.

And there's different techniques that you can actually determine the age of different sedentary rocks. For example, and then you can construct, if you're lucky, you'll find enough fossils of a particular lineage. You can reconstruct the evolution of the lineage. I'm sure you all have seen the example of the horse, for example, where we have actually quite good evidence when ancient horses look like.

And we can reconstruct the sequence of events that led to the evolution of modern-day horses. Now, you can imagine, though, that when we talk about such ancient events like these there really is no fossil record. OK, so what people have figured out, then, is that that was really a stroke of genius that came about in the late 60s, that DNA molecules can act as evolutionary chronometers.

OK, now what do I mean by that?

I mean that you can take DNA sequences or gene sequences from different kinds of organisms. Based on those gene sequences you can reconstruct the relationships to each other. You can determine whether two organisms are closely related or whether they are only very distantly related. And the underlying mechanism of that, is that mutations happen with a certain probability all the time.

So, the idea is that as time passed on, DNA molecules will change.

So they will accumulate, actually, mutations, and so this will lead to, and that the idea is that the amount of change in a particular DNA sequence is proportional to the time of separate evolution of two different lineages or two different organisms.

So, the amount is more or less proportional -- -- to time since the last common ancestry.

So, let me explain how this is actually done.

What you really need in order to do this, is you need genes that are related to each other, OK? So, genes, they need to be universally distributed. That meets all organisms that you want to compare need to have this type of gene. And, those genes need to have conserved function.

In these genes, we can then compare to each other, and I will explain how this is actually done. Any questions so far?

OK, so the example that I actually want to bring is the 16S ribosomal RNA genes.

We oftentimes abbreviate this rRNA. Now, does anybody remember what the ribosomal RNAs are and do? What's the ribosome? Yes?

Right, and what does it do? Exactly, it's the location where messenger RNA is translated into protein.

Now, the ribosomal RNAs are an integral part of the ribosome.

They play both a catalytic role as well as a structural role in the ribosome. And so, fundamentally, because this is such a fundamental organelle, all living organisms possess it.

So, all organisms have it. So this allows us to use these genes to really compare all living organisms to each other.

OK, so this is a very important point.

I wanted to show you a, OK, if it wakes up. There we go.

An example of these ribosomal RNA genes, now this is actually, what you see here is a secondary structure of the actual RNA, the ribosomal RNA. Now, these molecules have a secondary structure because they play a catalytic and structural role.

And so, the really amazing thing is when you look at the structure, the structure determines really the function of those molecules in different organisms. And then look at this.

We have here a bacterium, and here are an archaea. Now, if you think back to the first couple of slides, what I showed you is that those organisms have not shared a common evolutionary history for about four, or so, billion years, or 3 billion years, excuse me. But, if you just glance very quickly at the structures, you see that they look very similar to each other. So, there's an indication that the function is really very highly conserved of those molecules.

However, when you actually look at the sequences in detail, what you'll find is that there's different regions.

And I'd given some examples here denoted by A, B, C in those molecules. And these different regions of the molecules are really the key to its usefulness in figuring out the evolution and ecology of many organisms.

The region number A here, or denoted by A, a sequence stretches that are the same in all living organisms.

So they are universally conserved, which means that if you get a mutation in a gene in that particular region, you are dead. OK, that's why it's conserved essentially.

Then we have those regions B where the length is conserved, but the sequence is not. So, there are sequence change allowed, but the length needs to be conserved. And then there's the region C were neither length nor sequence is actually conserved, and where we get a lot of variation. So, let me write this down. We have three types of sequence stretches.

We have A, what I called the universally conserved sequences. We have B where length, but not sequence is conserved. And, we have C where neither length nor sequence is actually conserved.

And the first two stretches, the first two types of sequence stretches, are very important in figuring out the phylogeny or the evolutionary relationships amongst organisms. Whereas the sequence stretches number C because they vary so dramatically, are very important in identifying organisms.

So what can we actually know do with those sequences?

Well, the first step is we need to generate an alignment.

OK, and this is actually shown here, where each row denotes a gene from a particular organism.

OK, so these are all abbreviated here.

These actually aren't ribosomal RNA genes, but other genes.

And that what you will see here is we can recognize those three different regions that I've pointed out before. You have the regions A which tell you which nucleotides line up with each other, so you use this sort of as an anchor because the sequences never vary amongst organisms. And that the sequence region B where you light up sequences that vary or stretches that vary in sequence but not in length. Now, why is this important?

It's important because you have in each column that nucleotides that have originated from a common ancestral nucleotide, and whose variation over time you can actually monitor.

Is everybody with that? Any questions? OK, great.

The second step, then, is the calculation of a similarity.

And this is shown here. Again, we have a very simplified alignment now of four different organisms. Here, we have the sequences that we want to compare. And what you'll see is that they're overall very similar, but there are different sort of nucleotides. And so, what we simply do is for each pair of sequence combinations, we calculate the sequence similarity value. So, what you see is that you have 12 nucleotides, and the first pair differs in three nucleotides. OK, so that tells us, or it's called actually a distance here, I'm sorry. Let me write this down here.

It's simply one minus the similarity, of course, but so basically a quarter of the nucleotides differ where it's between A and C, a third of the nucleotides difference on. OK, so you do this for each pair of sequences, excuse me. The third step, then, is to calculate the correction for multiple mutations affecting the same nucleotides.

Now, you can imagine that over time there's a probability that a particular nucleotide mutates, say, twice. So, in the first instance it may change from A to a G, , but then it changes to a C.

But when you look at the modern-day sequences, you don't know that this actually happened. And so there's ways to statistically estimate what the likelihood is that a sequence actually contains such multiple events.

OK, and this, we called, a corrective evolutionary distance then. And what you will note is that the corrected evolutionary distance is invariably larger than the actual observed one.

Now, what can we can do with those distances? We can constrain them into a best fit tree of relationships.

So, we can draw what we call is a best fit tree.

That's shown here. We have our four organisms, but when you look at those branches of the tree what you'll see is that they add up roughly to the correct evolutionary distance here.

So, between A and B we have 0. 3 and 0.08, which roughly gives you 0.3 here, OK, whereas between A and C the tree is constrain such that we have 0.31, and here 0. 5, and so overall you roughly get the distance here that we have calculated. And so what this means is that you ordered the organisms by their calculated evolutionary distance. And so you have now obtained, actually, a very intuitive picture of the relationship of organisms to each other where A and B are obviously the most closely related ones, and A and D are the most distantly related.

Is everybody with it? Any questions? OK, now, this best fit tree is what we call a phylogeny.

Now, excuse me, these techniques really revolutionized the study of evolutionary relationships, and one of the things that it allowed us to do is to construct universal phylogenetic trees or what we can also call the tree of life. And I will show you this on the next slide, and that I want to make a few general statements about this.

So first of all, when you analyze all known organisms, and obviously that would be a big task, but representative of all known organisms, what you'll find is that, indeed, we have three major lineages: the bacteria, the archaea, and the eukarya. OK, so we have what we call three domains of life: the archaea, bacteria, and the eukarya.

So, this really is the evidence that life really split very, very early on into those three lineages that I showed you before.

Interestingly, two of those major domains here are prokaryotic, OK? So, two of the domains are prokaryotes. Moreover, if you actually look at the types of organisms that are on here, you'll notice that even on the eukaryotic side of the tree, most of the organisms here are actually microbial. So, the single celled organisms: and that means that most of the life on the planet is microbial.

The vast diversity of organisms on the planet are microorganisms.

So, we can say that most life is microbial.

And when you, then, look at analysis of mitochondria, and chloroplasts which all have their own genetic machinery, and therefore also their own ribosomes you'll see that the mitochondrion, OK, and the chloroplasts both tree within the bacteria. So, we really have an amazing confirmation of this endosymbiont theory which actually developed in the absence of gene sequences by some Russian scientists in the early 20th century. So, we have that mitochondria and chloroplasts tree within bacteria, and this really supports the endosymbiont theory. So really, you could say eukaryotes are really just walking, and swimming, and flying incubators for bacteria, right? So, just hosts for microorganisms.

OK, so basically you can, what you should take home from this is the three domains of life. Two are prokaryotic, and even more so most of the diversity that we find is actually microbial, and then finally the endosymbiont theory is actually confirmed by those phylogenies. Now, what I want to cover in the remaining time, is how we can actually use now those sequences to learn something about organisms in the environment.

That's the topic of molecular ecology.

To introduce this, I just want to show you a couple slides that really sort of capture what the big problem is that we're facing here. Now, when we look at the abundance of prokaryotic cells in different types of environments, what we see is that there is an enormous number of different prokaryotes out there.

This summarizes, here, different types of environments. We have the marine environment, freshwater environment, sediment and soils, subsurface sentiments and animal guts.

And that this number here gives you the average number of prokaryotic cells either per milliliter or per gram. And it here we have the total number of cells obtained by multiplying the average number with the total volume of the particular environment.

So what you can see is that in the marine environment, we have an average half a million cells per milliliter of water, OK? It freshwater, we have about a million cells.

What is that telling you? There's a ton of prokaryotes out there. What you go swimming, you take a little gulp of water: you've probably eaten several million prokaryotes, that it's nothing to worry about because what this also tells us is that very, very few prokaryotes out there are really pathogens because otherwise you'd be sick all the time.

Now, in sediments and soils, in as little as a gram you have five times 10^9 prokaryotic cells almost. 5 billion prokaryotic cells are out there, and even in very, very deep sediments that reach down to 3,000 m, you have a substantial number of prokaryotic cells.

Well, and here's your guts, 10^5 times 10^6 gives you 10^11 per gram. So again, you're just a walking incubator for a very complex microbial community. Here's the global abundance. You see that steeps of surface sediments and the marine environment, probably in terms of numbers at least, the most important microbial environments. Now, faced with this enormous abundance of prokaryotes out there, very important question is how many of them are out there? Or, how diverse our prokaryotes in the environment? That's important if you want to figure out their function and the environment, and want to understand also their evolution. And what I want to show you here is that we've gone through an amazing development in our understanding of prokaryotic diversity in the environment over the last 10 to 15 years or so. Who knows about E.

. Wilson here? One person? So, he wrote a very famous book on biodiversity, which was published in 1988, where he tried to summarize, really, how diverse the known organisms are on the planet it also try to extrapolate to the total diversity.

And what you see is that he came up with about 1.4 million different species here, mostly dominated by insects. That's the big section here on this pie chart. The plants: very important.

And if you look, the prokaryotes feature with about 3, 00 different species. So, in 1988 we thought there were very few prokaryotic species out there. If you look about 10 years into the future and take the assessment here, and this just exemplifies how the thinking has changed, you see that we think now that there is about 11 million different species out there, and that the vast majority of them are prokaryotic, OK, 10 million. So, this big part of the pie chart is really the prokaryotic diversity. Now, what really has changed is that we've actually started to use molecular techniques to determine the diversity of prokaryotes in the environment.

So molecular ecology is really the use of molecular gene sequences obtained directly from the environment -- -- to learn about the diversity prokaryotic -- -- diversity out there. Now, this slide just quickly summarizes this. Basically, the idea is that you go out into the environment and collect either water or soil samples that, as I just showed you, invariably contain a lot of different prokaryotic cells. You then lyse the cells and purify their DNA. And so that you end up with a mixture of DNA that represents the organisms out there, and then you can use universal PCR primers to actually amplify ribosomal RNA genes from all the organisms that are present in your samples.

Now, why can you use universal PCR primers? Well, they target the regions number A that I showed you before.

Those regions in the genes are invariant amongst all organisms.

You guys all remember how the PCR works, right? We cover this.

OK? Yes? No? Who doesn't? You don't? All right, come to the board. Just kidding. OK, you should look it up. I don't have time to cover this, unfortunately, but basically it's a technique that allows you to amplify specific types of genes millions to billion fold. And once you have done this, what you can do is that you can purify the genes on gels, and then separate them by cloning them into individual plasmids. And those plasmids have been inserted into E. coli cells, and the E.

coli cells are then individually grown up so that each culture contains only a single plasmid, and you can then sequence these ribosomal DNAs or ribosomal RNA genes from those clones.

And so, you have obtained a library of the ribosomal RNA genes from the environment. So, we use environmental ribosomal RNA gene libraries from which we then can actually compare how many different types of genes are out there.

So let me show you an example of this. What we have done recently, we've gone out in one of the first really comprehensive samplings of coastal bacteria plankton, which means the bacteria that are present free living in ocean water. And so, we've done this, we've collected all those clones, and then basically we constructed those phylogenetic trees that I showed you before that really allow us see how many different types are out there, and how closely related they are to one another. And what we found is that in this environment that you think might be very simple because it just the water column right? No, not much structure in there.

We found over 1500 bacterial 16S ribosomal RNA sequences to occur, so an enormous diversity of prokaryotes of bacteria in that particular environment. And the important point is that when you actually look at a collection of such studies that I just showed you, what you find is that the vast majority of microorganisms in the environment have never been cultured. So traditionally what we do of course to learn about microorganisms when you grow E.

coli, or so, you throw them onto culture plates.

You make lots of different cells, and that allows you to study some of their properties. But when you look, for example, at results from the ocean, this summarizes now coastal and open ocean environments, again, the bacteria plankton is those free-floating bacterial cells in the water.

And you compare this to what we've actually been able to culture from those environments. What you see is that you have some dominant groups here. They have all funny names, most of them, because they're just clones and clone libraries.

But these are the dominant groups that show up in clone libraries.

Here's their relative representation in different clone libraries from a variety of environments. And so here you have one very important one, the SAR11 group, or this one, the SAR86, that always show up in clone libraries.

But we've never see them in culture, so the important point to realize here is that what is actually happening is that whenever we go out, we find a great diversity of bacteria out there, but we have no idea what they actually do.

And this is one of the big questions that we need to answer to understand, really, how the planet actually works. What are those uncultured microorganisms out in the environment really doing, and what is their importance? And we'll talk about this next time.

We're going to talk about environmental genomics because essentially what we can do now, is we have techniques available that allow us to isolate and least large fragments of the genomes, sequence those, and look at what kinds of genes they have present.

And that allows us, then, to infer some of their function in the biogeochemical cycles in the environment.

OK, so with this I'm going to close today unless you have any more questions.

1. Hrothgar

What a great phrase

2. Hartmann

does not at all agree with the previous communication

3. Ralf

Exactly! Good idea, I maintain.

4. Ahebban