# What is the difference between nucleotide polymorphism (θ) and nucleotide diversity (π)?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

In my population genetics book (see reference at bottom) they define them as:

• Nucleotide polymorphism (θ): proportion of nucleotide sites that are expected to be polymorphic in any sample of size 5 from this region. Of the genome. \$hat{θ}\$ equals the proportion of nucleotide polymorphism observes in the sample (S) divided by \$a_1 = sum_{i=1}^{n-1} frac{1}{I}\$. If n= 5, \$a_1 = frac{1}{1}+ frac{1}{2}+ frac{1}{3}+ frac{1}{4}+ frac{1}{5}=2.083\$
• Nucleotide diversity is the average proportion of nucleotide differences between all possible pairs of sequences in the sample. In R, I came up with that code which is in accordance with what is in the book.

``data # Only polymorphisms total.snp # This is the total number of sites that were looked (e.g. 16 might be polymorphic over the 500 sites that we've looked at. So 484 sites are monomorphic) n = nrow(data) # Number of samples pwcomp = n*(n-1)/2 # Number of pairwise comparisons for(i in 1:n.col){ # Compute the number of differences in the samples that are polymorphic t.v = as.vector(table(data[,i])) z = outer(t.v,t.v,'*') temp= c(temp,sum(z[lower.tri(z)])) } pi.hat = sum(temp)/(pwcomp*total.snp) # Nb of different pairwise comparison/(nb all pairwise comparison * total nb of loci (polymorphic or not))``

In the book, they say

On the other hand, there is a theoretical relation between θ and π that is expected under simplifying assumption that the alleles are invisible to natural selection.

In summary θ = π with this assumption and large sample sizes.

1. So what is the difference in the
2. Why are we calculating both (what could they tell us)?

Hartl, D. L., & Clark, A. G. (1997). Principles of Population Genetics (3rd ed.). Sinauer Associates Incorporated.

You are also confusing definition of these terms with their expected values (typically computed from coalescent theory). Let's look at an example.

Consider a haploid population of 6 individuals. We will represent these individuals with 5 sites that can take values 0 or 1.

``00000 00001 00000 00010 00010 00000``

Polymorphism

In this population there are \$S=2\$ polymorphic sites. In other words, there are two sites for which different individuals take different values in the population. Over these 5 sites, the fraction of polymorphic sites is \$frac{2}{5}\$. You use \$ heta\$ for this quantity but to me, \$ heta\$ is typically used as a short hand for \$2 N mu\$ (or \$4 N mu\$ for diploid populations).

Under an infinite allele model, in a single panmictic haploid population (without selection and other assumptions), \$E[S] = 2 N mu a\$, where \$a = sum_{i=1}^{k-1} frac{1}{i}\$, where \$k\$ is the sample size.

Genetic diversity

Genetic diversity is also known as expected heterozygosity. While it is most often called \$pi\$, I will call it \$H\$ to avoid confusion. Imagine, you randomly sample two individuals in the population, what is the probability that there have different alleles at a given locus?

• At locus 1, \$H = 0\$
• At locus 2, \$H = 0\$
• At locus 3, \$H = 0\$
• At locus 4, \$H = 2 frac{2}{5} frac{3}{5} = frac{12}{25}\$
• At locus 5, \$H = 2 frac{1}{5} frac{4}{5} = frac{8}{25}\$

You can average over all loci if you want.

The expected heterozygosity in panmictic population (without selection and other assumptions), \$E[H] = frac{4 N mu}{1 + 4 N mu}\$. You should notice a relationship between \$H\$ and \$F_{ST}\$ here!

Nucleotide diversity

The nucleotide diversity is the equivalent of \$D\$ from Nei and Li (1979) and is highly related to \$D_{XY}\$ (Nei, 1987). Once you've managed to demonstrate that the expected coalesence time is \$N\$ (I won't do the demonstration here), then under an infinite site model, the number of mismatch between two individuals is logically \$E[D] = 2 N mu\$ (it is the number of mutations that has happened in both lineages since the the coalescence event). For diploid population, multiply everything by 2.

Relatedness between \$D\$ and \$S\$

Under a neutral model \$frac{S}{a} = D\$. Actually the standardize difference between \$frac{S}{a}\$ and \$D\$ is what is used in Tajima's \$D\$ tests.

Source of information

Hartl and Clarck is good IMO but I would also like to recommend reading Population Genetics: A Concise Guide by Gillespie who explain all of that wonderfully.

## Nuclear Gene Variation in Wild Brown Rats

Although the brown rat (Rattus norvegicus) is widely used as a model mammal throughout biological sciences, little is known about genetic variation in wild rat populations or the relationship of commonly used inbred strains to their wild relatives. We sampled wild brown rats from the species’ presumed ancestral range in NW China and from a derived population in the UK and estimated nucleotide diversity and population subdivision, based on the sequences of 30 autosomal protein-coding loci. Neutral genetic diversity was close to 0.2% in both populations, which is about five times lower than diversity at the orthologous sites in a population of wild house mice from the species’ putative ancestral range in India. We found significant population differentiation between UK and Chinese populations, as assessed by Fst and the program STRUCTURE. Based on synonymous diversity and divergence between the brown rat and house mouse, we estimate that the recent effective population size in brown rats is approximately 130,000 (approximate 95% confidence interval 85,000-184,000), about fivefold lower than wild house mice.

The brown rat is the leading animal model in physiology and pharmacology research, and, after the house mouse, is the most widely studied model mammal in genetics. Yet, little is known about the origin of inbred rat strains or the genetic relationship between inbred strains and wild rats. Studies of the mitochondrial genome (Brown and Simpson 1981 Li et al. 1999 Lin et al. 2012), allozymes (Bender et al. 1985 Cramer et al. 1988), and random amplified polymorphic DNA markers (Jiang et al. 2005) hint at the presence of substantial genetic diversity in wild brown rat populations. Geographic variation for morphological traits in Chinese populations has been reported (Wu 1982), and four subspecies have been recognized on the basis of variation in morphology (Wu 1982 Wilson and Reeder 2005). However, the amount of genetic variation in nuclear genes and the extent of geographical differentiation among populations in nature are not known.

There is substantially more information on genetic diversity among inbred laboratory rat strains. Surveys of microsatellites (Canzian 1997 Thomas et al. 2003) and single nucleotide polymorphisms [SNPs (Smits et al. 2004 STAR Consortium 2008)] suggest that inbred strains are genetically diverse and distinct from Brown Norway strains, which are assumed to be the most closely related strains to wild brown rats (Thomas et al. 2003). Brown Norway is the reference strain for the rat genome project (Gibbs et al. 2004). Rattus species most likely originated in Southeast Asia, which is the center of current rat species diversity (Rowe et al. 2011). Rattus norvegicus is assumed to have evolved on the plains of Asia in NW China and Mongolia, where wild brown rats are still found in what is presumed to be their native habitat. Although the black rat (Ratrrus rattus) is known from antiquity in Europe (Barnett 2001), R. norvegicus is believed to have reached Europe much later, probably between the 16th and 18th centuries, from where it has spread worldwide and largely displaced R. rattus in temperate regions.

In contrast to wild rats, there is a good deal of information on the genetic diversity in wild house mice (Mus musculus). House mouse subspecies vary in their mean nucleotide diversity (Baines and Harr 2007 Salcedo et al. 2007) and the extent of gene flow between the different subspecies varies across the genome (Teeter et al. 2008). Nuclear gene diversity is greatest in populations from their putative ancestral ranges, which are Iran and NW India in the cases of M. m. domesticus and M. m. castaneus, respectively (Baines and Harr 2007). NW India is also believed to be the ancestral range of the species complex as a whole (Din et al. 1996). Diversity from the putative ancestral range of M. m. musculus (NW Afghanistan) is currently unknown (Baines and Harr 2007). Synonymous site diversity of protein-coding genes in M. m. castaneus and M. m. domesticus ancestral populations is more than eightfold and threefold greater, respectively, than observed in human populations [(Baines and Harr 2007 Halligan et al. 2010) D. L. Halligan, A. Kousathanas, R. W. Ness, B. Harr, L. Eöry, H. Li, T. M. Keane, D. J. Adams, and P. D. Keightley, unpublished results], presumably a consequence of a substantially higher effective population sizes in wild mice. We therefore expected to see a similar pattern in brown rats, with diversity greatest in individuals from the ancestral range, and overall diversity levels comparable with mice, under the expectation that the rat ancestral effective population size is likely to have been very large. Here, we report the first survey of nuclear gene sequence diversity in wild brown rats. We sequenced 30 autosomal loci in a sample of individuals from the species’ presumed ancestral range in NW China and from a derived population in the UK.

## Analyses of sequence polymorphism and haplotype diversity of LEAFY genes revealed post-domestication selection in the Chinese elite maize inbred lines

Post-domestication selection refers to the artificial selection on the loci controlling important agronomic traits during the process of genetic improvement in a population. The maize genes Zfl1 and Zfl2, duplicate orthologs of Arabidopsis LEAFY, are key regulators in plant branching, inflorescence and flower development, and reproduction. In this study, the full gene sequences of Zfl1 and Zfl2 from 62 Chinese elite inbred lines were amplified to evaluate their nucleotide polymorphisms and haplotype diversities. A total of 254 and 192 variants that included SNPs and indels were identified from the full sequences of Zfl1 and Zfl2, respectively. Although most of the variants were found to be located in the non-coding regions, the polymorphisms of CDS sequences classified Zfl1 into 16 haplotypes encoding 16 different proteins and Zfl2 into 18 haplotypes encoding eight different proteins. The population of Huangzaosi and its derived lines showed statistically significant signals of post-domestication selection on the Zfl1 CDS sequences, as well as lower nucleotide polymorphism and haplotype diversity than the whole set. However, the Zfl2 locus was only selected for in the heterotic group Reid. Further evidence revealed that at least 17 recombination events contributed to the genetic and haplotype diversities at the Zfl1 locus and 16 recombination events at the Zfl2 locus.

## Nucleotide Diversity and Association Genetics of Xyloglucan Endotransglycosylase/hydrolase (XTH) and Cellulose Synthase (CesA) Genes in Neolamarckia cadamba

1283bp) and CesA (778bp) DNA sequences and further associates those SNPs with basic wood density. Primers were designed in flanking the partial XTH and CesA genes from 15 N. cadamba trees. The amplified DNA fragments were sequenced and the basic wood density measurements were determined for each tree. The sequence variation analyses revealed that 34 SNPs (2.65% occurrence) and 3 SNPs (0.39% occurrence) were found in 15 partial genomic DNA sequences of NcXTH1 and NcCesA1, respectively. All the SNPs were discovered in both exon and intron regions. NcXTH1 examined sites showed higher nucleotide diversities of π = 0.00402 and θ w = 8.919 when compared to NcCesA1 (π = 0.00127 θ w = 0.9226). The LD decayed slowly with distance of polymorphic sites in a linear pattern with the mean R 2 value of 0.000687. Association genetics study showed that 2 SNPs from NcXTH1 genes were significantly associated with basic wood density (p<0.05) of N. cadamba . Once the gene-associated SNP markers in NcXTH1 genes are validated, it could be potentially used as a tool in Gene-Assisted Selection (GAS) of N. cadamba trees. This study has also demonstrated that the candidate-gene based association genetics is a powerful approach to dissect complex adaptive traits for organism lacking a genome sequence or reference genomic resources.

S.Y. Tiong, W.S. Ho, S.L. Pang and J. Ismail, 2014. Nucleotide Diversity and Association Genetics of Xyloglucan Endotransglycosylase/hydrolase (XTH) and Cellulose Synthase (CesA) Genes in Neolamarckia cadamba . Journal of Biological Sciences, 14: 267-275.

Neolamarckia cadamba or locally known as Kelampayan is one of the evergreen and fast growing tropical forest tree species under Rubiaceae family (Joker, 2000). This species is naturally distributed throughout India, Pakistan, Sri Lanka, Burma, Malaysia, Cambodia, Thailand and Laos. The wood is used to make plywood, pulp, paper, boxes, furniture and more. Moreover, N. cadamba also served as medicinal plant for traditional curing using its leaf, bark and roof (Patel and Kumar, 2008 Mondal et al ., 2009 Bussa and Pinnapareddy, 2010). Despite it has high economic value, studies on N. cadamba at the molecular level are still limited to date. Recently, a Kelampayan tree transcriptome database (NcdbEST) had been established to provide useful genomics information and resources for researchers to deeply explore the genomics basic of the N. cadamba (Ho et al ., 2010).

Molecular study of wood is important to develop the fundamental study of wood biology and genetics in order to produce ‘designer wood’ or wood of desired characteristics. DNA markers are now widely used in molecular characterization because DNA markers can characterize cultivars, provenances or genotypes precisely and enable the measurement of genetic relationships (Narayanan et al ., 2007 Lau et al ., 2009). Association genetics is the approach that identifies statistical association between phenotypic trait variations and allelic polymorphism in targeted genes. The association study in plants are generally highlights in the utilization of a wide range of commercial and fitness traits mapped with the well-characterized candidate genes. For instance in maize, candidate genes were associated with digestibility (Guillet-Claude et al ., 2004), maysin synthesis (Szalma et al ., 2005) and kernel composition and starch production (Wilson et al ., 2004). In forest tree species, Eucalyptus was the first study carried out to find the relationship between microfibril angle and cinnamoyl CoA reductase (CCR) gene using SNP-association approach (Thumma et al ., 2005). The first association genetic study using multigenes in forest tree species, Pinus radiata was carried out using 58 SNPs from 20 wood and drought-related candidate genes (Gonzalez-Martinez et al ., 2007). Dillon et al . (2010) also examined the allelic variation that affects Pinus radiata wood properties using 36 cell wall candidate genes. Gene-associated SNPs found in cinnamate 4-hydroxylase (C4H) and cinnamyl alcohol dehydrogenase (CAD) genes from Acacia mangium also showed their significant relationship with wood density, specific gravity and cell wall thickness (Tchin et al ., 2011).

SNP is the most commonly used new generation molecular marker in breeding programme because it is considered as the most abundant resource for genetic variation s studies and are distributed throughout the genome (Halushka et al ., 1999 Kwok et al ., 1994). According to Wang et al . (1998), SNPs occurs once every 500-1000 bp when comparing two chromosomes. Other estimation also includes one SNP occurs every 100-300 bp in any genome (Gupta et al ., 2001). SNP is believed to have a relatively stable inheritance as compared to other marker systems. SNP with the low mutation rate makes it an excellent marker to study complex genetic traits, such as wood formation and as a tool for understanding genome evolution (Syvanen, 2001). SNP is also preferred to be used because it serves as a direct marker with the sequence information of the exact nature of the allelic variant is provided (Gupta et al ., 2001). Technological improvements have made SNP and indel as attractive markers for high-throughput marker-assisted breeding, EST-mapping and the integration of genetic and physical maps (Nasu et al ., 2002 Rafalski, 2002 Gonzalez-Martinez et al ., 2007 Ibitoye and Akin-Idowu, 2010 Thomson et al ., 2010).

Wood is made up of secondary xylem tissues and has a chemical complex of cellulose, lignin, hemicelluloses and extractives. The synthesis and development process of cell walls are very important in wood formation. Xyloglucan endotransglycosylase/hydrolase (XTH) and cellulose synthase (CesA) are two key proteins that play essential role in primary and secondary cell walls, respectively. CesA proteins catalyze cellulose polymerization (Somerville, 2006) and are believed to act as a central catalyst in the generation of plant cell wall biomass (Kumar et al ., 2009). XTH is considered as a key agent to regulate cell wall expansion and is believed to be responsible for the incorporation of newly synthesized xyloglucan (XG) into the wall matrix (Darley et al ., 2001).

The selection of high quality wood traits at the early stage is important to produce high quality woods at a low cost (starting modal), energy (labor) and land usage to maximize the commercial value. Association genetics is becoming one of the effective approaches to identify the association between variation in genetic and phenotype. Therefore, identification of gene-associated SNPs that correlated with wood properties (basic wood density) may benefits in marker-assisted breeding programme for plantation establishment. Hence, the objectives of this study were to identify SNPs in partial genomic DNA sequence of XTH and CesA genes and to identify the genetic association of XTH and CesA genes with basic wood density of N. cadamba .

Plant materials: A total of 15 kelampayan trees were selected randomly from the natural stands in Kota Samarahan, Sarawak, Malaysia in October 2011. The inner bark issues were collected and total genomic DNA was isolated by using a modified Doyle and Doyle (1990) protocol. The isolated DNA was purified by using the Wizard Genomic DNA Purification Kit (Promega, USA) according to manufacturer’s protocol.

PCR amplifications: Primer pair that includes the flanking of complete coding sequence in XTH genomic DNA was designed based on the full-length NcXTH1 cDNA (GenBank accession number: JX134619). For CesA gene, primer pair flanking partial genomic DNA that includes HVRII region was designed based on the complete coding sequence of NcCesA1 (GenBank accession number: JX134621). All the primers were designed using Primer Premier 5.0 software (PREMIER Biosoft International, USA) as shown in Table 1. PCR reactions were carried out in a total 25 μL reaction volume containing 1x PCR buffer, 0.2 mM dNTPs mix (Invitrogen, Brazil), 1.5 mM MgCl 2 , 10 pmol of forward and reverse primers each, 1.0 U Taq DNA polymerase (Invitrogen, Brazil) and 30 ng of DNA template. The thermal cycling profile was programmed at 94°C for 2 min, 35 cycles at 94°C for 30 sec, 57°C (XTH amplification) or 64°C (CesA amplification) for 45 sec, 72°C for 30 seconds and finally extension at 72°C for 7 min.

Cloning and sequencing: Amplicons were purified using QIAquick ® Gel Extraction Kit (QIAGEN, Germany) and ligated into pGEM ® -T Easy Vector System (Promega, USA). Clones with targeted inserts were identified by carrying out colony PCR using M13 universal primer pair. Plasmids of positive clones were isolated by using the Wizard Plus SV Minipreps DNA Purification System (Promega, USA). Purified plasmids were then sent for sequencing using 3730xl DNA Analyzer (Applied Biosystem, USA).

Sequence variation analysis: Base calling and vector sequences were removed using Chromas version 2.33 (Technelysium, AU). Each of the sequence was verified and checked for their homology using BLASTn provided by NCBI. The sequences were aligned using CLC Main Workbench version 5.0 software (CLC bio, Denmark) to identify single nucleotide polymorphisms. All sequence polymorphisms detected then were visually rechecked from chromatograms. Later, the open reading frame sequences for each gene were translated into amino acid sequences by using ExPASy translate tool (http://web.expasy.org/translate/). Protein sequence alignment was done using CLC Main Workbench version 5.0 (CLC bio, Denmark) to determine the synonymous and non-synonymous mutations.

Basic wood density measurement: Wood cores of 5 mm diameter and 8 cm long were sampled from the trunk of each selected N. cadamba trees using an increment borer at the height of about 1.3-1.4 m. A replicate was taken for each sample. Wood cores were kept in different labeled tubes placed on ice to maintain the constant humidity for measurement of green value. Green volume of each wood core was taken by measuring the water displaced in the wood core. According to Phytagora’s theorem, water has a density of 1. Therefore, the measured weight of displaced water is equal to the sample’s volume. Oven-dry weight was measured from the same sample by drying it in a well ventilated oven at 103°C until it achieves constant weight (Chave, 2005). Basic wood density was calculated as oven dry weight per green volume.

Nucleotide diversity and association genetics analysis: DNA Sequence Polymorphism version 5 (DnaSP v.5.0) software (Librado and Rozas, 2009) was used to estimate the average number of nucleotide diversity per site (π), θ values per site, synonymous and non-synonymous nucleotide polymorphisms and Tajima’s D test. Statistical test carried out by DnaSP do not include the insertion and deletion site mutations. TASSEL version 3.0 (Bradbury et al ., 2007) was used to calculate the R 2 value to determine the correlation of SNPs. Linkage disequilibrium (LD) graph was plotted using R 2 values against pairwise polymorphic nucleotide distances. General Linear Model (GLM) was also tested using 1,000 permutation test for each SNPs with respect to basic wood density. SNPs that scored at p-value lower than 0.05 considered to be significantly associated with its trait.

SNP discovery: A total of 34 SNPs were found in NcXTH1 genes (

1,290 bp) which corresponds to one SNP in every 38 bp (2.65% occurrence). Similar levels of SNP frequencies were found in other studies. Tchin et al . (2012) showed that one SNP occurs in every 59 bp in C4H gene in N. cadamba . Krutovsky and Neale (2005) also found that one SNP in every 46 bp in 18 Douglas fir wood quality candidate genes. Another study that used 35 cell wall compound genes recorded one SNP in every 54 bp (Kelleher et al ., 2012). The large difference of the SNP occurrence might due to the differences in species and number of samples used. For instance, in a study by Dillon et al . (2010), one SNP was recorded in every 466 bp of XTH gene with only five samples of Pinus radiate .

Out of the 34 polymorphic sites, 20 substitution SNPs and 3 indels were located in the coding regions (exons) and another 9 substitution SNPs as well as 2 indels were located in the non-coding regions (introns and untranslated regions). In this study, more sequence polymorphism was found in the coding sequence (856 bp) because the numbers of sites studied were longer than the non-coding region (434 bp). However, the SNP frequency in non-coding region was slightly higher (39 bp/SNP) than in coding region (37 bp/SNP). Overall, 24 out of 34 (71%) of the SNPs in NcXTH1 were singletons (locus with only one sequence showing a mutated nucleotide) which most of them were located in exons (Table 2). There were only 5 parsimony informative SNPs (locus with two or more sequences owned a nucleotide variant) found in the full-length DNA sequence of NcXTH1.

Full-length NcXTH1 genomic DNA was found to have an equal number of synonymous and non-synonymous SNPs with the proportion of 0.060 each. Although most of the non-synonymous SNPs were found in the coding region, the occurrence of non-synonymous SNPs was highest in non-coding region (1.84%). Nucleotide diversity in NcXTH1 was parallel with the findings of Krutovsky and Neale (2005), where most of the SNPs were singletons, synonymous and found mostly in the non-coding region in 18 wood quality-related candidate genes in Douglas fir trees. Most of the SNPs were caused by transitional mutation (71%), especially the A-G substitution (Table 2). A study by Tchin et al . (2012) also shows a higher frequency in transition mutations (55%) in another two wood formation candidate genes, cinnamate 4-hydroxylase (C4H) and cinnamyl alcohol dehydrogenase (CAD) in N. cadamba . Tchin et al . (2011) also found that A-G transition mutations were frequently found (83%) in partial DNA sequences of CAD in A. mangium superbulk trees.

Frequency of SNP in partial DNA sequences of NcCesA1 gene was lower than in NcXTH1 with one SNP detected in every 259 bp. SNP occurrence of CesA gene in N. cadamba was predicted to be lower than SNP found in CesA1 of Shorea parvifolia , where one SNP was detected in every 115 bp of DNA sequences with only the used of five samples (Seng et al ., 2011). On the other hand, NcCesA1 showed higher SNP frequency than in CesA genes of Pinus radiate . Dillon et al . (2010) reported that one SNP was detected in every 403 and 374 bp DNA sequences of CesA1 and CesA3, respectively, in Pinus radiate . In this study, very low SNP frequency was detected in the coding region, with only 1 SNP detected in every 531 bp.

Table 3 shows the summary of SNP distribution in partial NcCesA1 DNA sequences (778 bp). A total of three SNPs found in NcCesA1 were caused by substitution mutation and no Indel was detected. Two out of three SNPs located in non-coding regions. In reverse with NcXTH1, more transversion SNPs found in NcCesA1 with one G-C and G-T substitution detected each. All of the three SNPs found in NcCesA1 partial DNA sequences were synonymous, with higher frequency in the non-coding region (0.80%). Meanwhile, only one singleton SNP was found in the exon of NcCesA1 while two parsimony informative SNPs found in the non-coding regions.

Nucleotide diversity analysis: Sequence statistical analysis of NcXTH1 and NcCesA1 using DnaSP v.5.0 software (Librado and Rozas, 2009) was examined on all the sequences (sites) excluding gaps and missing data. All of the calculation and analysis did not include Indels. Alignment of 15 NcXTH1 full-length DNA sequences contained 13 sites with alignment gaps and thus, 1,277 out of 1,290 sites of NcXTH1 full-length DNA sequences were examined. In NcCesA1 partial DNA sequences alignment, no gap was found and all of the 778 sites were included in the analysis. Nucleotide diversity, Tajima’s test and minimum number of recombination events of NcXTH1 and NcCesA1 are summarized in Table 4.

Nucleotide diversity (π) is the average number of nucleotide differences per site between two sequences and the population mutation parameter is represented by θ. Nucleotide diversity of 15 DNA sequences of NcXTH1 was estimated to be higher than NcCesA1 at 0.00402 and 0.00127, respectively. Overall, all of the θ values per site calculated from number of nucleotide diversity (π), total number of mutation (n), or polymorphic sites (S) in NcXTH1 were much more higher than in NcCesA1 (Table 4). The θ value per sequence estimated from S(θ w ) of NcXTH1 (8.919) was almost 10-fold higher than in NcCesA1 (0.9226). This was caused by the higher frequency of SNP in NcXTH1 than in NcCesA1. θ w value of both genes examined in N. cadamba were predicted to be much more higher than other wood-related candidate genes that scored θ w lower than 0.02 (Brown et al ., 2004 Krutovsky and Neale, 2005).

Tajima’s D statistic reflects the difference between nucleotide diversity, π and theta (θ) per sequence from number of polymorphic sites (S), θ w . The value of D is expected close to zero at equilibrium between genetic drift and selectively neutral mutation (Brown et al ., 2004). In NcXTH1, almost all of the SNPs showed significant Tajima’s D value (less than 0.05), except those at silent (synonymous and non-coding) sites. The negative values (-1.667 to -2.008) indicate an excess of low frequency variants are segregating in NcXTH1 gene. Similarly, most of the Tajima’s D values were negative in wood quality-related candidate genes in Pseudotsuga menziesii (Krutovsky and Neale, 2005) and in Pinus taeda (Brown et al ., 2004). In contrast to NcXTH1, all of the Tajima’s D values tested in NcCesA1 failed to show significant results with an overall positive value of 0.2187.

Linkage disequilibrium (LD): A considerable amount of Linkage Disequilibrium (LD) was found within NcXTH1 and NcCesA1 sequences using TASSEL software (Bradbury et al ., 2007). A total of 496 pairwise comparisons were estimated from the examined sites. More than 80% of the paired sites showed LD statistically significant by a two-sided Fisher’s Exact test. R 2 values were calculated which represents the correlation between alleles at two loci and is informative for evaluating the resolution of association approaches (Bradbury et al ., 2007). In this study, the mean R 2 value for 496 estimates of LD was 0.000687. Average R 2 value of N. cadamba was far lower than in another timber species balsam poplar (0.52 Olson et al ., 2010) and in soybean (0.36 Zhu et al ., 2003).

Averaged across all loci, appreciable decay of LD to the values <0.10 within 350 bp was detected when R 2 was plotted against distance between sites in base pairs for NcXTH1 and NcCesA1 sequenced gene fragments (Fig. 1). As compared to other species, N. cadamba do not show significant decay when only two genes were used in LD estimation. Nucleotide studies on other species showed that LD decayed more than 50% including loblolly pine (Brown et al ., 2004), soybean (Zhu et al ., 2003) and Douglas fir tree (Krutovsky and Neale, 2005). According to a study by Nordborg (2000), LD decays more rapidly in outcrossing species as compared to selfing species because recombination is more effective in outcrossing species.

Basic wood density: The diameter at breast height (dbh) of N. cadamba trees was in the range of 24.0 cm to 102.0 cm with an average of 39.2 cm (Table 5). Most of the N. cadamba trees are having the diameter between 24.0 to 50.0 cm, except for sample NcMT1 (102.0 cm). The green weight, oven dry weight and green volume of each replicate were taken and recorded to calculate the mean basic wood density. On average, the basic wood density of N. cadamba was 368.93 kg m -3 . Wood sample NcMT1 recorded the highest basic density (444.03 kg m -3 ) while NcHT4 recorded the lowest (300.25 kg m -3 ) among 15 samples.

Basic density was determined because it is one of the most important criteria in breeding programmes to evaluate wood potential for pulp, paper or wood industries (Seca and Domingues, 2006).

Association genetics study: Out of 34 SNPs detected in NcXTH1, two of them (SNP8 and SNP29) showed significant association (p<0.05) with the basic wood density at position 118 and 1,173 bp, respectively (Fig. 2). Both markers were located in exons and caused by synonymous T-C transition mutation. None of the non-synonymous SNP was found to be significantly associated with the basic density. The associated-SNP8 and-SNP29 were found to occur in the individual tree NcMT1 that had the highest wood density (444.03 kg m -3 ). A similar study also had been carried out on the association of SNP markers in another two wood candidate genes (C4H and CAD) with the basic wood density (Tchin et al ., 2011, 2012). In this study, two SNPs were also found to be significantly associated with the basic wood density of N. cadamba and A. mangium . This supports the findings that SNP8 and SNP29 could potentially be developed into SNP markers for superior N. cadamba tree selection once validated in the future.

Previous studies have shown that there is a strong relationship between wood quality-related candidate genes with the wood properties (Plomion et al ., 2000 Wegrzyn et al ., 2010 Tchin et al ., 2011, 2012). Plomion et al . (2000) had shown that the growth of maritime pines was affected by the biochemical content, such as lignin and cellulose content, where these chemical contents were controlled by its genetics. Association genetics of traits controlling lignin and cellulose biosynthesis in P. trichocarpa also have been demonstrated by the rapid decay of within-genes (SuSy1 and C4H1) LD and the high coverage of amplicons across each gene (Wegrzyn et al ., 2010). These data reflected that numerous identified polymorphisms are in close proximity to the causative SNPs.

In conclusion, this study has demonstrated that association genetics is a powerful tool to discover naturally occurring allelic variations in genes associated with economically important traits by harnessing the genetic variation at the population level.

The gene-associated SNP identified in XTH gene of N. cadamba could be potentially used as a tool in Gene-Assisted Selection (GAS) of N. cadamba after validated in breeding programmes to produce quality planting materials that are of faster growth, high-yield and high wood quality and also adapted to local conditions, so that we may achieve economic benefits of great significance.

The authors would like to thank all the lab assistants and foresters involved in this research project for their excellent field assistance in sample collection. This study is part of the joint Industry-University Partnership Programme, a research programme funded by the Sarawak Forestry Corporation and Universiti Malaysia Sarawak (Grant No. 02(DPI09)832/2012(1)), RACE/a(2)884/20121(02) and GL(F07)/06/2013/STA-UNIMAS(06)).

2: Brown, G.R., G.P. Gill, R.J. Kuntz, C.H. Langley and D.B. Neale, 2004. Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc. Natl. Acad. Sci. USA., 101: 15255-15260.

3: Bussa, S.K. and J. Pinnapareddy, 2010. Antidiabetic activity of stem bark of Neolamarckiacadamba in alloxan induced diabetic rats. Int. J. Pharm. Technol., 2: 314-324.

4: Chave, J., 2005. Measuring wood density for tropical forest trees: A field manual for the CTFS sites. Lab. Evolution et Diversite Biologique, University Paul Sabatier, France, February 16, 2005, pp: 1-7.

5: Darley, C.P., A.M. Forrester and S.J. McQueen-Mason, 2001. The molecular basis of plant cell wall extension. Plant Molecular Biol., 47: 179-195.
CrossRef |

6: Dillon, S. K., M. Nolan, W. Li, C. Bell, H. X. Wu and S. G. Southerton, 2010. Allelic variation in cell wall candidate genes affecting solid wood properties in natural populations and land races of Pinus radiate. Genetics, 185: 1477-1487.
CrossRef | PubMed | Direct Link |

7: Doyle, J.J. and J.L. Doyle, 1990. Isolation of plant DNA from fresh tissue. Focus, 12: 13-15.

8: Gonzalez-Martinez, S.C., N.C. Wheeler, E. Ersoz, C.D. Nelson and D.B. Neale, 2007. Association genetics in Pinus taeda L. I. wood property traits. Genetics, 175: 399-409.
CrossRef | PubMed |

9: Guillet-Claude, C., C. Birolleau-Touchard, D. Manicacci, P. M. Rogowsky and J. Rigau et al., 2004. Nucleotide diversity of the ZmPox3 maize peroxidase gene: Relationships between a MITE insertion in exon 2 and the variation in forage maize digestibility. BMC Genetics, 10.1186/1471-2156-5-19

10: Gupta, P.K., J.K. Roy and M. Prasad, 2001. Single nucleotide polymorphisms: A new paradigm for molecular marker technology and DNA polymorphism detection with emphasis on their use in plants. Curr. Sci., 80: 524-535.

11: Halushka, M.K., J.B. Fan, K. Bentley, L. Hsie and N. Shen et al., 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet., 22: 239-247.

12: Ho, W.S., S.L. Pang, P.S. Lai, S.Y. Tiong and S.L. Phui et al ., 2010. Genomics studies on plantation tree species in Sarawak. Proceedings of the International Symposium on Forestry and Forest Products 2010: Addressing the Global Concerns and Changing Societal Needs, October 5-7, 2010, Kuala Lumpur, Malaysia, pp: 172-182.

13: Seng, H.W., P.S. Ling, P. Lau and I. Jusoh, 2011. Sequence variation in the Cellulose synthase (SpCesA1) gene from Shorea parvifolia ssp. parvifolia mother trees. Pertanika J. Trop. Agric. Sci., 34: 317-323.

14: Ibitoye, D.O. and P.E. Akin-Idowu, 2010. Marker-Assisted-Selection (MAS): A fast track to increase genetic gain in horticultural crop breeding. Afr. J. Biotechnol., 9: 8889-8895.

15: Joker, D., 2000. Neolamarckia cadamba (Roxb.) Bosser ( Anthocephalus chinensis (Lam.) A. Rich. ex Walp.). Seed Leaflet No. 17, September 2000, Danida Forest Seed Centre, Denmark, pp: 1-2.

16: Kelleher, C.T., J. Wilkin, J. Zhuang, A.J. Cortes and A.L.P. Quintero et al., 2012. SNP discovery, gene diversity and linkage disequilibrium in wild populations of Populus tremuloides. Tree Genet. Genomes, 8: 821-829.
CrossRef |

17: Krutovsky, K.V. and D.B. Neale, 2005. Nucleotide diversity and linkage disequilibrium in cold-hardiness and wood quality-related candidate genes in Douglas fir. Genetics, 171: 2029-2041.
CrossRef |

18: Kumar, M., S. Thammannagowda, V. Bulone, V. Chiang and K.H. Han et al., 2009. An update on the nomenclature for the cellulose synthase genes in Populus. Trends Plant Sci., 14: 248-254.
CrossRef | PubMed | Direct Link |

19: Kwok, P.Y., C. Carlson, T.D. Yager, W. Ankener and D.A. Nickerson, 1994. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics, 23: 138-144.
CrossRef | PubMed | Direct Link |

20: Lau, E.T., W.S. Ho and A. Julaihi, 2009. Molecular cloning of cellulose synthase gene, SpCesA1 from developing xylem of Shorea parvifolia spp. Parvifolia. Biotechnology, 8: 416-424.

21: Librado, P. and J. Rozas, 2009. DnaSP v5: A software for comprehensive analysis of DNA polymorphism data. Bioinformatics, 25: 1451-1452.
CrossRef | PubMed | Direct Link |

22: Mondal, S., G.K. Dash and S. Acharyya, 2009. Analgesic, anti-inflammatory and antipyretic studies of Neolamarckia cadamba barks. J. Pharm. Res., 2: 1133-1136.

23: Narayanan, C., S.A. Wali, R. Shukla, R. Kumar, A.K. Mandal and S.A. Ansari, 2007. RAPD and ISSR markers for molecular charazterization of teak (Tectona grandis) plus trees. J. Trop.Sci., 19: 218-225.

24: Nasu, S., J. Suzuki, R. Ohta, K. Hasegawa and R. Yuia et al., 2002. Search for and analysis if single nucleotide polymorphisms (SNPs) in rice (Oryza sativa, Oryzarufipogon) and establishment of SNP markers. DNA Res., 9: 163-171.
CrossRef | PubMed | Direct Link |

25: Nordborg, M., 2000. Linkage disequilibrium, gene trees and selfing: An ancestral recombination graph with partial self-fertilization. Genetics, 154: 923-929.

26: Olson, M.S., A.L. Robertson, N.Takebayashi, S. Silim, W.R. Schroeder and P. Tiffin, 2010. Nucleotide diversity and linkage disequilibrium in balsam poplar (Populus balsamifera). New Phytol., 186: 526-536.

27: Patel, D. and V. Kumar, 2008. Pharmacognostical studies of Neolamarckia cadamba (Roxb.) Bosser leaf. Int. J. Green Pharm., 2: 26-27.
CrossRef |

28: Plomion, C., C. Pionneau, J. Brach, P. Costa and H. Bailleres, 2000. Compression wood-responsive proteins in developing xylem of maritime pine (Pinus pinaster Ait.). Plant Physiol., 123: 959-969.
PubMed |

29: Rafalski, A., 2002. Applications of single nucleotide polymorphisms in crop genetics. Curr. Opin. Plant Biol., 5: 94-100.

30: Seca, A.M.L. and F.M.J. Domingues, 2006. Basic density and pulp yield relationship with some chemical parameters in eucalyptus trees. Pesquisa Agropecuaria Bras., 41: 1687-1691.

31: Somerville, C., 2006. Cellulose synthesis in higher plants. Annu. Rev. Cell Dev. Biol., 22: 53-78.

## Acknowledgments

We thank Stephen Keller, Maren Friesen, and Sergey Nuzhdin for discussions, Jean-Marie Prosperi and Magalie Delalande for the development and management of the M. truncatula germplasm collection (seeds available at http://www1.montpellier.inra.fr/BRC-MTR/), and Thierry Huguet and M. El Arbi for development of some M. truncatula germplasm. This work was carried out in part using computing resources at the University of Minnesota Supercomputing Institute and was funded by National Science Foundation Grant 0820005.

## Materials and Methods

### Samples and sequencing

Leaf samples were collected from 24 genotypes of P. tremula and 24 genotypes of P. tremuloides (Table S1). Genomic DNA was extracted from leaf samples, and paired-end sequencing libraries with insert sizes of 650 bp were constructed for all genotypes. Whole-genome sequencing with a minimum expected depth of 20× was performed on the Illumina HiSequation 2000 platform at the Science for Life Laboratory, Stockholm, and 2× 100-bp paired-end reads were generated for all genotypes. Two samples of P. tremuloides failed to yield the expected coverage and were therefore removed from subsequent analyses. We obtained publicly available short-read Illumina data of 24 P. trichocarpa individuals from the National Center for Biotechnology Information (NCBI) Short Read Archive (SRA) (Table S1). Individuals were selected to have a similar read depth as the samples of the two aspen species. The accession numbers of P. trichocarpa samples can be found in Evans et al. (2014). All analyses are thus based on data from 24 P. tremula, 22 P. tremuloides, and 24 P. trichocarpa genotypes.

Prior to read alignment, we used Trimmomatic ( Lohse et al. 2012) to remove adapter sequences from reads. Since the quality of reads always drops toward the end of reads, we used Trimmomatic to cut off bases from the start and/or end of reads when the quality values were <20. If the length of the processed reads was reduced to <36 bases after trimming, reads were completely discarded. FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was used to check and compare the per-base sequence quality between the raw sequence data and the filtered data. After quality control, all paired-end and orphaned single-end reads from each sample were mapped to the P. trichocarpa version 3 (v3.0) genome ( Tuskan et al. 2006) using BWA-MEM with default parameters in bwa-0.7.10 ( Li 2013).

### Single nucleotide polymorphism and genotype calling

We implemented two complementary bioinformatics approaches: First, many studies have pointed out the bias inherent in population genetic estimates using genotype calling approach from NGS data ( Nielsen et al. 2011 Nevado et al. 2014). Single- or multiple-sample genotype calling can result in a bias in the estimation of site frequency spectrum (SFS), as the former usually leads to overestimation of rare variants, whereas the latter often leads to the opposite ( Nielsen et al. 2011). Therefore, in this study we employed a method implemented in the software package—Analysis of Next-Generation Sequencing Data (ANGSD v0.602) ( Korneliussen et al. 2014)—to estimate the SFS and all population genetic statistics derived from the SFS without calling genotypes. Second, for those analyses that require accurate single nucleotide polymorphism (SNP) and genotype calls, we performed SNP calling with HaplotypeCaller of the GATK v3.2.2 ( DePristo et al. 2011), which called SNPs and indels simultaneously via local reassembly of haplotypes for each individual and created single-sample genomic VCFs (gVCFs). gVCFs in GATK were then used to merge multi-sample records together, correct genotype likelihoods, re-genotype the newly merged record, and perform re-annotation. The following filtering steps were then used to reduce the number of false-positive SNPs and retain high-quality SNPs: (1) We removed all SNPs that overlapped with sites excluded by all previous filtering criteria. (2) We retained only biallelic SNPs with a distance of >5 bp away from any indels. (3) We treated genotypes with a genotype quality score (GQ) <10 as missing and then removed those SNPs with a genotype missing rate >20%. (4) We removed SNPs that showed significant deviation from Hardy–Weinberg Equilibrium (P < 0.001). After all filtering, 8,502,169 SNPs were detected among the three Populus species and were used in downstream analyses.

### Population structure

We used fourfold synonymous SNPs with minor allele frequency >0.1 to perform population structure analyses with ADMIXTURE ( Alexander et al. 2009). We ran ADMIXTURE on all the sampled individuals among species and on the samples within each species separately. The number of genetic clusters (K) varied from 1 to 6. The most likely number of genetic clusters was selected by minimizing the cross-validation error in ADMIXTURE.

### Diversity and divergence: related summary statistics

For nucleotide diversity and divergence estimates, only the reads with mapping quality >30 and the bases with a quality score >20 were used in all downstream analyses with ANGSD ( Korneliussen et al. 2014). First, we used the -doSaf implementation in ANGSD to calculate the site-allele-frequency likelihood based on the SAMTools genotype likelihood model ( Li et al. 2009). Then, we used the -realSFS implementation in ANGSD to obtain an optimized folded global SFS using the expectation maximization algorithm for each species. Based on the global SFS, we used the -doThetas function in ANGSD to estimate the per-site nucleotide diversity from posterior probability of allele frequency based on a maximum-likelihood approach ( Kim et al. 2011). Two standard estimates of nucleotide diversity, the average pairwise nucleotide diversity (Θπ) ( Tajima 1989) and the proportion of segregating sites (ΘW) ( Watterson 1975), and one neutrality statistic test, Tajima’s D ( Tajima 1989), were summarized along all 19 chromosomes using nonoverlapping sliding windows of 100 kilobases (kbp) and 1 megabases (Mb). Windows with <10% of covered sites left from previous quality filtering steps were excluded. In the end, 3340 100-kbp and 343 1-Mb windows, with an average of 50,538 and 455,910 covered bases per window, respectively, were included.

All these statistics were also calculated for each type of functional element (0-fold nonsynonymous, fourfold synonymous, intron, 3′ UTR, 5′ UTR, and intergenic sites) over nonoverlapping 100-kbp and 1-Mb windows in all three Populus species. The category of gene models followed the gene annotation of P. trichocarpa version 3.0 ( Tuskan et al. 2006). For protein-coding genes, we included only genes with at least 90% of covered sites left from previous filtering steps to ensure that the three species have the same gene structures. For regions overlapped by different transcripts in each gene, we classified each site according to the following hierarchy (from highest to lowest): coding regions (CDS), 3′ UTR, 5′ UTR, and intron. Thus, if a site resides in a 3′ UTR in one transcript and in a CDS for another, the site was classified as CDS. We used the transcript with the highest content of protein-coding sites to categorize synonymous and nonsynonymous sites within each gene. A total of 16.52, 3.4, 7.19, 4.02, 31.89, and 73.46 Mb was partitioned into 0-fold nonsynonymous (where all DNA sequence changes lead to protein sequence changes), fourfold synonymous (where all DNA sequence changes lead to the same protein sequences), 3′ UTR, 5′ UTR, intron, and intergenic categories.

### Linkage disequilibrium and population-scaled recombination rate

A total of 1,409,377 SNPs, 1,263,661 SNPs, and 710,332 SNPs with minor allele frequency >10% were used for the analysis of linkage disequilibrium (LD) and population-scaled recombination rate (ρ) in P. tremula, P. tremuloides, and P. trichocarpa, respectively. To estimate and compare the rate of LD decay among the three Populus species, we first used PLINK 1.9 ( Purcell et al. 2007) to randomly thin the number of SNPs to 100,000 in each species. We then calculated the squared correlation coefficients (r 2 ) between all pairs of SNPs that were within a distance of 50 kbp using PLINK 1.9. The decay of LD against physical distance was estimated using nonlinear regression of pairwise r 2 vs. the physical distance between sites in base pairs ( Remington et al. 2001).

We estimated the population-scaled recombination rate ρ using the Interval program of LDhat 2.2 ( McVean et al. 2004) with 1,000,000 MCMC iterations sampling every 2000 iterations and a block penalty parameter of 5. The first 100,000 iterations of the MCMC iterations were discarded as a burn-in. We then calculated the scaled value of ρ in each 100-kbp and 1-Mb window by averaging over all SNPs in that window. Only windows with >10,000 (in 100-kbp windows) and 100,000 sites (in 1-Mb windows) and 100 SNPs left from previous filtering steps were used for the estimation of ρ.

### Estimating the distribution of fitness effects of new amino acid mutations and the proportion of adaptive amino acid substitutions

We generated the folded SFS in each species for a class of selected sites (0-fold nonsynonymous sites) and a class of putatively neutral reference sites (fourfold synonymous sites) from SNP data using a custom Perl script. We employed a maximum-likelihood (ML) approach as implemented in the program distribution of fitness effects (DFE)-α ( Keightley and Eyre-Walker 2007 Eyre-Walker and Keightley 2009) to fit a demographic model with a step of population size change to the neutral SFS. Fitness effects of new deleterious mutations at the selected site class were sampled from a gamma distribution after incorporating the estimated parameters for the demographic model. This method assumes that fitness effects of new mutations at neutral sites are zero and unconditionally deleterious at selected sites since it assumes that advantageous mutations are too rare to contribute to polymorphism ( Keightley and Eyre-Walker 2007). We report the proportion of amino acid mutations falling into different effective strengths of selection (Nes) range: 0–1, 1–10, and >10, respectively.

From the estimated DFE, the proportion of adaptive amino acid substitutions (α) and the relative rate (ω) of adaptive substitution at 0-fold nonsynonymous sites were estimated using the method of Eyre-Walker and Keightley (2009). This method explicitly accounts for past changes in population size and the presence of slightly deleterious mutations. Among the total of 8,502,169 SNPs detected by GATK, on average <1% were shared between either of the two aspen species and P. trichocarpa (Figure S2). We therefore used the aspen species and P. trichocarpa as each other’s outgroup species to calculate between-species nucleotide divergence at fourfold synonymous and 0-fold nonsynonymous sites since it is unlikely to be influenced by shared ancestral polymorphisms. Jukes–Cantor multiple hits correction was applied to the divergence estimates ( Jukes and Cantor 1969). For the parameters of Nes, α, and ω, we generated 200 bootstrap replicates by resampling randomly across all SNPs in each site class using R ( R Develpment Core Team 2014). We excluded the top and bottom 2.5% of bootstrap replicates and used the remainder to represent the 95% confidence intervals for each parameter.

### Genomic correlates of diversity

To examine the factors influencing levels of neutral polymorphism in all three Populus species, we first assumed that the fourfold synonymous sites in genic regions were selectively neutral as every possible mutation at fourfold degenerate sites is synonymous. In the following we refer to the pairwise nucleotide diversity at fourfold synonymous sites (θfourfold) as “neutral polymorphism.” As a comparison to genic region, we also estimated levels of nucleotide diversity at intergenic sites (θIntergenic). Then we tabulated several other genomic features within each 100-kbp and 1-Mb window that may correlate with patterns of polymorphism. First, we summarized population-scaled recombination rate (ρ) as described above for each species. Second, we tabulated GC content as the fraction of bases where the reference sequence (P. trichocarpa v3.0) was a G or a C. Third, we measured the gene density as the number of functional genes within each window according to the gene annotation of P. trichocarpa version 3.0. Any portion of a gene that fell within a window was counted as a full gene. Fourth, we accounted for the variation of mutation rate by calculating the number of fixed differences per neutral site (either fourfold synonymous sites or intergenic sites) between aspen and P. trichocarpa within each window, which was performed in the ngsTools ( Fumagalli et al. 2014). The reason why we used divergence between aspen and P. trichocarpa to measure mutation rate is because they are distantly related ( Wang et al. 2014) and the estimate of divergence is unlikely to be influenced by shared ancestral polymorphisms between species as shown above. Fifth, we tabulated the number of covered bases in each window as those left from all previous filtering criteria.

We used Spearman’s rank-order correlation test to examine pairwise correlations between the variables of interest. To account for the autocorrelation between variables, we further calculated partial correlations between the variables of interest by removing the confounding effects of other variables ( Kim and Soojin 2007). All statistical tests were performed using R version 3.2.0 unless stated otherwise.

### Data availability

All newly generated Illumina reads of 24 P. tremula and 22 P. tremuloides from this study have been submitted to the SRA at NCBI. All accession numbers can be found in Table S1.

## Abstract

Recent developments have led to an enormous increase of publicly available large genomic data, including complete genomes. The 1000 Genomes Project was a major contributor, releasing the results of sequencing a large number of individual genomes, and allowing for a myriad of large scale studies on human genetic variation. However, the tools currently available are insufficient when the goal concerns some analyses of data sets encompassing more than hundreds of base pairs and when considering haplotype sequences of single nucleotide polymorphisms (SNPs). Here, we present a new and potent tool to deal with large data sets allowing the computation of a variety of summary statistics of population genetic data, increasing the speed of data analysis.

Citation: Soares I, Moleirinho A, Oliveira GNP, Amorim A (2015) DivStat: A User-Friendly Tool for Single Nucleotide Polymorphism Analysis of Genomic Diversity. PLoS ONE 10(3): e0119851. https://doi.org/10.1371/journal.pone.0119851

Academic Editor: Swarup Kumar Parida, National Institute of Plant Genome Research (NIPGR), INDIA

Received: June 19, 2014 Accepted: January 18, 2015 Published: March 10, 2015

Copyright: © 2015 Soares et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: DivStat software is freely available to download for MACINTOSH, UNIX and WINDOWS systems at the websites https://www.mediafire.com/folder/za5wjlcoc5oi1/DivStat and http://www.portugene.com/DivStat.html, accompanied by tutorials, example files and installation details.

Funding: This work was supported by the Portuguese Foundation for Science and Technology (FCT) fellowship (SFRH/BD/73508/2010) to A. M. IPATIMUP is an Associate Laboratory of the Portuguese Ministry of Science, Technology and Higher Education and is partly supported by FCT. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

## Nucleotide diversity in starch synthase IIa and validation of single nucleotide polymorphisms in relation to starch gelatinization temperature and other physicochemical properties in rice (Oryza sativa L.)

The characteristics of starch, such as gelatinization temperature (GT), apparent amylose content (AAC), pasting temperature (PT) and other physicochemical properties, determine the quality of various products of rice, e.g., eating, cooking and processing qualities. The GT of rice flour is controlled by the alk locus, which has been co-mapped to the starch synthase IIa (SSIIa) locus. In this study, we sequenced a 2,051 bp DNA fragment spanning part of intron 6, exon 7, intron 7, exon 8 and part of 3′ untranslated region of SSIIa for 30 rice varieties with diverse geographical distribution and variation in starch physicochemical properties. A total of 24 single nucleotide polymorphisms (SNPs) and one insertion/deletion (InDel) were identified, which could be classified into nine haplotypes. The mean pairwise nucleotide diversity π was 0.00292, and Watterson’s estimator θ was 0.00296 in this collection of rice germplasm. Tajima’s D test for selection showed no significant deviation from the neutral expectation (D = − 0.04612, P > 0.10). However, significant associations were found between seven of the SNPs and peak GT (T p) at P < 0.05, of which two contiguous SNPs (GC/TT) showed a very strong association with T p (P < 0.0001). With some rare exception, this GC/TT polymorphism alone can differentiate rice varieties with high or intermediate GT (possessing the GC allele) from those with low GT (possessing the TT allele). In contrast, none of these SNPs or InDel was significantly associated with amylose content. A further 509 rice varieties with known physicochemical properties (e.g., AAC and PT) and known alleles of other starch synthesizing genes were genotyped for the SSIIa GC/TT alleles. Association analysis indicated that 82% of the total variation of AAC in these samples could be explained by a (CT)n simple sequence repeat (SSR) and a G/T SNP of Waxy gene (Wx), and 62.4% of the total variation of PT could be explained by the GC/TT polymorphism. An additional association analysis was performed between these molecular markers and the thermal and retrogradation properties for a subset of 245 samples from the 509 rice varieties. The SSIIa GC/TT polymorphism explained more than 60% of the total variation in thermal properties, whereas the SSR and SNP of Wx gene explained as much as the SSIIa GC/TT of the total variation in retrogradation properties. Our study provides further support for the utilization of the GC/TT polymorphism in SSIIa. As shown in our study of 509 rice varieties, the GC/TT SNP could differentiate rice with high or intermediate GT from those with low GT in about 90% of cases. Using four primers in a single PCR reaction, the GC/TT polymorphism can be surveyed on a large scale. Thus, this SNP polymorphism can be very useful in marker-assisted selection for the improvement of GT and other physicochemical properties of rice.

This is a preview of subscription content, access via your institution.

## Introduction

Crop domestication is a complex process mediated by a series of phenotypic changes to improve cultivation, harvesting, and consumption. Rice (Oryza sativa L.) is one of the earliest domesticated crop species, and the genetic diversity of cultivated rice has been reduced by up to 80% from that of the wild ancestor during the domestication and artificial selection processes [1]. The most extreme loss of diversity is found in modern high-yielding rice varieties, and this has serious consequences for disease susceptibility and adaptation to changing environments [2]. By contrast, rice landraces, which originated and evolved in the field over millennia via selective breeding by farmers, have retained genetic variation [3]. This variation has important implications in rice breeding by providing new genes/alleles for crop improvement. However, with the development of modern agriculture, a large number of local landraces have been replaced with modern varieties introduced over the past 40 years [4]. In China, rice landraces are no longer planted in most provinces, with the exception of some ethnic minority regions, such as areas of Yunnan and Guizhou Provinces.

Selection by various ethnic groups inhabiting areas of different altitudes and climatic conditions and with different cultivation methods, cultures, and traditions has contributed to rice crop diversity in Yunnan. Accordingly, Yunnan is one of the largest centers of genetic diversity for rice worldwide [5–7]. Rice landraces in Yunnan are widely distributed in a region from 21°8′ 32′′ N to 29°11′ 18′′ N and 97°31′ 39′′ E to 106°11′ 47′′ E, and are planted at various altitudes and under diverse climatic conditions [7]. Landraces grow from altitudes of approximately 76 m in Hekou county, Honghe prefecture in the southeastern part of Yunnan to 2,700 m in Weixi county, Diqing prefecture [7]. The wide distribution of rice landraces provides an excellent opportunity for studies of genetic structure and differentiation patterns of rice landraces along the altitudinal range as well as the role of altitude in shaping population genetic structure. Several studies have examined genetic differentiation and the distribution of rice landraces along altitudinal gradients in Yunnan based on phenotype traits or insertion/deletion (indel) molecular markers [8–10]. According to a previous study, both indica and japonica rice varieties are cultivated in Yunnan [6], and the distribution of rice landraces can be artificially categorized into three rice cultivation regions as follows: (1) the indica belt at altitudes of below 1,400 m, (2) the mixed indica and japonica belt at altitudes of between 1,400 and 1,600 m, and (3) the japonica belt at altitudes of above 1,600 m [8]. However, little is known about how altitudinal variation affects population genetic structure. Furthermore, it is not clear whether rice landraces in Yunnan exhibit isolation by altitude. These questions need to be explored using DNA data.

Among various molecular marker types, simple sequence repeat (SSR) markers are commonly used to estimate genetic diversity, population structure, and differentiation in numerous plant species [11]. Furthermore, with the development of DNA sequencing technology, multi-locus DNA sequences have been successfully used to estimate genetic diversity and for phylogenetic analyses [12–15]. DNA sequence differences directly reflect genetic differences accordingly, sequence comparison is an ideal method for revealing genetic diversity and differentiation.

We collected diverse rice landraces from Yunnan, China along a wide range of altitudes and analyzed gene sequences and SSR markers. Our objectives were to examine the genetic structure of rice landraces and to assess the role of isolation by altitude in genetic differentiation.

## Discussion

### Genetic diversity and population demography in P. krempfii

Our analysis revealed extremely low levels of nucleotide polymorphism in P. krempfii. For nuclear loci, the mean silent nucleotide diversity in P. krempfii (πs = 0.0015 θws = 0.0020) was comparable with those found in Pinus cembra (πs = 0.0024 θws = 0.0024) (Mosca et al. 2012 ), but much lower than those in other pines (Fig. 3 Table S7). For mtDNA, we did not detect any polymorphism across ten mtDNA regions (approximately 10 kbp). Although low nucleotide variation for mtDNA has been observed in conifers, some of the mtDNA regions analyzed in this study have been widely used in previous population studies in pines, and various levels of polymorphism have been reported for most pine species, including those with limited range of distribution (Chiang et al. 2006 Eckert et al. 2008 Wang et al. 2011 ). For example, Eckert et al. ( 2008 ) detected 14 mitotypes in Pinus balfouriana, a California endemic pine with only two disjunct populations, based on four mtDNA fragments involved in this study. For the cpSSR, haplotype diversity (He = 0.911) detected in P. krempfii was high, as observed in most pine species (Höhn et al. 2005 Petit et al. 2005 Wang et al. 2011 , 2013 ). The contrasting levels of genetic diversity between cpSSR and mt- and nuclear DNA sequences observed in P. krempfii can be due to the different mutation rates between genomic regions. In pine species, the mutation rate for length variation at cpSSR loci (3.2–7.9 × 10 −5 ) was 5–6 orders of magnitude higher than the substitution rates in mt- (4 × 10 −11 ) and nuclear DNA (7 × 10 −10 ) sequences (Provan et al. 1999 Mower et al. 2007 Willyard et al. 2007 ). The asymmetric diversity between genetic markers has been observed in other pines such as P. cembra using cpSSR (He = 0.917) and nuclear DNA sequences (πs = 0.0024) (Höhn et al. 2005 Mosca et al. 2012 ). The difference in diversity between cpSSR relative to mt and nuclear loci could also be due to varied demographic and selective histories of different genomes. For example, during range fragmentation, the loss of the cpDNA diversity in single spatially isolated population could be compensated by efficient pollen flow from adjacent populations, whereas isolated populations may experience stronger bottleneck on mt genome due to limited seed dispersal. Natural selection could reduce the genetic diversity of functional nuclear loci, but may not affect neutral cpSSR loci. In summary, the nucleotide polymorphism in nuclear and mt genomes of P. krempfii was lowest among the pine species studied so far, whereas high genetic diversity was observed at cpSSR loci possibly due to the hypervariable nature of the SSR markers.

We detected low (5.2%) but significant differentiation among the extant populations of P. krempfii. Even within each region, FST values (3.8–7.8%) were also significant. This level of differentiation is comparable to pine species with wide distribution ranges (Wang et al. 1991 Ma et al. 2006 Pyhäjärvi et al. 2007 ). Due to lack of geographical barrier between sampled populations of P. krempfii, the population differentiation in this species could have been caused by fragmented nature of its distribution. Unlike most other pines, P. krempfii does not form pure stands, and individual populations consist of small groups of trees and/or solitary individuals dispersed among dense thicket of other tree species (Nguyen and Thomas 2004 ). These conditions are likely to limit dispersal of its pollen and seed, and contribute to differentiation between local populations. Furthermore, P. krempfii is distributed in a wet rainforest environment, which could preclude efficient wind pollination (Turner 2001 ). High humidity dampens pollen grains, and heavy rains wash away pollen from the air. In summary, despite relative proximity of individual populations, the low population density of P. krempfii and humid environment have prevented gene flow and led to certain degree of population differentiation. These findings indicate that even species with a very limited distribution may harbor genetically differentiated populations.

Approximate Bayesian computation simulations using nuclear loci suggested that P. krempfii experienced an exponential population growth, which started approximately 2450 years ago. Population expansion was also supported by mismatch distribution test based on the cpDNA data. The timing of the population growth in P. krempfii was much later than those in other Eurasian pines, which were dated at a few hundred thousand years ago, for example, P. densata from the Tibetan Plateau (Gao et al. 2012 ) and P. sylvestris from Europe (Pyhäjärvi et al. 2007 ). Thus, the population expansion revealed in P. krempfii might have been induced by regional climate changes or human activates, rather than global climate fluctuations during the Pleistocene. Assuming generation time of P. krempfii as 50 years, the population growth lasted for only 49 generations in this species. This episode of population expansion is too short to allow for accumulation of extensive polymorphism. Moreover, the habitat of P. krempfii has deteriorated and become fragmented in the last decades (Nguyen and Thomas 2004 ), which could have resulted in the reduction and fragmentation of P. krempfii populations. As suggested by earlier studies, the models implemented and explored in ABC and mismatch distribution analyses are most likely too simplistic (Ingvarsson 2008 Gao et al. 2012 ). Presumably, P. krempfii has gone through repeated population size expansions and contractions, and the most recent population decline was not revealed by the current simulations. The reduction in population size and population fragmentation could decrease the frequency of rare alleles in a very short time (Ellstrand and Elam 1993 ).

The extremely low nucleotide diversity detected in P. krempfii is nearly 2–8 times lower than in most other Eurasian pines (Fig. 3 Table S7) and is consistent with its small population size (1.43 × 10 4 ). ABC analyses suggested that P. krempfii has maintained an extremely small ancestral population, comprising only a few hundred individuals (455) for more than 2.8 Myr before entering the population growth phase. This situation of P. krempfii with small ancestral population size is different from that of other relic gymnosperms such as Ginkgo biloba and Cathaya argyrophylla, which have been abundant and widespread before glaciations (Wang and Ge 2006 Gong et al. 2008 ). Brodribb and Feild ( 2008 ) speculated that competition from angiosperms and subtropical podocarps could have limited the success of P. krempfii.

Small population size has two important genetic consequences. One is loss of genetic diversity due to genetic drift. Another is increased inbreeding, which leads to higher levels of homozygosity and mortality caused by lethal or semi-lethal alleles. The inbreeding coefficient (FIS = 0.26) in P. krempfii is much higher than those in other pines such as P. pinaster (0.069) (Eveno et al. 2008 ). In this study, we collected cones from most of the sampled individuals and found that practically all of the seeds were empty. Although the mating system of P. krempfii has not been studied, pine species are self-compatible and it is well known that the presence of empty seed in this group of conifers is a sure indicator of increased levels of inbreeding (Karkkainen et al. 1996 ). Therefore, apart from the loss of diversity due to genetic drift, the extant populations of P. krempfii may be also suffering from additional loss due to inbreeding. Future study on the mating system of P. krempfii could reveal the true impact of inbreeding on the loss of genetic diversity in this species.

Pinus krempfii is thought to be an ancient relict (Nguyen and Thomas 2004 ). It is the only extant species in subsection Krempfianae and diverged from other pines more than 10 million years ago (Willyard et al. 2007 ). The unique morphology, physiology, anatomy, limited distribution range, and distinct habitat also indicate that this species has been isolated from the other pines for a long time. Long-term isolation together with small population size could have enhanced the impact of genetic drift and inbreeding in P. krempfii, resulting in severe reduction in genetic diversity.

Nucleotide diversity could also be reduced by selective sweeps that diminish variation at and around particular genes or by purifying selection against deleterious mutations closely linked to neutral variants (Hahn 2008 ). However, we did not find strong evidence for selection at any of the analyzed loci. The rapid decrease of LD over distance also suggested limited effects of genetic hitchhiking. Therefore, while selection may partly explain the low levels of nucleotide variation at several loci, it does not seem to be sufficient to explain the low levels of variation across nuclear loci included in our study.

### Conservation implications

The low nucleotide polymorphism, restricted distribution, and high ratio of empty seeds in P. krempfii suggest the species is exposed to a considerable risk of extinction. Although the most extant populations of P. krempfii are currently under legal protection in national parks in Vietnam, they face serious threat and risk of extinction by stochastic processes because of their small size. Population size is the most important of the five criteria for listing species as endangered under the International Union for the Conservation of Nature and Natural Resources (IUCN) system (http://www.iucn.org/), and the loss of genetic variation may decrease the potential for a species to persist in the face of biotic and abiotic changes. Thus, efforts should be made to increase the genetic diversity and population size of P. krempfii. Pinus krempfii does not form pure forests and typically occurs as small groups of 10–30 trees in dense subtropical forests (Nguyen and Thomas 2004 ). The persistence and regeneration of this species are highly reliant on the subtropical forest environment. For example, the seedlings and saplings of P. krempfii were restricted to shade environment under the forest canopy (Nguyen and Thomas 2004 ). Unfortunately, there is a continuing decline in the extent and quality of its habitats due to human activities (e.g., war in the 1960s and the clearance of land for agriculture) and climate changes in recent decades (Nguyen and Thomas 2004 ). The loss of habitat could have decreased the population size of P. krempfii in the past and would prevent recovery of population in the future. Therefore, the first effort to recover the extant P. krempfii population should be the protection and restoration of the habitat that P. krempfii is adapted to.

The in situ conservation alone, however, cannot conserve and recover the species because of the restricted distribution of P. krempfii. Therefore, ex situ conservation should also be given high priority to offset the habitat deterioration and fragmentation. In this regard, introductions can be designed to establish self-sustaining wild populations, and this practice should be carried out in suitable habitats.

The high ratio of empty seed detected in P. krempfii suggests that this species is suffering from inbreeding depression. Thus, traditional breeding practices such as controlled crosses between genetically distinct populations, even between individuals of the same stand, could be helpful to restore and enrich genetic diversity in P. krempfii. Controlled crosses are important genetic tools for both breeding and conservation of wild populations of economically and ecologically important plant species. Although population differentiation was low in P. krempfii, some population pairs (e.g., Bidoup vs. Cong Troi 103) showed considerable divergence (data not shown). Therefore, controlled crosses between these populations seem reasonable, and a seed orchard could be established for production of genetically improved seeds of P. krempfii, but the potential benefits should be evaluated prior to full implementation. Future studies should employ both genomic and ecological data to better understand the evolutionary history of P. krempfii and to make additional conservation efforts (e.g., outcrossing assessment and population viability analysis) to develop better quantitative recovery criteria for this species.

1. Gotilar

2. Cocytus

it will be interesting.

3. Sazragore

You allow the mistake. I offer to discuss it. Write to me in PM, we will handle it.

4. Leo

It above my understanding!

5. Hevataneo

This is the simply excellent sentence