10.9: Genomic Libraries - Biology

10.9: Genomic Libraries - Biology

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

A genomic library might be a tube full of recombinant bacteriophage. The library is made to contain a representation of all of possible fragments of that genome. Bacteriophage are often used to clone genomic DNA fragments because:

  • phage genomes are bigger than plasmids and can be engineered to remove a large amount of DNA that is not needed for infection and replication in bacterial host cells.
  • the missing DNA can thus be replaced by foreign insert DNA fragments as long as 18- 20kbp (kilobase pairs), nearly 20X as long as typical cDNA inserts in plasmids.
  • purified phage coat proteins can be mixed with the recombined phage DNA to make infectious phage particles that would infect host bacteria, replicate lots of new recombinant phage, and then lyse the cells to release the phage.

The need for vectors like bacteriophage that can accommodate long inserts becomes obvious from the following bit of math. A typical mammalian genome consists of more than 2 billion base pairs. Inserts in plasmids are very short, rarely exceeding 1000 base pairs. Dividing 2,000,000,000 by 1000, you get 2 million, a minimum number of phage clones that must be screened to find a sequence of interest. In fact, you would need many more than this number of clones to find a gene (or parts of one!). Of course, part of the solution to this “needle in a haystack” dilemma is to clone larger DNA inserts in more accommodating vectors.

From this brief description, you may recognize the common strategy for genetically engineering a cloning vector: determine the minimum properties that your vector must have and remove non-essential DNA sequences. Consider the Yeast Artificial Chromosome (YAC), hosted by (replicated in) yeast cells. YACs can accept humongous foreign DNA inserts! This is because to be a chromosome that will replicate in a yeast cell requires one centromere and two telomeres… and little else!

Recall that telomeres are needed in replication to keep the chromosome from shortening during replication of the DNA. The centromere is needed to attach chromatids to spindle fibers so that they can separate during anaphase in mitosis (and meiosis). So along with a centromere and two telomeres, just include restriction sites to enable recombination with inserts as long as 2000 Kbp. That’s a YAC! The tough part of course is keeping a 2000Kbp long DNA fragment intact long enough to get it into the YAC.

However a vector is engineered and chosen, sequencing its insert can tell us many things. They can show us how a gene is regulated by revealing known and uncovering new regulatory DNA sequences. They can tell us what other genes are nearby, and where genes are on chromosomes. Genomic DNA sequences from one species can probe for similar sequences in other species and comparative sequence analysis can then tell us a great deal about gene evolution and the evolution of species.

One early surprise from gene sequencing studies was that we share many common genes and DNA sequences with other species, from yeast to worms to flies… and of course vertebrates and our more closely related mammal friends. You may already know that the chimpanzee’s and our genomes are 99% similar. Moreover, we have already seen comparative sequence analysis showing how proteins with different functions nevertheless share structural domains.

Let’s look at cloning a genomic library in phage. As you will see, the principles are similar to cloning a foreign DNA into a plasmid, or in fact any other vector, but the numbers and details used here exemplify cloning in phage.

A. Preparing Genomic DNA of a Specific Length for Cloning

To begin with, high molecular weight (i.e., long molecules of) the desired genomic DNA are isolated, purified and then digested with a restriction enzyme. Usually, the digest is partial, aiming to generate overlapping DNA fragments of random length. When the digest is electrophoresed on agarose gels, the DNA (stained with ethidium bromide, a fluorescent dye that binds to DNA) looks like a bright smear on the gel. All of the DNA could be recombined with suitably digested vector DNA. But, to further reduce the number of clones to be screened for a sequence of interest, early cloners would generate a Southern blot (named after Edward Southern, the inventor of the technique) to determine the size of genomic DNA fragments most likely to contain a desired gene.

Beginning with a gel of genomic DNA restriction digests, the Southern blot protocol is illustrated below

To summarize the steps:

a) Digest genomic DNA with one or more restriction endonucleases.

b) Run the digest products on an agarose gel to separate fragments by size (length). The DNA appears as a smear when stained with a fluorescent dye.

c) Place a filter on the gel. The DNA transfers (blots) to the filter for e.g., 24 hours

d) Remove the blotted filter and place it in a bag containing a solution that can denature the DNA.

e) Add radioactive probe (e.g., cDNA) containing the gene or sequence of interest. The probe hybridizes (bind) to complementary genomic sequences on the filter

f) Prepare an autoradiograph of the filter and see a ‘band’ representing the size of genomic fragments of DNA that include the sequence of interest.

Once you know the size (or size range) of restriction digest fragments that contain the DNA you want to study, you are ready to:

a) run another gel of digested genomic DNA.

b) cut out the piece of gel containing the fragments that ‘lit up’ with your probe in the autoradiograph.

c) remove (elute) the DNA from the gel piece into a suitable buffer

d) prepare the DNA for insertion into (recombination with) a genomic cloning vector

B. Recombining Size-Restricted Genomic DNA with Phage DNA

After elution of restriction digested DNA fragments of the right size range from the gels, the DNA is mixed with compatibly digested phage DNA at concentrations that favor the formation of H-bonds between the ends of the phage DNA and the genomic fragments. Addition of DNA ligase covalently links the recombined DNA molecules. These steps are abbreviated in the illustration below.

The recombinant phage that are made next will contain sequences that become the genomic library.

C. Creating Infectious Viral Particles with Recombinant Phage DNA

The next step is to package the recombined phage DNA with added purified viral coat proteins to make infectious phage particles (below)

Packaged phage are added to a culture tube full of host bacteria (typically E. coli). After infection, the recombinant DNA enters the cells where it replicates and directs the production of new phage that eventually lyse the host cell (illustrated below).

The recombined vector can also be introduced directly into the host cells by transduction (which is to phage DNA what transformation is to plasmid DNA). Whether by infection or transduction, the recombinant phage DNA ends up in host cells which produce new phage that eventually lyse the host cell. The released phages go on to infect more host cells until all cells have lysed. What remains is a tube full of lysate containing cell debris and lots of recombinant phage particles.

D. A Note About Some Other Vectors

We’ve seen that phage vectors accommodate larger foreign DNA inserts than plasmid vectors, and YACs even more…, and that for larger genomes, the goal is to choose a vector able to house larger fragments of ‘foreign’ DNA so that you end up screening fewer clones. Given a large enough eukaryotic genome, it may be necessary to screen more than a hundred thousand clones in a phage-based genomic library. Apart from size-selection of genomic fragments before inserting them into a vector, selecting the appropriate vector is just as important. The table below lists commonly used vectors and the sizes of inserts they will accept.

Vector TypeInsert Size (thousands of bases)
Plasmidsup to 15
Phage Lambda ((lambda ))up to 25
Cosmidsup to 45
Bacteriophage P170 to 100
P1 artificial chromosomes (PACs)130 to 150
Bacterial artificial chromosomes (BACs)120 to 300
Yeast artificial chromosomes (YACs)250 to 2000

Click on the links to these vectors to learn more about them. We will continue this example by screening a phage lysate genomic library for a recombinant phage with a genomic sequence of interest.

E. Screening a Genomic Library; Titering Recombinant Phage Clones

A phage lysate is titered on a bacterial lawn to determine how many virus particles are present. A bacterial lawn is made by plating so many bacteria on the agar plate that they simply grow together rather than as separate colonies. In a typical titration, a lysate might be diluted 10-fold with a suitable medium and this dilution is further diluted 10-fold… and so on. Such serial 10X dilutions are then spread over bacterial (e.g., E. coli) lawns. What happens on such a culture plate?

Let’s say that when 10 μl of one of the dilutions are spread on the bacterial lawn, they infect 500 E. coli cells on the bacterial lawn. After a day or so, there will be small clearings in the lawn called plaques…, 500 of them in this example. These are 500 tiny clear spaces on the bacterial lawn created by the lysis of first one infected cell, and then progressively more and more cells neighboring the original infected cell. Each plaque is thus a clone of a single virus, and each virus particle in a plaque contains a copy of the same recombinant phage DNA molecule (below).

If you actually counted 500 plaques on the agar plate, then there must have been 500 virus particles in the 10 μl seeded onto the lawn. And, if this plate was the fourth dilution in a 10-fold serial dilution protocol, there must have been 2000 (4 X 500) phage particles in 10 μl of the original undiluted lysate.

F. Screening a Genomic Library; Probing the Genomic Library

In order to represent a complete genomic library, it is likely that many plates of such a dilution (~500 plaques per plate) will have to be created and then screened for a plaque containing a gene of interest. But, if only size-selected fragments were inserted into the phage vectors in the first place, the plaques represent only a partial genomic library, requiring screening fewer clones to find the sequence of interest. For either kind of library, the next step is to make replica filters of the plaques. Replica plating of plaques is similar to making a replica filter bacterial colonies. While much of the phage DNA in a plaque is encased in viral proteins, there will also be DNA on the plaque replicas that were never packaged into viral particles. The filters can be treated to denature the latter DNA and then directly hybridized to a probe with a known sequence. In the early days of cloning, probes for screening a genomic library were usually an already isolated and sequenced cDNA clone, either from the same species as the genomic library, or from a cDNA library of a related species. After soaking the filters in a radioactively labeled probe, X-Ray film is placed over the filter, exposed and developed. Black spots will form where the film lay over a plaque containing genomic DNA complementary to the radioactive probe. In the example illustrated below, a globin cDNA might have been used to probe the genomic library (globin genes were among the first to be cloned!).

G. Isolating a Gene for Further Study

Cloned genomic DNA fragments are much longer than any gene of interest, and always longer than any cDNA from a cDNA library. They are also embedded in a genome that is thousands of times as long as the gene itself, making the selection of an appropriate vector necessary. If the genome can be screened among a reasonable number of cloned phage (~100,000 plaques for instance), the one plaque producing a positive signal on the autoradiograph would be further studied.

This plaque should contain the gene of interest. The next step is to find the gene within a genomic clone that can be as much a 20kbp long. The traditional strategy is to purify the cloned DNA, subject it to restriction endonuclease digestion, and separate of the digest particles by agarose gel electrophoresis. Using Southern Blotting, the separated DNA fragments are denatured and blotted to a nylon filter. The filter is then probed with the same tagged probe used to find the positive clone (plaque). The smallest DNA fragment containing the gene of interest can itself be subcloned in a suitable vector, and grown to provide enough DNA for further study of the gene.

Using seasonal genomic changes to understand historical adaptation to new environments: Parallel selection on stickleback in highly-variable estuaries

Alan Garcia-Elfring, Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada.

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

McGill University Genome Center, McGill University, Montreal, QC, Canada

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA

Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Alan Garcia-Elfring, Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada.

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

McGill University Genome Center, McGill University, Montreal, QC, Canada

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA

Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, CA, USA

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada

Department of Biology, Redpath Museum, McGill University, Montreal, QC, Canada


Parallel evolution is considered strong evidence for natural selection. However, few studies have investigated the process of parallel selection as it plays out in real time. The common approach is to study historical signatures of selection in populations already well adapted to different environments. Here, to document selection under natural conditions, we study six populations of threespine stickleback (Gasterosteus aculeatus) inhabiting bar-built estuaries that undergo seasonal cycles of environmental changes. Estuaries are periodically isolated from the ocean due to sandbar formation during dry summer months, with concurrent environmental shifts that resemble the long-term changes associated with postglacial colonization of freshwater habitats by marine populations. We used pooled whole-genome sequencing to track seasonal allele frequency changes in six of these populations and search for signatures of natural selection. We found consistent changes in allele frequency across estuaries, suggesting a potential role for parallel selection. Functional enrichment among candidate genes included transmembrane ion transport and calcium binding, which are important for osmoregulation and ion balance. The genomic changes that occur in threespine stickleback from bar-built estuaries could provide a glimpse into the early stages of adaptation that have occurred in many historical marine to freshwater transitions.

Optimizing recombineering in Corynebacterium glutamicum

Correspondence Anthony J. Sinskey, Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Ave., Bldg. 68, Cambridge, MA 02139-4301, USA.

Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

Disruptive & Sustainable Technologies for Agricultural Precision, Singapore-MIT Alliance for Research and Technology, Singapore, Singapore

Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

Disruptive & Sustainable Technologies for Agricultural Precision, Singapore-MIT Alliance for Research and Technology, Singapore, Singapore

Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

Disruptive & Sustainable Technologies for Agricultural Precision, Singapore-MIT Alliance for Research and Technology, Singapore, Singapore

Department of Biology, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA

Disruptive & Sustainable Technologies for Agricultural Precision, Singapore-MIT Alliance for Research and Technology, Singapore, Singapore

Correspondence Anthony J. Sinskey, Department of Biology, Massachusetts Institute of Technology, 77 Massachusetts Ave., Bldg. 68, Cambridge, MA 02139-4301, USA.

Cheng Li and Charles A. Swofford contributed equally to this study.


Owing to the increasing demand for amino acids and valuable commodities that can be produced by Corynebacterium glutamicum, there is a pressing need for new rapid genome engineering tools that improve the speed and efficiency of genomic insertions, deletions, and mutations. Recombineering using the λ Red system in Escherichia coli has proven very successful at genetically modifying this organism in a quick and efficient manner, suggesting that optimizing a recombineering system for C. glutamicum will also improve the speed for genomic modifications. Here, we maximized the recombineering efficiency in C. glutamicum by testing the efficacy of seven different recombinase/exonuclease pairs for integrating single-stranded DNA and double-stranded DNA (dsDNA) into the genome. By optimizing the homologous arm length and the amount of dsDNA transformed, as well as eliminating codon bias, a dsDNA recombineering efficiency of 13,250 transformed colonies/10 9 viable cells was achieved, the highest efficiency currently reported in the literature. Using this optimized system, over 40,000 bp could be deleted in one transformation step. This recombineering strategy will greatly improve the speed of genetic modifications in C. glutamicum and assist other systems, such as clustered regularly interspaced short palindromic repeats and multiplexed automated genome engineering, in improving targeted genome editing.

Results and Discussion

Genome assembly and annotation

For H. armigera, the final assembly freeze (‘csiro4bp’) has 997 scaffolds covering a total of 337 Mb and including 37 Mb of gaps. The N50 is 1.00 Mb, and the mean scaffold length is 338 kb (Table 1). This assembly was selected from several that were generated based on contig and scaffold length and integrity and gene assembly quality for a set of test genes. For H. zea, the final assembly freeze (‘csirohz5p5’) has 2975 scaffolds covering a total of 341 Mb, including 34 Mb of gaps. The N50 is 201 kb, and the mean scaffold length is 115 kb (Table 1). These overall genome sizes are very close to those previously determined by flow cytometry for these and closely related heliothine species [38]. However, they are smaller than those estimated from genome data for the original lepidopteran model genome, the silkworm Bombyx mori (431.7 Mb) [39] and its relative, the tobacco hornworm Manduca sexta (419 Mb) [40]. The N50 statistic for H. armigera in particular compares well to other lepidopteran draft assemblies, although the B. mori assembly has a significant proportion of the genome in larger scaffolds (Table 1).

Automated annotation of the H. armigera genome followed by some manual correction by domain experts (see below) yielded a final official gene set (OGS2) of 17,086 genes (Additional file 1: Table S1). This gene set was then used to derive a final OGS (OGS2) containing 15,200 good-quality gene models for H. zea (Additional file 1: Table S1). Orthologues of another 1192 H. armigera gene models were present as poor-quality models (i.e. much shorter than expected from their H. armigera orthologues) in the available H. zea assemblies and transcriptome data, making a total of 16,392 H. armigera genes for which orthologues could be identified in the H. zea genome. This left 694 H. armigera genes for which no H. zea orthologues were found. In the H. zea assemblies, on the other hand, 410 gene models more than 100 codons in length were identified that had no apparent H. armigera orthologue but these were generally incomplete models that lacked start codons. Nor could any of the very few Pfam domains that were found among the latter gene models be assigned to any of the major manually annotated gene families. These latter H. zea models were therefore not analysed further.

Application of the Benchmarking Universal Single-Copy Orthologues (BUSCO) pipeline [41] showed that the two Helicoverpa OGS2s compare well for completeness with the other lepidopteran genomes analysed. In particular, the H. armigera genome scored more highly on both the genome and protein analyses for genes present than do either of the well-characterised B. mori or M. sexta genomes (Table 1).

Nearly 83% (14,155) of the 17,086 genes identified in the H. armigera genome could be functionally annotated by searches against B. mori and Drosophila melanogaster proteome databases as matching proteins with functions described as other than “uncharacterised”. Most of these also have InterProScan domains or Gene Ontology (GO) annotations (Table 1 Additional file 2: Table S2).

Orthologue mapping of the 17,086 H. armigera genes with the 15,007 National Center for Biotechnology Information (NCBI) Gnomon models for B. mori identified 10,612 direct orthologues. Of the genes in either of these species without direct orthologues in the other, 3043 of the H. armigera genes and 2479 of those from B. mori have GO annotations. For the B. mori genes with no H. armigera orthologue, the major over-represented annotations are chromatin structure and organisation, and DNA replication, with some genes also relating to chorion production (Fig. 1). In contrast, the H. armigera genes without known orthologues in B. mori are over-represented with annotations of signal transduction and sensory perception relating to taste and smell (corresponding to those terms labelled G protein coupled receptor signaling pathway), proteolysis and detoxification.

GO term analyses of gene gain/loss events in H. armigera vs B. mori. The left panel shows GO terms enriched in the H. armigera gene set vs B. mori, and the right panel shows those enriched in the B. mori gene set vs H. armigera

GO annotations were found for 237 of the 694 H. armigera genes without an identifiable match in the H. zea genome. The GO annotations most over-represented among these genes involved sensory perception and signal transduction of taste or smell (Additional file 3: Figure S1). Analysis of the 1192 genes present in H. armigera but with poor models in the H. zea genome showed that only those associated with retrotransposon-coding sequences were enriched this is consistent with these genes lying in poorly assembled genomic regions rather than belonging to any biologically distinct functional group.

Using RepeatModeler, we estimated that the H. armigera and H. zea genomes contain 14.6% (49 Mb) and 16.0% (53 Mb) repeats, respectively, which was significantly less than the

35% repetitive sequence found in the B. mori genome and the

25% repetitive sequence found in the postman butterfly Heliconius melpomene by equivalent methods (Table 1 Additional file 4: Table S3). Most (

84%) of the repeats in both Helicoverpa genomes consisted of unclassified repeats, with less than 1% of each genome consisting of simple repeats or low-complexity regions. A total of 682 unique complex repeats were found in H. armigera, and 97 of these had similarities to Dfam hidden Markov models (HMMs) [42] from other species. In concordance with Coates et al. [38], who identified 794 transposable elements (TEs) among bacterial artificial chromosome (BAC) clones from H. zea, a little over half of all TEs identified were type I elements (retrotransposed) in H. armigera (53%) and H. zea (also 53%), and about half of those were long interspersed nuclear elements (LINEs) (Additional file 4: Table S3). Gypsy elements were the most numerous long terminal repeat (LTR) elements identified in both genomes, although LTR elements were less abundant in H. zea than in H. armigera, possibly reflecting poorer genome assembly quality. For both genomes, the most abundant of the type II elements (DNA transposon-like) that could be classified belonged to the hAT family.

An extensive microRNA (miRNA) catalogue ( has been developed for B. mori [43,44,45] and (as of August 2016) contains 563 mature miRNA sequences, the most for any insect. Two recent papers have also identified miRNAs in H. armigera [46, 47]. We have identified 301 potential miRNAs in H. armigera by combining the ones previously identified for this species and those identified through our own sequencing of small RNAs (Additional file 5: Table S4). Of these, 134 appear to be conserved (E value ≤ 0.001) between H. armigera and B. mori, and 251 and 232 of them, respectively, could be found in our H. armigera and H. zea assemblies, although these numbers dropped to 183 and 161, respectively, when only perfect matches were allowed. Several of the H. armigera and H. zea miRNAs occur within 1 kb of others, but there is only one cluster of more than two (H. armigera scaffold_103 H. zea scaffold_688).

Genome organisation

We next investigated the proportion of the H. armigera genome showing syntenic relationships with B. mori chromosomes. We found that 569 H. armigera scaffolds (93% of the assembled genome) carried at least two contiguous H. armigera genes which had identifiable orthologues on the same B. mori chromosome, and so could be used in this analysis. Of these scaffolds, 536 only contained genes with orthologues on the same B. mori chromosome (Additional file 3: Figure S2). The remaining scaffolds contained two or three discrete blocks of synteny mapping to different chromosomes and may therefore represent non-syntenous relationships or misassemblies. The 536 scaffolds above represent 75.6% of the assembled genome and indicate a very high level of synteny across these two widely separated lepidopterans. This bears out the conclusions of high conservation of macro and micro synteny in Lepidoptera from other studies [48,49,50].

We then investigated the synteny between the two heliothine assemblies. Of the 2975 scaffolds in the considerably more fragmented H. zea assembly, 2367 had good-quality gene models corresponding to H. armigera genes. A total of 1761 of these scaffolds (83% of the assembled H. zea genome) each contained at least two contiguous genes forming a synteny block with an H. armigera scaffold (Additional file 3: Figure S2). As with the H. armigera/B. mori comparison above, most of the 1761 scaffolds (1512, covering 62% of the assembled genome) correspond to a single H. armigera scaffold, with the remainder (249, covering 21% of the genome) comprising multiple distinct blocks of synteny to different H. armigera scaffolds. As above, the latter could indicate either non-syntenous relationships or misassemblies. Notwithstanding the limitations due to the more fragmented H. zea genome, these analyses again indicate a high level of synteny between the species.

Annotation of gene families related to detoxification, digestion, chemosensation and defense

The gene families involved in detoxification, digestion and chemoreception were manually checked and annotated following application of an EXONERATE-based dedicated pipeline using all available sequences and complementary DNAs (cDNAs) to augment the automatically generated models. This yielded a total of 908 H. armigera and 832 H. zea genes. Other automatically generated gene models were manually annotated as belonging to gene families concerned with stress response and immunity, as well as to cuticular protein, ribosomal protein and transcription factor families. Additional file 6: Table S5 gives the names and locations of the total of 2378 H. armigera and 2269 H. zea genes processed in these ways.

The five major detoxification gene families (cytochrome P450s (P450s), carboxyl/cholinesterases (CCEs), glutathione S-transferases (GSTs), uridine diphosphate (UDP)-glucuronosyltransferases (UGTs) and ATP-binding cassette transporters (ABCs)) are very similar in size in H. armigera and H. zea (Table 2 Additional file 4: Sections 1–5). The slightly greater numbers recovered in the former species might be due in part to the higher quality of the assembly for that species. We also compared these numbers with those obtained with the same curation pipeline for the monophagous B. mori and the pest species M. sexta, which is oligophagous on Solanaceae (see Additional file 4: Sections 1–5) and, for the P450s, CCEs and GSTs, also for another pest, the diamondback moth Plutella xylostella, which is oligophagous on Brassicaceae (see Additional file 4: Sections 1–3). Relatively little difference from these other species was evident for the ABCs and UGTs, but quite large differences were found for the other detoxification families. The number of genes encoding P450s, CCEs and GSTs in the two heliothines are similar to or slightly larger than those of one of the other pest species, M. sexta, but substantially larger than those in B. mori and the other pest, P. xylostella — twice as large in the case of the GSTs and 20–40% larger in the case of the P450s and CCEs.

Notably, the differences in the H. armigera P450s, CCEs and GSTs are largely reflected in those of their subgroups that are generally associated with xenobiotic detoxification — the P450 clans 3 and 4, the detoxification and digestive CCE clades and the GST delta and sigma classes [51,52,53] (Fig. 2). Of particular note is the large cluster of CCEs in clade 1, with 21 genes for H. armigera, all located in one cluster of duplicated genes on scaffold_0. Twenty genes from this clade were also recovered from H. zea, and 26 from M. sexta, but only eight from B. mori (Additional file 4: Section 2). There were also large P450 clusters: the CYP340K cluster (10 genes) on scaffold_107 and the CYP340H cluster (six genes) on scaffold_371, both in clan 4, plus the clan 3 CYP6AE genes (11) on scaffold_33. Excepting the relatively low numbers for P. xylostella, the differences in P450s, CCEs and GSTs are consistent with the hypothesised positive relationship of detoxification gene number to host range [11], with the net difference of the heliothines from B. mori and P. xylostella across the three families being at least 50 genes (Additional file 4: Sections 1–3).

Phylogenetic, physical and transcriptional relationships within the major detoxification gene clusters. Selected clades of P450s, GSTs and CCEs, containing genes associated with detoxification functions, are shown. Clades discussed more extensively in the text are highlighted in red. Further details about the gene names and their associated OGS numbers are given in Additional file 4: Sections 1–3. Bars below the gene names indicate genes within a distinctive genomic cluster on a specific scaffold with the number shown see Additional file 4: Sections 1–3 for further details. The clade 1 CCEs are specifically indicated. The phylogenetic order shown does not reflect the physical order of genes within a cluster. Expression is given as fragments per kilobase of transcript per million mapped reads (FPKM) for the tissue/developmental stage transcriptomes and log2(fold change) (logFC) for the host-response transcriptomes

Consistent with their role in host use, the digestive proteases and neutral lipases are also similar in number in H. armigera and H. zea, and more numerous in both than in B. mori (Table 2) (comparable quality annotations not being available for M. sexta or P. xylostella). The differences are again substantial:

200% in the case of the trypsins and neutral lipases, and

50% for the chymotrypsins, giving well over a 50-gene difference in total. As above, many of the differences can be attributed to amplifications of particular gene clusters (Fig. 3 Additional file 4: Section 6). In H. armigera, there are 29 clade 1 trypsin genes, with 28 in a single genomic cluster, and 26 clade 1 chymotrypsin genes in a single genomic cluster (Fig. 3 Additional file 4: Section 6). While the largest cluster of acid lipases comprises just five genes, there are several expanded clusters of neutral lipases, the largest three containing 13, seven and five genes, respectively (Fig. 3 (showing two of these clusters) Additional file 4: Section 7).

Phylogenetic, physical and transcriptional relationships within the major digestion gene clusters. Selected clades of serine proteases and lipases containing genes associated with digestive functions are shown. For the serine proteases, chymotrypsins (on the left) and trypsins (right) are shown as a single tree the neutral and acid lipases are shown separately. Clades discussed more extensively in the text are highlighted in red. Further details about the gene names and their associated OGS numbers are given in Additional file 4: Sections 6, 7. Bars below the gene names indicate genes within a distinctive genomic cluster on a specific scaffold with the number shown see Additional file 4: Sections 6, 7 for further details. The clade 1 chymotrypsins and trypsins are specifically indicated for the latter, no single scaffold is shown because the cluster spans scaffolds 306, 5027, 842 and 194. The phylogenetic order shown does not reflect the physical order of genes within a cluster. Expression is given as FPKM for the tissue/developmental stage transcriptomes and logFC for the host-response transcriptomes

Only one of the four families of chemosensory proteins, the gustatory receptors (GRs), showed large differences in number between the four species (Table 2 Additional file 4: Section 8, and see also [54]). In this case, H. armigera had 28% more genes than H. zea (213 vs 166, respectively), far more than would be expected simply from the difference between the two species in assembly quality. This concurs with the finding above that the GO terms most enriched among the H. armigera genes without H. zea equivalents included sensory perception and signal transduction of taste or smell. In fact, 47 (20%) of the 237 genes in this category for which we found GO terms were GRs. H. armigera also had about three times as many GRs as B. mori, and four times as many as M. sexta (213 vs 69 and 45, respectively). The difference from B. mori is again consistent with the enrichment of GO terms concerned with sensory perception and signal transduction related to taste or smell found among the H. armigera genes without equivalents in B. mori, as discussed above for Fig. 1. Notably, the oligophagous M. sexta has even fewer GR genes than B. mori we do not know why this is so.

Few differences were evident among the two heliothines and B. mori in the numbers of genes involved in stress response and immunity (Additional file 4: Section 9) or in groups of genes important for larval growth, such as the cuticular proteins and transcription factors (Additional file 4: Section 10). The largest single cluster of duplicated genes we found anywhere in the H. armigera genome involved 60 cuticular protein RR-2 genes, the corresponding clusters in H. zea and B. mori comprising 58 and 54 genes, respectively (Additional file 4: Section 10). Full details of the genes in these families and functional classifications are provided in Additional file 6: Table S5.

Evolutionary analyses of major gene family expansions in H. armigera and H. zea

Phylogenetic analysis revealed several major duplication events of detoxification and digestion-associated genes within the heliothine lineage which pre-dated the divergence of the two species but nevertheless occurred relatively recently within this lineage. For example, radiations of 11 CYP6AEs in clan 3, 25 CYP340s and 15 CYP4s in clan 4 (Additional file 4: Section 1), 15 of the clade 1 CCEs (Additional file 4: Section 2) and 23 each of the clade 1 trypsins and chymotrypsins (Additional file 4: Section 6) were found in the heliothine lineage. Many of these duplicated genes have been associated with rapid amino acid sequence divergence for example, divergences within the three large clusters (i.e. clade 1 in each case) of CCEs, trypsins and chymotrypsins in H. armigera have resulted in identity ranges of 45–91%, 47–95% and 48–98%, respectively. Dating analyses using the Bayesian Markov chain Monte Carlo (MCMC) method in Bayesian evolutionary analysis by sampling trees (BEAST) v2.4.3 [55] showed that most of the duplications occurred from more than 1.5 to about 7 Mya (Additional file 4: Table S6 Additional file 7). This range pre-dates the estimate by Mallet et al. [25] and Behere et al. [26] of around 1.5 Mya for the divergence of H. armigera and H. zea, a date supported by our analysis below.

Phylogenetic analyses of the GRs (Additional file 4: Section 8) showed that the very large numbers of those genes in the heliothines compared to B. mori were also largely due to recent amplifications within the heliothine lineage. On the other hand, the larger number of GRs in H. armigera than H. zea could be attributed to the loss of genes in the H. zea lineage, since our divergence dating puts those amplifications earlier than the H. zea/H. armigera split. Furthermore, the fact that 12 of the 20 genes among the 2269 manually curated H. zea gene models which had internal stop codons were GRs (cf. none in H. armigera Additional file 4: Section 8) suggests that the process of GR gene loss in H. zea may be ongoing.

We next carried out several analyses on the evolutionary changes in the above major gene families. As noted, a large body of empirical evidence from a wide range of insect species enables us to partition the clades within the P450, CCE and GST families into those that have been recurrently associated with detoxification functions and those for which there is little or no empirical evidence of such functions. Nine of the H. armigera genes in the detoxification lineages, but none of the genes in the other lineages, were found to be missing in the H. zea assembly. We then compared the rates of amino acid sequence divergence between the two heliothines for P450, CCE and GST genes in these two sorts of lineages. We found that the K a/K s statistics in the lineages directly associated with detoxification functions generally diverged in amino acid sequence more rapidly between the two heliothines than did other lineages in these families (Table 2). Finally, we used Tajima’s relative rate test to screen for heterogeneity in rates of amino acid sequence divergence among closely related paralogues in these lineages (Table 3 Additional file 4: Table S7), finding that 42% (19/45) of the pairs in the detoxification lineages yielded significantly different rates, whereas only 14% (2/14) of pairs in other lineages in these families did so. Significant differences in rates were also observed for several major digestive clades, particularly among the chymotrypsins, and for several GR lineages (Additional file 4: Table S7).

Overall, the picture emerging from the evolutionary analyses is of extensive recent amplification and rapid sequence divergence among several clades of the detoxification, dietary and GR gene families in the heliothine lineage prior to the H. armigera/H. zea split, with the subsequent loss of some detoxification and more GR genes in H. zea. We propose that the gene amplification and diversification prior to the split reflect the emergence of this highly polyphagous branch of the heliothine megapest lineage, while the subsequent loss of genes in H. zea reflects its contraction to a somewhat narrower host range than that of H. armigera. We do not know how their host species differed in pre-agricultural times, but, notwithstanding considerable overlap, there are now some differences between them. Cunningham and Zalucki [27] list hosts from 68 plant families for H. armigera but only from 29 families for H. zea. Many papers on the ecology of H. zea cite its heavy dependence on maize, soy and, in some cases, their wild relatives [56,57,58,59,60,61], while some major papers on H. armigera [57, 62, 63] stress that large populations of the species live on diverse wild hosts outside agricultural areas.

Transcriptomic profiles of the detoxification and digestive genes across tissues and developmental stages

A profile of tissue/stage-specific gene expression was built up from 31 RNA-seq-based transcriptomes from either whole animals or specific tissues/body parts, with 15 of the latter being from fifth instar larvae and 12 from adults (Additional file 4: Table S8). These included tissues important in sensing, detoxification or digestion in adults (antennae and tarsi of each sex) and larvae (mouthparts, salivary gland, gut, tubules, fat body and epidermis). Transcripts from a total of 13,099 genes were detected at levels sufficient to analyse, including 303 of the 353 genes from the detoxification families and 145 of the 193 from the digestion families above (see Additional file 4: Sections 1–7 for full details) the chemosensory genes generally showed too little expression for meaningful analyses.

The results for the P450 clans, CCE clades and GST classes most often associated with detoxification and/or where we found the largest differences in gene number between the species above are summarised in Fig. 2. Relatively high expression (fragments per kilobase of transcript per million mapped reads (FPKM) >30) was found for many of the CYP6s and CYP9s in various detoxification and digestion-related tissues and for some of the CYP4s in various detoxification-related tissues for one particular clade of delta GSTs and most of the sigma GSTs in most detoxification and digestive tissues and for about half of the CCEs in clades 1, 6 and 16, mostly in digestive tissues, principally fifth instar midguts. The ABC transporters were expressed in most tissues screened, with one particular lineage (the ABCG subfamily) expressed at higher levels in several detoxification-related tissues and also salivary glands, while relatively high UGT expression was found for the UGT-40 lineage in various detoxification and digestive tissues (Additional file 4: Sections 4, 5).

For the digestion-related families, Fig. 3 shows that expression of most midgut proteases was high in fifth instar midguts and to a lesser extent foreguts, with little expression elsewhere. Interestingly, as was the case with the clade 1 CCEs, particular subclades of the clade 1 trypsins and chymotrypsins were only expressed at low levels in any of the digestive (or detoxification) tissues. The lipases showed a more complex pattern of expression, with the galactolipases among the neutral lipases (the clusters containing HarmLipases 33–37 and 66–71) and a recently diverged cluster of acid lipases (HarmLipases 24–28) among the minority heavily expressed in mid- or foregut. On the other hand, the medium- (8–16 residues) and large- (21–26 residues) lidded neutral lipases (HarmLipases 09, 40, 54–56, 04 and 77, and 02, 03, 38 and 93 i.e. groups 5, 7 and 8b respectively in Additional file 4: Section 7), as well as several triacylglycerol and miscellaneous other lipases, were expressed in a range of other tissues (mainly fat body, salivary gland, silk gland and cuticle).

Larval growth and transcriptomic responses of the detoxification and digestion genes on different hosts

H. armigera larvae were raised on seven different species of host plant known to differ in their quality as hosts [64] plus the soy-based standard laboratory diet used in the first transcriptomics experiment above. The laboratory colony is normally maintained on the standard diet, but remains capable of completing its life cycle on host plants such as cotton [65]. Use of this colony allows ready comparison of the responses to different host plants at the whole genome level.

The experiment was designed to measure developmental time to, and weight and gene expression profiles at, a specific developmental stage, i.e. instar 4 plus 1 day. All hosts allowed larvae to develop to this point. There were large differences in the performance of the larvae on the eight diets, with mean development time to harvest varying between 7 and 15 days and mean weight at harvest varying between 13 and 150 mg (Fig. 4). The laboratory diet was clearly the most favourable, with the larvae developing relatively rapidly and growing to the largest size, while Arabidopsis was clearly the poorest, giving the longest development time for a very low larval weight. Maize and green bean yielded midrange values for both measures. Cotton and Capsicum produced relatively small but rapidly developing larvae, whereas tomato and tobacco produced relatively large but slowly developing larvae. It is of interest that the diet allowing most rapid completion of development was in fact cotton this was also found to be the case by Liu et al. [64].

Effects of rearing diet on development time and weight gain. The mean weights and development times with their standard errors are plotted for larvae from each diet

Gene expression was then profiled at the defined developmental point. Read mapping of RNA-seq data for the whole fourth instar larvae to the OGS2 yielded data for 11,213 genes at analysable levels (a minimum level of 5 reads per million across three libraries). Differential expression (DE) on plant hosts compared to the control diet was then calculated for each of these genes, with 1882 found to be differentially expressed on at least one host (Additional file 8: Table S9). These 1882 genes included 185 of the 546 genes in the detoxification and digestion-related families above (analysable data having been obtained for 452 of the 546). This was a highly significant, greater than threefold enrichment (hypergeometric test p = 1.5 × 10 –48 ) of these families compared to their representation in the genome overall. The 185 DE genes included approximately one-third of each of the detoxification and digestion sets. The chemosensory proteins were only poorly represented among the 11,213 genes with analysable data only 10 GRs were analysable and none of them were differentially expressed.

Initial analysis of DE genes in the major detoxification and digestion-related gene families (Figs. 2 and 3) found wide variation in transcriptional responses among both the hosts and the genes. Nevertheless, some clear patterns emerged. Most of the genes in the five detoxification families were upregulated on the least favoured diet, Arabidopsis, and for four of these families most of the genes screened were downregulated on cotton. For the P450s and CCEs, tobacco also elicited a broadly similar upregulation response to Arabidopsis. For the GSTs, most genes were downregulated on every host other than Arabidopsis, with maize eliciting the most frequent downregulated response. The UGTs also produced downregulated responses on several hosts other than Arabidopsis, but in this case maize elicited some upregulated responses. Most ABC transporters were upregulated on every host other than cotton and to a lesser extent Capsicum.

Many of the genes in the five detoxification-related families which were most prone to differential regulation across the various hosts occurred in physical clusters. These genes included the CYP340K cluster on scaffold_107, the CYP340H cluster on scaffold_371, the CYP341 genes on scaffold_21, the clade 1 esterases mentioned above and a large cluster of 13 UGT33 genes on scaffold_562. Many others, although not always physically clustered, were nevertheless closely related in a phylogenetic sense, for example, the GSTD1n, GSTS2, ABCB and ABCC lineages. In a few of these cases, such as the CYP340 and 341 clusters and the GSTD1n lineage, some of the genes within each cluster/lineage showed similar patterns of DE. However, in most cases, different genes within each cluster or lineage reacted differently to the different hosts. Thus, considerable regulatory evolution has accompanied the diversification of coding sequences within these clusters and lineages.

Importantly, many of the genes in the detoxification families most prone to DE on the various host plants were not necessarily ones that had been heavily expressed in the tissues related to detoxification or digestion on the laboratory diet. Genes prone to host plant-related DE that had been highly expressed in the tissues on the laboratory diet included some CYP6s, CYP337s and delta GSTs. However, genes prone to DE on the different hosts that had shown little expression in the tissues on the laboratory diet included several CYP340s, clade 1 CCEs, ABCs and UGTs (Fig. 2). This accords with empirical evidence that many detoxification genes are inducible in response to xenobiotic exposure [51,52,53].

Many of the midgut proteases also showed DE on different host plants (Fig. 3). Overall, the proteases were more likely to be downregulated on the host plants compared to the protein-rich soy-based laboratory diet, this effect being most pronounced on green bean, cotton and Arabidopsis. These downregulatory responses were most evident in certain regions of the clade 1 trypsin and chymotrypsin clusters. On the other hand, Capsicum and to a lesser extent tobacco elicited several upregulatory responses in other regions of these two clusters, with some specific genes, e.g. Try116 and Try118, showing divergent responses on green bean and Capsicum. For Capsicum and to a lesser extent tomato, upregulatory responses were also evident in the cluster of seven trypsin genes on scaffold_9. Coordinated changes across several hosts were evident for Tryp114–120 within the clade 1 trypsin cluster but, as with the detoxification genes above, even closely linked genes within genomic clusters generally diverged in their transcriptional responses across the panel of diets.

Many of the acid lipases, but only a phylogenetically restricted minority of the neutral lipases (clades 1 and 2, each with nine genes), also showed significant DE across the various diets (Fig. 3). In contrast to the proteases, the diet-responsive lipases were most often upregulated on the host plants as opposed to the laboratory diet, which is consistent with the fact that laboratory diets generally have higher levels of free fatty acids than the host plants [66]. Interestingly, tobacco, Arabidopsis and to a lesser extent green bean elicited similar responses from many of the genes in both sets of lipases. Otherwise, however, the lipases showed a diversity of host responses more akin to the diversity seen in the other gene families above. Thus, there were relatively few cases of closely related lipase genes within clusters showing the same expression profiles across the various diets and, as with the other systems above, those that did generally involved the most recently diverged clusters (e.g. the neutral lipases HarmLipases 82–84 67, 69 and 70 and 66, 71 and 72 Additional file 4: Section 7).

Fewer genes implicated in growth and morphogenesis and stress responses showed DE across the hosts (Additional file 4: Sections 9, 10) than did the families above, although some involved in growth and morphogenesis showed DE on cotton and Arabidopsis, and some stress response genes showed DE on Capsicum. The cotton-specific expression changes may be due to the faster rate of developmental stage progression on this host, meaning that more gene families, pathways and networks show variable expression at any particular time point.

Overall, most (1199) of the total set of 1882 DE genes across the genome were only identified as DE on a single diet, suggesting a specific response to the particular characteristics of the host plant (Fig. 5). Each host plant elicited DE in at least 200 genes, with cotton, Arabidopsis and Capsicum each affecting more than 600. The most common shared responses involved genes that were differentially expressed on cotton and Capsicum (124 genes) and to a lesser extent on Arabidopsis and tobacco (58 genes). Notably, Arabidopsis and tobacco were the poorest hosts (long developmental time and low larval weight), and cotton and Capsicum were also relatively inefficiently used (shorter developmental time, but still relatively low weight gain) (Fig. 4).

Numbers of genes differentially expressed on each of the different diets. The seven diets are listed at the bottom of the figure, with the total numbers of DE genes on each diet shown by the horizontal histogram at the lower left. The main histogram shows the number of DE genes summed for each diet individually and for various diet combinations. The diets for which each number is calculated are denoted by black dots, representing either a single diet plant or a combination of multiple different diets. See also Additional file 3: Figure S3 for a principal component analysis showing the relationships among the transcriptional responses to the different diets

Integrating the tissue/developmental stage and host-response transcriptomics

Two weighted gene co-expression networks were constructed, one for each of the tissue/developmental stage and host-response data sets, using sets of 13,099 and 7977 rigorously filtered genes, respectively (see Methods). Each network assigned each gene in the data set to a co-expression module containing genes with the most similar expression profiles to it.

Five of the 47 co-expression modules recovered from the tissue/developmental stage network were highly enriched for genes among the 1882 identified above as differentially expressed in response to diet 529 of the 1456 genes in these five modules were among the 1882 DE genes (Fig. 6). These five modules highlight the important tissues involved in that response, with, as expected, tissues implicated in detoxification and digestion being strongly represented: four of these modules contained genes expressed specifically in the larval fore/midgut (T1), the Malpighian tubules (T2), the fat body (T3) or in all detoxification/digestion tissues (T4). The fifth module (T5) corresponds to genes expressed in the sensory apparatus (larval antenna/mouthparts and adult antennae/tarsus), highlighting that sensory/behavioural responses play a key role in host plant adaptation in H. armigera [27].

Expression profiles for selected co-expression modules from the tissue/developmental stage transcriptomic experiment that are enriched for diet-responsive genes. The five modules for which expression profiles are shown are those most enriched for genes called as DE in the host-response experiment (see text). Expression (FPKM) profiles for each module are shown on the left, with the tissue types (see text) identified by colour as in the legend. The composition of each module is described in the central panels, showing the total number (N) of genes per module, the number that are DE, the number in all diet co-expression modules (DM) and the number in the major gene family (GF) classes defined by the key below. Major functions enriched in each module are noted on the right of the figure

The host-response co-expression network yielded 37 modules, of which nine were enriched for genes in the 1882 DE gene set above (675 of the 1485 genes in these nine modules being DE genes) and are therefore most likely to contain networks of genes involved in host response (Fig. 7). Four (D8, D10, D21 and D25) of these nine modules were also significantly enriched for the 546 genes in the families identified a priori as containing general detoxification (D10) and digestion (D8 — specifically protease) related functions (Fig. 7), as was one further module, D37 (Additional file 4: Table S10a Additional file 9: Table S10b). Five of the nine modules (D8, D10 and D25 again, as well as D23 and D24) were also significantly enriched for the 1456 genes in the five stage/tissue co-expression modules involving tissues with detoxification- and digestion-related functions (Additional file 4: Table S10a), consistent with these modules’ enrichment for DE genes. Three further diet modules were identified as also enriched for genes in these developmental modules, one of which (D37, the other two being D3 and D32), as noted, had also been enriched for the 546 a priori identified genes in detoxification/digestion gene families (Additional file 4: Table S10a). D37 is of particular note, being specifically enriched (27 of its 32 members) for midgut trypsin and chymotrypsin sequences in the two large clusters shown in Fig. 3 while expressed at relatively low levels on the control laboratory diet, these genes were all upregulated on several of the plant hosts.

Expression profiles for selected co-expression modules from the host-response transcriptomic experiment. The eight modules for which expression profiles are shown are those most enriched for DE genes. Four of these modules (see text) are also significantly enriched in genes from the detoxification- and digestion-related families. Expression (log2FC) profiles for each module are shown on the left. The composition of each module is described in the central panels, showing the total number (N) of genes per module, the number that are DE, the number in the five tissue/developmental stage modules T1–T5 (TM) and the number in the major gene family (GF) classes defined by the key below. Major functions enriched in each module are noted on the right of the figure. See Additional file 4: Section 11 for more detailed analyses of the host-response network including aspects illustrated by the co-expression modules D20 and D3

Unsurprisingly, the three diet modules D8, D10 and D25, which were significantly enriched for all three sets of genes above (i.e. the 1882 DE genes, the 546 in the key gene families and 1456 in the five key tissue/developmental stage modules), were all over-represented with GO terms covering functional annotations such as catabolism, amylase, endopeptidase, carboxylester hydrolase and monooxygenase (Additional file 3: Figure S4). D25 alone contains 11 P450s from clans 3 and 4, 10 CCEs, including six from clade 1, nine UGTs, two delta class GSTs, a trypsin and a lipase. Notably also the transcription factors in these modules — three each in D8 and D10 and one more in D25 (Additional file 4: Section 11) — are candidates for the crucial upstream regulatory roles controlling host responses (see also Additional file 4: Section 10 Additional file 10). The plants on which these modules with significant numbers of the transcription factors (e.g. D8 and D10) were most upregulated — cotton, Capsicum and Arabidopsis — were among the most problematic or inefficiently used of the hosts tested.

Taken together, the expression data illustrate the considerable extent to which the H. armigera larval host response involves coordinated expression, on a tissue-specific basis, of specific genes, including a significant number of those in the major detoxification- and digestion-related families. Further, the diversity of co-expression patterns across the different host plants emphasises the transcriptomic plasticity of H. armigera larvae. It will be of great interest now to test whether H. zea shows comparable levels of transcriptomic plasticity on similar hosts.

Resequencing data

Whole genome sequence data from a total of four H. armigera lines and five H. zea lines/individuals were analysed to further investigate the genetic relationships between the two species. In addition to the reference lines for the two species, from Australia and North America, respectively, the sample included two Chinese and one African-derived H. armigera lines and four H. zea individuals from North America. Single-nucleotide polymorphisms (SNPs) in the nine resequenced genomes were called in two ways, one from each of the two species’ reference sequences.

When the SNPs were called from the H. armigera reference sequence, a multi-dimensional scaling (MDS) analysis placed the resequenced genomes for each species very close to each other and well separated from the other species, but the H. armigera reference line was well separated from both these groups, albeit closer to the other H. armigera than the H. zea samples (Fig. 8a). When the SNPs were called from the H. zea reference line, the MDS placed all five H. zea sequences close to one another and well separated from all the H. armigera samples, but the latter could then be separated in the second MDS dimension, with one Chinese sequence (SW) slightly removed from both the other Chinese sequence (AY) and the African-collected laboratory strain (SCD) (Fig. 8b). The separation of the H. armigera reference from the other H. armigera lines (Fig. 8a) probably reflects the fact that the H. armigera reference line represents a distinct subspecies, H. armigera conferta, which is present only in Australia, New Zealand and some south-west Pacific islands [23, 37]. Notwithstanding their differing geographic ranges, both subspecies are found in a very wide range of ecological habitats, and there is no evidence as yet that they differ in their ability to inhabit any specific ecology [27, 57, 63, 67]. Whole genome sequences of comparable quality of the two H. armigera subspecies will be needed to identify particular genome sequences distinguishing the two.

Population structure. Results of MDS analyses, using (a) H. armigera and (b) H. zea as the reference strain. The proportion of variance explained by each dimension is given as a percentage on the axis label. To include the reference strains on these plots, genotypes for each reference strain were recoded as 0/0

With both MDS analyses supporting the view that H. armigera and H. zea are indeed separate species, we next estimated the date of the divergence between H. armigera and H. zea by conducting a coalescence analysis using sequence data for 16 recently diverged loci (Additional file 3: Figure S5 Additional files 11 and 12). The resulting tree, with H. punctigera as the outgroup, confirmed H. armigera and H. zea as two distinct species. The divergence dates between the three species were then estimated by applying the coalescence to the 12 most rapidly evolving of the 16 genes [68]. We calculated that H. armigera and H. zea diverged 1.4+/–0.1 Mya, their lineage and that leading to H. punctigera diverged 2.8+/– 0.2 Mya and the Australian H. armigera lineage diverged from the other analysed H. armigera lineages 0.9+/–0.1 Mya. Our coalescent analyses are therefore consistent with the general assumption in indicating that all our H. zea lines diverged from H. armigera prior to the divergence among the sequenced H. armigera lines (although Leite et al. [20] had suggested H. zea was the basal lineage). The estimate for the H. armigera/H. zea split agrees well with previous estimates of around 1.5 Mya for this date, based on biochemical genetics [25] and mitochondrial DNA (mtDNA) phylogenies [26] using a mutation rate estimate of 2% per million years in Drosophila mitochondrial DNA [69]. We find no evidence for introgression between the species since. Our estimates also concur with those of Cho et al. [12] in placing H. punctigera basal to the H. armigera/H. zea lineage, although the date of this divergence has not previously been estimated.

Estimates of genome-wide diversity (pi) were consistently about twice as high within the resequenced H. armigera genomes as they were within the resequenced H. zea genomes (Additional file 3: Figure S6), regardless of which species was used as the reference. Interestingly, however, the H. armigera sequences showed lower diversity values for non-synonymous sites compared with synonymous sites than did H. zea (Additional file 3: Figures S6, S7). Thus, although there was greater heterozygosity overall in the H. armigera samples, their non-synonymous sites showed more evidence of selective constraint than did the H. zea samples. Note that the absolute values for diversity shown in Additional file 3: Figure S6 (

0.015 for H. armigera and 0.004 for H. zea) are lower than those reported by others (e.g. see [37, 70]), probably due to the more stringent filtering used to allow us to compare individuals from the two species (see Methods). Nevertheless, the relative levels of polymorphism are consistent across all these studies.

Consistent with the estimates of heterozygosity, Bayesian skyline plot analysis using the resequencing data consistently estimated a much (

10×) greater contemporary effective population size for H. armigera than for H. zea (N e

2.5 × 10 8 and 2.5 × 10 7 respectively). In addition, our estimates of effective population size change through time indicated an expansion in H. armigera around 6–8 Mya. By contrast, the effective population size of H. zea increased only slowly from about 1.5 Mya. All these values were obtained using the corresponding reference genomes to call the SNPs, but essentially the same results were obtained whichever reference genome was used (data not shown).

We found small but significant positive correlations between H. armigera and H. zea in the pattern of variation in pi across their genomes. This was true for both their synonymous and non-synonymous sites, although the correlation was slightly stronger for the synonymous sites (rho = 0.421 cf. 0.387, p < 0.001 for both Additional file 3: Figure S7). This difference is to be expected, as lineage-specific selective pressures will result in greater diversity between the species at non-synonymous sites. The size of the correlations seen for both the synonymous and non-synonymous sites implies that, while a large proportion of variance in diversity across genomic bins is shared across the two species, the majority (

0.6) of this variance is in fact not shared between them.

Candidate insecticide resistance genes

Paralleling its greater host range, H. armigera is also considerably more prone to develop insecticide resistance than H. zea, even though many populations of both are heavily exposed to insecticides [30, 71]. H. armigera has developed resistance to many chemical insecticides, including organochlorines, organophosphates, carbamates and pyrethroids (see [30, 72,73,74] for reviews), and, more recently, to the Cry1Ab, Cry1Ac and Cry2Ab Bt toxins delivered through transgenic crops [75]. By contrast, in H. zea significant levels of resistance have only been found for organochlorines and pyrethroids and, even then, relatively infrequently [30].

In most of the H. armigera cases at least one of the underlying mechanisms is known, but specific mutations explaining some of the resistance have only been identified for three of them, specifically the metabolic resistance to pyrethroids and the Cry1Ab and Cry2Ab resistances [31, 32, 76, 77]. However, in several of the other cases bioassay and biochemical information on the resistance in H. armigera or H. zea, together with precedent molecular studies from other species, indicate the genes likely to be involved. We therefore screened our sequence data for the presence of intact copies of those genes, their expression profiles and mutations recurrently found to confer resistance in other species. The reference Australian H. armigera colony and the resequenced African strain are known to be susceptible to most if not all the insecticides above, but the two Chinese lines could be resistant to pyrethroids and possibly other chemical insecticides [71, 78]. The Chinese AY line had also been shown to be resistant to the Cry1Ac Bt toxin [79]. The reference H. zea line is susceptible to all the insecticides above, and the resequenced lines were also derived from populations known not to have any significant resistances. The results of our screens are detailed in Additional file 4: Section 12 and summarised below.

Resistance due to insensitive target sites has been demonstrated for organochlorines, organophosphates and pyrethroids in H. armigera. These resistances would be expected to involve gamma-aminobutyric acid (GABA)-gated chloride ion channels, acetylcholinesterase-1 or possibly acetylcholinesterase-2 and voltage-gated sodium channels, respectively. We found good models of the key genes, with wild-type sequences lacking known resistance mutations, in both species. The transcriptome data show them to be well expressed in neural tissue. Both H. armigera and H. zea were found to have orthologues of certain additional GABA-gated chloride ion channel genes found in other Lepidoptera although these genes have sequence variations at locations associated with resistance mutations in other insects, none of these changes in Lepidoptera have been associated with resistance (Additional file 4: Section 12).

Resistance due to enhanced metabolism of the insecticide has been demonstrated for organophosphates and pyrethroids in H. armigera. The organophosphate resistance is correlated with the upregulation of several clade 1 carboxylesterases [80], particularly CCE001g, but which of the overexpressed CCEs actually causes the resistance remains unknown. The pyrethroid resistance is mainly caused by enhanced P450-mediated metabolism, and much of this is due to novel CYP337B3 genes resulting from fusions of parts of the adjacent CYP337B1 and CYP337B2 genes through unequal crossing over [76, 81]. Although CYP337B3 alleles have been identified at various frequencies in populations around the world, there was no evidence, either from screening for reads that cross the fusion junction or from read densities for the constituent sequences, for their existence in any of the sequenced lines for either species. Another P450 gene that is interesting in relation to insecticide resistance is the CYP6AE14 gene. This P450 was originally implicated in the metabolism of a particular insecticidal compound produced by cotton (gossypol) but is now thought to have a more general role in detoxifying various plant defense chemistries [82,83,84]. Notably, we find no evidence of the CYP6AE14 gene in any of our H. zea genome or transcriptome data.

Several molecular mechanisms have been reported for resistances to Bt toxins in H. armigera. They involve disruptions to the cadherin [31] or ABCC2 transporter [77] proteins in the larval midgut for the Cry1Ab/c toxins, and to ABCA2 proteins for the Cry2Ab toxin [32]. All these resistance mutations are recessive. We find intact gene models for these genes in both reference genomes and the resequenced lines. Although the AY strain is known to be resistant to Cry1Ac, that resistance is dominant [79] and therefore likely to be due to mutation in an unknown gene different from those mentioned above.

The genomes of both species therefore contain good models of the genes encoding the target sites for several classes of chemical insecticides and Bt toxins for which target site resistance has been reported in H. armigera or other species. This would be expected given the known essential neurological functions of the chemical insecticide targets and the indications of important functions for the Bt targets provided by the fitness costs in the absence of Bt commonly associated with Bt resistance mutants [85]. Notably, however, we found two presence/absence differences in genes implicated in metabolic resistance to chemical insecticides or plant toxins in H. armigera. In both cases, as described above, the gene has been found in H. armigera populations but not in our H. zea data. One is the chimeric CYP337B3 gene, and the other is CYP6AE14. These cases may represent benefits to H. armigera from specific neofunctionalisations enabled by the extensive duplication of its detoxification genes. Also relevant here is our evidence for this species’ diverse upregulatory responses of various detoxification genes to different hosts. Given emerging evidence for similar sorts of upregulatory responses to various insecticides [72], and the abilities of some of the detoxification enzymes to bind/transform a wide range of insecticides [86,87,88], its unusually large repertoire of detoxification enzymes may provide H. armigera with a high level of metabolic tolerance to many insecticidal chemistries.

Amplifying and Using the Library

Once you receive your library from Addgene, check the library information to see if it should be amplified before you conduct your screen. If you need to amplify the library, please refer to the depositor’s protocol for the best results.

For some libraries, plasmid DNA can be delivered directly to the cells of interest. With others, notably pooled lentiviral plasmid libraries, the plasmids must first be used to make virus. This pooled virus is subsequently used to deliver the plasmids to the cells of interest. In either case, next-generation sequencing of the maxiprep DNA is recommended to verify that the library is complete - an incomplete library may lead to false positives or false negatives in later experiments, and can also negatively affect data reproducibility.



We sampled leaves from accessions at the Cacao Research Unit at the University of West Indies and CATIE in Costa Rica (Supplementary Table 1).

DNA extraction and sequencing libraries preparation

Samples processed at Stanford University were prepared as follows:

DNA was extracted using ZR Plant/Seed DNA MiniPrep™ (Zymo Research Inc). Approximately 3 g of leaf material per extraction per sample was cut and placed in homogenization tubes with ceramic pearls and lysis buffer. Samples were homogenized in a FastPrep-24 TM (MP Biomedicals, LLC) placed in a cold room at 4 °C for 60 s at a speed of 4.5 m sec -1 . If the tissue was not homogenized thoroughly, the tissues were homogenized for an additional 20–40 s at the same speed. DNA was quantified using a Qubit TM 3.0 fluorometer (ThermoFisher Scientific), using a dsDNA HS Assay Kit. Additionally, overall quality of extracted DNA was assessed with 2% E-Gel (Invitrogen, Carlsbad, CA). Most of the samples were prepared using Nextera DNA Sample Preparation Kits (Epicentre, Chicago, IL, USA) and NEBnext® Ultra DNA Library Prep Kit for Illumina (New England BioLabs, Inc). The remaining samples were prepared by first shearing the genomic DNA using a M220 Focused-ultrasonicator™ (Covaris Inc) and NEBnext® Ultra DNA Library Prep Kit for Illumina (New England BioLabs, Inc). Libraries were quantified on Agilent 2100 Bioanalyzer High Sensitivity DNA chip for concentration and size distribution, pooled in sets of 3–4 per batch, and sequenced on the HiSeq 2000/2500 platform at the Stanford Sequencing Service Center (100 cycles, paired read mode).

Samples processed at Indiana University were prepared as follows:

DNA was extracted using a protocol customized for enrichment of high molecular weight DNA from cacao leaves. Approximately 450 mg of leaf material per sample was ground to powder under liquid N2 using mortar and pestle. Tissue powder was homogenized and washed twice by vortexing in 3 ml of ice cold 100 mM HEPES, 0.1% PVP-40, 4% b-mercaptoethanol, followed by centrifugation at 7000 rpm in an Eppendorf F35-6-30 rotor. Nuclei were extracted from tissue pellets on ice in 50 mM Tris-Cl pH 8.0, 50 mM EDTA, and 50 mM NaCl with 15% sucrose, and centrifuged at 3600 rpm to pellet trace the cellular debris. Nuclei were lysed at 70 °C for 15 min in 20 mM Tris-Cl pH 8.0, 10 mM EDTA with the addition of SDS to a final concentration of 1.5%. Protein was precipitated on ice with the addition of NH4OAc to a final concentration of 2.7 M, pelleted twice by centrifugation at 7000 rpm. DNA was precipitated using gentle inversion in an equal volume of cold isopropanol, followed by centrifugation at 7000 rpm. DNA pellets were washed in 70% ethanol and resuspended in 10 mM Tris-Cl, 1 mM EDTA using wide bore pipette tips. DNA quality and quantity in the high molecular weight fraction (24 to ≥ 60 kb) was assessed by migration on Genomic DNA Screen Tape, Agilent TapeStation 2200 Software (A.01.04) (Agilent) and secondarily quantified by fluorimetry using the dsDNA HS Assay Kit (Invitrogen) with a Qubit TM 2.0 fluorometer (ThermoFisher). Sequencing libraries were prepared either as unamplified NGS libraries, using the PCR-free DNA library kit (KAPPA) or minimally amplified libraries were prepared using the TruSeq DNA Sample Prep Kit (Illumina) with four cycles of PCR at the Roy J. Carver Biotechnology Center, University of Illinois at Urbana–Champaign (UIUC). All library preparation steps were according to the manufacturer with the exception that after shearing for minimally amplified libraries, DNA was cleaned through a Zymo column and size selected to retain only 400–600 bp fragments. All libraries were evaluated for quality using an Agilent 2100 Bioanalyzer High Sensitivity DNA Assay (Agilent), quantified by qPCR, pooled in sets of 12 at equimolar concentration, and sequenced as paired 2 × 161nt reads on a UIUC HiSeq2500 instrument using HiSeq SBS sequencing kit version 4. Fastq files were generated with CASAVA 1.8.2.

Read processing and SNP identification

The Illumina data were basecalled using Illumina software CASAVA 1.8.2, and sequences were demultiplexed with a requirement of full match of the six nucleotide index that was used for library preparation. Samples prepared using Nextera were hard clipped 13 nt from the 5’ end. Following demultiplexing, raw sequenced data was analyzed for quality using FastQC 62 . We performed adaptive quality trimming (setting a quality threshold of 25) and additional hard trimming of the reads based on stabilization of the base composition on the 5’ end of the sequences using TrimGalore! and cutadapt 63,64 . Sets of reads from individual samples were mapped to the Matina-v1.1 reference genome 21 , using the burrow-wheeler aligner BWA 65 with relaxed conditions for the editing distance (0.06), as it was expected that T. cacao has a high genetic diversity. Aligned sam files were preprocessed prior to performing SNP identification with Samtools/Picard Tools and Bamtools 66,67,68 to mark duplicates, fix mate pair information, correct unmapped reads flags, and obtain overall mapping statistics. We followed recommendations of the Genome Analysis Toolkit to perform base quality recalibration and local realignment to minimize false positives during the SNP calling procedure 69 . Finally, we performed genotype calling using Real Time Genomics population analysis tool to speed the process of SNP identification 70 . Calls were also called with GATK, and a suitable subset of SNPs were kept after a combination of Variant Quality Score Recalibration (VQSR) and hard filters that included thresholds in coverage (maximum coverage = 200*50×), quality by depth (QD 2) estimated from the division of variant confidence by unfiltered depth of non-reference samples, fisher strand test (FS 50), and the root mean square of the mapping quality across samples (MQ 30). Variants identified were phased, per population, using shapeit v2.12 on a subset of variants in which the minor allele frequency (MAF) > 0.05 71,72 . The phasing was performed per chromosome for the ten main chromosomes using only biallelic sites.

Identified SNPs were annotated using SNPEff 73 . For this, we used the current gene annotation from the Matina-v1.1 reference genome 21 to construct a new database for Theobroma cacao. This database was used to annotate the observed polymorphisms following their potential effect on gene expression and functionality according to their position with respect to the coding regions.

Population genetic analyses

We characterize the distribution of genetic variation in the populations, estimating variation using two approximations for the inference of genetic variation: Watterson’s theta (θw) 74 and the number of pairwise differences per site (π) 75 . We used vcftools 76 to estimate both statistics in windows of 1 kb. Generalized Linear models to explain the differences in diversity among populations are explained in the section Distribution of Genetic variation among genetic groups in the Supplementary text.

We used a ADMIXTURE 24 , an implementation of an approach similar to well-known STRUCTURE 77 . Based on an expectation–maximization algorithm, ADMIXTURE uses a maximum likelihood-based approach to assign ancestry genome wide and visualize the genetic structure of the T. cacao populations. A cross-validation procedure is employed to select the most likely number of clusters that explains the structure of the data 24 . We filtered our data and restricted our analysis to SNPs with minor allele frequency over 5%, and we also pruned the data for LD as the approximations assume unlinked loci. For this, we used vcftools 76 to estimate LD (r2) scores for each pair of SNPs in windows of 2000 SNPs and excluded one of the pair if r 2 > 0.45. The windows were selected with 500 SNPs of overlap. The final dataset contained 63,374 SNPs. We analyzed this dataset using ADMIXTURE and set 2–18 ancestral populations (K = 2 to K = 18) in 100 replicates. We checked for convergence of individual ADMIXTURE runs at each K by evaluating the maximum difference in log likelihood scores in fractions of runs with the highest log likelihood scores at each K. We assume that a global log likelihood maximum was reached at a given K if at least 10% of the runs with the highest score show minimal variation in log likelihood scores and present consistent assignment to the groups. It has been shown 78 that a threshold of 5 log likelihood units is conservative enough to assure similar results to those obtained with CLUMPP 79 . In addition to the admixture analysis, we performed a multidimensional scaling analysis on the same set of SNPs employed for ADMIXTURE. First, we normalized the data (centered and standardize) following previous recommendations 80 and performed MDS analyses using Singular Value Decomposition on the normalized data using the cmd scale function in R.

We measured population differentiation resulting from restrictions in gene flow between populations using Weir and Cockerham’s FST estimator 81 in windows of 5 kb, after filtering out low-frequency alleles. To summarize the genome-wide differentiation among populations, we estimated the mean of FST estimators across windows and standard error for every pair of comparisons.

The map and location of populations in South America was created using ggmaps in R. The maps used in ggmaps are obtained from Google maps (open access source) and the diamonds used for the positioning of the populations were modified to increase the size in the Illustrator.

We fitted a generalized linear model to explain the differences in genetic diversity along the Pacific/Atlantic axis of genetic differentiation captured in the second component of a multidimensional scaling. For this, we estimated the centroids for PC1 and PC2 of the data presented in Fig. 1b. These centroids were used as predictors (βi) to explain the differences in mean genetic diversity per population (measured as π, Y in the following model) under a simple linear model with a Gaussian family (Y = eta _o + eta _i + epsilon ) . Admixed individuals were excluded from the analysis.

We used a model-based approach to infer the population relationships between the ten main groups as implemented in TreeMix 26 to identify the relationships between populations and identify signatures of domestication.

We used two methods to infer the demographic history of populations using individual genomes and small sets of individuals per population. First, we used the pairwise sequentially Markovian coalescent as implemented in PSMC 28 second, we used SMC++, a likelihood-free method that can leverage information from multiple individuals from the population (as opposed to PSMC) to infer population size changes in the past 29 . We assumed a mutation rate μ = 7.1 × 10 −9 mutations × bp -1 × gen -1 82. 83 . We also examined the effect of uncertainty in mutation rates by including analysis following recent work suggesting that mutation rates could be half of that estimated previously on the order of 3.1 × 10 −9 mutations × bp −1 × gen −1 84 . Additional details are provided in the Supplementary text. We assumed a generation time of 5 years, based on the observation that it takes 5 years on average to go from seed to seed in cacao. The figures describing the evolutionary history inferred with PSMC were obtained from adjusting a smoothing spline across individual histories inferred for each sample that corresponded to the same population.

We estimated inbreeding using a simple moment estimator F = 1 – Hetobs/Heexp 85 to assess the magnitude of inbreeding experienced by individuals in each population. We then addressed the impact of historical population size on estimated inbreeding using an ANOVA to compare the estimated inbreeding F-statistics among populations.

The association between effective population size and inbreeding was examined with a generalized linear model of the form (Y = eta _0 + eta _i + >) , where Y is the inbreeding coefficient F, β0 is the intersect, and βi is the effect of effective population size. As a predictor, we used the harmonic mean of the effective population sizes estimated under the PSMC model for each population under the population genetic assumption that the smallest population size experienced by the population will strongly influence the magnitude of drift.

Using the inferred relationships among populations obtained with TreeMix, we selected the most closely related population to domesticated Criollo (the Curaray population) to perform detail demographic analyses and infer time of divergence between populations and demographic trajectories for the populations. We use an approximation based on the comparison of the observed site frequency spectrum and simulations in a maximum likelihood framework to decide which model better explained the data, as implemented in the program δaδi 33 . We informed the three main models tested (see Supplementary Information) with the aid of the PSMC results. Akaike information criteria and magnitude of the residuals were employed for model selection. For the estimation of confidence intervals, we performed 1000 bootstraps of the observed dataset and performed estimations using the selected demographic model. Additional details on the estimation of confidence intervals, uncertainty of the generation time and mutation rates, and detailed analysis of the likelihood surface for parameters of interest are provided in the Supplementary Information.

Regions under selection were inferred by analyzing departures from the site frequency spectrum. Analyses performed with XP-CLR 45 allowed us to detect local deviations from the genome-wide site frequency spectrum. For this, we set fixed windows of 0.05 cM for 200 SNPs and grid size of 2 kb. For these analyses, we used the Curaray population as reference and took the top 1% windows with significant XP-CLR score. In addition, we selected those windows of 5 kb in which FST values corresponded to the top 1% of the distribution to examine regions of the genome which potentially present higher differentiation than expected.

Cost-of-domestication analysis

We inferred deleterious and tolerated effects for non-synonymous mutations using a method that uses phylogenetic conservatism. To deploy this method as implemented in Sorting Intolerant from Tolerant (SIFT) 4G 58 , we built a custom database of predictions for all possible non-synonymous SNPs using SIFT4G for T. cacao. SIFT outputs a SIFT score for each amino acid substitution the score ranges from 0 to 1. The amino acid substitution is predicted deleterious if the score is ≤ 0.05 and tolerated if the score is > 0.05.

We used a log-linear model to test for differences in the number of deleterious and tolerated mutations between Criollo and Amelonado. Amelonado was chosen because of the similar levels of inbreeding observed. Because of the differences in sample size, we estimated for each population the number of deleterious and tolerated mutations at three different allele frequency classes: rare (0–0.25), intermediate (0.25–0.375), and frequent (0.375–0.5). This model allowed us to test for general trends in the data and show that there is a significant difference in the number of deleterious mutations among Criollo and Amelonado along binned classes of minor allele frequency. A post-hoc analysis was done with the Mantel–Haenszel test to test for specific effects. See Supplementary information for additional details on the implementation.

Finally, we genotyped an additional set of 151 accessions using a customized chip of 15 K SNPs specific for cacao that was developed in parallel to this work using the novel variants identified in a subset of the accessions 61 . We intersected the genotyped set with the 79 accessions from this work that clearly belong to each one of the putative genetically differentiated populations and performed a supervised ancestry analysis in ADMIXTURE with conditions similar to those explained previously. We measured productivity (measured in kg × ha −1 × yr −1 ) in all 151 accessions. The impact of the accumulation of deleterious mutations on productivity was assessed by fitting a generalized linear model to explain productivity (measured in kg × ha −1 × yr −1 ) as a function of Criollo ancestry after correcting for inbreeding (Supplementary Information for more details). We built a generalized linear model with a Gaussian family of the form:

Y= β0+β1+β2+ε, where Y corresponds to the yield, β0 corresponds to the intersect, β1 corresponds to the proportion of Criollo ancestry, and β2 is the coefficient of inbreeding F estimated for each individual.

We compared the estimates obtained when Criollo ancestry is used versus those obtained when Amelonado ancestry is used as a predictor to test for the specific effect of domestication and not just inbreeding. Supplementary Figure 13 shows the results of association analysis between Amelonado ancestry and productivity. Additional details about the analysis are provided in the Supplementary Information.

Code availability

The computer code is available by OEC via a github repository oeco28/Cacao_Genomics at

NGS Workflow Steps

The next-generation sequencing workflow contains three basic steps: library preparation, sequencing, and data analysis. Learn the basics of each step and discover how to plan your NGS workflow.

Preparing for the NGS Workflow

Before starting the next-generation sequencing workflow, isolate and purify your nucleic acid. Some DNA extraction methods can introduce inhibitors, which can negatively affect the enzymatic reactions that occur in the NGS workflow. For best results, use an extraction protocol optimized for your sample type. For RNA sequencing experiments, convert RNA to cDNA by reverse transcription.

After extraction, most NGS workflows require a QC step. We recommend using UV spectrophotometry for purity assessment and fluorometric methods for nucleic acid quantitation.

Step 1 in NGS Workflow: Library Prep

Library preparation is crucial to the success of your NGS workflow. This step prepares DNA or RNA samples to be compatible with a sequencer. Sequencing libraries are typically created by fragmenting DNA and adding specialized adapters to both ends. In the Illumina sequencing workflow, these adapters contain complementary sequences that allow the DNA fragments to bind to the flow cell. Fragments can then be amplified and purified.

To save resources, multiple libraries can be pooled together and sequenced in the same run—a process known as multiplexing. During adapter ligation, unique index sequences, or “barcodes,” are added to each library. These barcodes are used to distinguish between the libraries during data analysis.

Library Prep Resources

Find guidance for library quantification and quality control.

Learn how to avoid contamination when purifying DNA/RNA.

Step 2 in NGS Workflow: Sequencing

During the sequencing step of the NGS workflow, libraries are loaded onto a flow cell and placed on the sequencer. The clusters of DNA fragments are amplified in a process called cluster generation, resulting in millions of copies of single-stranded DNA. On most Illumina sequencing instruments, clustering occurs automatically.

In a process called sequencing by synthesis (SBS), chemically modified nucleotides bind to the DNA template strand through natural complementarity. Each nucleotide contains a fluorescent tag and a reversible terminator that blocks incorporation of the next base. The fluorescent signal indicates which nucleotide has been added, and the terminator is cleaved so the next base can bind.

After reading the forward DNA strand, the reads are washed away, and the process repeats for the reverse strand. This method is called paired-end sequencing.

Sequencing by Synthesis Technology

Step 3 in NGS Workflow: Data Analysis

After sequencing, the instrument software identifies nucleotides (a process called base calling) and the predicted accuracy of those base calls. During data analysis, you can import your sequencing data into a standard analysis tool or set up your own pipeline.

Today, you can use intuitive data analysis apps to analyze NGS data without bioinformatics training or additional lab staff. These tools provide sequence alignment, variant calling, data visualization, or interpretation.

Learn More About Data Analysis

Jump-Start Your NGS Workflow

Want to get started faster? Consult with experimental design experts through our Workflow Design and Evaluation Service.* We’ll help you design an NGS workflow that’s right for you, process your samples, and generate your first NGS data set.

*Not available in Asia and South Pacific countries.

Contact Us

Microbial Whole-Genome Sequencing

Microbial whole-genome sequencing can be used to identify pathogens, compare genomes, and analyze antimicrobial resistance. Our featured NGS workflow for this application describes the recommended steps. The entire workflow proceeds from DNA to data in less than 24 hours.

DNA Isolation

Use an extraction kit to isolate DNA from microbial colonies without introducing inhibitors. We recommend using glass beads. Assess purity using UV spectrophotometry and quantitate DNA using fluorometric methods.

Library Prep

Prepare and quantify libraries following the protocol listed in the Illumina DNA Prep Guide. You can also perform an optional library quality check using the Agilent 2100 Bioanalyzer or Advanced Analytical Fragment Analyzer. You’ll need:

2.5 hours
Estimated DNA input: 1–500 ng


Sequence libraries in a 2 × 150 bp run following the protocol listed in the iSeq 100 System Guide. You’ll need:

19.5 hours
Estimated output: 1.2 Gb per 2 × 150 bp run
Samples per run: 4–5 samples assuming 5 Mb at 50× coverage

Data Analysis

Analyze data using the BWA Aligner app and visualize data using the Integrative Genomics Viewer app in BaseSpace Sequence Hub. You’ll need:


Department of Geriatrics, University of California, UC San Diego, 9500 Gilman Drive, #9111, La Jolla, CA, 92093-9111, USA

Department of Medicine, Division of Hematology/Oncology, and Center for Personalized Cancer Therapy, University of California, Moores Cancer Center, San Diego, USA

Sadakatsu Ikeda & Razelle Kurzrock

Tokyo Medical and Dental University, Tokyo, Japan

Kaiser Permanente Southern California, San Diego, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar


GK and SI drafted the manuscript. RK checked manuscript and made crtical changes and contributions. All authors read and approved the final manuscript.

Corresponding author

Genetic Perturbation Platform

The Genetic Perturbation Platform, formerly known as the RNA interference (RNAi) Platform, supports functional investigations of the mammalian genome that can reveal how genetic alterations lead to changes in phenotype.

To enable those investigations, the platform develops technologies for perturbing genes, including libraries of CRISPR/Cas9 constructs, short hairpin RNAs (shRNAs), and open reading frames (ORFs) to edit, knockdown, or overexpress genes, respectively.

In addition to developing technologies for perturbing genes, the platform also works to improve the effectiveness of the techniques, enhance delivery methods, and create infrastructure and resources to enable their use on large and small scales. Importantly, the team also assists collaborators in experimental planning and execution, helping them to choose the best model system and most appropriate readout to assess effects of the chosen perturbation.

The platform’s scope has grown significantly since its inception, when it grew out of The RNAi Consortium (TRC), a collaborative effort of six academic research institutions and five leading life sciences organizations. The Genetic Perturbation Platform and TRC worked as an integrated team to develop the materials and technology to enable and enhance RNAi as a tool for mammalian genetic screening. The materials and knowledge generated by this team were made available to the entire scientific community. The platform has since expanded to include varied forms of perturbing gene function, including a significant focus on CRISPR/Cas9 constructs.

The Genetic Perturbation Platform is directed by David Root, a physical chemist with significant experience in cell-based screening and building lentiviral-based libraries.

The platform's major activities include:

Platform scientists are creating genome-scale libraries of RNAi reagents, targeting virtually all human and mouse genes, which are carried in lentiviral vectors that allow this library to be introduced into a wide range of cell types. They have also developed a streamlined production process for rapid expansion of the library.

The platform is also building genome-scale libraries of CRISPR/Cas9 reagents, to knock out genes or perform genome editing.

In 2011 the RNAi Platform, in collaboration with the Broad Institute Cancer Program and the Center for Cancer Systems Biology at the Dana-Farber Cancer Institute, made publicly available the human ORFeome library V8.1, providing an additional tool for manipulating the human genome.

Platform scientists are developing and optimizing methods for using the library in high-throughput screening. This has enabled the first arrayed RNAi genetic screens by lentiviral infection as well as pooled screening approaches.

Platform scientists are evaluating the performance of the entire library using quantitative PCR to measure knockdown of the target transcript, to create the first fully-validated lentiviral RNAi libary. They are also testing and optimizing the performance of the library under different conditions, including different cell lines, timepoints, and multiplicity of infection (MOI).

New versions of the library are being developed, as well as new methods to screen large subsets of the existing library.

Informatics scientists in the platform are developing archival and analytical tools necessary to ensure the utility of this library for all scientists.

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts.

What does HudsonAlpha do?

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

What is Genome Sequencing?

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

How can I get involved?

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

Do you have any conferences?

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

How can I contact HudsonAlpha?

Far far away, behind the word mountains, far from the countries Vokalia and Consonantia, there live the blind texts. Separated they live in Bookmarksgrove right at the coast of the Semantics, a large language ocean. A small river named Duden flows by their place and supplies it with the necessary regelialia. It is a paradisematic country, in which roasted parts of sentences fly into your mouth. Even the all-powerful Pointing has no control about the blind texts it is an almost unorthographic life One day however a small line of blind text by the name of Lorem Ipsum decided to leave for the far World of Grammar.

Watch the video: Genomic Library (August 2022).