Information

How to identify an unknown species from a .fasta file containing its genome sequences

How to identify an unknown species from a .fasta file containing its genome sequences



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am a Biochemist that is unfamiliar with bioinformatic tools and new to academia as a whole.

I am currently using ILLUMINA PE DNA sequence data, which I trimmed (Trimmomatic), corrected (Rcorrector) and assembled (SPAdes). I am now interested in using the genetic sequences from my contigs to identify the original organism (I know it is a bacterium).

I have tried using ncbi's BLASTn on one of my contigs with default parameters, on nucleotide collection database. The contig length is about 360 000 bases long. I let the query run for 30 minutes before stopping, nothing came of it.

What would be the best way to go about doing this?

Edit: For context, these DNA sequences are from genetic material that was extracted from bacteria from a soil sample. Using selective growth environment, we assured the organisms are indeed bacteria (further tests were done to determine other characteristics but this is besides the point).


I would suggest using a tool designed for this purpose such as GTDB-tk (slow but informative), mash (very fast but very reference-dependent), or something similar.


MiRDeep2 documentation

miRDeep2 is a software package for identification of novel and known miRNAs in deep sequencing data. Furthermore, it can be used for miRNA expression profiling across samples. Last, a new module for preprocessing of raw Illumina sequencing data produces files for downstream analysis with the miRDeep2 or quantifier module. Colorspace sequencing data is currently not supported by the preprocessing module but it is planed to be implemented. Preprocessing is performed with the mapper.pl script. Quantification and expression profiling is done by the quantifier.pl script. miRNA identification is done by the miRDeep2.pl script.


About the tools

Kraken and Centrifuge are two powerful metagenomics tools developed by the Salzberg lab at Johns Hopkins Center for Computational Biology . Both tools are highly accurate and are orders of magnitude faster than BLAST-based algorithms.

With earlier tools, researchers were faced with a trade off between reasonable analysis time frame and accuracy. BLAST-based tools, while accurate, are very slow and computationally expensive. To perform more sophisticated metagenomic experiments, researchers need tools that can quickly and accurately derive useful information from a tangled web of sequencing reads.

Kraken and Centrifuge both use a reference database to classify sequences to their taxonomic ranks. Kraken searches for exact k-mer matches in a given sequence against the database to identify a taxonomic label for the sequence. It relies on a precomputed database of k-mers (substrings of fixed length k ) and the lowest common ancestors of the reference genomes containing those k-mers. Kraken is able to classify 4.1 million 100-base-pair reads per minute at accuracy comparable to the fastest BLAST program. 4

Table 1. Classification sensitivity and precision for Centrifuge, Kraken, and MegaBLAST using simulated reads. In Centrifuge, only uniquely classified reads were used to compute accuracy. To measure speed, 10 million reads were used for Centrifuge and Kraken and 100,000 reads were used for MegaBLAST. All programs were run on a Linux System with 1TB of RAM using one CPU (2.1 GHz Intel Xeon). Table reproduced from Kim et al. Genome Biology 2016. 5

A drawback of Kraken is its substantial memory requirement. For a database of 4,278 prokaryotic genomes, Kraken requires 93GB of memory, effectively restricting its use to high-performance clusters.

Centrifuge was developed to address this drawback. By exchanging k-mer based indexing with a strategy based on the Burrows-Wheeler transform and Ferragina-Manzini index, Centrifuge significantly reduces the storage space for the indexed reference genomes, while still allowing quick exact search of k-mers for a given sequence. 5

The reduced storage space translates to a lower memory footprint in Centrifuge. A centrifuge index of the full NCBI non-redundant nucleotide sequence database requires only 69 GB, and includes full taxonomical mapping of the sequences in the sample.

Centrifuge maintains high levels of accuracy and can be effectively run on a personal computer. Unlike Kraken, Centrifuge can assign a sequence to multiple taxonomic categories, allowing more specificity at the genus and species level (Table 1).


Materials and Methods

Plasmid database

All 9,351 available plasmid sequences were downloaded in April 2017 from the NCBI RefSeq database. As the dataset also contained partial plasmid sequences and antimicrobial resistance genes, we filtered it to only include sequences whose identifiers contained “plasmid” and not “gene,” “partial,” “incomplete” or “putative”. The downloaded multi-FASTA file was separated into the individual FASTA files of each plasmid, which were converted to k-mer lists using the GenomeTester4 toolkit. A text-format index file was created that connects the names of plasmid k-mer list files to their FASTA identifiers. The final plasmid database contained 8,514 plasmid sequences. The database built with k = 20, used in most of the tests, is available from https://figshare.com/s/5f7b924544839f7d6e59 or http://bioinfo.ut.ee/plasmidseeker/ uncompressed size on the disk is 8.8 GB.

Generating simulated WGS samples for optimal k-mer length selection

Four samples were generated, each containing reads from a bacterium (Escherichia coli O157:H7 str. Sakai (NC_002695.1), Pseudomonas aeruginosa UCBPP-PA14 (CP000438.1), Acinetobacter baumannii YU-R612 (CP014215.1), Staphylococcus aureus subsp. aureus MRSA252 (BX571856.1)) and 1–3 plasmids (Table 1). The sample of S. aureus was used as a negative control and was without plasmids. MetaSim, a next-generation sequencing simulator (Richter et al., 2011), was used to generate synthetic 80 bp reads (MetaSim cmd -mg errormodel-80bp.mconf -f [clone length] -r [number of reads]). Empirical error model for 80 bp reads was downloaded from https://ab.inf.uni-tuebingen.de/software/metasim/errormodel-80bp.mconf. In case of plasmid sequences shorter than 3,000 bp, clone length was set to 500. A total of 500,000 reads were generated for each plasmid and 1,000,000 for each bacterial genome. Each simulated sample consisted of 1,000,000 bacterial reads and a corresponding number of plasmid reads to ensure specific plasmid copy number in the sample. The number of plasmid reads added was calculated as follows: (number of bacterial reads × read length ÷ bacterial full genome length) × (desired plasmid copy number × plasmid length ÷ read length).

Strain Plasmid Plasmid length (bp) Real copy number Predicted copy number
Pseudomonas aeruginosa (sim) pNOR-2000 21,880 20.00 19.50
pUM505 123,322 5.00 5.38
pKLC102 103,532 5.00 5.75
Acinetobacter baumannii (sim) pABIR 29,823 10.00 10.15
Escherichia coli (sim) pOSAK1 3,306 10.00 10.33
pO157 92,721 2.00 2.11
Providencia stuartii P. stuartii strain ATCC 33672 plasmid 48,844 3.02 2.89
Citrobacter freundii pKEC-a3c 272,297 5.23 5.17
pCAV1335-5410 5,410 14.00 14.00
Corynebacterium callunae pCC1 4,084 11.75 11.90
pCC2 82,716 0.87 0.90

“Sim” represents a simulated sample. For real WGS samples, real copy number is given as estimated by Antipov et al. (2016). For real copy number calculations in simulated samples, see “Materials and Methods”.

Generating simulated WGS samples for mutational analysis

Eight samples were made with MetaSim (settings as above), each containing 80 bp reads from a bacterium (E. coli or P. aeruginosa, same as above) and a plasmid: pOSAK1 (3,306 bp) in case of E. coli and pUM505 (123,322 bp) in case of P. aeruginosa. Random substitution mutations were induced to plasmid reference sequences using an in-house script (PlasmidSeeker GitHub, “make_mutations.pl”). Average mutation rates per bp were as follows: 1/1,000, 1/300, 1/100, 1/30. Relative copy numbers of the plasmid and the bacterium were 10 to 1 in all samples, respectively. Number of bacterial reads in all samples was 1,000,000.

Generating simulated WGS samples for detectable k-mer fraction analysis

Sixteen samples were made with MetaSim (settings and error profile as above), each containing 80 bp reads from a bacterium (E. coli, P. aeruginosa, S. aureus or A. baumannii, same as above). Number of reads in each sample was equal to the respective genome length divided by the read length (80 bp) times the desired coverage (1, 3, 5 or 7). Theoretical distribution was assumed to follow Poisson distribution, read length was equal to the read length in simulated samples (80 bp) and the average error rate was 0.01/bp.

Escherichia coli samples and analysis with PlasmidSeeker and plasmidSPAdes

We used three E. coli WGS samples with sequence type 131 (Accession numbers as follows: EC1, ERR1937840 EC2, ERR1937914 EC3, ERR1937841 see also Table S3). PlasmidSeeker was run with F and C thresholds set to 80%. PlasmidSPAdes was run with default settings.

Statistical test for comparing isolate and plasmid copy numbers

A part or the whole plasmid can integrate into the bacterial genome. Additionally, bacterial genomes might contain k-mers that are also present in plasmids, just by chance. Therefore, the bacterial isolate genome may contribute k-mers to the fraction of plasmid k-mers. If the copy number of a plasmid is not significantly different from the chromosomal copy number, most of the detected plasmid k-mers might originate from the bacterial genome. Therefore, to detect whether k-mers are actually from a plasmid or from the bacterial chromosome, we test the hypothesis: H 0 : Cov bacteria = Cov plasmid H 1 : Cov bacteria ≠ Cov plasmid

Covbacteria is the expected coverage of unique chromosomal k-mers and Covplasmid is the expected coverage of unique plasmid k-mers. We assume that the copy number of the plasmid is usually different from the copy number of the bacterial chromosome. Therefore, the expected coverage of plasmid k-mers is also different from the expected coverage of chromosomal k-mers. Hence, accepting the hypothesis H1 leads to the conclusion that at least some of the reads containing reference plasmid k-mers come from a plasmid.

To test the hypothesis, we fitted a mixture of negative binomial distributions to describe the distribution of the k-mer frequencies (we used the mixture distribution to allow some k-mers to be missing or to have increased copy number as k-mers might originate from repeat regions). We used separate mixtures to describe the plasmid and chromosomal k-mer frequencies. The method used assumes at least 70% of k-mers to be unique in the sequenced isolate and plasmid. We fitted two models by the maximum likelihood method. In one model, we restricted the expected coverage of unique k-mers to be the same for both plasmid and chromosomal k-mers. The second model allows the expected coverages to be different. We used a likelihood ratio type test to compare the two models. Copy number was considered significantly different at threshold p < 0.05 corrected with Bonferroni in case of multiple tests.


Results

An overview of lincRNA identification with Evolinc-I

Evolinc-I validation

After establishing a workflow using the most commonly accepted parameters for defining a lincRNA (detailed in Section Materials and Methods), we wanted to evaluate its efficiency at distinguishing between unknown or novel protein-coding genes and non-coding loci. For this, we used a random set of 5,000 protein-coding transcripts selected from the TAIR10 annotation to determine Evolinc-I's false discovery rate (FDR i.e., protein-coding transcripts erroneously classified as lincRNAs). ORFs for this test dataset ranged in length from 303 to 4,182 nts, with an average ORF of 1,131 nts (File S3). Because Evolinc is designed to automatically remove transcripts that map back to known genes, we removed these 5,000 genes from the reference genome annotation file, and then generated a transcript assembly file from RNA-seq data where these 5,000 genes were known to be expressed. We fed the transcript assembly file to Evolinc-I. Out of 5,000 protein-coding genes, only 11 were categorized as non-coding by Evolinc-I (0.22% FDR File S3). Further investigation of the 11 loci revealed that they were predominantly low coverage transcripts with ORFs capable of producing polypeptides 㺐, but 𼄀 amino acids (aa). Moreover, low read coverage for these transcripts led to incomplete transcript assembly. Together these factors were responsible for the miss-annotation of these loci as non-coding. Importantly, our results indicate that read depth and transcript assembly settings impact lincRNA identification, a finding also noted by Cabili et al. (2011). Therefore, exploring transcript assembly parameters may be necessary prior to running Evolinc-I. In sum, Evolinc-I has a low FDR that can be further reduced by increasing read per base coverage thresholds during transcript assembly as performed in Cabili et al. (2011).

We determined the overlap of Evolinc predicted lincRNAs with previously published datasets from humans and Arabidopsis, following as closely as possible the methods published for each dataset. We first used Evolinc-I to identify lincRNAs from an RNA-seq dataset generated by Liu et al. (2012) in Arabidopsis (File S1). From nearly one billion reads generated from four different tissues (siliques, flowers, leaves, and roots), Liu et al. (2012) identified 278 lincRNAs (based on the TAIR9 reference genome annotation). Using the Liu et al. (2012) SRA data, we mapped RNA-seq reads and assembled transcripts with Tophat2 and Cufflinks2 in the DE. From these transcripts, Evolinc-I, identified 571 lincRNAs. We then reconciled the lincRNAs identified in Liu et al. (Liu-lincRNAs) with those from Evolinc-I (Evolinc-lincRNAs), by identifying overlapping genomic coordinates for lincRNAs from the two datasets using the Bedtools suite (Quinlan and Hall, 2010). Of the 278 Liu-lincRNAs, 261 were also recovered by Evolinc-I (Table S1). Cufflinks failed to assemble the 17 unrecovered Liu-lincRNAs, due to low coverage, and thus differences in recovery for these loci reflect differences in the Cufflinks parameters employed.

The Arabidopsis genome reference has been updated since Liu et al. (2012), from TAIR9 to TAIR10 (Lamesch et al., 2012). We also ran Evolinc-I with the TAIR10 annotation and found that only 198 of the 261 Liu-lincRNAs were still considered intergenic (Figure ​ (Figure1B). 1B ). The remaining 63 were reclassified as overlapping a known gene (either sense overlapping transcript, SOT, or antisense overlapping transcript, AOT). This highlights an important aspect of Evolinc-I. While Evolinc-I is able to identify long non-coding RNAs without a genome annotation, genome annotation quality can impact whether an lncRNA is considered intergenic vs. AOT or SOT. In sum, 198 of the 571 lincRNAs identified by Evolinc-I correspond to a previously identified Liu-lincRNA (Figure ​ (Figure1B 1B ).

Of the 571 lincRNAs identified by Evolinc-I, 373 were not classified as lincRNAs by Liu et al. (2012). Evolinc-I removes transcripts that overlap with the 5′ and 3′ UTRs of a known gene, whereas Liu et al. (2012) removed transcripts that were within 500 bp of a known gene (Liu et al., 2012). This difference in the operational definition of intergenic space accounts for the omission of 197 Evolinc-lincRNAs from the Liu et al. (2012) lincRNA catalog. In addition, Evolinc-I removes transcripts with high similarity to transposable elements, but not tandem di- or tri-nucleotide repeats. We could see no biological reason for excluding these simple repeat containing transcripts, and in fact, transcripts with simple tandem repeats have been attributed to disease phenotypes and therefore might be of particular interest (Usdin, 2008). The inclusion of these transcripts accounts for 106 of the unique Evolinc-lincRNAs.

Finally, 70 of the 571 Evolinc-lincRNAs were entirely novel, and did not correspond to any known Liu-lincRNA or gene within the TAIR10 genome annotation. To determine whether these represented bona fide transcripts, we tested expression of a subset (n = 20) of single and multi-exon putative lincRNAs by RT-PCR using RNA extracted from two different tissues (seedlings and flowers, Figure S1). We considered expression to be positive if we recovered a band in two different tissues or in the same tissue but from different biological replicates. We recovered evidence of expression for 18 of these putative lincRNAs out of 20 tested. Based on these data we conclude that a majority of the 70 novel lincRNAs identified by Evolinc-I for Arabidopsis are likely to reflect bona fide transcripts, and thus valid lincRNA candidates.

We next compared Evolinc-I against a well-annotated set of human lincRNAs characterized by Cabili et al. (2011). Cabili et al. (2011) used RNA-seq data from 24 different tissues and cell types, along with multiple selection criteria to identify a “gold standard” reference set of 4,662 lincRNAs. We assembled transcripts from RNA-seq data for seven of these tissues (File S1) using Cufflinks under the assembly parameters and read-per-base coverage cut-offs of Cabili et al. (2011) (see Section Materials and Methods). We then fed these transcripts to Evolinc-I. To directly compare Evolinc-I identified lincRNAs with the Cabili et al. (2011) reference dataset (Cabili-lincRNAs), we used the BED files generated by Evolinc-I to identify a subset of 360 multi-exon putative lincRNAs that were observed in at least two tissues (consistent with criteria employed in Cabili et al. (2011) when using a single transcript assembler). We then asked whether these 360 Evolinc-I lincRNAs were found in either the Cabili-lincRNAs, or the hg19 human reference annotation (UCSC). A total of 317 (88%) of the Evolinc-I lincRNAs matched known lincRNAs from the two annotation sources (Figure ​ (Figure1C). 1C ). The remaining 43 transcripts (12% of the 360 tested) passed all other criteria laid out by Cabili et al. (2011) and therefore may be bona fide lincRNAs, but will require further testing.

Evolution of lincRNA loci with Evolinc-II

Evolinc-II validation

Evolinc-II is an automated and improved version of a workflow we previously used to determine the depth to which Liu-lincRNAs (Liu et al., 2012) were conserved in other species of the Brassicaceae (Nelson et al., 2016). The Evolinc-II workflow is outlined in Figure ​ Figure2A. 2A . While most Liu-lincRNAs were restricted to Arabidopsis, or shared only by Arabidopsis and A. lyrata, 3% were conserved across the family, indicating that the lincRNA-encoding locus was present in the common ancestor of all Brassicaceae

54 MYA (Beilstein et al., 2010). We used Evolinc-II to recapitulate our previous analysis in three ways. First, to provide replicates for statistical analysis, we randomly divided the 5,361 Liu-lincRNAs into 200-sequence groups prior to Evolinc-II analysis (n = 27 Figure ​ Figure2B 2B and Figure S2B). Second, we performed a separate comparison by dividing the Liu-lincRNAs based upon chromosomal location (n = 5). Lastly, we used Evolinc-II to search for sequence homologs using the complete Liu-lincRNA dataset but querying with varying E-value cutoffs (E-20, E-15, E-10, E-05, and E-01). This analysis allowed us to test the impact of the requirement for reciprocity on the recovery of putative homologs under different E-value criteria (Figure ​ (Figure2B 2B and Figure S2D). The number of sequence homologs increased for each decrement in BLAST stringency (Figure S2D), indicating that a significant number of putative homologs fulfill the reciprocity requirement even as sequence similarity decreases. The percentage of sequence homologs retrieved by Evolinc-II was statistically indistinguishable for lincRNAs assigned to groups, chromosomes, or the average from all E-value cutoffs (Figure ​ (Figure2B 2B and Figure S2C). Thus, Evolinc-II is a robust method to identify sets of lincRNAs that are conserved across a user-defined set of species, such as the Brassicaceae.

In addition to identifying sets of conserved lincRNAs, Evolinc-II also highlights conserved regions within each query lincRNA. To demonstrate these features, we scanned through the Liu-lincRNA Evolinc-II summary statistics file (at 1E-10 File S4) to identify a conserved lincRNA. At1NC023160 is conserved as a single copy locus in eight of the 10 species we examined. It was identified by Liu et al. (2012) based on both RNA-seq and tiling array data, as well as validated by Evolinc-I. During the comparative analyses, Evolinc-II generates a query-centric coordinate file that allows the user to visualize within a genome browser (e.g., JBrowse Buels et al., 2016) what regions of the query lincRNA are most conserved. Using this query-centric coordinate file, we examined the 332 nt At1NC023160 locus in the CoGe genome browser and determined that the 3′ end was most highly conserved (Figure ​ (Figure2C). 2C ). We used the MAFFT multiple sequence alignment generated by Evolinc-II for At1NC023160 to perform structure prediction with RNAalifold (Figure S3A Lorenz et al., 2011). The structural prediction based on the multiple sequence alignment had a greater base pair probability score and lower minimum free energy than the structure inferred from the Arabidopsis lincRNA alone (Figures S3B,C). Conserved regions of a lincRNA serve as potential targets for disruption via genome editing techniques, thereby facilitating its functional dissection.

Using Evolinc-II to infer the evolution of the human telomerase RNA locus TERC

In addition to exploring the evolutionary history of a lincRNA catalog, Evolinc-II is an effective tool to infer the evolution of individual lincRNA loci. To showcase the insights Evolinc-II can provide for datasets composed of a small number of lincRNAs, we focused on the well-characterized human lincRNA, TERC. TERC is the RNA subunit of the ribonucleoprotein complex telomerase that is essential for chromosome end maintenance in stem cells, germ-line cells, and single-cell eukaryotes (Theimer and Feigon, 2006 Blackburn and Collins, 2011 Zhang et al., 2011). TERC is functionally conserved across almost all eukarya, but is highly sequence divergent. Building on work performed by Chen et al. (2000) we used Evolinc-II to examine the evolutionary history of the human TERC locus in 26 mammalian species that last shared a common ancestor between 100 and 130 MYA (Figure ​ (Figure3 3 Glazko, 2003 Arnason et al., 2008).

Evolinc-II analysis of the human TERC locus in mammals. Species tree of 25 species (Ornithoryhnchus anatinus not shown) within class Mammalia with duplication (D) or loss (L) events hung on the tree (left). A micro-synteny profile is shown to the right for each species, showing the TERC locus in red, and adjacent protein-coding genes in black. Direction of each gene is indicated with arrows. The mouse and rat TERC loci are indicated by blue arrows to represent the poor sequence similarity between these two loci and human TERC. Divergence times are approximate and extracted from Arnason et al. (2008). A key is shown below, with gene names indicated. To regenerate micro-synteny analyses with CoGe (genomeevolution.org) for all species on the tree click on the following links: https://genomevolution.org/r/lxvp, https://genomevolution.org/r/lxvo, https://genomevolution.org/r/lxvn, https://genomevolution.org/r/lxz6.

Evolinc-II identified a human TERC sequence homolog in 23 of the 26 species examined (Figure ​ (Figure3 3 raw output shown in Figure S4). We were unable to identify a human TERC homolog in Ornithoryhnchus anatinus (platypus), representing the earliest diverging lineage within class Mammalia, using our search criteria. In addition, Mus musculus (mouse) and Rattus norvegicus (rat) were also lacking a human TERC homolog. However, close relatives of mouse and rat, such as Ictidomys tridecemlineatus (squirrel) and Oryctolagus cuniculus (rabbit) retained clear human TERC sequence homologs, suggesting that loss of the human TERC-like locus is restricted to the Muridae (mouse/rat family). This is in agreement with the previous identification of the mouse TERC, which exhibits much lower sequence similarity with the human TERC than do other mammals (Chen et al., 2000). All identified human TERC homologs also share synteny, suggesting similar evolutionary origins for this locus throughout mammals (Figure ​ (Figure3). 3 ). Evolinc-II also identified lineage-specific duplication events for the human TERC-like locus in the orangutan, lemur, and galago genomes (Figure ​ (Figure3), 3 ), similar to previous observations in pig and cow (Chen et al., 2000). In sum, Evolinc-II can be applied to both large and small datasets to uncover patterns of duplication, loss, and conservation across large phylogenetic distances.


Reference proteomes

The one gene one protein proteome sets are compiled from species sourced from complete genomes submitted to INSDC with gene model annotations from:

The gene2acc, fasta and idmapping files for individual species are available for download here:
https://ftp.ebi.ac.uk/pub/databases/reference_proteomes/QfO

SeqXML versions are documented by our partners and they are available here:
https://www.seqxml.org/xml/Reference_proteomes.html

Current Composition of Primary Protein Sets

The following table describes the status of the species:

Species Number of Genes/Proteins
UP000007062 7165 ANOGA Anopheles gambiae 15553
UP000000798 224324 AQUAE Aquifex aeolicus 1553
UP000006548 3702 ARATH Arabidopsis thaliana 27468
UP000001570 224308 BACSU Bacillus subtilis 4260
UP000001414 226186 BACTN Bacteroides thetaiotaomicron 4782
UP000007241 684364 BATDJ Batrachochytrium dendrobatidis 8610
UP000009136 9913 BOVIN Bos taurus 23847
UP000002526 224911 BRADU Bradyrhizobium diazoefficiens 8253
UP000001554 7739 BRAFL Branchiostoma floridae 28542
UP000001940 6239 CAEEL Caenorhabditis elegans 19813
UP000000559 237561 CANAL Candida albicans 6035
UP000002254 9615 CANLF Canis lupus 20654
UP000000431 272561 CHLTR Chlamydia trachomatis 895
UP000006906 3055 CHLRE Chlamydomonas reinhardtii 17614
UP000002008 324602 CHLAA Chloroflexus aurantiacus 3850
UP000008144 7719 CIOIN Ciona intestinalis 16678
UP000002149 214684 CRYNJ Cryptococcus neoformans 6604
UP000000437 7955 DANRE Danio rerio 25706
UP000002524 243230 DEIRA Deinococcus radiodurans 3085
UP000007719 515635 DICTD Dictyoglomus turgidum 1743
UP000002195 44689 DICDI Dictyostelium discoideum 12728
UP000000803 7227 DROME Drosophila melanogaster 13821
UP000000625 83333 ECOLI Escherichia coli 4392
UP000002521 190304 FUSNN Fusobacterium nucleatum 2046
UP000000539 9031 CHICK Gallus gallus 18113
UP000000577 243231 GEOSL Geobacter sulfurreducens 3402
UP000001548 184922 GIAIC Giardia intestinalis 4900
UP000000557 251221 GLOVI Gloeobacter violaceus 4406
UP000001519 9595 GORGO Gorilla gorilla 21789
UP000000554 64091 HALSA Halobacterium salinarum 2423
UP000000429 85962 HELPY Helicobacter pylori 1552
UP000015101 6412 HELRO Helobdella robusta 23328
UP000005640 9606 HUMAN Homo sapiens 20600
UP000001555 6945 IXOSC Ixodes scapularis 20489
UP000001686 374847 KORCO Korarchaeum cryptofilum 1602
UP000000542 5664 LEIMA Leishmania major 8038
UP000018468 7918 LEPOC Lepisosteus oculatus 18321
UP000001408 189518 LEPIN Leptospira interrogans 3676
UP000000805 243232 METJA Methanocaldococcus jannaschii 1787
UP000002487 188937 METAC Methanosarcina acetivorans 4468
UP000002280 13616 MONDO Monodelphis domestica 21225
UP000001357 81824 MONBE Monosiga brevicollis 9188
UP000000589 10090 MOUSE Mus musculus 22001
UP000001584 83332 MYCTU Mycobacterium tuberculosis 3993
UP000000807 243273 MYCGE Mycoplasma genitalium 483
UP000000425 122586 NEIMB Neisseria meningitidis 2001
UP000001593 45351 NEMVE Nematostella vectensis 24428
UP000002530 330879 ASPFU Neosartorya fumigata 9647
UP000001805 367110 NEUCR Neurospora crassa 9759
UP000000792 436308 NITMS Nitrosopumilus maritimus 1795
UP000059680 39947 ORYSJ Oryza sativa 43672
UP000001038 8090 ORYLA Oryzias latipes 23617
UP000002277 9598 PANTR Pan troglodytes 23053
UP000000600 5888 PARTE Paramecium tetraurelia 39461
UP000001055 321614 PHANO Phaeosphaeria nodorum 15997
UP000006727 3218 PHYPA Physcomitrella patens 31365
UP000005238 164328 PHYRM Phytophthora ramorum 15349
UP000001450 36329 PLAF7 Plasmodium falciparum 5383
UP000002438 208964 PSEAE Pseudomonas aeruginosa 5564
UP000008783 418459 PUCGT Puccinia graminis 15688
UP000002494 10116 RAT Rattus norvegicus 21588
UP000001025 243090 RHOBA Rhodopirellula baltica 7271
UP000002311 559292 YEAST Saccharomyces cerevisiae 6050
UP000002485 284812 SCHPO Schizosaccharomyces pombe 5138
UP000001312 665079 SCLS1 Sclerotinia sclerotiorum 14445
UP000001973 100226 STRCO Streptomyces coelicolor 8034
UP000001974 273057 SULSO Sulfolobus solfataricus 2938
UP000001425 1111708 SYNY3 Synechocystis sp. 3507
UP000001449 35128 THAPS Thalassiosira pseudonana 11717
UP000000536 69014 THEKO Thermococcus kodakarensis 2301
UP000000718 289376 THEYD Thermodesulfovibrio yellowstonii 1982
UP000008183 243274 THEMA Thermotoga maritima 1852
UP000007266 7070 TRICA Tribolium castaneum 16568
UP000001542 5722 TRIVA Trichomonas vaginalis 50190
UP000000561 237631 USTMA Ustilago maydis 6788
UP000008143 8364 XENTR Xenopus tropicalis 22514
UP000001300 284591 YARLI Yarrowia lipolytica 6449
UP000007305 4577 MAIZE Zea mays 39399

Gene mapping files (*.gene2acc)

Column 1 is a unique gene symbol that is chosen with the following order of preference from the annotation found in:

  1. Model Organism Database (MOD)
  2. Ensembl or Ensembl Genomes database
  3. UniProt Ordered Locus Name (OLN)
  4. UniProt Open Reading Frame (ORF)
  5. UniProt Gene Name

A dash symbol (-) is used when the gene encoding a protein is unknown.

Column 2 is the UniProtKB accession or isoform identifier for the given gene symbol. This column may have redundancy when two or more genes have identical translations.

Column 3 is the gene symbol of the canonical accession used to represent the respective gene group and the first row of the sequence is the canonical one.

Protein FASTA files (*.fasta and *_additional.fasta)

These files, composed of canonical and additional sequences, are non-redundant FASTA sets for the sequences of each reference proteome. The additional set contains isoform/variant sequences for a given gene, and its FASTA header indicates the corresponding canonical sequence ("Isoform of . "). The FASTA format is the standard UniProtKB format.

For further references about the standard UniProtKB format, please see:

>sp|Q9H6Y5|MAGIX_HUMAN PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4 MEPRTGGAANPKGSRGSRGPSPLAGPSARQLLARLDARPLAARAAVDVAALVRRAGATLR LRRKEAVSVLDSADIEVTDSRLPHATIVDHRPQHRWLETCNAPPQLIQGKAHSAPKPSQA SGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHINGE STQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGGP EVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGSP GPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH

>sp|Q9H6Y5-2|MAGIX-2_HUMAN Isoform of Q9H6Y5, Isoform 2 of PDZ domain-containing protein MAGIX OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=4 MPLLWITGPRYHLILLSEASCLRANYVHLCPLFQHRWLETCNAPPQLIQGKAHSAPKPSQ ASGHFSVELVRGYAGFGLTLGGGRDVAGDTPLAVRGLLKDGPAQRCGRLEVGDVVLHING ESTQGLTHAQAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVVPSWPDRSPDPGG PEVTGSRSSSTSLVQHPPSRTTLKKTRGSPEPSPEAAADGPTVSPPERRAEDPNDQIPGS PGPWLVPSEERLSRALGVRGAAQFAQEMAAGRRRH >tr|C9J123|C9J123_HUMAN Isoform of Q9H6Y5, PDZ domain-containing protein MAGIX (Fragment) OS=Homo sapiens OX=9606 GN=MAGIX PE=1 SV=2 MSPNSPLHCFYLPAVSVLDSADIEVTDSRLPHATIVDHRPQVGDLVLHINGESTQGLTHA QAVERIRAGGPQLHLVIRRPLETHPGKPRGVGEPRKGVDRSPDPGGPEVTGSRSSSTSLV QHPPSRTTLKKTRGSPEPSPEAA

Coding DNA Sequence FASTA files (*_DNA.fasta)

These files contain the coding DNA sequences (CDS) for the protein sequences where it was possible. The format is as in the following example (UP000005640_9606_DNA.fasta):

>sp|A0A183|ENSP00000411070 ATGTCACAGCAGAAGCAGCAATCTTGGAAGCCTCCAAATGTTCCCAAATGCTCCCCTCCC CAAAGATCAAACCCCTGCCTAGCTCCCTACTCGACTCCTTGTGGTGCTCCCCATTCAGAA GGTTGTCATTCCAGTTCCCAAAGGCCTGAGGTTCAGAAGCCTAGGAGGGCTCGTCAAAAG CTGCGCTGCCTAAGTAGGGGCACAACCTACCACTGCAAAGAGGAAGAGTGTGAAGGCGAC TGA

The 3 fields of the FASTA header are:

  1. sp (Swiss-Prot reviewed) or tr (TrEMBL)
  2. UniProtKB Accession
  3. EMBL Protein ID or Ensembl/Ensembl Genome ID

Unsuccessful Coding DNA Sequence mapping files (*_DNA.miss)

For the species that did not have a perfect mapping for all protein sequences to a CDS, these files contain the entries that could not be mapped. The format is as in the following example (UP000005640_9606_DNA.miss):

sp A6NF01 CAUTION: Could be the product of a pseudogene sp A4QN01 NOT_ANNOTATED_CDS

  1. "sp" (Swiss-Prot reviewed) or "tr" (TrEMBL)
  2. UniProtKB accession
  3. Reason why the protein could not be mapped to a CDS

Database mapping files (*.idmapping)

These files contain mappings from UniProtKB to other databases for each reference proteome. The format consists of three tab-separated columns:

  1. UniProtKB accession
  2. ID_type:
    • Database name as shown in UniProtKB cross-references and supported by the ID mapping tool on the UniProt web site (https://www.uniprot.org/mapping )
  3. ID:
    • Identifier in the cross-referenced database.

The xml files contain all the information from fasta (canonical and additional), idmapping and CDS in SeqXML format (see https://seqxml.org .)
E.g. (from UP000005640_9606.xml, header and one entry)

<?xml version="1.0" encoding="utf-8"?> <seqXML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" speciesName="Homo sapiens" xsi:noNamespaceSchemaLocation="http://www.seqxml.org/0.4/seqxml.xsd" seqXMLversion="0.4" sourceVersion="2016_04" source="QfO http://w ww.ebi.ac.uk/reference_proteomes/" ncbiTaxID="9606"> <entry source="UniProtKB"> <description>Immunoglobulin lambda variable 4-69</description> <AAseq>MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI</AAseq> <DBRef source="UniProtKB-ID"></DBRef> <DBRef source="Gene_Name"></DBRef> <DBRef source="GI"></DBRef> <DBRef source="UniRef100"></DBRef> <DBRef source="UniRef90"></DBRef> <DBRef source="UniRef50"></DBRef> <DBRef source="UniParc"></DBRef> <DBRef source="EMBL"></DBRef> <DBRef source="EMBL-CDS"></DBRef> <DBRef source="NCBI_TaxID"></DBRef> <DBRef source="Ensembl"></DBRef> <DBRef source="Ensembl_TRS"></DBRef> <DBRef source="Ensembl_PRO"></DBRef> <DBRef source="UCSC"></DBRef> <DBRef source="EuPathDB"></DBRef> <DBRef source="GeneCards"></DBRef> <DBRef source="HGNC"></DBRef> <DBRef source="neXtProt"></DBRef> <DBRef source="GeneTree"></DBRef> <DBRef source="OMA"></DBRef> <DBRef source="CRC64"></DBRef> <property name="GN" value="IGLV4-69"></property> <property name="SV" value="1"></property> <property name="DNAsource" value="ENSP00000374817"></property> <property name="ensemblVersion" value="91,38"></property> <property name="UPID" value="UP000005640"></property> <property name="DNAseq" value="ATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCAT GGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCA"></property> <property name="PE" value="1"></property> </entry>

Toni Gabaldón, Christophe Dessimoz, Julie Huxley-Jones, Albert J Vilella,Erik LL Sonnhammer and Suzanna Lewis

Genome Biology 2009, 10:403

Published: 29 September 2009

Christophe Dessimoz, Toni Gabaldón, David S. Roos, Erik LL Sonnhammer, Javier Herrero and the Quest for Orthologs consortium

Bioinformatics 2012, 28:900

Published: 12 February 2012

Erik LL Sonnhammer, Toni Gabaldón, Alan W Sousa da Silva, Maria Martin, Marc Robinson-Rechavi, Brigitte Boeckmann, Paul D Thomas, Christophe Dessimoz and the Quest for Orthologs consortium

Bioinformatics 2014, 30:2993

Adrian M Altenhoff, Brigitte Boeckmann, Salvador Capella-Gutierrez, Daniel A Dalquen, Todd DeLuca, Kristoffer Forslund, Jaime Huerta-Cepas, Benjamin Linard, Cécile Pereira, Leszek P Pryszcz, Fabian Schreiber, Alan Sousa da Silva, Damian Szklarczyk, Clément-Marie Train, Peer Bork, Odile Lecompte, Christian von Mering, Ioannis Xenarios, Kimmen Sjölander, Lars Juhl Jensen, Maria J Martin, Matthieu Muffato, Quest for Orthologs consortium, Toni Gabaldón, Suzanna E Lewis, Paul D Thomas, Erik Sonnhammer and Christophe Dessimoz


How to identify an unknown species from a .fasta file containing its genome sequences - Biology

PlantClusterFinder (version 1.3)

PlantClusterFinder (PCF) detects metabolic gene clusters in a sequenced genome. It uses a gene location file provided by the user (see below) and a PGDB created with Pathway Tools as well as further information (see below) to identify enzyme-coding genes (metabolic genes) located together on a chromosome. Initially only continous stretches of metabolic genes lying directly next to each other are allowed. This condition is relaxed by iteratively increasing the intervening (non-metabolic) gene size by one. Several criteria to select for clusters are provided. In addition to this, clusters can be prevented from forming by a section of criteria. Details of PCF (version 1.0) can be found in PMID: 28228535.

The major differences between this version (1.3) and previous versions (1.0 and 1.2) are:

Authors: Pascal Schlapfer, December 2017 Bo Xue, December 2017

Mandatory inputs: all files need full path, order does not matter.

Outputs: None (Two files are generated: vGeneOutputFile, containing all the information about genes and vClusterOutputFile, containing all the information about clusters)


16S rRNA Sequencing: A PCR-based Technique to Identify Bacterial Species

Planet Earth is a habitat for millions of bacterial species, each of which has specific characteristics. Identification of bacterial species is widely used in microbial ecology to determine biodiversity of environmental samples and medical microbiology to diagnose infected patients. Bacteria can be classified using conventional microbiology methods, such as microscopy, growth on specific media, biochemical and serological tests, and antibiotic sensitivity assays. In recent decades, molecular microbiology methods have revolutionized bacterial identification. A popular method is 16S ribosomal RNA (rRNA) gene sequencing. This method is not only faster and more accurate than conventional methods, but also allows identification of strains that are difficult to grow in laboratory conditions. Furthermore, differentiation of strains at the molecular level enables discrimination between phenotypically identical bacteria (1-4).

16S rRNA joins with a complex of 19 proteins to form a 30S subunit of the bacterial ribosome (5). It is encoded by the 16S rRNA gene, which is present and highly conserved in all bacteria due to its essential function in ribosome assembly however, it also contains variable regions which may serve as fingerprints for particular species. These features have made the 16S rRNA gene an ideal genetic fragment to be used in identification, comparison, and phylogenetic classification of bacteria (6).

16S rRNA gene sequencing is based on the polymerase chain reaction (PCR) (7-8) followed by DNA sequencing (9). PCR is a molecular biology method used to amplify specific fragments of DNA through a series of cycles that include:

i) Denaturation of a double stranded DNA template
ii) Annealing of primers (short oligonucleotides) that are complementary to the template
iii) Extension of primers by the DNA polymerase enzyme, which synthesizes a new DNA strand

A schematic overview of the method is shown in Figure 1.


Figure 1: Schematic overview of the PCR reaction. Please click here to view a larger version of this figure.

There are several factors that are important for a successful PCR reaction, one of which is quality of the DNA template. Isolation of chromosomal DNA from bacteria can be performed using standard protocols or commercial kits. Special care should be taken to obtain DNA that is free of contaminants that can inhibit the PCR reaction.

Conserved regions of the 16S rRNA gene permit the design of universal primer pairs (one forward and one reverse) that can bind to and amplify the target region in any bacterial species. The target region can vary in size. While some primer pairs can amplify most of the 16S rRNA gene, others amplify only parts of it. Examples of commonly used primers are shown in Table 1 and their binding sites are depicted in Figure 2.

Primer name Sequence (5'𔾷') Forward/ reverse Reference
8F b) AGAGTTTGATCCTGGCTCAG forward -1
27F AGAGTTTGATCMTGGCTCAG forward -10
515F GTGCCAGCMGCCGCGGTAA forward -11
911R GCCCCCGTCAATTCMTTTGA reverse -12
1391R GACGGGCGGTGTGTRCA reverse -11
1492R GGTTACCTTGTTACGACTT reverse -11

Table 1: Examples of standard oligonucleotides used in amplification of 16S rRNA genes a) .
a) The expected lengths of the PCR product generated using the different primer combinations can be estimated by calculating the distance between the binding sites for the forward and the reverse primer (see Figure 2), e.g. the size of the PCR product using primer pair 8F-1492R is

1500 bp, and for primer pair 27F-911R

900 bp.
b) also known as fD1


Figure 2: Representative figure of the 16S rRNA sequence and the primer-binding sites. Conserved regions are colored in grey and variable regions are filled with diagonal lines. To allow for the highest resolution, primer 8F and 1492R (name based on location on rRNA sequence) are used to amplify the whole sequence, allowing for the sequencing of several variable regions of the gene. Please click here to view a larger version of this figure.

Cycling conditions for PCR (i.e. the temperature and time required for the DNA to be denatured, annealed with primers, and synthesized) are dependent on the type of polymerase that is used and the properties of the primers. It is recommended to follow the manufacturer's guidelines for a particular polymerase.

After the PCR program is completed, the products are analyzed by agarose gel electrophoresis. A successful PCR yields a single band of expected size. The product must be purified prior to sequencing to remove residual primers, deoxyribonucleotides, polymerase, and buffer which were present in the PCR reaction. The purified DNA fragments are usually sent for sequencing to commercial sequencing services however, some institutions perform DNA sequencing at their own core facilities.

The DNA sequence is automatically generated from a DNA chromatogram by a computer and must be carefully checked for quality, as manual editing is sometimes needed. Following this step, the gene sequence is compared with sequences deposited in the 16S rRNA database. The regions of similarity are identified, and the most similar sequences are delivered.

Procedure

  1. While handling microorganisms, it is required to follow good microbiological practice. All microorganisms, especially unknown samples, should be treated as potential pathogens. Follow aseptic technique to avoid contaminating the samples, researchers, or the laboratory. Wash hands before and after handling bacteria, use gloves, and wear protective clothing.
  2. Carry out a risk assessment for the experimental protocol for the genomic DNA isolation and PCR product purification. Some reagents may be harmful!
  3. Pure culture is essential for the 16S rRNA sequencing. Before proceeding to isolation of genomic DNA, make sure the starting material is entirely pure. This can be done by streak plating to isolate individual colonies. These can be further grown streaked on plates individually, or in broth, if needed.
  4. Laboratory equipment required:
    1. Thermal cycler for PCR. The function of the thermal cycler is to raise and lower temperature according to a set program. While creating the program you will be asked to enter the temperature and time values for every PCR step as well as total number of cycles.
    2. Agarose gel electrophoresis system. It is used to separate DNA fragments based on their size and charge. In this protocol, agarose gel electrophoresis will be used to visualize the quality of isolated genomic DNA and PCR products.

    Note: The demonstrated protocol applies to 16S rRNA gene sequencing from a pure culture of bacteria. It does not apply to metagenomic studies.

    1. Culturing bacteria for isolation of genomic DNA (gDNA).
      1. Grow your microorganism on a suitable medium. Both liquid and solid media can be used in this step. Choose conditions that yield the best growth. While planning the experiment, keep in mind that slow-growing bacteria may need several days to reach the late-log/stationary growth phase. In this protocol, Bacillus subtilis 168 was grown in lysogeny broth (LB) overnight in a shaking incubator set at 200 rpm, 37°C.
      1. If bacteria were grown on solid medium, scrape some cells using a sterile loop and resuspend them in 1 mL of distilled water
      2. If bacteria were grown in liquid medium, use approximately 1.5 mL of an overnight culture.
      3. Pellet the cells by centrifugation (1 minute, 12,000 - 16,000 × g), remove the supernatant, and use the cells for gDNA isolation using a commercial kit or standard protocols [e.g. CTAB total DNA preparation (13) or phenol-chloroform extraction (14)]. Here, a commercial kit was used to isolate gDNA from 1.5 mL of B. subtilis 168 overnight culture, OD600 = 1.5.
        Note 1: For some Gram-negative bacteria this step can be omitted and replaced by simple release of DNA from cells by boiling. Resuspend bacterial pellet in distilled water and incubate in a heating block set at 100 °C for 10 minutes.
        Note 2: Gram-positive bacterial cells are difficult to disrupt. It is therefore recommended to choose a gDNA isolation method or kit that is dedicated to isolation from this group of bacteria.
      1. Check the quality of the isolated gDNA by agarose gel electrophoresis. First, mix 5 µL of the isolated gDNA with 1 µL of the loading dye (6x), and load the sample on a 0.8% agarose gel that contains a DNA staining reagent.
      2. Load a molecular mass standard and run the electrophoresis until the dye front reaches the bottom of the gel.
      3. Once the electrophoresis is completed, visualize the gel on a suitable transilluminator (either UV or blue light). gDNA appears as a thick high molecular band (above 10 kb). An example of the gDNA quality check is shown in Figure 3.
      4. If the gDNA passes the quality control (i.e. the high molecular band is present and there is little-to-no smearing of the gDNA), dilute your gDNA serially by first labelling 3 microcentrifuge tubes as follows: ൒x", 蔴x" and 񓐈x".
      5. Pipette 90 µL of sterile distilled water into each of the 3 tubes.
      6. Take 10 µL of the gDNA solution and add it to the tube marked ൒x".
      7. Pipet the whole volume (i.e. 100 µL) up and down thoroughly to ensure the solution is mixed uniformly. Then, take 10 µL of the solution from this tube and transfer it to the tube marked 蔴x".
      8. Mix as described before and repeat the same procedure by transferring 10 µL of the solution from tube 蔴x" to the tube 񓐈x". These dilutions will be used as template in the PCR reaction.


      Figure 3: Agarose gel electrophoresis of gDNA isolated from Bacillus subtilis. Lane 1: M - molecular mass marker (from top to bottom: 10000 bp, 8000 bp, 6000 bp, 5000 bp, 4000 bp, 3500 bp, 3000 bp, 2500 bp, 2000 bp, 1500 bp, 1000 bp). Lane 2: gDNA - genomic DNA isolated from Bacillus subtilis. Please click here to view a larger version of this figure.

      1. Amplification of the 16S rRNA gene by PCR.
        Note: The PCR protocol below is optimized for a particular DNA polymerase and primer pair 8F - 1492R (see Table 1). Optimization of the protocol is required for each polymerase and primer pair.
        1. Thaw all reagents on ice.
        2. Prepare the PCR master mix as shown in Table 2. Since the DNA polymerase is active at room temperature, the reaction setup must be performed on ice, i.e. the PCR tubes and the reaction components should be kept on ice all the time. Prepare one reaction per each gDNA sample and one reaction for negative control. Negative control is a PCR mix without the gDNA template and is used to ensure that the other components of the reaction are not contaminated.
          Note: In case of multiple samples, a master mix is commonly prepared. Master mix is a solution containing all the reaction components except the template. It helps to omit repetitive pipetting, avoid pipetting error, and ensures high consistency between the samples. To prepare master mix, multiply the volume of each component (except the DNA template) by the number of samples tested. Mix all the components in microcentrifuge tube and pipet the whole volume up and down several times.
        3. Aliquot 49 µL of the master mix into the individual PCR tubes.
        4. Add 1 µL template into tubes with master mix. For negative control add 1 µL of sterile water. To ensure that the components are well mixed, gently pipet the mix up and down

        Table 2: PCR reaction components. * use the 10x, 100x or 1000x diluted gDNA from step 2.3.

        Step Temperature Time Cycles
        Initial denaturation 98°C 30 sec
        Denaturation 98°C 10 sec 25-30
        Annealing 60°C 30 sec
        Extension 72°C 45 sec
        Final extension 72°C 7 min
        Hold 4°C

        Table 3: PCR program for the amplification of the 16S rRNA gene.


        Figure 4: Agarose gel electrophoresis of PCR products amplified using primers 8F and 1492R and gDNA as a template. The gDNA sample from B. subtilis (see Figure 3) was diluted 10, 100 and 1000 times in order to test for the best outcome. Lane 1: M - molecular mass marker (from top to bottom: 10000 bp, 8000 bp, 6000 bp, 5000 bp, 4000 bp, 3500 bp, 3000 bp, 2500 bp, 2000 bp, 1500 bp, 1000bp, 750 bp, 500 bp, 250 bp). Lane 2: PCR reaction with 10x diluted template. Lane 3: PCR reaction with 100x diluted template. Lane 4: PCR reaction with 1000x diluted template. Lane 5: (C-) - negative control (reaction without the DNA template). Please click here to view a larger version of this figure.

        3. Data analysis and results

        Note: The PCR product is sequenced using the forward (here 8F) and the reverse (here 1492R) primers. Therefore, two sets of data sequence are generated, one for the forward and one for the reverse primer. For each sequence at least two types of file are generated: i) a text file containing the DNA sequence and ii) a DNA chromatogram, which shows the quality of the sequencing run.

        1. For the forward primer, open the chromatogram, and carefully examine the sequence. An ideal chromatogram for a quality sequence should have evenly spaced peaks and little or no background signals (Figure 5A).
        2. If the chromatogram is not high quality, the sequence should be discarded, or the sequence text file should be revised according to the following:
          1. The presence of double peaks throughout the chromatogram indicates the presence of multiple DNA templates. This can be the case if the bacterial culture was not pure. Such a sequence should be discarded (Figure 5B).
          2. An ambiguous chromatogram might arise from the presence of different colored peaks in the same location. One of the most common errors is the presence of two different colored peaks in the same position and improper assigning of the bases by the sequencing software (Figure 5C). Manually correct any incorrectly assigned nucleotides and edit them in the text file.
          3. Low resolution chromatograms can result in "broad peaks" that often cause mis-counting of the nucleotides in these regions (Figure 5D). This error is difficult to correct, and therefore possible mismatches in the further alignment step should not be treated as reliable.
          4. Poor chromatogram reading quality and the presence of multiple peaks is commonly seen at the 5' and 3' ends of the sequence. Some sequencer software removes these low-quality fragments automatically (Figure 5E), and the nucleotides are not included in the text file. If your sequence was not truncated automatically, determine the low-quality fragments (e.g. weak signal, overlapping peaks, loss of resolution) at the ends and remove the respective bases from the text file.


          Figure 5: Examples of DNA sequencing troubleshooting. A) An example of a quality chromatogram sequence (evenly-spaced, unambiguous peaks). B) Poor quality sequence that usually occurs at the beginning of the chromatogram. The grey-zone area is considered low quality and automatically removed by the sequencing software. More bases can be trimmed manually. C) Presence of double-peaks (indicated by arrows). A nucleotide that is indicated by the red arrow has been read by the sequencer as "T" (red peak), but the blue peak is stronger, and it can also be interpreted as "C". D) Overlapping peaks indicate DNA contamination (i.e. more than one template). E) Loss of resolution and so called "broad peaks" (marked by rectangle) that prevent reliable base-calling. Please click here to view a larger version of this figure.

          1. Repeat 3.1 and 3.2 for the reverse primer.
          2. Finally, assemble the forward and reverse sequences into one contiguous sequence. A good sequencing run yields a sequence of up to 1100 bp. Considering that the PCR product is

          1. Merge the two sequences using the DNA sequence assembly program, e.g. a free tool such as CAP3 (http://doua.prabi.fr/software/cap3) (15).
          2. Insert the two sequences in FASTA format into the indicated box. Click the "Submit" button and wait for the results to return.
          3. To view the assembled sequence press "Contigs" in the result tab. To view the details of the alignment press "Assembly details".
            Note 1: If CAP3 software is used for contig assembly there is no need to convert the reverse primer sequence into reverse-complementary however, this step might be needed if another program is used.
            Note 2: FASTA format is a text-based format to represent the nucleotide sequence. The first line (the description line) in a FASTA file starts with a symbol ">" followed by the name or a unique identifier of the sequence. Following the description line is the nucleotide sequence. Paste your sequences in the following format:

          1. Select the "Nucleotide BLAST" tool to compare your sequence to the database.
          2. Enter your sequence (the contig assembled in 3.5) into the "Query sequence" text box, then select the database ൘S rRNA sequences (Bacteria and Archea)" in the scroll down menu.
          3. Press the "BLAST" button on the bottom of the page. The most similar sequences will be returned. An example BLAST result is shown in Figure 6. In the presented experiment the top hit is B. subtilis strain 168, showing 100% identity with the sequence available in the BLAST database.
          4. If the top hit does not show 100% identity, go to the alignment and check for mismatches. Upon clicking on the top hit, you will be directed to the details of the alignment. Aligned nucleotides will be joined by short vertical lines while mismatched nucleotides have a gap between them. Return to the chromatogram that you received from the sequencing company and revise the sequence once again with the focus on the mismatched region. Correct the sequence if any additional errors are found. Run BLAST again using the corrected sequence.


          Figure 6: Example of the nucleotide BLAST outcome. 16S rRNA gene sequence from pure culture of B. subtilis 168 was used as a query sequence. The top hit shows 100% identity (underlined) to the B. subtilis strain 168, as expected. Please click here to view a larger version of this figure.

          Earth is home to millions of bacterial species, each with unique characteristics. Identifying these species is critical in evaluating environmental samples. Doctors also need to distinguish different bacterial species to diagnose infected patients.

          To identify bacteria, a variety of techniques can be employed, including microscopic observation of morphology or growth on a specific media to observe colony morphology. Genetic analysis, another technique for identifying bacteria has grown in popularity in recent years, due in part to 16S ribosomal RNA gene sequencing.

          The bacterial ribosome is a protein RNA complex consisting of two subunits. The 30S subunit, the smaller of these two subunits, contains 16S rRNA, which is encoded by the 16S rRNA gene contained within the genomic DNA. Specific regions of 16S rRNA are highly conserved, due to their essential function in ribosome assembly. While other regions, less critical to function, may vary among bacterial species. The variable regions in 16S rRNA, can serve as unique molecular fingerprints for bacterial species, allowing us to distinguish phenotypically identical strains.

          After obtaining a quality sample of gDNA, PCR of the 16S rRNA encoding gene can begin. PCR is a commonly used molecular biology method, consisting of cycles of denaturation of the double-stranded DNA template, annealing of universal primer pairs, which amplify highly conserved regions of the gene, and the extension of primers by DNA polymerase. While some primers amplify most of the 16S rRNA encoding gene, others only amplify fragments of it. After PCR, the products can be analyzed via agarose gel electrophoresis. If amplification was successful, the gel should contain a single band of an expected size, depending upon the primer pair used, up to 1500bp, the approximate length of the 16S rRNA gene.

          After purification and sequencing, the obtained sequences can then be entered into the BLAST database, where they can be compared with reference 16S rRNA sequences. As this database returns matches based on the highest similarity, this allows confirmation of the identity of the bacteria of interest. In this video, you will observe 16S rRNA gene sequencing, including PCR, DNA sequence analysis and editing, sequence assembly and database searching.

          When handling microorganisms, it is essential to follow good microbiological practice, including using aseptic technique and wearing appropriate personal protective equipment. After performing an appropriate risk assessment for the microorganism or environmental sample of interest, obtain a test culture. In this example, a pure culture of Bacillus subtilis is used.

          To begin, grow your microorganism on a suitable medium in the appropriate conditions. In this example, Bacillus subtilis 168 is grown in LB broth overnight in a shaking incubator set to 200 rpm at 37 degrees Celsius. Next, use a commercially available kit to isolate genomic DNA or gDNA from 1.5 milliliters of the B. subtilis overnight culture.

          To check the quality of the isolated DNA, first mix five microliters of the isolated gDNA with one microliter of DNA gel loading dye. Then, load the sample onto a 0.8% agarose gel, containing DNA staining reagent, such as SYBR safe or EtBr. After this, load a one kilobase molecular mass standard onto the gel, and run the electrophoresis until the front dye is approximately 0.5 centimeters from the bottom of the gel. Once the gel electrophoresis is complete, visualize the gel on a blue light transilluminator. The gDNA should appear as a thick band, above 10 kilobase in size and have minimal smearing.

          After this, to create serial dilutions of the gDNA, label three microcentrifuge tubes as 10X, 100X, and 1000X. Then, use a pipette to dispense 90 microliters of sterile distilled water into each of the tubes. Next, add 10 microliters of the gDNA solution to the 10X tube. Pipette the whole volume up and down to ensure the solution is mixed thoroughly. Then, remove 10 microliters of the solution from the 10X tube and transfer this to the 100X tube. Mix the solution as previously described. Finally, transfer 10 microliters of the solution in the 100X tube, to the 1000X tube.

          To begin the PCR protocol, thaw the necessary reagents on ice. Then, prepare the PCR master mix. Since the DNA polymerase is active at room temperature, the reaction set up must occur on ice. Aliquot 49 microliters of the master mix into each of the PCR tubes. Then, add one microliter of template to each of the experimental tubes and one microliter of sterile water to the negative control tube, pipetting up and down to mix. After this, set the PCR machine according to the program described in the table. Place the tubes into the thermocycler and start the program.

          Once the program is complete, examine the quality of your product via agarose gel electrophoresis, as previously demonstrated. A successful reaction using the described protocol should yield a single band of approximately 1.5 kilobase. In this example, the sample containing 100X diluted gDNA yielded the highest quality product. Next, purify the best PCR product, in this case, the 100X gDNA, with a commercially available kit. Now the PCR product can be sent for sequencing.

          In this example, the PCR product is sequenced using forward and reverse primers. Thus, two data sets, each containing a DNA sequence and a DNA chromatogram, are generated: one for the forward primer and the other for the reverse primer. First, examine the chromatograms generated from each primer. An ideal chromatogram should have evenly spaced peaks with little to no background signals.

          If the chromatograms display double peaks, multiple DNA templates may have been present in the PCR products and the sequence should be discarded. If the chromatograms contained peaks of different colors in the same location, the sequencing software likely miscalled nucleotides. This error can be manually identified and corrected in the text file. The presence of broad peaks in the chromatogram indicates a loss of resolution, which causes miscounting of the nucleotides in the associated regions. This error is difficult to correct and mismatches in any of the subsequent steps should be treated as unreliable. Poor chromatogram reading quality, indicated by the presence of multiple peaks, usually occurs at the five prime and three prime ends of the sequence. Some sequencing programs remove these low quality sections automatically. If your sequence was not truncated automatically, identify the low quality fragments and remove their respective bases from the text file.

          Use a DNA assembly program to assemble the two primer sequences into one continuous sequence. Remember, sequences obtained using forward and reverse primers should partially overlap. In the DNA assembly program, insert the two sequences in FASTA format into the appropriate box. Then, click the submit button and wait for the program to return the results.

          To view the assembled sequence, click on Contigs in the results tab. Then, to view the details of the alignment, select assembly details. Navigate to the website for the basic local alignment search tool, or BLAST, and select the nucleotide BLAST tool to compare your sequence to the database. Enter your sequence into the query sequence text box and select the appropriate database in the scroll down menu. Finally, click the BLAST button on the bottom of the page, and wait for the tool to return the most similar sequences from the database.

          In this example, the top hit is B. subtilis strain 168, showing 100% identity with the sequence in the BLAST database. If the top hit does not show 100% identity to your expected species or strain, click on the sequence which most closely matches your query to see the details of the alignment. Aligned nucleotides will be joined by short vertical lines and mismatched nucleotides will have gaps between them. Focusing on the identified mismatched regions, revise the sequence and repeat the BLAST search if desired.

          Subscription Required. Please recommend JoVE to your librarian.

          Applications and Summary

          Identifying bacterial species is important for different researchers, as well as for those in healthcare. 16S rRNA sequencing was initially used by researchers to determine phylogenetic relationships between bacteria. In time, it has been implemented in metagenomic studies to determine biodiversity of environmental samples and in clinical laboratories as a method to identify potential pathogens. It enables a quick and accurate identification of bacteria present in clinical samples, facilitating earlier diagnosis and faster treatment of patients.

          Subscription Required. Please recommend JoVE to your librarian.

          References

          1. Weisburg, W.G., Barns, S.M., Pelletier, D.A. and Lane D.J. 16S ribosomal DNA amplification for phylogenetic study. J Bacteriol.173 (2): 697-703. (1991)
          2. Drancourt, M., Bollet, C., Carlioz, A., Martelin, R., Gayral, J.P., Raoult D. 16S ribosomal DNA sequence analysis of a large collection of environmental and clinical unidentifiable bacterial isolates. J Clin Microbiol.38 (10):3623-3630. (2000)
          3. Woo, P.C., Lau, S.K., Teng, J.L., Tse, H., Yuen, K.Y. Then and now: use of 16S rDNA gene sequencing for bacterial identification and discovery of novel bacteria in clinical microbiology laboratories. Clin Microbiol Infect.14 (10):908-934. (2008)
          4. Tang, Y.W., Ellis, N.M., Hopkins, M.K., Smith, D.H., Dodge, D.E., Persing, D.H. Comparison of phenotypic and genotypic techniques for identification of unusual aerobic pathogenic gram-negative bacilli. J Clin Microbiol.36 (12):3674-3679. (1998)
          5. Tsiboli, P., Herfurth, E., Choli, T. Purification and characterization of the 30S ribosomal proteins from the bacterium Thermus thermophilus. Eur J Biochem.226 (1):169-177. (1994)
          6. Woese, C.R. Bacterial evolution. Microbiol Rev.51 (2):221-271. (1987)
          7. Bartlett, J.M., Stirling, D. A short history of the polymerase chain reaction. Methods Mol Biol.226:3-6. (2003)
          8. Wilson, K.H., Blitchington, R.B., Greene, R.C. Amplification of bacterial 16S ribosomal DNA with polymerase chain reaction. J Clin Microbiol.28 (9):1942-1946. (1990)
          9. Shendure, J., Balasubramanian, S., Church, G.M., Gilbert, W., Rogers, J., Schloss, J.A., Waterston, R.H. (2017) DNA sequencing at 40: past, present and future. Nature.550:345-353.
          10. Lane, D.J. 16S/23S rRNA sequencing. (1991) In Nucleic acid techniques in bacterial systematics. (Goodfellow, M. and Stackebrandt, E., eds.) p.115-175. Wiley and Sons, Chichester, United Kingdom.
          11. Turner, S., Pryer, K.M., Miao, V.P., Palmer, J.D. (1999) Investigating deep phylogenetic relationships among cyanobacteria and plastids by small subunit rRNA sequence analysis. J Eukaryot Microbiol.46:327-338.
          12. Fredricks, D.N., Relman, D.A. (1998) Improved amplification of microbial DNA from blood cultures by removal of the PCR inhibitor sodium polyanetholesulfonate. J Clin Microbiol.36:2810-2816.
          13. Wilson, K. Preparation of genomic DNA from bacteria. (2001) Curr Protoc Mol Biol. Chapter 2:Unit 2.4.
          14. Wright, M. H., Adelskov, J., Greene, A.C. (2017) Bacterial DNA extraction using individual enzymes and phenol/chloroform separation. J Microbiol Biol Educ.18:18.2.48.
          15. Huang, X., Madan, A. (1999). CAP3: A DNA sequence assembly program. Genome Res.9:868-877.

          Transcript

          Earth is home to millions of bacterial species, each with unique characteristics. Identifying these species is critical in evaluating environmental samples. Doctors also need to distinguish different bacterial species to diagnose infected patients.

          To identify bacteria, a variety of techniques can be employed, including microscopic observation of morphology or growth on a specific media to observe colony morphology. Genetic analysis, another technique for identifying bacteria has grown in popularity in recent years, due in part to 16S ribosomal RNA gene sequencing.

          The bacterial ribosome is a protein RNA complex consisting of two subunits. The 30S subunit, the smaller of these two subunits, contains 16S rRNA, which is encoded by the 16S rRNA gene contained within the genomic DNA. Specific regions of 16S rRNA are highly conserved, due to their essential function in ribosome assembly. While other regions, less critical to function, may vary among bacterial species. The variable regions in 16S rRNA, can serve as unique molecular fingerprints for bacterial species, allowing us to distinguish phenotypically identical strains.

          After obtaining a quality sample of gDNA, PCR of the 16S rRNA encoding gene can begin. PCR is a commonly used molecular biology method, consisting of cycles of denaturation of the double-stranded DNA template, annealing of universal primer pairs, which amplify highly conserved regions of the gene, and the extension of primers by DNA polymerase. While some primers amplify most of the 16S rRNA encoding gene, others only amplify fragments of it. After PCR, the products can be analyzed via agarose gel electrophoresis. If amplification was successful, the gel should contain a single band of an expected size, depending upon the primer pair used, up to 1500bp, the approximate length of the 16S rRNA gene.

          After purification and sequencing, the obtained sequences can then be entered into the BLAST database, where they can be compared with reference 16S rRNA sequences. As this database returns matches based on the highest similarity, this allows confirmation of the identity of the bacteria of interest. In this video, you will observe 16S rRNA gene sequencing, including PCR, DNA sequence analysis and editing, sequence assembly and database searching.

          When handling microorganisms, it is essential to follow good microbiological practice, including using aseptic technique and wearing appropriate personal protective equipment. After performing an appropriate risk assessment for the microorganism or environmental sample of interest, obtain a test culture. In this example, a pure culture of Bacillus subtilis is used.

          To begin, grow your microorganism on a suitable medium in the appropriate conditions. In this example, Bacillus subtilis 168 is grown in LB broth overnight in a shaking incubator set to 200 rpm at 37 degrees Celsius. Next, use a commercially available kit to isolate genomic DNA or gDNA from 1.5 milliliters of the B. subtilis overnight culture.

          To check the quality of the isolated DNA, first mix five microliters of the isolated gDNA with one microliter of DNA gel loading dye. Then, load the sample onto a 0.8% agarose gel, containing DNA staining reagent, such as SYBR safe or EtBr. After this, load a one kilobase molecular mass standard onto the gel, and run the electrophoresis until the front dye is approximately 0.5 centimeters from the bottom of the gel. Once the gel electrophoresis is complete, visualize the gel on a blue light transilluminator. The gDNA should appear as a thick band, above 10 kilobase in size and have minimal smearing.

          After this, to create serial dilutions of the gDNA, label three microcentrifuge tubes as 10X, 100X, and 1000X. Then, use a pipette to dispense 90 microliters of sterile distilled water into each of the tubes. Next, add 10 microliters of the gDNA solution to the 10X tube. Pipette the whole volume up and down to ensure the solution is mixed thoroughly. Then, remove 10 microliters of the solution from the 10X tube and transfer this to the 100X tube. Mix the solution as previously described. Finally, transfer 10 microliters of the solution in the 100X tube, to the 1000X tube.

          To begin the PCR protocol, thaw the necessary reagents on ice. Then, prepare the PCR master mix. Since the DNA polymerase is active at room temperature, the reaction set up must occur on ice. Aliquot 49 microliters of the master mix into each of the PCR tubes. Then, add one microliter of template to each of the experimental tubes and one microliter of sterile water to the negative control tube, pipetting up and down to mix. After this, set the PCR machine according to the program described in the table. Place the tubes into the thermocycler and start the program.

          Once the program is complete, examine the quality of your product via agarose gel electrophoresis, as previously demonstrated. A successful reaction using the described protocol should yield a single band of approximately 1.5 kilobase. In this example, the sample containing 100X diluted gDNA yielded the highest quality product. Next, purify the best PCR product, in this case, the 100X gDNA, with a commercially available kit. Now the PCR product can be sent for sequencing.

          In this example, the PCR product is sequenced using forward and reverse primers. Thus, two data sets, each containing a DNA sequence and a DNA chromatogram, are generated: one for the forward primer and the other for the reverse primer. First, examine the chromatograms generated from each primer. An ideal chromatogram should have evenly spaced peaks with little to no background signals.

          If the chromatograms display double peaks, multiple DNA templates may have been present in the PCR products and the sequence should be discarded. If the chromatograms contained peaks of different colors in the same location, the sequencing software likely miscalled nucleotides. This error can be manually identified and corrected in the text file. The presence of broad peaks in the chromatogram indicates a loss of resolution, which causes miscounting of the nucleotides in the associated regions. This error is difficult to correct and mismatches in any of the subsequent steps should be treated as unreliable. Poor chromatogram reading quality, indicated by the presence of multiple peaks, usually occurs at the five prime and three prime ends of the sequence. Some sequencing programs remove these low quality sections automatically. If your sequence was not truncated automatically, identify the low quality fragments and remove their respective bases from the text file.

          Use a DNA assembly program to assemble the two primer sequences into one continuous sequence. Remember, sequences obtained using forward and reverse primers should partially overlap. In the DNA assembly program, insert the two sequences in FASTA format into the appropriate box. Then, click the submit button and wait for the program to return the results.

          To view the assembled sequence, click on Contigs in the results tab. Then, to view the details of the alignment, select assembly details. Navigate to the website for the basic local alignment search tool, or BLAST, and select the nucleotide BLAST tool to compare your sequence to the database. Enter your sequence into the query sequence text box and select the appropriate database in the scroll down menu. Finally, click the BLAST button on the bottom of the page, and wait for the tool to return the most similar sequences from the database.

          In this example, the top hit is B. subtilis strain 168, showing 100% identity with the sequence in the BLAST database. If the top hit does not show 100% identity to your expected species or strain, click on the sequence which most closely matches your query to see the details of the alignment. Aligned nucleotides will be joined by short vertical lines and mismatched nucleotides will have gaps between them. Focusing on the identified mismatched regions, revise the sequence and repeat the BLAST search if desired.


          Background

          Right after the death of an organism, microbial communities colonize the decomposing tissues and together with enzymes from the organism they start degrading the DNA molecules [1,2,3]. DNA degradation is dependent on time and environmental variables such as temperature but also humidity and acidity [4]. Even though the specific model for DNA decay is still debated and it is likely multifactorial [4], the consequence is that ancient remains typically contain very few molecules of endogenous DNA and these sequences are characterized by short fragment sizes [5].

          A second major challenge of ancient DNA research is contamination from exogenous sources [6, 7]. Environmental DNA molecules in the soil matrix the ancient sample was recovered from can easily overwhelm the small amounts of endogenous DNA. This is also true for DNA from people who collected and handled the samples in the field and/or museum collections [8, 9]. While the use of Polymerase Chain Reaction (PCR) technology allowed ancient DNA research to overcome low concentration problems, the sensitivity of the PCR has made it very difficult to avoid introducing modern contaminant sequences among the authentic ancient DNA [10].

          In the last decade, together with more refined DNA extraction and laboratory methods tailored to efficiently retrieve very short and scarce DNA sequences [5, 11], it has become possible to obtain massive amounts of sequences from ancient material using high-throughput sequencing technologies. These technologies have allowed the recovery of hundreds of ancient human (reviewed in [12]) and other high quality ancient faunal genomes such as those from horses [13], wooly mammoths [14], and bears [15]. However, the challenges from exogenous contamination remain and have sparked a search for computational methods to identify and monitor contaminant DNA sequences in ancient sequencing datasets.

          Aside from the short fragment size, the other most notable characteristic of ancient DNA is post-mortem damage. After death, the repairing mechanisms of DNA damage such as hydrolysis and oxidation stop functioning, and this damage accumulates in predictable patterns [16, 17] The most common ancient DNA damage is deamination of cytosines to uracils in the overhangs of fragmented DNA molecules [16, 18, 19]. This results in an excess of C to T substitutions in the 5′ end (and G to A in the 3′ end) of ancient DNA sequences. Since this feature is very common in sequences derived from ancient DNA sources and absent in younger samples, it has been widely used as a key criteria to authenticate ancient DNA experiments [5, 20].

          In modern-day ancient DNA studies, exogenous sequences are differentiated from real ancient sequences from the source organism by mapping all sequences to a reference genome and keeping only those that result in alignments with less than a defined number of differences [21, 22]. This approach to circumvent environmental contamination has gained general acceptance, and currently exogenous contaminants are at most considered problematic due to their consumption of sequencing capacity. However, the probability of spurious alignments from exogenous sequences occurring by chance increases with decreasing sequence length [23]. In order to avoid these, thresholds for minimum fragment length, that still allow for enough specificity of the alignments, are used [24,25,26].

          Modern human contamination is especially problematic for human palaeogenomic studies since ancient, anatomically modern humans typically fall within the variation of modern humans [27, 28]. This has led to the development of a plethora of methods aimed at computationally quantifying and monitoring exogenous contamination in ancient human DNA datasets [29]. However, the number of methods that allow for the effective exclusion of this type of contamination remains limited. For example, Skoglund et al. [30] used the differential empirical distributions of post-mortem damage (PMD) scores, based on both base quality scores and their level of polymorphism with respect to the reference genome, to differentiate DNA sequences from ancient and modern samples. The PMD scores in a contaminated ancient sample could then be used to successfully identify and separate the sequences that are most likely to have originated from an ancient template molecule from the contaminant ones. Even though this method can allow for the enrichment of the proportion of ancient sequences several-fold in respect to the contaminant sequences, the amount of data lost in the process is very large (45–90%) depending on the age of the ancient sample [30].

          Here we use competitive mapping to investigate the presence of exogenous sequences in ancient sequencing files to evaluate the pervasiveness of human contamination in ancient faunal DNA studies. Previous ancient DNA studies have used similar strategies, i.e. mapping the sequenced ancient DNA data to several reference sequences at the same time, to identify target microbial species (e.g. [31, 32]). We use competitive mapping to identify the levels of contamination in ancient faunal sequencing files and characterize the exogenous sequences by using summary statistics to compare them to those of authentic ancient DNA. We then present this strategy as a simple and fast method that enables the conservative removal of human contamination from ancient faunal datasets with a limited loss of true ancient DNA sequences.


          Acknowledgements

          We are grateful to Melanie Kuhlmann for excellent technical assistance. We thank Sebastian Packheiser who contributed to cloning and plant transformation. This work was supported by the Ministry of Education and Science (BMBF) grant “AnnoBeet: Annotation des Genoms der Zuckerrübe unter Berücksichtigung von Genfunktionen und struktureller Variabilität für Nutzung von Genomdaten in der Pflanzenbiotechnologie" (FKZ 0315962 A). A further "thanks" to our industrial partners KWS Saat AG and Syngenta Seeds GmbH. We acknowledge support for the Article Processing Charge by the German Research Foundation and the Open Access Publication Fund of Bielefeld University Library.


          Watch the video: How to Convert to and from a FASTA Formatted File (August 2022).