Information

Predicting and identifying microbes and enzymes DNA sequence with metabolic prediction

Predicting and identifying microbes and enzymes DNA sequence with metabolic prediction



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

Presently I am working on metagenomics of coal biomethenation by bacterial consortium.

I have got the sequence result (Illumina). The sequence is huge and I can't predict anything from the sequence. I have gone through different database like Metacyc, Biocyc etc. Please help me that how can I draw an inference towards the metabolic enzyme that are involved in biomethenation of coal.


The first step after sequencing is finding probable genes. After that, genes and their proteins can be classified to belong to protein classes. This is the most what you can do with completely unknown genes. It's possible nowadays to predict the final structure using contact maps (if there is no homologous structure known) but this will still leave you unclear on ligands in many cases. So, the final step is to clarify the function with biochemical lab methods.

So, if you are stuck with a huge sequence, first try to find the genes in it.

http://en.wikipedia.org/wiki/Gene_prediction

For subsequent annotation/classification, I recommend InterPro and PROSITE.

http://www.ebi.ac.uk/Tools/pfa/iprscan/

http://prosite.expasy.org/scanprosite/


Well first off, we don't know how you sequenced the data. What did you actually sequence? Did you look at transcriptional activity using RNA-seq or did you do a full genome sequencing. Were there any paired-end reads? How did you create your library? Did you enrich the bacterial consortium for biomethanation activity?

Where do you reads map? The consortium complicates things but you will likely need to build a quality contig library before you do anything else. Without a doubt, much of your library will not map to anything interesting but you should still annotate the library by seeing where the reads map up. Identifying ORFs will be useful.

Likely, you will have a set of genes that will be involved with biomethanation. You should BLAST the bejesus out of your contig library for hits. Do they match any of your ORFs? Going the otherway, do your ORFs match any of the genes in EcoCyc or BioCyc. Do you need to use a larger database?


Predicting and identifying microbes and enzymes DNA sequence with metabolic prediction - Biology

Microorganisms can be classified on the basis of cell structure, cellular metabolism, or on differences in cell components.

Learning Objectives

Distinguish between phenotypic characteristics for Bacteria, Archaea and Eukaryotes

Key Takeaways

Key Points

  • The relationship between the three domains ( Bacteria, Archaea, and Eukaryota) is of central importance for understanding the origin of life. Most of the metabolic pathways are common between Archaea and Bacteria, while most genes involved in genome expression are common between Archaea and Eukarya.
  • Microorganisms are very diverse. They include bacteria, fungi, algae, and protozoa microscopic plants, and animals. Single-celled microorganisms were the first forms of life to develop on earth, approximately 3 billion–4 billion years ago.
  • The Gram stain characterizes bacteria based on the structural characteristics of their cell walls. By combining morphology and Gram-staining, most bacteria can be classified as belonging to one of 4 groups (Gram-positive cocci, Gram-positive bacilli, Gram-negative cocci, and Gram-negative bacilli).
  • There are some basic differences between Bacteria, Archaea, and Eukaryotes in cell morphology and structure which aid in phenotypic classification and identification.

Key Terms

  • Gram stain: A method of differentiating bacterial species into two large groups (Gram-positive and Gram-negative).
  • microorganism: An organism that is too small to be seen by the unaided eye, especially a single-celled organism, such as a bacterium.
  • domain: In the three-domain system, one of three taxa at that rank: Bacteria, Archaea, or Eukaryota.

Microorganisms are very diverse. They include bacteria, fungi, algae, and protozoa microscopic plants (green algae) and animals such as rotifers and planarians. Most microorganisms are unicellular (single-celled), but this is not universal.

Single-celled microorganisms were the first forms of life to develop on earth, approximately 3 billion–4 billion years ago. Further evolution was slow, and for about 3 billion years in the Precambrian eon, all organisms were microscopic. So, for most of the history of life on earth the only forms of life were microorganisms. Bacteria, algae, and fungi have been identified in amber that is 220 million years old, which shows that the morphology of microorganisms has changed little since the Triassic period. When at the end of the 19 th century information began to accumulate about the diversity within the bacterial world, scientists started to include the bacteria in phylogenetic schemes to explain how life on earth may have developed. Some of the early phylogenetic trees of the prokaryote world were morphology-based. Others were based on the then-current ideas on the presumed conditions on our planet at the time that life first developed.

Microorganisms tend to have a relatively rapid evolution. Most microorganisms can reproduce rapidly, and microbes such as bacteria can also freely exchange genes through conjugation, transformation, and transduction, even between widely-divergent species. This horizontal gene transfer, coupled with a high mutation rate and many other means of genetic variation, allows microorganisms to swiftly evolve (via natural selection) to survive in new environments and respond to environmental stresses.

The relationship between the three domains (Bacteria, Archaea, and Eukaryota) is of central importance for understanding the origin of life. Most of the metabolic pathways, which comprise the majority of an organism’s genes, are common between Archaea and Bacteria, while most genes involved in genome expression are common between Archaea and Eukarya. Within prokaryotes, archaeal cell structure is most similar to that of Gram-positive bacteria.

Phenotypic Methods of Classifying and Identifying Microorganisms

Classification seeks to describe the diversity of bacterial species by naming and grouping organisms based on similarities. Microorganisms can be classified on the basis of cell structure, cellular metabolism, or on differences in cell components such as DNA, fatty acids, pigments, antigens, and quinones.

Bacterial Morphology: Basic morphological differences between bacteria. The most often found forms and their associations.

There are some basic differences between Bacteria, Archaea, and Eukaryotes in cell morphology and structure which aid in phenotypic classification and identification:

The relative sizes of prokaryotic cells: Relative scales of eukaryotes, prokaryotes, viruses, proteins and atoms (logarithmic scale).

  • Bacteria: lack membrane -bound organelles and can function and reproduce as individual cells, but often aggregate in multicellular colonies. Their genome is usually a single loop of DNA, although they can also harbor small pieces of DNA called plasmids. These plasmids can be transferred between cells through bacterial conjugation. Bacteria are surrounded by a cell wall, which provides strength and rigidity to their cells.
  • Archaea: In the past, the differences between bacteria and archaea were not recognized and archaea were classified with bacteria as part of the kingdom Monera. Archaea are also single-celled organisms that lack nuclei. Archaea in fact differ from bacteria in both their genetics and biochemistry. While bacterial cell membranes are made from phosphoglycerides with ester bonds, archaean membranes are made of ether lipids.
  • Eukaryotes: Unlike bacteria and archaea, eukaryotes contain organelles such as the cell nucleus, the Golgi apparatus, and mitochondria in their cells. Like bacteria, plant cells have cell walls and contain organelles such as chloroplasts in addition to the organelles in other eukaryotes.

The Gram stain, developed in 1884 by Hans Christian Gram, characterizes bacteria based on the structural characteristics of their cell walls. The thick layers of peptidoglycan in the “Gram-positive” cell wall stain purple, while the thin “Gram-negative” cell wall appears pink. By combining morphology and Gram-staining, most bacteria can be classified as belonging to one of four groups (Gram-positive cocci, Gram-positive bacilli, Gram-negative cocci, and Gram-negative bacilli). Some organisms are best identified by stains other than the Gram stain, particularly mycobacteria or Nocardia, which show acid-fastness on Ziehl–Neelsen or similar stains. Other organisms may need to be identified by their growth in special media, or by other techniques, such as serology.

Gram-positive bacteria: Streptococcus mutans visualized with a Gram stain.

While these schemes allowed the identification and classification of bacterial strains, it was unclear whether these differences represented variation between distinct species or between strains of the same species. This uncertainty was due to the lack of distinct structures in most bacteria, as well as lateral gene transfer between unrelated species. Due to lateral gene transfer, some closely related bacteria can have very different morphologies and metabolisms. To overcome this uncertainty, modern bacterial classification emphasizes molecular systematics, using genetic techniques such as guanine cytosine ratio determination, genome-genome hybridization, as well as sequencing genes that have not undergone extensive lateral gene transfer, such as the rRNA gene.


Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

Get time limited or full article access on ReadCube.

All prices are NET prices.


Results

Predicting candidate sequences for orphan enzymes based on (meta)genomic and metabolic pathway neighbours

We first identified 555 orphan enzymes that operate in metabolic pathways (i.e., connected to at least one other enzyme by a common compound) by analysing the KEGG database (Kanehisa et al, 2008) ( Figure 1 ). After identifying the EC numbers of the pathway neighbours of these orphan ECs, we retrieved all genes with the same EC number from the 338 prokaryotic genomes of the STRING7 resource (von Mering et al, 2007). For the genes in the 63 metagenomes, EC numbers were assigned via a best BLAST match to KEGG orthologous groups (see Materials and methods and Supplementary Section 1). As neighbouring prokaryotic genes are often involved in the same metabolic pathway, we analysed the genomic neighbourhood and retrieved gene sequences of relevant genome neighbours as candidate genes for the orphan enzymes. Using genomic data, we extracted 400� candidate genes and 97� from metagenomic data (Supplementary dataset 1).

To quantify the likelihood that a specific candidate gene performs the function of the orphan enzyme, we developed a scoring scheme based on four parameters: (1) The genome neighbourhood score (NBH), which measures the distance between two neighbouring genes as well as the evolutionary conservation of the synteny. This metric captures the biological phenomenon that functionally associated genes are usually clustered in conserved operon structures, (2) The co-occurrence score (COR), which measures how often two genes occur within the same genome. This metric reflects the tendency for members of the same pathway to appear in genomes together, (3) The pathway neighbour score (PNE), which normalizes for the varying numbers of pathway neighbours of the orphan enzyme and (4) The signature domain score (DOM), which indicates whether candidate proteins contain domain(s) that are unique to enzymes catalysing similar reactions to the orphan enzymes (having the same first 3 EC numbers).

Benchmarking revealed that high-confidence candidate sequences can be obtained for over 100 orphan enzymes

To assess the accuracy of our pipeline and to determine the best combination of the four scoring parameters, we benchmarked our predictions using 100 sets of 350 randomly selected enzymes from the KEGG database (that have corresponding sequences) ( Figure 2 ). We considered each of these to be orphan enzymes, applied the newly developed pipeline and then assigned the candidate genes a set of four scores for each of the parameters (NBH, COR, PNE and DOM). We classified the predictions according to their four scores, and then, to estimate the accuracy of each scoring parameter, or combination of parameters, we calculated the proportion of the predictions that were assigned to the correct EC number. First, to understand the predictive power of each of the four scoring parameters, we benchmarked each parameter separately, using the genomic and metagenomic data ( Figure 2B ). Predictions from the genome data illustrate that the co-occurrence score is the best predictor and correlates most strongly with the overall accuracy. The parameter COR in metagenomic data also works well, but for more than 30% of the metagenomic sequences, phylogenetic profiles could not be constructed due to a lack of sequence similarity to currently available data. Here, the signature domains allowed many predictions ( Figure 2B ). Second, we performed benchmarking for each combination of the four scoring parameters. Although each individual scoring parameter works to some extent, benchmarking clearly shows that integration of the four parameters is better than any one parameter used in isolation ( Figure 2A ). Finally, we assembled a set of high-confidence predictions from all of the parameter combinations that yielded an accuracy greater than 70% ( Figure 2A ), resulting in predicted sequences for 131 orphan enzymes (Supplementary Table 2 and Supplementary datasets 2 and 3). For some of the parameter combinations, even more than 90% accuracy is expected.

Benchmarking of the scoring parameters. (A) Accuracy plot derived from genomic (red) and metagenomic data (blue) using the combination of neighbourhood score (NBH), co-occurrence (COR), signature domains (DOM) and pathway neighbours (PNE). Each candidate gene/neighbouring gene pair was assigned a score for NBH and COR. Each candidate gene was also assigned a PNE and DOM score. The predictions were classified according to their four scores: NBH (Ϡ.4,Ϡ.5,Ϡ.6,Ϡ.7,Ϡ.8,Ϡ.9), COR (Ϡ.1,Ϡ.2,Ϡ.3,Ϡ.4,Ϡ.5,Ϡ.6), DOM (0 or 1) and PNE (1, 2 or more). Then for each combination of scoring parameters, the number of correct and incorrect EC number assignments was calculated in order to determine the accuracy of each parameter combination. In total, 100 randomized datasets were generated to benchmark the prediction pipeline. Each point represents all predictions from a specific combination of the four parameters (center). The horizontal axis indicates the positive predicted values (PPV), which is calculated as the number of true positives (TP) over the summation of TP and false positives (FP). The vertical axis indicates the number of predictable enzymes. The yellow-shaded area represents the high-confidence set of predictions that was assembled from the union of all points yielding greater than 70% accuracy. (B) Accuracy plot for each separate parameter calculated using genomic or metagenomic data. The colour and size of the points represents the intensity of the scores. The grey dots indicate the combined plot in (A).

We then manually investigated the 131 orphan enzymes with high-confidence predictions in more detail. Reconciliation with additional databases and literature searches revealed that 26 out of these 131 already have a sequence deposited in the curated Swissprot database or literature (Supplementary Figure 4 and Supplementary Tables 3 and 4). For 17 of the 26 (65%) database sequences there was homology to sequences from EC numbers that agreed up to at least the first digit (Supplementary Figure 5). Our candidate sequences that have no orthology to the sequences in the database may represent alternative orthologous groups catalysing the same reaction, as about 70% of the EC numbers in KEGG are encoded by more than one orthologous group (Supplementary Figure 3A). Therefore, we do not consider these as mispredictions, but they can no longer be called orphan enzymes, although none of these sequences are indicated in the enzyme-specific databases ExPASy-ENZYME or KEGG. The activities of the remaining 105 orphan enzymes range from core metabolism, such as nucleotide metabolism, to peripheral pathway ( Figure 3A , Supplementary Figure 6) and we could assign over 16� sequences to these.

Breakdown of the predicted enzymes. (A) The number of EC numbers for which candidate genes can be predicted using parameter combinations with greater than 70% accuracy. Red indicates candidate genes that were derived from only genomic data, and blue indicates candidate genes that were derived only from metagenomic data. (B) The pie charts represent the proportion of the gene candidates that have an unknown function versus a current annotation for genes from genomic (red) and metagenomic (blue) data. The striped area represents genes that were detected only in genomic or metagenomic data, whereas the genes represented by the solid colours were identified in both genomic and metagenomic data. (C, left) The novelty of the predictions is illustrated at the enzyme level and the gene level. The enzymes were categorized into three categories: (1) all candidate genes for that enzyme are currently annotated as functionally unknown (yellow), (2) some (usually most) of the candidate genes for the enzyme are functionally unknown while others are annotated with an EC number (yellow/green) and (3) all of the candidate genes for that enzyme have a current EC annotation (green). The candidate genes are then divided into functionally unknown (yellow) and currently annotated (green). (C, right) For those 40% of the candidate genes that are currently annotated, we illustrate the level of agreement between our predicted EC number and the current annotation. We overlaid this with similar data from KEGG, as over 30% of the OGs in KEGG are assigned to multiple EC numbers (Supplementary Figure 2). White bars represents multifunctionality of enzymatic activity in KEGG original data and green the currently annotated candidate genes.

Experimental confirmation of the predicted enzymatic function for two candidate sequences

After determining that our pipeline can reveal high-confidence predictions for candidate sequences for orphan enzymes, we performed experimental confirmations. We assessed the ease of experimental validation for some of the high-confidence predictions (e.g., access to gDNA) out of 45 corresponding EC numbers, 15 sequences were amenable to cloning and 7 were chosen for functional validation based on the commercial availability of the reactants as well as the ability to monitor the substrates and products using available analytical methods. Of the six proteins that were successfully heterologously expressed, the proposed function was verified for two enzymes (Supplementary Section 5).

We succeeded in experimentally verifying the correct function of candidate sequences for EC 2.6.1.14 (asparagine oxo-acid transaminase, Figure 4A left) and EC 2.6.1.38 (histidine transaminase, Figure 4A right), showing the reliability of this prediction pipeline. Using the prediction pipeline, we retrieved candidate sequences for these two enzymes using genomic (EC 2.6.1.14) or genomic and metagenomic data (EC 2.6.1.38) ( Figure 4B ). Candidate sequences were heterologously expressed, and in assays containing the purified candidate proteins and the substrates, the expected reaction products were unambiguously identified using a combination of LC/MS and MS/MS ( Figure 4C and Supplementary Figures 11� see Supplementary Section 5 for details). Concerning the four other candidate proteins (for EC 2.1.1.19, 2.1.1.68, 2.3.1.32 and 2.7.1.28), neither product formation nor substrate consumption was detected in enzymatic assays through LC/MS. For EC 2.7.1.28, a peak of very slight intensity with a m/z consistent with the one of the products, D -glyceraldehyde-3-phosphate, could be detected. Nevertheless, LC/MS analyses could not lead us to conclude the predicted activity, as the substrate D -glyceraldehyde could never be detected, and neither ATP consumption nor ADP formation could be established. In addition, two different continuous spectrophotometric assays were set up to try to confirm the predicted activity. In the first one, the production of ADP was coupled to the consumption of NADH, using commercial pyruvate kinase and lactate dehydrogenase, along with phosphoenolpyruvate. In the second one, the production of glyceraldehyde-3-phosphate was coupled to the production of NADH using commercial glyceraldehyde-3-phosphate dehydrogenase. In both cases, the assays were inconclusive. However, as detailed in the Discussion, there can be many difficulties in the experimental process to validate an enzymes' function therefore absence of evidence is not necessarily evidence of absence.

Orphan enzymes with experimental validation. (A) The chemical reactions catalysed by the two orphan enzymes for which candidate sequences were experimentally validated (B) Metabolic pathway neighbours and genome neighbours of the orphan enzymes. (C) Extracted ion chromatogram (EIC) and MS/MS plots supporting the identity of the expected reaction products.

Assessing functional novelty and multifunctionality for the candidate sequences

After the benchmarking and experimental validations showed the reliability of the pipeline, we examined the validated orphan enzymes and their corresponding genes in more detail. As expected from the benchmarking, the number of enzymes for which candidate sequences can be predicted was greater for genomic than for metagenomic data ( Figure 3A ). This is due in part to the short length of contigs in metagenomic data, as this reduces the number of genomic neighbours that are available for the first screen of our pipeline. For 48 enzymes, candidate sequences were predicted from both metagenomic and genomic data. However, for 13 orphan enzymes we found candidate sequences only in metagenomic data, exemplifying the ability of this pipeline to detect sequences from bacteria in environmental samples. One example is biotin CoA synthetase (6.2.1.11) found in the gut metagenomes. This prediction is supported by the fact that bacterial synthesis and degradation of biotin is known to be important in the human large intestines (Said, 2009 Arumugam et al, 2011).

As many as 9884 of the individual candidate sequences (about 60%) are annotated as 𠆏unction unknown', ‘hypothetical' or similar ( Figure 3B ), and assigning them to orphan activities thus provides functional annotations that can be further propagated into newly sequenced genomes through the use of homology-based annotation methods. An even higher fraction of unannotated sequences predicted to code for orphan enzymes can be found in metagenomics data ( Figure 3B ).

Overall, 40% of the candidate sequences are already annotated with an EC number ( Figure 3C ). We believe that the vast majority of these imply multifunctionality, as this is a common attribute of enzymes (Nobeli et al, 2009). Indeed, over 30% of the genes in the KEGG database are assigned to more than one EC number (Supplementary Figure 3B). Of these multifunctional enzymes in KEGG, about 30% are assigned to EC numbers that agree up to 3 digits, while another 50% have no agreement between the different EC numbers. Our candidate sequences that have a current annotation and are potentially multifunctional have a similar trend in the level of agreement between the assigned and predicted EC numbers ( Figure 3C ). It is therefore plausible that these genes with current annotations represent multifunctional enzymes, although we cannot rule out either mispredictions from our pipeline nor errors in the current annotations due to the automatic nature of most genome annotations.

In addition to coupling unannotated sequences to specific functions, our pedictions also provided putative functions for certain Domains of Unknown Function (DUF domains). The prediction pipeline led to the identification of five DUF domains that are unique to candidates of orphan enzymes. For example, DUF2254 is only present in genes predicted to encode the orphan EC 2.4.2.15, guanosine phosphorylase (Supplementary Table 5). As a byproduct of our pipeline, we also identified 150 DUF domains that are unique to specific non-orphan EC numbers yet had not been annotated so far (Supplementary Table 6), and should improve various studies that use domain databases like Pfam or SMART (Finn et al, 2010 Letunic et al, 2012).

High-confidence predictions yield putative sequences for enzymes with commercial and biotechnological applications

Some orphan enzymes from our high-confidence predictions have potential commercial or medical applications, for example EC 2.8.1.5, thiosulphate𠅍ithiol sulphurtransferase, involved in sulphur metabolic pathways that are essential in many pathogenic bacteria, but not present in humans, and could therefore provide drug targets. In addition, four of the orphan enzymes with very high scores could be utilized for the synthesis of commercially available nutraceuticals, one could be used in the food industry and another two have applications in bioremediation (Supplementary Table 7). Furthermore, candidate genes were predicted for phenylpyruvate decarboxylase (EC 4.1.1.43), using a parameter combination with 80% accuracy, that converts phenylpyruvate to phenylacetaldehyde, which is the first and crucial step in the synthesis of branched-chain higher alcohols as biofuels (Atsumi et al, 2008). The genes that our analysis linked to phenylpyruvate decarboxylase represent a valuable repertoire for efficient production of biofuels. All of the predictions and sequences are available at our website ( http://www.bork.embl.de/∼yamada/orphan_enzymes/).

Orphan enzyme reactions improve the accuracy of genome-scale metabolic models

To measure the impact of our findings on genome-scale metabolic models, we analysed reactions represented by the 120 metabolic models obtained from the Model SEED database (Henry et al, 2010) (Supplementary Table 8) and determined if any of them contained orphan enzymes for which we have reliable predictions. For most of the metabolic models, the reactions encoded by the orphan enzymes were not included, and thereby represent novel reactions. For each model, there were around 40 novel reactions averaging about 5�% of total reactions ( Figure 5 ). Interestingly, this trend was observed for manually reconstructed models as well as for automatically reconstructed models. For example, in the most recent reconstruction for Escherichia coli (Orth et al, 2011), 49 novel reactions (from parameter combinations with estimated accuracy 㹰%) could be added to the model while only 1 reaction in the current model represents one of these orphan enzymes (Supplementary Table 9). The fact that these orphan enzymes are not represented in the metabolic models shows that the completeness of these reconstructions is heavily reliant on the current annotation quality, and thus considerably affected by orphan enzymes.

Enrichment of genome-scale metabolic models by orphan enzymes. The barplot shows the number of reactions in 120 publically available genome-scale metabolic models from Model Seed (Henry et al, 2010) (white) and novel enzymatic reactions for these models predicted by our pipeline with over 70% accuracy (red). Current gaps in terms of enzyme-catalysed reactions are also shown (blue). The line graph plots the fraction of novel enzymes contributed by orphan enzymes. Only the 10 models with the highest fraction of novel reactions are shown. The histogram in the lower right shows the distribution of the novel fraction for 120 seed models used in this study (Supplementary Table 7).

To estimate the impact of the novel reactions on flux simulations using these models, we performed flux coupling analysis (FCA) (Burgard et al, 2004), before and after adding the corresponding novel orphan enzyme reactions into the models. Comparative FCA helped us to systematically elucidate the effects of adding new reactions on the topology of flux connectivity at the whole-network scale (see Materials and methods). In the case of the latest (manually curated) E. coli model (Orth et al, 2011), a large fraction (16%) of dependency relationships between the fluxes were altered following the addition of 49 novel reactions (Supplementary Figure 9). In general, the addition of the new reactions led to a decrease in the number of coupled reactions. For example, changes were detected in vitamin biosynthesis pathways where the addition of the orphan reactions led to a decrease in the number of fully coupled reactions (reaction pairs for which the corresponding fluxes are directly proportional to each other). This trend shows that the new reactions are relatively well embedded within the existing network and provide additional branches for flux routing.

Then to establish if adding the orphan enzyme reactions to the current models improves their accuracy, we determined if the updated models were better in predicting gene essentiality. For �% of the 72 SEED models tested, there was at least one gene for which the prediction changed from essential to non-essential, with the largest change being 26 genes in the case of Salmonella typhimurium. For the rest �% of the models, no change in essentiality predictions was observed following the addition of the orphan enzyme reactions (Supplementary Figure 10). Addition of new reactions to a model can change the existing predictions in two different ways (i) false essential predictions can then be correctly predicted as non-essential, and/or, (ii) some of the true essential predictions are later wrongly predicted as non-essential. To determine if the observed changes in essentiality predictions were biologically meaningful, we compared the experimentally determined essentiality status of the genes to the essentiality status predicted from the models with and without the orphan enzyme reactions. Four of the species probed in our study had genome-wide gene-essentiality data available. For the Bacillus subtilis model, no changes were predicted for gene essentiality following the addition of the corresponding orphan enzyme reactions. However, for the other three species, E. coli K-12, Campylobacter jejuni subsp. Jejuni NCTC 11168 and Helicobacter pylori J99, predictions for a total of 15 genes changed to non-essential due to addition of the orphan enzyme reactions. All of these changes to non-essential were then found to be consistent with the results from experimental genome-wide knock-out data, illustrating that the addition of the orphan enzyme reactions to the metabolic models made them more accurate for gene knock-out analyses ( Figure 6B ).

Gene-essentiality predictions for genome-scale metabolic models including orphan enzymes. (A) Distribution of the number of genes for which the computational prediction changed from essential to non-essential across 72 genome-scale metabolic models (Supplementary Table 8). (B) Comparison of the gene-essentiality predictions from the models with/without orphan enzymes to essentiality derived from experimental data. Only genes for which addition of the orphan enzymes altered the existing predictions are shown.


MATERIALS AND METHODS

Glimmer

Glimmer's salient feature is its use of interpolated Markov models (IMMs) for capturing gene composition ( 18). IMMs are variable-order Markov chain models that maximize the model order for each specific oligonucleotide window based on the amount of training data available. IMMs then interpolate the nucleotide distributions between the chosen order and one greater. Thus, IMMs construct the most sophisticated composition model that the training data sequences support. To segment the sequence into coding and non-coding sequence, Glimmer uses a flexible ORF-based framework that incorporates knowledge of how prokaryotic genes can overlap and upstream features of translation initiation sites (TIS) like the ribosomal binding site (RBS). Glimmer extracts every sufficiently long ORF from the sequence and scores it by the log-likelihood ratio of generating the ORF between models trained on coding versus non-coding sequence. The features included in the log-likelihood ratio are composition via the IMMs, RBS via a position weight matrix (PWM) and start codon usage. For simplicity, features are assumed to be independent so that the overall score can be computed as a sum of the individual feature log-likelihood ratios. A dynamic programming algorithm finds the set of ORFs with maximum score subject to the constraint that genes cannot overlap for more than a certain threshold, e.g. 30 bp.

Additional features

Glimmer is ineffective on metagenomic sequences because its gene composition model is trained under the assumption that the sequences all originated from a single genome. Recent approaches both relax this assumption and add new features used to discriminate between coding and non-coding sequence. One approach called MetaGeneAnnotator (MGA) uses a similar framework to Glimmer by scoring ORFs and choosing a high scoring set using dynamic programming ( 25). MGA incorporates additional gene features, of which we add three—ORF length, adjacent gene orientation, and adjacent gene distance—to Glimmer. Below, we describe how to compute models for these features given an annotated genome. In the sections to follow, we further explain how such genomes are obtained.

First, we seek probability distributions for the length of coding and non-coding ORFs. For the coding model, our sample data are the lengths of annotated genes in the training genome. For the non-coding model, the lengths of non-coding ORFs that meet a minimum length threshold (75 bp) and a maximum overlap threshold with a gene (30 bp) are considered. One can estimate the distributions using a non-parametric method based on the histogram of lengths or a parametric method where one assumes a well-studied probability distribution and computes the maximum likelihood parameters ( 38). We use both methods to obtain our estimate. Where training data are plentiful, such as for common gene sizes, a non-parametric approach (such as kernel smoothing) offers greater modeling accuracy than any parameterized distribution. But when data are sparse, such as for very long ORFs, the non-parametric approach fails. For example, we cannot assign a useful probability to an ORF larger than any in our training set though it should obviously receive a large log-likelihood ratio score. A parameterized distribution can assign meaningful probabilities to ORFs of any length. We analyzed a number of distributions and found that a Gamma distribution most accurately modeled the gene length distributions examined and produced the highest accuracy gene predictions.

To combine the two versions, we use a histogram after kernel smoothing with a Gaussian kernel ( 38) for the first quartile (as determined by the raw counts), a Gamma distribution with maximum likelihood parameters for the last quartile and a linear combination of the two with a linearly changing coefficient in between (e.g. Figure 1). Performance was robust to other blending schemes and to the points at which the model changes. We score an ORF with the log-likelihood ratio that the the feature was generated by the coding versus non-coding model and add it to the ORF's overall score.

Distributions for coding and non-coding ORF lengths (in amino acids) from Deinococcus radiodurans R1 estimated using the Gamma distribution (Gamma), a smoothed histogram (Hist), and a blend of the first two (Blend) that uses the histogram model for the first quartile, the Gamma model for the last quartile, and a linear combination in between. The Hist model offers greater accuracy for short and medium sized ORFs (e.g. the deviation from Gamma at 200 bp in the coding plot), but is useless for very long ORFs, which Gamma can model more effectively. The shape of the D. radiodurans length distributions are typical of the prokaryotic genomes examined, but Glimmer-MG estimates the distributions for each genome individually.

Distributions for coding and non-coding ORF lengths (in amino acids) from Deinococcus radiodurans R1 estimated using the Gamma distribution (Gamma), a smoothed histogram (Hist), and a blend of the first two (Blend) that uses the histogram model for the first quartile, the Gamma model for the last quartile, and a linear combination in between. The Hist model offers greater accuracy for short and medium sized ORFs (e.g. the deviation from Gamma at 200 bp in the coding plot), but is useless for very long ORFs, which Gamma can model more effectively. The shape of the D. radiodurans length distributions are typical of the prokaryotic genomes examined, but Glimmer-MG estimates the distributions for each genome individually.

ORFs truncated by the end of their fragments require an adjustment to the length model. We know that the total length of a truncated ORF with X bp on a fragment is at least X and should therefore be scored higher than a complete X bp ORF. We accomplish this by modeling the joint distribution of the length and the presence of start and stop codons ( Supplementary Methods ).

Features computed on pairs of adjacent genes also capture useful information. For example, genes are frequently arranged nearby in the same orientation to form transcriptional units called operons ( 39). Alternatively, consecutive genes with opposing ‘head-to-head’ orientations (where the 5′-ends of the genes are adjacent) tend to be further apart to allow room for each gene's respective RBS. We added two features of adjacent genes: their orientation with respect to each other and the distance between them. Again, we need distributions for coding and non-coding ORFs to score a pair of adjacent genes by their log-likelihood ratio. The gene model uses all adjacent pairs of annotated genes. For the non-coding model, we consider pairs including non-coding ORFs that satisfy the length and overlap constraints with their adjacent annotated genes.

For adjacent gene orientation, we count the number of times each adjacent arrangement appears in the training data and normalize the counts to probabilities. The adjacent gene distance model is estimated similarly to the gene length models described above. However, common parameterized distributions were not a good fit for the distances so we rely solely on a smoothed histogram. Because one gene's start codon often overlaps another gene's stop codon due to shared nucleotides, we do not smooth the histogram for distances implying overlapping start or stop codons. We incorporate these features during Glimmer's dynamic programming algorithm for choosing ORFs by adding the log-likelihood ratios when linking an ORF to its previous adjacent ORF.

Classification

All previously published approaches to metagenomic gene prediction parameterize the gene composition models as a function of the sequence GC-content. For example, MetaGeneMark computes (offline) a logistic regression for each dicodon frequency as a function of GC-content for a large set of training genomes and sets its hidden Markov model parameters (online) according to the GC-content of the metagenomic sequence ( 28). For whole genomes, gene composition model training has traditionally been performed on annotated close evolutionary relatives rather than genomes with similar GC-content ( 40). Many methods for assigning a taxonomic classification to a metagenomic sequence are currently available ( 29–32). Here, we suggest using one of these methods called Phymm ( 29), rather than GC-content, to find evolutionary relatives of the metagenomic sequences on which to train. Phymm trains an IMM on every reference genome in GenBank ( 41), scores each input sequence with all IMMs and assigns a classification at each taxonomic level according to the reference genome of the highest scoring IMM. Phymm's IMMs are single-periodic and trained on all genomic sequence, in contrast to Glimmer's IMMs which are three-periodic and trained only on coding sequences.

Thus, before predicting genes, we run Phymm on the input sequences to score each sequence with each reference IMM. To train the gene prediction models, we use gene annotations for the genomes corresponding to the highest scoring IMMs. These annotations are taken from NCBI's RefSeq database ( 42). Though classification with Phymm is very accurate, the highest scoring IMM is rarely from the sequence's exact source genome. For this reason, we found that training over multiple genomes (e.g. 43) captured a broader signal that improved prediction accuracy. Though most of the training can be performed offline, the models over multiple genomes must be combined online for each sequence. Features such as the length, start codon and adjacent gene distributions are easy to combine across multiple training genomes by simply summing the feature counts.

IMMs cannot be combined quickly, and saving trained IMMs for all combinations of two or three genomes would require too much disk space. In practice, pairs of genomes with similar composition are far more likely to be top classification hits together and we can restrict our offline training to only these pairs ( Supplementary Methods ).

Glimmer-MG's RBS model trains using ELPH (http://cbcb.umd.edu/software/ELPH), a motif finder based on Gibbs sampling, to learn a 6-bp PWM from the 25-bp upstream of every gene in the training set. We train these PWMs offline for each individual reference genome, but like the other features, RBS modeling for metagenomic sequences benefits from the broader signal obtained by combining over multiple training genomes. Averaging PWMs for the top three Phymm classifications can be done quickly, but dilutes the signal. Instead, we generalized the RBS model in Glimmer-MG to score the upstream region of each start codon using a mixture of PWMs in equal proportions. Thus, a gene's RBS score is the probability that the best 6 bp motif in the 25-bp upstream of the start codon was generated by a mixture of three PWMs normalized by a null model based on GC-content to a log-likelihood ratio.

Two interesting cases warrant further discussion. First, a novel sequence may not be phylogenetically related to any known reference genome in the database. Here, Phymm's highest scoring IMMs will merely represent the reference genomes with most similar nucleotide composition. Prior work demonstrating the relationship between even simple nucleotide composition statistics and prediction model parameters supports the validity of this strategy ( 24–28). In addition, we did not detect a significant relationship between prediction accuracy and the divergence of a sequence from the reference genome database ( Supplementary Figure S2 ). Second, some sequences will contain horizontally transferred genes. While single genome gene prediction typically cannot implement a model general enough to predict these genes accurately, Glimmer-MG is more robust because Phymm will likely ‘mis-classify’ the sequence containing the gene by scoring the sequence more highly with IMMs more representative of the genome from which the gene was transferred than the sequence's true source genome.

Clustering

The following prediction pipeline has been applied successfully on whole prokaryotic genomes. First, train models on a finished and annotated close evolutionary relative. Make initial predictions, but then retrain the models on them and make a final set of predictions ( 40). By using Phymm to find training genomes, we replicate the first step in this pipeline for application to metagenomics. However, retraining on the entire set of sequences would combine genes from many different organisms and yield a non-specific and ineffective model. If the sequences could be separated by their source genome, retraining could be applied.

We accomplish this goal using S cimm , an unsupervised clustering method for metagenomic sequences that models each cluster with a single-periodic IMM ( 34). After initially partitioning the sequences into a specified number of clusters, S cimm repeats the following three steps until the clusters are stable: train IMMs on the sequences assigned to their corresponding clusters, score each sequence using each cluster IMM and reassign each sequence to the cluster corresponding to its highest scoring IMM. While S cimm may not partition the sequences exactly by their source organism, the mistakes that it tends to make do not create significant problems for retraining gene prediction models. In cases where S cimm merges sequences from two organisms together, they are nearly always phylogenetically related at the family level ( 34). Though some families can be quite diverse, this shared phylogenetic relationship, combined with the nucleotide composition similarity that S cimm more directly identifies, is encouraging. S cimm sometimes separates sequences from a single organism into multiple clusters, but this occurs most often for highly abundant organisms, in which case there will usually still be enough training data in each cluster to be informative. The Phymm classifications that have already been obtained imply an initial clustering at a specified taxonomic level (e.g. family), which can be used as an initial partition for the iterative clustering optimization in a mode of the program referred to as PhyScimm ( 34). Using PhyScimm also implicitly chooses the number of clusters, removing this free variable.

After clustering the sequences, we focus on each cluster individually to retrain the coding IMM, RBS and start codon models before making the final predictions within that cluster. The ORF length and adjacent ORF feature distributions are more difficult to estimate from short sequence fragments, so we continue to learn them using the Phymm classifications to whole annotated genomes. If the cluster is too small, retraining may not have enough data to capture the gene features, and prediction accuracy may decrease. We tested various thresholds and requiring at least 80 Kb of predicted coding sequence for retraining produced the highest accuracy predictions. For clusters with less, we do not retrain and instead finalize the gene predictions from the initial iteration. Accuracy may also decrease if the cluster is heterogeneous and does not effectively model some of its sequences. For each sequence, we compute the ratio between the likelihood that the cluster IMM versus its top scoring Phymm IMM generated the sequence. If the ratio is too low, we assume that the cluster does not represent this sequence well enough and finalize its initial predictions. The full pipeline for metagenomic gene prediction is depicted in Figure 2.

Glimmer-MG pipeline. First, we classify the sequences with Phymm in order to find related reference genomes to train the feature models. We use these to make initial gene predictions. Next, we cluster the sequences with S cimm , starting at an initial partition from the Phymm classifications. Within each cluster, we retrain the models on the initial predictions before using all information to make the final set of predictions.

Glimmer-MG pipeline. First, we classify the sequences with Phymm in order to find related reference genomes to train the feature models. We use these to make initial gene predictions. Next, we cluster the sequences with S cimm , starting at an initial partition from the Phymm classifications. Within each cluster, we retrain the models on the initial predictions before using all information to make the final set of predictions.

Sequencing errors

Gene prediction on raw sequencing reads or contigs with low coverage must contend with sequencing errors. The most prevalent type of error made by the 454 sequencing technology is an insertion or deletion (indel) at a homopolymer run. Indels cause major problems for gene prediction by shifting the coding frame of the true gene, making it impossible for a method without a model for these errors to predict it exactly. When Glimmer-MG encounters a shifted gene, the most frequent outcome is two predictions, each of which covers half of the gene up to the point of the indel and then beyond ( Figure 3). Such predictions have limited utility.

Indel errors. Depicted above is a common case where indel sequencing errors disrupt a gene prediction. This 454 simulated 526 bp read falls within a gene in the forward direction, but has an insertion at position 207 and a deletion at position 480. Without modeling sequencing errors, Glimmer-MG begins to correctly predict the gene (shown in green), but is shifted into the wrong frame by the insertion (shown in red) and soon hits a stop codon. Downstream, Glimmer-MG makes another prediction in the correct coding frame, but it too is forced into the wrong frame by the deletion. By allowing Glimmer-MG to predict frameshifts from sequencing errors, the prediction follows the coding frame nearly perfectly. The insertion site is exactly predicted and the deletion site is only off by 19 bp.

Indel errors. Depicted above is a common case where indel sequencing errors disrupt a gene prediction. This 454 simulated 526 bp read falls within a gene in the forward direction, but has an insertion at position 207 and a deletion at position 480. Without modeling sequencing errors, Glimmer-MG begins to correctly predict the gene (shown in green), but is shifted into the wrong frame by the insertion (shown in red) and soon hits a stop codon. Downstream, Glimmer-MG makes another prediction in the correct coding frame, but it too is forced into the wrong frame by the deletion. By allowing Glimmer-MG to predict frameshifts from sequencing errors, the prediction follows the coding frame nearly perfectly. The insertion site is exactly predicted and the deletion site is only off by 19 bp.

While the problems caused by sequencing errors have been known for some time ( 23, 22), only recently has a good solution been published in the program FragGeneScan ( 26). FragGeneScan uses a hidden Markov model where each of the three indexes into a codon are represented by a model state, but allows irregular transitions between the codon states that imply the presence of an indel in the sequence. On simulated sequences containing errors, FragGeneScan achieves far greater accuracy than previous methods that ignore the possibility of errors.

Since Glimmer-MG uses an ORF-based approach to gene prediction, we must take a more ad hoc approach to building an error model into the algorithm. First, we address 454 indel errors. When Glimmer-MG is scoring the composition of an ORF using the coding and non-coding IMMs, we allow branching into alternative reading frames. More specifically, we traverse the sequence and identify low-quality base calls (defined below) that are strong candidates for a sequencing error. At these positions, we split the ORF into three branches. One branch scores the ORF as is. The other two switch into different frames to finish scoring, implying an insertion and deletion prediction. ORFs that change frames are penalized by the log-likelihood ratio of the predicted correction's probability to the original base call probability. A maximum of two indel predictions per ORF is used to limit the computation time. After scoring all ORFs, ORFs with the same start and stop codon (but potentially different combinations of interior indels) are clustered and only the highest scoring version is kept. All remaining ORFs are pushed to the dynamic programming stage where the set of genes with maximum score subject to overlap constraints is chosen. However, the algorithm is further constrained to disallow an indel prediction in a region of overlapping genes.

Focusing on low-quality base calls (typically <5–10% of the sequence) makes the computation feasible. If quality values are available for the sequences, either from the raw read output or the consensus stage of an assembler, Glimmer-MG uses them and designates base calls less than a quality value threshold as potential branch sites. For 454 sequences that are missing quality values, we designate the final base of homopolymer runs longer than a length threshold as potential branch sites.

In Illumina reads, indels are rare, and the primary errors are substitutions ( 44). Most substitutions do not affect a start or stop codon and are nearly irrelevant to the gene prediction. We focus on the most detrimental error, which is a substitution that converts a regular codon to a stop codon, thus prematurely truncating the gene. To predict such errors, we consider substitution errors to remove each stop codon in the sequence. That is, for every ORF, we consider an altered ORF where the previous stop codon did not exist, thus combining the current ORF with the previous one in the same frame. Similarly to the 454 error model, we penalize these altered ORFs with the log-likelihood ratio (based on the quality values) comparing the probability that the stop codon contains a sequencing error that changed it from a regular codon to the probability that it truly is a stop codon. All normal and altered ORFs are considered during the dynamic programming stage to choose the maximum scoring set of ORFs.

Whole genomes

Although we implemented the additional gene features with metagenomics in mind, they improve accuracy on whole genomes as well. In Glimmer3.0, the following pipeline was recommended ( 20). First, using a program called long-orfs, find long non-overlapping ORFs in the sequence with amino acid composition that is typical of prokaryotic genomes. Train the coding IMM on these sequences, and predict genes on the genome. On the initial predictions, train the RBS and start codon models. Finally, make a second set of gene predictions incorporating the new models.

We recommend a similar scheme for a new whole-genome pipeline, designated as Glimmer3.1. As before, we use long-orfs to train an IMM and predict an initial set of genes. Without a length model, these initial predictions tend to include many erroneous small gene predictions. We use a log-likelihood ratio threshold to filter out the lowest scoring ones. On the remaining genes, we retrain all models—IMM, RBS, start codons, length and adjacency features—before predicting again. To eliminate any remaining bias from the initial prediction and filtering, we retrain and predict one final time.

The preceding pipeline is unsupervised, but we can do slightly better on average by following Glimmer-MG and using GenBank reference genomes. In this pipeline, we first classify our new genome with Phymm to find similar reference genomes. Alternatively, a researcher may be able to specify these genomes based on prior knowledge. We train RBS, start codon, length and adjacency models from the RefSeq annotations of these similar genomes as described above. For the gene IMM, accuracy is better if we use long-orfs compared with an IMM trained on relative reference genomes. After making initial predictions, we retrain the IMM, RBS and start codon models before predicting genes a final time.

Simulated metagenomes

We constructed simulated datasets from 1206 prokaryote genomes in GenBank ( 41) as of November 2010. Since Glimmer-MG involves clustering the sequences, it is important to have realistic simulated metagenomes. For each metagenome, we randomly chose 50 organisms and included all chromosomes and plasmids. We sampled organism abundances from the Pareto distribution, a power law probability distribution that has previously been used for modeling metagenomes ( 45). Reference genomes included in the metagenome were removed from Phymm's database so that the sequences appeared novel and unknown. To simulate a single read, we selected a chromosome or plasmid with probability proportional to the product of its length and the organism's abundance and then chose a random position and orientation from that sequence. To enable comparison between experiments with different read lengths and error rates, we simulated 20 metagenomes (i.e. organisms, abundances, read positions and read orientations) and used them to derive each experiment's dataset. We labeled the reads using gene annotations that are not described as hypothetical proteins from RefSeq ( 42).

In experiments where we considered sequencing errors, we focused on three prevailing technologies. Two varieties of high-throughput, short-read technologies with very different characteristics have become ubiquitous tools for sequencing genomes, including metagenomics ( 46). The Illumina sequencing platform generates 35–150 bp length reads with sequencing errors consisting almost entirely of substitutions ( 44). The 454 sequencing platform generates 400–550 bp length reads where indels make up nearly all of the errors ( 47). Less popular in recent studies due to greater expense and lesser throughput is Sanger sequencing with read lengths of 600–1000 bp and both substitution and indel sequencing errors. We include Sanger sequencing both because previous programs were designed and tested with the technology in mind and because the reads resemble contigs assembled from the more prevalent short read technologies with respect to length and errors tending to occur at the fragments' ends.

To imitate Sanger reads, we used the lengths and quality values from real reads taken from the NCBI Trace Archive ( 41) as templates. That is, for each fragment simulated from a genome as described above, we randomly chose a real Sanger read from our set to determine the length and quality values of the simulated read. Then we simulated errors into the read according to the quality values and using a ratio of five substitutions per indel. To achieve a specific error rate for a dataset, we multiplied the probability of error at every base by a factor defined by the desired rate. To generate simulated reads to imitate the Illumina platform, we similarly used real 124 bp reads as templates to obtain quality values, but injected only substitution errors. For 454 reads, we used a read simulator called FlowSim, which closely replicates the 454 stochastic sequencing process to generate the sequences and their quality values ( 43). We conservatively quality trimmed all read ends to avoid large segments of erroneous sequence.

Accuracy

We computed accuracy a few ways to capture the multiple goals of gene prediction. Sensitivity is the ratio of true positive predictions to the number of true genes, and precision is the ratio of true positive predictions to the number of predicted genes. Since the RefSeq annotations tend to be incomplete after the removal of hypothetical proteins, which are unconfirmed computational predictions, we consider sensitivity to be the more important measure as ‘false positive’ predictions may actually be real genes. For this reason, precision values in the experiments we performed are artificially low and should be interpreted carefully. For all experiments, we computed the sensitivity and precision of the 5′-and 3′-ends of the genes separately. Since there is only a single 3′ site, 3′ prediction is generally given more attention. There are frequently many choices for the 5′-end of the gene and a paucity of sequence information to discriminate between them. Adding to the difficulty, most of the 5′ annotations in even the high-quality RefSeq database are unverified.

In experiments with sequencing errors, indels shift the gene's frame and substitutions can compromise the start and stop codons. To measure the ability of the gene prediction to follow the coding frame, we compute sensitivity and precision at the nucleotide level. That is, every nucleotide is considered a unit and a true positive prediction must annotate the nucleotide as coding in the correct frame. A gene prediction that is correct until a sequencing error indel but predicted in the wrong frame beyond gets partial credit, whereas a gene prediction that identifies the error location and shifts the frame of the prediction gets full credit.


Discussion

Here we have described a global strategy to predict candidate sequences for orphan enzymes. Candidate sequences were obtained using a combination of metabolic pathway adjacency and genomic neighbourhood information. Overall, a lower proportion of candidate sequences were obtained using metagenomic data, than genomic data, but this might only be due to the restrictions we had to impose: Sanger and 454 samples that have a low coverage of the respective genomes. Although many novel enzymes and organisms may be represented in metagenomic samples, the human gut and marine metagenomes that we used are complex communities with hundreds of species ( Qin et al, 2010 ), and a long tail of low-abundance organisms ( Arumugam et al, 2011 ), thereby limiting the coverage of each individual genome and thus the extent of assembly. Consequently, the majority of the contigs that we analysed only contained two genes, thus limiting the number of neighbour gene pairs that can be detected (Supplementary Figures 7 and 8). Although some available metagenomic datasets have a large number of long contigs, these are usually dominated by a few genomes and thus would not offer access to an increased number of genomes ( Tyson et al, 2004 Garcia Martin et al, 2006 ). In the future, contigs will become longer, due to increases in read lengths and improvements in assembly algorithms, therefore enhancing the ability of this pipeline to make predictions from metagenomic data allowing greater access to novel activities of hidden environmental samples.

In addition to the benchmarking, we supported our predictions with the experimental validation of the proposed enzymatic function for two out of six heterologously expressed candidate proteins. The ratio of experimental successes is lower than the 70% expected accuracy. However, we would not expect the ratio of experimental successes to be equivalent to the theoretical prediction accuracy. The experimental process to validate a specific enzymatic function is a very complex process involving many variables. First, an enzyme can be purified in a soluble form but will become inactive during the purification process due to improper handling or exposure to unfavourable conditions such as oxygen. In addition, the proteins purified in this study were tagged with a histidine (his-tagged), as many heterologously expressed proteins are. The addition of a terminal his-tag can dramatically decrease the activity of a protein ( Kadas et al, 2008 ) or render it totally inactive ( Albermann et al, 2000 Halliwell et al, 2001 ). Moreover, there are many variables to optimize for the enzymatic activity tests. Only by adjusting the buffer type, buffer pH, cofactors, time of incubation, temperature of incubation or the analytical methods used might a certain assay become successful. For example, in assay optimization trials for EC 2.6.1.38 we changed the mobile phase for the LC/MS from 10 mM ammonium acetate to water and the peak area of the product glutamate was increased more than 11 times (Supplementary Figure 16). However, there is a practical limit to how many permutations of experimental conditions can be attempted, and only if the initial screening assay is close to the optimal conditions further optimization is feasible. Yet, the two validations in hand are a proof of principle for our approach and even without further experimental validation the benchmarks indicated high-accuracy candidate sequences for 131 orphan enzymes, more than a third of the tractable enzymes stored in pathway databases.

Then to assess the impact of this expanded enzyme knowledge on systems biology, we compared the currently available genome-scale metabolic models with and without the addition of the orphan enzymes with high-confidence predictions. Subsequently, gene-knockout simulations showed that some genes considered to be essential in the current models became non-essential after the addition of the orphan enzymes. The addition of these orphan enzymes increased the accuracy of the models as all genes for which gene essentiality changed now agree with the experimentally determined essentiality status of the gene. Interestingly, several of the reactions for which the essential to non-essential predictions changed were reactions introduced by the automated gap-filling procedure during the reconstruction process. This observation suggests that the orphan enzyme reactions will not only influence the model simulations but also likely affect the gap-filling procedure, and thereby the reaction content of the final model, beyond simple addition of few new reactions. Taken together, the percentage of novel reactions, FCA and improved gene-essentiality predictions mean that our findings will improve the automatic as well as the manual reconstruction process for genome-scale metabolic models and applications thereof ( Oberhardt et al, 2009 ).

About 70% of the orphan enzymes in KEGG do not have pathway neighbours and are thus not amenable to our current pipeline (Figure 1). However, in the future, our candidate gene identification pipeline could be modified to identify other genes that might be functionally related to the orphan enzymes through the integration of genome-scale functional data, such as gene lethality screens ( Nichols et al, 2011 ), genetic interactions ( Costanzo et al, 2010 ) or gene-expression profiles. This should enable one to retrieve candidate genes by searching the gene neighbourhood of the orthologs of these genes that are functionally related to the orphan enzymes. Furthermore, the current pipeline is only applicable to prokaryotic genomes. However, it could be extended to partially analyse fungal genomes as certain secondary metabolite pathways are known to be organized in gene clusters ( Regueira et al, 2011 ).

The linkage of sequences to these orphan functions implies that these functions can be utilized in genome-, transcriptome- and proteome-based methods. Here we illustrated the impact on genome-scale metabolic models. This benefit will be propagated into many different biological systems as these sequences will act as bait so that the newly sequenced genomes can be ascribed these functions through homology-based annotation methods. This is the first systematic approach to retrieve sequences for many orphan enzymes, and the developed computational framework can be applied to additional genomes and metagenomes as they get sequenced.


Scientists predict academic achievement from DNA alone

Scientists from King's College London have used a new genetic scoring technique to predict academic achievement from DNA alone. This is the strongest prediction from DNA of a behavioural measure to date.

The research shows that a genetic score comprising 20,000 DNA variants explains almost 10 per cent of the differences between children's educational attainment at the age of 16. DNA alone therefore provides a much better prediction of academic achievement than gender or even 'grit', a personality trait thought to measure perseverance and passion for long-term goals.

Published today in Molecular Psychiatry, these findings mark a 'tipping point' in predicting academic achievement and could help with identifying children who are at greater risk of having learning difficulties.

Previous research on twin studies has found that 60 per cent of differences between individuals' educational achievement are due to differences in DNA. Whilst this may seem far from the 10 per cent predicted in this study, the authors note that twin studies examine the sum total of all genetic effects, including common and rare variants, interactions between genes, and gene-environment interactions. Twin studies can therefore tell us the overall genetic influence on a trait in a population. Polygenic scores however estimate genetic influence from common variants only, which explains the discrepancy between these DNA-based studies and twin studies (10 per cent vs 60 per cent).

As human traits are so complex and influenced by thousands of gene variants of very small effect, it is useful to consider the joint effects of all of these trait-associated variants -- and this principle underlies the polygenic score method. The value of polygenic scores is that they allow us to estimate genetic effects for academic achievement, or any other trait, at an individual level, based on a person's DNA.

Calculating an individual's polygenic score requires information from a genome-wide association study (GWAS) that finds specific genetic variants linked to particular traits, in this case academic achievement. Some of these genetic variants, known as single nucleotide polymorphisms (SNPs), are more strongly associated with the trait, and some are less strongly associated. In a polygenic score, the effects of these SNPs are weighed by the strength of association and then summed to a score, so that people with many SNPs related to academic achievement will have a higher polygenic score and higher academic achievement, whereas people with fewer associated SNPs will have a lower score and lower levels of academic achievement.

This new King's research is based on a recent GWAS that examined almost 10 million SNPs and identified 74 genetic variants that were significantly associated with years of completed education. 'Years of education' was used as a proxy measure for education achievement and related traits.

Using the GWAS to guide their selection of DNA variants, the researchers measured academic achievement in Mathematics and English at ages 7, 12 and 16 (GCSE), in a sample of 5,825 unrelated individuals from the Twins Early Development Study (TEDS).

Their findings show that what makes students achieve differently in their educational achievement is strongly affected by DNA differences on average those with a higher polygenic score would obtain a grade between A and B, whereas those with a lower score obtained an entire grade below in terms of GCSE scores at age 16. As well as this, 65 per cent of people in the higher polygenic group went on to do A-levels, whereas only 35 per cent from the lower group did so.

Saskia Selzam, first author from the MRC Social, Genetic & Developmental Psychiatry (SGDP) Centre at King's College London, said: 'We believe that, very soon, polygenic scores will be used to identify individuals who are at greater risk of having learning difficulties.

'Through polygenic scoring, we found that almost 10 per cent of the differences between children's achievement is due to DNA alone. 10 per cent is a long way from 100 per cent but it is a lot better than we usually do in predicting behaviour. For instance, when we think about differences between boys and girls in maths, gender explains around one per cent of the variance. Another example is 'grit', which describes the perseverance of an individual, and only predicts around five per cent of the variance in educational achievement.'

Professor Robert Plomin, senior author of the study, also from the MRC SGDP Centre at King's College London, added: 'We are at a tipping point for predicting individuals' educational strengths and weaknesses from their DNA.

'Polygenic scores could be used to give us information about whether a child may develop learning problems later on, and these details could guide additional support that is tailored to a child's individual needs. We believe personalised support of this nature could help to prevent later developmental difficulties.'

The Twins Early Development Study (TEDS) is supported by a programme grant from the Medical Research Council.


Discussion and Conclusion

The main motivation for this work was the unavailability of any specialized database which provides comprehensive information on enzymes involved in different types of biofuel production. The mining of literature revealed that only a limited number of enzymes involved in biofuel production are currently known from a limited number of genomes. Therefore, as the first step, we constructed the ‘BioFuelDB’ knowledgebase of all enzymes involved in biofuel production from the available literature. However, the limited repertoire of these enzymes becomes a limitation while selecting the enzyme variants which can perform the desired reaction under the industrial conditions that may not be optimum for the given enzyme, hence leading to decrease in efficiency (Bhardwaj Ajay, Zenone & Chen, 2015). Thus, to explore efficient and novel variants of enzymes involved in different steps of biofuel production, we have developed the Benz tool which can identify novel homologues of the known biofuel enzyme sequences from sequenced genomes and metagenomes. The hybrid approach incorporating HMMER 3.0 and RAPSearch2 programs provides high accuracy and high speed for prediction of biofuel enzymes. Further, it appears to be a useful strategy to adopt a hybrid approach involving two different methods as the homology-based method RAPSearch2 enables the identification of close homologs of the known biofuel enzymes, whereas, the profile-based method HMMER 3.0 helps to identify the remote homologs which show low sequence identity.

In the present scenario, metagenomic data generated from different environments comprising of sequences from culturable and unculturable microbial genomes can be mined to improve the repertoire of biofuel enzymes by revealing novel biofuel enzymes as well as the functional variants of the existing enzymes. In this study, the identification of 153,754 enzymes from 23 metagenomes indicates the possibility of finding such enzymes by exploiting the metagenomic data from several hundreds of metagenomes. Furthermore, the metagenomes are so rich in microbial diversity and functional genes that it is almost certain to identify the novel variant of a given enzyme (Sharma et al., 2010). Thus, the mining of novel homologues of biofuel enzymes from different environments using their metagenomic data enables the identification of novel variants which can work in wide range of conditions, and thus improves the enzyme repertoire.


Present address: School of Informatics, Center for Computational Biology and Bioinformatics, Indiana University - Purdue University Indianapolis, 714 N Senate Ave, Indianapolis, IN, 46202, USA

Affiliations

Bioinformatics Research Group, SRI International, 333 Ravenswood Ave, Menlo Park, CA, 94025, USA

Pedro Romero, Jonathan Wagg, Michelle L Green, Markus Krummenacker & Peter D Karp

Department of Developmental Biology, Stanford University, Stanford, CA, 94305, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

Corresponding author


Performance evaluation

A good model evaluation standard is crucial for assessing the utility of a model. Different indicators can be used to reveal the advantages and disadvantages of a model from different perspectives. Sensitivity (Sn), specificity (Sp), accuracy (Acc), and Mathew’s correlation coefficient (MCC) are used to evaluate models in machine learning (Chu et al., 2019 Deng et al., 2020 Gong et al., 2019 Jin et al., 2019 Shan et al., 2019 Su et al., 2019a, 2019b Wei et al., 2018a, 2018b Xu et al., 2018a, 2018b, 2018c Zhang et al., 2019a, 2019b). These metrics are formulated as follows: (7) S n = T P T P + F N (8) S p = T N T N + F P (9) A c c = T P + T N T P + T N + F P + F N (10) M C C = T P ∗ T N − F P ∗ F N ( T P + F P ) ∗ ( T P + F N ) ∗ ( T N + F P ) ∗ ( T N + F N )

TP, TN, FP and FN represent true positive, true negative, false positive, and false negative, respectively. Sn, Sp, Acc, and MCC can be calculated from these indicators. In addition, AUC (area under the ROC curve) was used to evaluate our model (Cheng & Hu, 2018 Cheng et al., 2018b Ding, Tang & Guo, 2019a, 2019b Shen et al., 2019). For further experiments, Table 2 records the hyperparameters of the model.


Watch the video: ΜΕΤΑΓΡΑΦΗ DNA (August 2022).