Information

How can I accurately predict proper transgenic protein function/structure from different species?

How can I accurately predict proper transgenic protein function/structure from different species?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am currently building a synthetic system in E.coli, and am often faced with the need to use genes from other distant prokaryotes. I know that myself and most of my colleagues resort to an educated trial and error approach. Besides the obvious (elimination of introns from eukaryotes, codon optimization, genetic relatedness), is there any predictable or semi-rational way of knowing if a transgene for a different organism will function properly in any given organism?


By looking at the sequence only? This is an unsolved problem.

Its not clear to me exactly what you are looking for, but here are some thoughts…

On a relatively narrow question as to whether any gene you plug into a bacterium, yeast, mouse, goat or other transgenic organism - the rna may not translate into protein in detectable levels or at high enough levels to get the desired result. changing the sequence of the DNA of the gene for better transcription/translation is a common necessary experimental.

On the broader systems biology level, other cofactors, substrates that are necessary for the function of the gene may have to be engineered in the host organism. That is also an open question.

All this is assuming the gene in question codes for an enzyme to make a new or different amount of the metabolic product from the host organism, a commmon synthetic biology target. If you are adding signaling pathway components, tinkering with a more specific mechanism in the cell, getting the function you desire could be a lot of work. E.g. if you were inserting a new ribosomal component or a gene related to mitosis, that would be hard to imagine a priori how it would function in the new cell or if it would just kill it.

Also, Olga Troyanskaya at Princeton CS is working on the related problem of finding the function of a gene by looking at its sequence and cross referencing to other known data. Electronic functional inference is difficult, esp when we have no complete gene functional description of any living organism to start with.


The formation of protein-protein complexes is essential for proteins to perform their physiological functions in the cell. Mutations that prevent the proper formation of the correct complexes can have serious consequences for the associated cellular processes. Since experimental determination of protein-protein binding affinity remains difficult when performed on a large scale, computational methods for predicting the consequences of mutations on binding affinity are highly desirable. We show that a scoring function based on interface structure profiles collected from analogous protein-protein interactions in the PDB is a powerful predictor of protein binding affinity changes upon mutation. As a standalone feature, the differences between the interface profile score of the mutant and wild-type proteins has an accuracy equivalent to the best all-atom potentials, despite being two orders of magnitude faster once the profile has been constructed. Due to its unique sensitivity in collecting the evolutionary profiles of analogous binding interactions and the high speed of calculation, the interface profile score has additional advantages as a complementary feature to combine with physics-based potentials for improving the accuracy of composite scoring approaches. By incorporating the sequence-derived and residue-level coarse-grained potentials with the interface structure profile score, a composite model was constructed through the random forest training, which generates a Pearson correlation coefficient >0.8 between the predicted and observed binding free-energy changes upon mutation. This accuracy is comparable to, or outperforms in most cases, the current best methods, but does not require high-resolution full-atomic models of the mutant structures. The binding interface profiling approach should find useful application in human-disease mutation recognition and protein interface design studies.


Summary

The LRK10-like receptor kinases (LRK10L-RLKs) are ubiquitously present in higher plants, but knowledge of their expression and function is still limited. Here, we report expression and functional analysis of TtdLRK10L-1, a typical LRK10L-RLK in durum wheat (Triticum turgidum L. ssp. durum). The introns of TtdLRK10L-1 contained multiple kinds of predicted cis-elements. To investigate the potential effect of these cis-elements on TtdLRK10L-1 expression and function, two types of transgenic wheat lines were prepared, which expressed a GFP-tagged TtdLRK10L-1 protein (TtdLRK10L-1:GFP) from the cDNA or genomic DNA (gDNA) sequence of TtdLRK10L-1 under the native promoter. TtdLRK10L-1:GFP expression was up-regulated by the powdery mildew pathogen Blumeria graminis f. sp. tritici (Bgt) in both types of transgenic plants, with the scale of the elevation being much stronger in the gDNA lines. Both types of transgenic plants exhibited enhanced resistance to Bgt infection relative to wild type control. Notably, the Bgt defence activated in the gDNA lines was significantly stronger than that in the cDNA lines. Further analysis revealed that a putative MYB transcription factor binding site (MYB-BS, CAGTTA) located in TtdLRK10L-1 intron I was critical for the efficient expression and function of TtdLRK10L-1 in Bgt defence. This MYB-BS could also increase the activity of a superpromoter widely used in ectopic gene expression studies in plants. Together, our results deepen the understanding of the expression and functional characteristics of LRK10L-RLKs. TtdLRK10L-1 is likely useful for further dissecting the molecular processes underlying wheat defence against Bgt and for developing Bgt resistant wheat crops.


Data availability

A detailed description of the datasets used in each part of the study is in the corresponding section of Supplementary Methods. Specifically, the Drosophila epigenetics datasets used in this study were generated by the modENCODE consortium, available online (http://data.modencode.org). The mouse epigenetics datasets were generated by the ENCODE and Roadmap Epigenomics consortium, available online (https://www.encodeproject.org). We downloaded the Drosophila STARR-seq data 28 and the mouse FIREWACh data 32 from previous studies. Results from transgenic-mouse enhancer assays were generated by the Pennacchio lab at LBNL. Experimental results are summarized in Supplementary Tables 4–9, with the mouse images and additional details available on the VISTA Enhancer Browser (https://enhancer.lbl.gov). The human-cell-line enhancer reporter assay results were generated by the Sutton lab at Yale University. Experiment results are summarized in Supplementary Table 10. More detailed results for each cell line are available in Supplementary Data 1.


Micronutrient Malnutrition Associated Health Problems

Human bodies are complicated, and they need two types of nutrients for proper functioning and survival: micronutrients and macronutrients. The basis of this division is the quantity of a nutrient that the body needs. Micronutrients are required in small quantities and macronutrients in large.

Micronutrients play an important role in the human body and are involved in mental and physical development (White and Broadley, 2005). Many micronutrients work as cofactors in the proper functioning of different enzymes in the human body and thereby regularize vital functions and metabolic processes. Their deficiencies adversely affecr more than 2 billion individuals, or one in three persons all over the world (Welch and Graham, 2004). These deficiencies occur when the intake and absorption of minerals and vitamins are too poor to sustain good health and development. According to the United Nations World Health Organization, the main challenge in developing countries is not famine, but poor nutrition, and the absence of nutrients essential for the growth and maintenance of important functions. The reasons for malnutrition are inadequate macronutrient consumption, disease, and other factors such as household food security, health services, maternal and childcare factors, and the environment. The problem of malnutrition is further magnified by an increasing world population, which will reach 8 billion by 2030. The bulk of this rise (93%) will occur in the developing world (Cheema et al., 2008). Micronutrients are not farmed in the body and must be derived from the diet. Crucial micronutrients include iodine, iron, zinc, and vitamins (A, B, and C). Although any individual can encounter micronutrient deficiency, in pregnant women and children the chances of developing deficiencies are at a maximum. This is not only due to low dietary intake but also due to the higher physiological demands of pregnancy and childhood development. Almost 38% of pregnant women and 43% of pre-school children are suffering from micronutrient deficiencies worldwide. More than 30% of the world's population is affected by hidden hunger. Deficiencies in micronutrients like iodine, iron, zinc, and vitamin A can have a devastating effect on health (Cheema et al., 2008).

In the human body, iron is present in every cell and plays an important role in various cell functions. Being a key component in hemoglobin protein, the most important function of iron is the transportation of oxygen from the lungs to tissues. Moreover, iron is also part of many enzymes that perform vital cell functions (Jimenez et al., 2015). In developing countries, iron deficiency is the most common condition and is the leading cause of anemia, which especially affects young women and children. According to the World Health Organization, over 2 billion people are affected by anemia worldwide, exhibiting symptoms of tiredness and problems in metabolism. Anemia is the primary clinical classification of iron deficiency in half of the population (Benoist et al., 2008). About 30�% of preschool children and pregnant women suffer from iron deficiencies in developed countries. The number of people suffering from iron deficiencies in developing countries is even higher (Lucca et al., 2006). Anemia is the most prevalent condition caused by iron deficiency, however, its deficiency may result in other complications such as fatigue, hair loss, pagophagia, pallor, and restless leg syndrome. Severe or untreated iron deficiency may lead to morbidity and death (Dosman et al., 2012 Miller, 2013).

The human body requires a variety of minerals and vitamins to remain healthy. Zinc is one of the essential minerals that the body needs for various biological processes, such as cell division, cell growth, and immune function (Maret and Sandstead, 2006). The body does not require a large amount of zinc, however, unlike other fat-soluble vitamins, the human body does not store zinc for long periods of time. So, there is a constant need for a zinc-enriched diet to prevent its deficiency (Frassinetti et al., 2006). Worldwide, about 1.1 billion people are affected by zinc deficiency due to a poor dietary intake (Kumssa et al., 2015). Zinc deficiency is related to many diseases which include night blindness, weight loss, impaired taste acuity, emotional disturbance, dermatitis, delayed wound healing, poor appetite, alopecia, and poor immunity (Evans, 1986).

Iodine

Iodine is an essential mineral for human health, as it is required for the biosynthesis of the thyroid hormones triiodothyronine (T3) and thyroxine (T4). Globally, more than 2 billion people are affected by an insufficient intake of iodine (Delange, 1994 Zimmermann and Boelaert, 2015). These two hormones, triiodothyronine (T3) and thyroxine (T4), have a vital role in the regulation of metabolism. Iodine deficiency results in a decreased production of these hormones that eventually causes the enlargement of thyroid tissue, a condition known as goiter. As of 2010, more than 187 million individuals are affected by goiter due to iodine deficiency (Greer et al., 1968 Vos et al., 2012). Moreover, iodine deficiency during pregnancy may result in impaired neurodevelopment of the offspring, whereas, during childhood, it affects somatic growth and cognitive functions (Zimmermann and Boelaert, 2015).

Vitamin A

Vitamin A, a fat-soluble vitamin, is required for a healthy immune system, growth of epithelial cells, eyesight, reproduction, and regulation of genes (Beyer, 2010). Vitamin A deficiency is most prevalent among preschool-aged children, especially in developing countries. It affects almost 100� million children throughout the world and every year 20,000�,000 preschool children lose their sight. Among pregnant women, its deficiency also causes night blindness, maternal mortality, and other poor consequences in pregnancy and lactation. Vitamin A is essential for the normal functioning of the visual system, epithelial integrity, immunity, reproduction, and the maintenance of cell growth and function. Many developing countries depend on plant foods to meet their vitamin A requirement (Simpson et al., 2011).

Vitamin B

Vitamin B, which is water soluble in nature, has eight forms: vitamin B1, B2, B3, B5, B6, B8, B9, and B12. All these forms act as co-factors in different metabolic mechanisms, such as carbohydrate metabolism and protein synthesis. Since each form of vitamin B is involved in different mechanisms, they all have different deficiency symptoms. Vitamin B6, e.g., is necessary for protein metabolism, healthy immune system, the formation of neurotransmitters, and the synthesis of enzymes required during the synthesis of other types of vitamins. However, humans are unable to synthesize it and depend on plants. Unfortunately, the rate of vitamin B6 deficiencies is increasing. Some of the symptoms of vitamin B6 deficiency include skin inflammation, a weak immune system, fatigue, and depression (Bryan et al., 2002).

Vitamin C

Vitamin C is also a water-soluble vitamin and is mostly obtained from plant sources. It is very well-known for its role in boosting the immune system, especially against allergies due to its antioxidant properties. It also acts as a co-factor in the synthesis of collagen, cholesterol, and certain amino acids (Perez-Massot et al., 2013 Maggini et al., 2017). It is also involved in energy metabolism. Its deficiency results in joint pains, bone and connective tissue disorders, poor healing, and a weak immune system (Maggini et al., 2017).

Vitamin E

Vitamin E is another fat-soluble vitamin obtained from food sources rich in oil content such as peanuts, sunflower, soybean, and maize. It can be stored in the fat reserves of the body and thus is not required in the daily diet. The recommended dietary intake is 15�.4 mg. It is an antioxidant, helps in the regulation of membrane lipid-packaging, prevents platelet aggregation, helps in eyesight, and is required for the prevention of multiple diseases, such as cancer and cardiovascular diseases. Vitamin E deficiency normally occurs in people with disorders of fat metabolism and can result in muscle weakness, hemolytic anemia, immune system changes, and neurological and ophthalmological disorders (Fitzpatrick et al., 2012 Rizvi et al., 2014).


Methods

Protein Expression and Purification.

cDNA clones for human C/EBPβ and ATF4 were obtained from the Dharmacon mammalian clone collection. The full-length protein-coding regions were cloned into pet expression vectors containing a C-terminal His-tag. Proteins were expressed in competent cells supplying additional rare tRNAs (RosettaTM DE3 Novagen) and purified using TALON Metal Affinity Resins (Clontech). For p53, WT (amino acids 1–393) and the C-terminal truncated (∆30, amino acids 1–363) p53 proteins were expressed and purified as previously described (25).

SELEX-Seq and Library Preparation.

EMSAs for the human bZIP proteins and extraction of bound DNA were performed as described previously (13, 21). Purified bound DNA was amplified using a 15-cycle PCR protocol using Phusion polymerase (New England Biolabs) and overhang primers adding the Illumina adapter sites. During each round, a unique Illumina identifier was added in a five-cycle PCR assay, for 20 cycles of PCR in total. The indexed libraries were gel-purified as described previously (13, 21). R0 and R1 indexed experiments were pooled and sequenced using the v2 high-output 75 cycles kit on an Illumina NEXTSeq Series desktop sequencer. R1 SELEX-seq for MAX protein was performed as described previously (22) and sequenced with Illumina’s HiSeq system at the New York Genome Center.

Hox Protein Purification and EMSA Assays.

EMSAs were performed as described previously (13). Proteins were purified as His-tagged fusions from BL21 cells. The UbxIVa isoform was used, and the HM isoform of Hth was copurified in complex with His-tagged Exd protein. Probe sequences used in the assay can be found in Dataset S2. Images were taken using a Typhoon scanner and processed using ImageJ (NIH).

Competitive EMSA.

Binding reactions were performed with 50 nM UbxIVa and 200 nM Hm-Exd protein. 32 P-radiolabeled probe (2 nM) was used in each reaction. The concentrations of low- and high-affinity competitor probes ranged from 2 to 781 nM. Normalized data (fraction bound) from the competition EMSAs (Dataset S2) were fit to competitor concentrations with a sigmoidal dose–response curve using nonlinear least squares with the appropriate start conditions (43). The reported IC50 errors are fit-derived uncertainties. The data and dose–response curves were rescaled such that the parameter b = 1 (compare equation 7 of ref. 43).

E3N WT site 2 EMSA.

Probe (6 nM) was used for the binding reactions. HM-Exd was used at a concentration of 500 nM. UbxIVa concentration ranged from 100 to 500 nM for WT and below nonspecific probes to 30–100 nM for the increased affinity probe.

Fly Strains and Crosses.

D. melanogaster strains were maintained under standard laboratory conditions. All enhancer constructs were cloned into the placZattB expression construct with an hsp70 promoter. Transgenic enhancer constructs were created by Rainbow Transgenic Flies, Inc. and were integrated at the attP2 landing site.

Embryo Manipulations.

Embryos were raised at 25 °C and were fixed and stained according to standard protocols. LacZ protein was detected using an anti–β-Gal antibody (1:1,000 Promega). Detection of primary antibodies was done using secondary antibodies labeled with Alexa Fluor dyes (1:500 Invitrogen).

Microscopy.

Each series of experiments to measure protein levels was performed entirely in parallel. Embryo collections, fixations, staining, and image acquisitions were performed side by side in identical conditions. Confocal exposures were identical for each series and were set to not exceed the 255 maximum level. Series of images were acquired over a 1-d time frame to minimize any signal loss or aberration. Confocal images were obtained on a Leica DM5500 Q Microscope with an ACS APO 20×/0.60 IMM CORR lens and Leica Microsystems LAS AP software. Sum projections of confocal stacks were assembled, embryos were scaled to match sizes, background was subtracted using a 50-pixel rolling-ball radius, and plot profiles of fluorescence intensity were analyzed using ImageJ software (https://imagej.nih.gov/ij/).

NRLB Model of R0 Bias.

To parameterize the biases in the initial (R0) library with probe sequences with an L-bp-long variable region, we maximize the following likelihood function: L = ∏ S f 0 ( S ) y 0 ( S ) .

Here, the product runs over all 4 L possible probes S, while y0(S) denotes the observed count in R0. The predicted frequency of probe S in R0 is given by f 0 ( S ) = w 0 ( S ) / Z 0 , where w 0 ( S ) = exp ( ∑ ϕ β ϕ X ϕ ( S ) ) is the Boltzmann weight and Z 0 = ∑ S w 0 ( S ) is the partition function. Our assumption is that the R0 biases are due to an accumulation of processes (oligomer synthesis, Klenow double-stranding, and PCR amplification) that are each translationally invariant within the probe but depend on local sequence context. Assuming independence between the successive positions along the probe in each process leads naturally to the log-linear (i.e., multiplicative) form of the R0 bias model above this form is also mathematically convenient, as it enables dynamic programming. The set of model features ϕ encompasses all oligomers of length k (or “k-mers”). Xϕ(S) represents the number of times k-mer ϕ occurs in sequence S, taking into account k − 1 flanking bases up- and downstream of the variable region on the forward strand. Z0 is computed using dynamic programming techniques. We fit the model parameters βϕ by maximizing the multinomial likelihood L(β) using the limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm (44). The optimal k is selected using cross-validation. Further information is provided in SI Appendix, Supplemental Methods.

NRLB Model of R1 Probe Selection.

To infer the protein-DNA recognition model based on the trends seen in the selected (R1) library, we maximize the following likelihood function: L = ∏ S f 1 ( S ) y 1 ( S ) .

Again, the product runs over all 4 L possible probes S, while y1(S) denotes the observed count in R1 (or a later round, if necessary). The predicted frequency of probe S in R1 is given by f 1 ( S ) = w 1 ( S ) / Z 1 , where w 1 ( S ) = f 0 ( S ) ( ∑ m ∑ v [ e Δ Δ G ( S v ) / R T ] + e β n s ) here, the additional sum is over binding modes m and Z 1 = ∑ S w 1 ( S ) , the partition function. The views v now include both the forward and reverse orientation and can extend into the up- and downstream regions flanking the variable region. While the combined length of the variable region and relevant flanking regions is unlimited in principle, our current code uses an efficient binary representation of a DNA sequence that limits it to 32 bp. As with NRLB’s R0 bias model, the partition function Z0 is evaluated using dynamic programming techniques. We fit the model parameters by maximizing the multinomial likelihood L(β) using L-BFGS (44). Due to the redundant parameterization of the model, the likelihood is invariant to changes in the parameters in certain directions (the “null space”). Different model fits can be compared by projecting out components in this null space. Further information is provided in SI Appendix, Supplemental Methods.

NRLB Model Construction.

Various settings were used to construct the NRLB models used in this study a detailed summary can be found in Dataset S1. All individual NRLB model fits are unseeded and start from all parameters set equal to zero. Further optimization is achieved by shifting the free energy parameters of converged models at all positions by ±1 bp and refitting. Optionally, dinucleotide parameters, initially set to zero, are introduced for the best mononucleotide model fit. When multiple binding modes are used, only a single mode is learned initially and additional modes are added sequentially. Model footprints were increased until the additional parameters were uninformative. In general, models with the highest likelihood were chosen.

Hox data.

For Hox monomers, 13-bp footprints were considered to account for four additional flanking bases on either side of the 5-bp “core” region from Slattery et al. (13). For Exd-Hox heterodimers, 18-bp footprints were considered for the Exd-Hox modes to account for three additional flanking bases on either side of the 12-bp core region defined by Slattery et al. (13). Multimode models were manually selected that contained the largest number of interpretable modes representing Exd monomer, Hox monomer, and Exd-Hox heterodimer binding with the smallest footprint size.

Max data.

Fourteen base pairs was chosen as the footprint size for fits to HT-SELEX and SELEX-seq data, as it appeared to capture all of the specificity. However, for fits to SMiLE-seq data, models with 8-bp footprints have the best likelihood, as the 32-bp limitation of our code prevents fitting more than 1 bp into the flanking regions.

ATF4 and C/EBPβ data.

Fourteen base pairs was chosen for the model footprint size as it appeared to capture most of the specificity. Multimode fits were used on the C/EBPβ dataset to remove additional sequence bias.

P53 data.

Twenty-four base pairs was chosen as the footprint size as it appeared to capture all of the specificity. Fits to the WT p53 dataset required three binding modes to fit to the data and produce a viable motif.

NRLB Model Construction for HT-SELEX Data.

NRLB models were built for 30 of the 35 HT-SELEX datasets used in the DeepBind study (30) (European Nucleotide Archive identifiers ERP001824 and ERP001826). Of the five that were excluded, three did not have R0 data (BHLHE41, CTCF, and PRDM1), while two others used variable regions longer than 32 bp (ELK4 and HNF4A), the limit imposed by our current implementation of NRLB. R0 bias models were built for each unique probe design we used 2-mer models as these had robust cross-validation performance for most TFs (R0 library size for HT-SELEX is vastly smaller than SELEX-seq libraries). We built selection models with mononucleotide features and nonspecific binding. For all TFs, models were constructed for footprint sizes from 8 to 15 bp and a maximum overlap with the constant flanks of 0–5 bp (a total of 48 hyperparameter combinations). Longer footprints were tested if there appeared to be additional specificity outside the 15-mer (EBF1: 8–16 bp, NFE2: 8–17 bp, PAX5: 8–19 bp, and ZNF143: 8–20 bp). In probes with a 30-bp variable region, the overlap with the flanking regions was restricted to 1 bp. Reverse complement symmetry was enforced only for factors from the following TF families: bHLH (45), bZIP (28), and AP-2 (46) (Dataset S3). Sequence bias frequently produced suboptimal models (compare SI Appendix, Fig. S11A), and it was therefore necessary to employ multiple binding modes all modes shared the same footprint length and symmetry status. In some cases, contaminants and/or poor enrichment forced the use of later round data (compare SI Appendix, Fig. S11B) in these cases, later rounds were treated in the same way as R1 data. Unlike other factors, Max was fit using criteria designed to align its model with that derived from SELEX-seq data (Dataset S1).

Selection of HT-SELEX Models.

As noted by others (30), HT-SELEX datasets (23) can be subject to contamination and sequence bias (compare SI Appendix, Fig. S11). Consequently, simply using likelihood as the criterion for selecting the best R1 single-mode model from among all footprint and flank hyperparameter combinations discussed above often yields motifs that are incorrect. To automate the selection of an appropriate model for each TF in a way that does not consider classification performance on ChIP-seq data, we settled on the following procedure. First, we defined a “viable” model as one that satisfied these criteria: (i) The highest-affinity sequence matches the relevant consensus sequence found in literature up to a 1-bp mismatch (compare SI Appendix, Fig. S11B and Dataset S3) (ii) the model contains at least three consecutive positions of considerable specificity ([ΔΔGmax − ΔΔGmin]/RT > 3 for mononucleotide features) (compare SI Appendix, Fig. S11C) and (iii) if multiple binding modes were fit simultaneously, only the primary mode (the one with the highest relative affinity) is used. Next, starting with R1 data for a given TF, single-mode models for each footprint size and flank hyperparameter combination that were deemed viable were ranked by likelihood. If no viable models were found, the number of binding modes was incremented by one and the process was repeated. If no viable models were found using three binding modes, the enrichment round was incremented by one and the number of binding modes was reset to one. The first viable motif thus selected for each TF was used in all subsequent analyses.

Visualization of Dinucleotide Models.

Models with dinucleotide features were summarized in terms of the model-predicted relative affinity of all sequences a single point mutation away from the highest-affinity sequence and visualized as an energy logo (19), which was created using the LogoGenerator tool from the REDUCE Suite (reducesuite.bussemakerlab.org). The highest-affinity sequence was determined using a tailor-made dynamic programming algorithm.

Observed and Predicted Sequencing Rate Comparisons.

These comparisons assume that the observed SELEX read counts follow a Poisson distribution whose rate parameter λ (normalized for library size) is determined by the model in question. As such, for a given probe, the predicted sequencing rate and variance are both λ. In practice, there are many more possible SELEX probes than reads, resulting in most reads never being observed (or only once), making it impossible to compute the observed sequencing rate and variance for each probe. To practically compare the observed sequencing rate, we aggregate probes by their model predicted sequencing rates λ. Computing the observed sequencing rate then requires knowledge of the number of probes and their total sequenced count within each bin. Depending on the dataset and model, slight variations in the computation of the observed sequencing rate are required. Once computed, comparing observed and predicted sequencing rates is trivial.

R0 bias models.

Predicted sequencing rates were explicitly computed for the entire universe of 4 16 unique probes for both the NRLB R0 bias model and the Markov model method of Slattery et al. (13). To predict these rates, the Java code underlying the R/Bioconductor package SELEX version 1.6.0 was used to build and run a fifth-order Markov model on R0 SELEX-seq data from Slattery et al. (13). The existing NRLB Java framework was used to do the same. Further analysis computed the the number of probes observed twice (n2), once (n1), or not at all (n0) in each bin and compared the ratios n1/n0 and n2/n0 with expectation. For Poisson random variables, the expected value of these ratios is equal to λ and λ 2 /2, respectively.

HT-SELEX R1 comparisons.

In general, the exact enumeration technique used for the R0 analysis described above is not feasible for most widely used SELEX library designs. To avoid the need to explicitly evaluate the sequencing rates of all probes, an adaptive version of the Wang–Landau algorithm (47) was used to compute an approximate density of states (DOS) for NRLB and DeepBind algorithms trained on HT-SELEX data. This allowed us to achieve unbiased estimates of the number of probes in each sequencing rate bin. As inputs, the Wang–Landau algorithm used the raw DeepBind probe scores, the probe binding affinity as estimated only by the raw NRLB binding model, or the overall NRLB probe score f1(S) (which includes the R0 bias model).

Prediction of R1 Oligomer Counts.

The R/Bioconductor package SELEX version 1.6.0 (bioconductor.org/packages/SELEX) was used to determine the observed R1 count for all 10mers. For each 10mer occurring at least 100 times, a predicted count was computed by summing the predicted frequency of all probes containing it at any offset and then multiplying by the total number of reads in R1. Observed and predicted count values were compared using a linear fit.

Scoring Genomic Sequences with NRLB.

For an NRLB model with footprint K and a target sequence of length L, relative affinity scores were computed at all 2(LK + 1) views in the forward and reverse directions. If included, the nonspecific binding term inferred on SELEX-seq data was rescaled by explicitly considering the effective length of the DNA ligands in each technology, without adjustable parameters. Total affinity for the target sequence is the sum of all affinity contributions. ΔΔG/RT for the target sequence is the logarithm of this sum.

Exd-Hox analysis.

Dinucleotide NRLB models (18-bp, single-mode) for Exd-UbxIVa and Exd-Scr were truncated to the 12-bp central core region (13), and then used to score all possible 12-mers (compare SI Appendix, Fig. S5).

D. melanogaster enhancer element analysis.

All relative affinity predictions were rescaled by the highest-affinity sequence in the D. melanogaster genome as predicted by the same model (compare Figs. 5 A and B and 6A and SI Appendix, Fig. S17A).

Scoring Sequences with DeepBind.

DNA sequences were scored using the v0.11 scoring tool available at tools.genes.toronto.edu/deepbind/download.html and the interactive database located at tools.genes.toronto.edu/deepbind/. The raw score was used in further analyses, as this value corresponds to ΔΔG/RT. To construct the histograms required for the analysis in SI Appendix, Figs. S13 and S15, we modified the C code of the DeepBind scoring tool to implement the Wang–Landau algorithm (47).

Comparison with MITOMI Binding Free Energy.

MITOMI ligand sequences were scored using NRLB and DeepBind models to obtain predicted ΔΔG/RT values as described above, which were then compared with MITOMI observed ΔΔG/RT values using a linear fit. Scores were shifted such that the target sequence with the highest score was set to ΔΔG/RT = 0.

ChIP-Seq Peak Classification.

NRLB and DeepBind models for 30 TFs in the HT-SELEX dataset (Dataset S3) were compared using AUC metrics. For NRLB, only the primary binding mode was used to score sequences, even if multiple binding modes had been used during the fit to HT-SELEX data. Positive and negative sets were constructed in three different ways: (i) The “DeepBind method” used the same 500 positive and 500 shuffled negative sequences derived from ENCODE ChIP-seq datasets as (30) for each TF, (ii) the “ENCODE Top 500 method” used the same ENCODE ChIP-seq datasets as Alipanahi et al. (30) but restricted the analysis to the 500 highest peaks, and (iii) the “ENCODE Bottom 500 method” used the 500 lowest peaks among those with a significant quality value (qValue). For the last two methods, positive sequences were defined as a 101-bp window centered around the midpoint of each peak following Bell et al. (48), for each positive sequence, two corresponding negative sequences were defined as a 101-bp window centered exactly one peak’s width upstream or downstream of the peak midpoint. Since this yields 500 positive and 1,000 negative sequences, we use area under the precision-recall curve to quantify classification performance.

Quantitative Validation of HT-SELEX Models.

Quantitative comparisons for 27 of the 30 NRLB and DeepBind models used in the ChIP-seq classification task were run on R1 HT-SELEX data from the more deeply sequenced technical replicate (24) of the original dataset (23) (European Nucleotide Archive identifier PRJEB14744). The three models that were excluded did not have R1 data in this newly sequenced replicate (E2F1, ELF1, and SP1). For the comparisons, it was unknown how much of the flanking regions the DeepBind model was trained on to account for this, all probe scores were computed, including 10-bp flanking regions. In the analyses below, either the raw DeepBind probe scores or the log of the total probe binding affinity as predicted by the reduced NRLB binding model (no R0 bias) was used.

Density plots.

The predicted DOS was computed using the Wang–Landau algorithm (discussed above). The observed R0 and R1 histograms were computed by binning the observed reads using the score of the respective model.

R0/R1 enrichment.

The binned counts from the density plots were used to compute the log ratio of the R1 and R0 counts (y axis enrichment) and compared with the expected enrichment (x axis computed model score). As there is an overall scaling factor between the model scores and the observed enrichment that is unknown, the computed enrichment values are rescaled so as to minimize the root-mean-square deviation between observed and predicted enrichment.

Observed/expected sequencing rate.

The binned counts and the predicted DOS from the density plots were used to compute the observed/exected sequencing rate following the method described above. For the final, optimal (full) NRLB model comparison, the NRLB model with the R0 bias term was used to compute a probe score only over the variable region and the flank length the model was trained on.

Identification of Validated Hox Binding Sites.

We curated 96 functionally validated Hox and Exd-Hox binding sites in 21 different enhancer elements in D. melanogaster based on available reporter data from 31 studies (36) (Dataset S4). The genomic context of a binding site was determined based on the most minimal enhancer element used in the reporter assay, and genomic coordinates were standardized to release 5 (dm3) of the D. melanogaster genome using DNA sequence information reported in the studies. Partial matches to the entire validated binding site sequence were used to identify binding site offsets within the enhancer elements. To account for variation in the position of the 12-bp core binding region within NRLB models, and for experimental error in identifying the true location of the binding site within the enhancer, any model-predicted site overlapping a region extending K − 1 nucleotides up- and downstream of an experimentally validated binding site was considered a match, where K denotes the footprint of the model. Any model-predicted site outside of this extended region was considered a false-positive result.

Enhancer elements were scored using mononucleotide and dinucleotide NRLB models as described above. By default, the appropriate Hox monomer model (SI Appendix, Fig. S8) was used unless the study stated that both Exd and Hox regulated the target if so, the appropriate Exd-Hox heterodimer model among the multiple binding modes in the model was used (SI Appendix, Fig. S8 and Dataset S4). To account for variations in local protein concentration, all affinities within an enhancer element were normalized to the highest-affinity sequence in the particular enhancer (resulting in the normalized affinities varying between 0 and 1 for all sites in all enhancers). Potential binding windows in the element were considered functionally important if their normalized affinity was at or above a threshold T. Precision and recall were computed for all enhancer elements for all values of T between 0 and 1. A similar analysis was performed to assess the performance of sequence gazing methods. The consensus TTWATK was used for Hox sites, and TGAYNNAY was used for Exd-Hox sites the former was derived by us from bacterial one-hybrid results (37), and the latter was adopted from the method of Slattery et al. (13). Sites were deemed functional if they matched the consensus. In the absence of a thresholding parameter, only a single precision and recall pair was computed.

Reporter Assay Analysis.

The significance of potential low-affinity sites was established using Mann–Whitney U tests on the recorded intensities (Dataset S5). The cumulative affinity of the various E3N and 7H sequences used in the reporter assays was computed by summing relative affinity over all views on the E3N and 7H genomic regions as scored by the single 18-bp heterodimer mode from a multiple binding mode fit for Exd-UbxIVa (SI Appendix, Fig. S8). The logarithm base 10 of the E3N reporter intensity values was fit to the rescaled total affinities using linear regression. The E3N and 7H reporter intensity values were also fit to a logistic model of expression saturation using nonlinear least squares parameter values were checked for significance using an F-test.

Data and Software Availability.

SELEX data.

The SELEX-seq data for human Max, ATF4, C/EBPβ, ATF4, and C/EBPβ full-length WT p53 and ∆30 p53 generated as part of this study will be made available in Gene Expression Omnibus (GEO).

NRLB models.

The NRLB models for more than 50 TFs described here (SI Appendix, Figs. S7, S11, and S12), along with tools for scoring any sequence or genome of interest using an NRLB model, will be made available as an R package via Bioconductor.

NRLB software.

NRLB was implemented entirely in Java. The Java source code and associated R functions for visualizing models and scoring sequences will be made available via GitHub. As designed, NRLB can be run on any machine that has Java installed, but will run slowly unless multithreading is enabled. Runtimes are also highly dependent on the number of reads and the complexity of the model a single-mode, nucleotide-only model for MAX fit to HT-SELEX data (∼63 thousand reads) can take seconds to fit and uses roughly 2 GB of RAM on a standard MacBook, while a three-mode dinucleotide model for Exd-Pb on SELEX-seq data (∼19 million reads) can take more than 10 h on a server with Dual Xeon Processors and 24 GB of RAM.


3 Results

3.1 Validation on publicly available datasets

AgMata was validated on the two datasets described above.

Table 1 shows the performance of AgMata and state-of-the-art tools on the amyl33 dataset. AgMata has an area under the ROC curve of 0.707 on this dataset. While AgMata provides continuous scores for each residue in the target protein, in order to perform a fair comparison with the state-of-the-art tools, we applied a threshold to turn it into a binary classifier. We decided to follow the same approach described in Walsh et al. (2014), setting the threshold in a way that the specificity on the amyl33 dataset is as close as possible to 85.

Method . Sen . Spe . BAC . MCC .
Aggrescan 35.37 79.26 57.32 0.13
FoldAmyloid 20.71 86.97 53.84 0.08
Tango 13.67 95.5754.62 0.14
AMYLPRED2* 39.27 84.48 61.875 0.22
MetAmyl (high specificity)* 39.05 83.14 61.10 0.19
MetAmyl (global accuracy)* 52.4670.73 61.60 0.17
FishAmyloid* 13.73 93.68 53.71 0.10
PASTA 2 (90 specificity) 30.24 90.00 60.12 0.22
PASTA 2 (85 specificity) 40.87 84.95 62.91 0.24
AgMata42.86 84.44 63.650.25
Method . Sen . Spe . BAC . MCC .
Aggrescan 35.37 79.26 57.32 0.13
FoldAmyloid 20.71 86.97 53.84 0.08
Tango 13.67 95.5754.62 0.14
AMYLPRED2* 39.27 84.48 61.875 0.22
MetAmyl (high specificity)* 39.05 83.14 61.10 0.19
MetAmyl (global accuracy)* 52.4670.73 61.60 0.17
FishAmyloid* 13.73 93.68 53.71 0.10
PASTA 2 (90 specificity) 30.24 90.00 60.12 0.22
PASTA 2 (85 specificity) 40.87 84.95 62.91 0.24
AgMata42.86 84.44 63.650.25

Note: Methods marked with * are supervised and are expected to perform better. MCC is the Matthew’s correlation coefficient. The highest scores of every column are reported in bold.

Method . Sen . Spe . BAC . MCC .
Aggrescan 35.37 79.26 57.32 0.13
FoldAmyloid 20.71 86.97 53.84 0.08
Tango 13.67 95.5754.62 0.14
AMYLPRED2* 39.27 84.48 61.875 0.22
MetAmyl (high specificity)* 39.05 83.14 61.10 0.19
MetAmyl (global accuracy)* 52.4670.73 61.60 0.17
FishAmyloid* 13.73 93.68 53.71 0.10
PASTA 2 (90 specificity) 30.24 90.00 60.12 0.22
PASTA 2 (85 specificity) 40.87 84.95 62.91 0.24
AgMata42.86 84.44 63.650.25
Method . Sen . Spe . BAC . MCC .
Aggrescan 35.37 79.26 57.32 0.13
FoldAmyloid 20.71 86.97 53.84 0.08
Tango 13.67 95.5754.62 0.14
AMYLPRED2* 39.27 84.48 61.875 0.22
MetAmyl (high specificity)* 39.05 83.14 61.10 0.19
MetAmyl (global accuracy)* 52.4670.73 61.60 0.17
FishAmyloid* 13.73 93.68 53.71 0.10
PASTA 2 (90 specificity) 30.24 90.00 60.12 0.22
PASTA 2 (85 specificity) 40.87 84.95 62.91 0.24
AgMata42.86 84.44 63.650.25

Note: Methods marked with * are supervised and are expected to perform better. MCC is the Matthew’s correlation coefficient. The highest scores of every column are reported in bold.

The balanced accuracy is the average of the sensitivity and specificity scores and it is not affected by the unbalancement of the dataset. The Matthew’s correlation coefficient, which best summarizes the confusion matrix ( Powers, 2011), shows a 92% increase in performance of AgMata with respect to Tango, one of the most used unsupervised methods. AgMata even improves the quality of the predictions from 13 to 250% with respect to supervised machine learning methods (marked with a star), which have been directly trained on aggregation data. AgMata performs basically on pair with PASTA2 on this dataset.

In addition, this increase in performance is obtained without using any structural or evolutionary information. These types of information are not always available, and this improves the general applicability of AgMata ( Orlando et al., 2016). Unlike most of the available tools, AgMata also takes into consideration the full sequence at once. This is essential in the aggregation problem, since the mechanism is often driven by the interaction and cooperation of residues in distant regions of the proteins.

The improvement in performance is confirmed by the AmyProFiltered dataset ( Table 2). In this case, we could only compare with PASTA2 and Tango, as we were not able to obtain or run large scale predictions with the other predictors.

Performances of the state-of-the-art methods for aggregation prediction on the AmyProFiltered dataset

Method . Sen . Spe . Acc . Pre . MCC . AUC .
PASTA 2 32.81 84.99 76.0331.19 0.1746 0.604
Tango 17.56 85.073.85 18.8 0.0262 0.512
AgMata39.482.03 74.71 31.250.1960.641
Method . Sen . Spe . Acc . Pre . MCC . AUC .
PASTA 2 32.81 84.99 76.0331.19 0.1746 0.604
Tango 17.56 85.073.85 18.8 0.0262 0.512
AgMata39.482.03 74.71 31.250.1960.641

Note: Sensitivity (Sen), specificity (Spe), accuracy (Acc), precision (Pre), Matthew’s correlation coefficient (MCC) and area under the ROC curve (AUC) are indicated. The thresholds for the PASTA2 and the Tango predictions have been selected in order to obtain a specificity as close to 85 as possible. The threshold used for AgMata is the one selected for the Amyl33 dataset. The highest scores of every column are reported in bold.

Performances of the state-of-the-art methods for aggregation prediction on the AmyProFiltered dataset

Method . Sen . Spe . Acc . Pre . MCC . AUC .
PASTA 2 32.81 84.99 76.0331.19 0.1746 0.604
Tango 17.56 85.073.85 18.8 0.0262 0.512
AgMata39.482.03 74.71 31.250.1960.641
Method . Sen . Spe . Acc . Pre . MCC . AUC .
PASTA 2 32.81 84.99 76.0331.19 0.1746 0.604
Tango 17.56 85.073.85 18.8 0.0262 0.512
AgMata39.482.03 74.71 31.250.1960.641

Note: Sensitivity (Sen), specificity (Spe), accuracy (Acc), precision (Pre), Matthew’s correlation coefficient (MCC) and area under the ROC curve (AUC) are indicated. The thresholds for the PASTA2 and the Tango predictions have been selected in order to obtain a specificity as close to 85 as possible. The threshold used for AgMata is the one selected for the Amyl33 dataset. The highest scores of every column are reported in bold.

Supplementary Figure S3 shows the time required for the prediction of the aggregation propensity of a protein as a function of its length.

3.2 Insights about ataxin-3 aggregation

Human ataxin-3 is a protein involved in Machado–Joseph disease, a neurodegenerative affliction belonging to the restricted group of polyglutamine expansion disorders. A common feature of these inherited diseases is the high tendency of the protein to self-assemble and form aggregates and amyloid fibers, in vitro and in the cellular milieu. The protein is composed of different domains: the N-terminal Josephin domain (180 residues) is followed by a flexible tail containing two ubiquitin interacting motifs, the expandable polyQ stretch ( Masino et al., 2003 Scarff et al., 2013), and typically a third ubiquitin interacting motifs. The size of the polyQ segment is variable in the normal population, but when extended beyond a specific threshold (>55Q), it becomes pathogenic. While the complete mechanism of aggregation is still not fully understood, the most supported hypothesis suggests that self-assembly of the Josephin domain mediates the initial stages of aggregation of normal and polyglutamine expanded ataxin-3 ( Ellisdon et al., 2007 Gales et al., 2005 Masino et al., 2004). This initial aggregation step is concomitant with the conversion into beta-rich structures that generate a surface suitable for the formation of long homopolymers with amyloid-like characteristics ( Silva et al., 2018). The expansion of the polyglutamine fragment accelerates ataxin-3 self-assembly by increasing the mobility of a central helical region within the Josephin domain ( Lupton et al., 2015 Scarff et al., 2015). The disease-related protein forms mature and SDS-resistant fibrils, in a second aggregation step strictly dependent on the expanded polyQ tract ( Ellisdon et al., 2007 Lupton et al., 2015 Scarff et al., 2015).

Figure 1 shows the predicted beta-aggregation propensity of this protein. The regions with high propensity are situated in the N-terminal Josephin domain, with the highest scoring residues visualized in the structure of this domain, which was not used in the prediction.

The relation between ataxin-3 predicted aggregation propensity and the structure of the N-terminal Josephin domain. The highlighted amino acids correspond to the highest scoring residues in the three predicted peaks, with three amino acids shown for the broader second peak

The relation between ataxin-3 predicted aggregation propensity and the structure of the N-terminal Josephin domain. The highlighted amino acids correspond to the highest scoring residues in the three predicted peaks, with three amino acids shown for the broader second peak

Despite not using any structural or evolutionary information, AgMata identifies amino acid residues with key roles that are distant in the sequence but share the same structural environment. Interestingly, they form part of a hydrophobic patch that partially overlaps with the ubiquitin-binding surface, which contains residues previously shown to play a role in aggregation ( Masino et al., 2004). In particular, the peaks around I77 and L93 correlate well with previous reports showing the involvement of the region 73–96 in aggregation and its involvement in the formation of the fibril core ( Lupton et al., 2015 Scarff et al., 2015). Another interesting observation is that L93 and I77 have already been studied by other authors and mutagenesis experiments have been performed: mutations L93A, I77A ( Lupton et al., 2015) and the double mutation I77K/Q78K ( Masino et al., 2011) are able to reduce the pathological aggregation of human ataxin-3. In addition, the peptides corresponding to region 153–167 are able to form amyloid fibrils in isolation ( Lupton et al., 2015). Other mutations that have been experimentally investigated are W87K and S81A [ Lupton et al., 2015 Masino et al., 2011 Saunders et al., 2011)]. Figure 2 shows the differences in the predicted aggregation profile between the wild-type and some of these mutations that decrease aggregation.

AgMata predictions for mutations of ataxin-3 that have been experimentally verified as decreasing aggregation. The red line represents the aggregation propensity of the wild-type protein, while the blue represents the mutant. Plots (A), (E) and (F) report mutations on the isolated Josephin domain, the others (plots B, C and D) are on the full length ataxin-3 with a 64 residue Q-tail. Ataxin64Q and the Josephin domain have been aligned in the plots to allow a better comparison of the peaks. (Color version of this figure is available at Bioinformatics online.)

AgMata predictions for mutations of ataxin-3 that have been experimentally verified as decreasing aggregation. The red line represents the aggregation propensity of the wild-type protein, while the blue represents the mutant. Plots (A), (E) and (F) report mutations on the isolated Josephin domain, the others (plots B, C and D) are on the full length ataxin-3 with a 64 residue Q-tail. Ataxin64Q and the Josephin domain have been aligned in the plots to allow a better comparison of the peaks. (Color version of this figure is available at Bioinformatics online.)

AgMata predicts reduced beta-aggregation for all of them. If we use the sum of all residues as a total score, then aggregation propensity is reduced by 65, 29, 67, 32, 8 and 4% for the mutations I77K/Q78K, W87K, I77A, S81A, L93A and S81A/R103G, respectively. It is also interesting that especially the first and the third peaks (the ones closest to I77 and F163) are always strongly reduced in intensity, except for the L93A and S81A/R103G cases. L93A does show a marked decrease in the second peak, which almost completely disappears. Supplementary Figures S1 and S2 report the effect of the mutations in accordance with TANGO and PASTA2 predictions.

Supplementary Table S1 reports all the known experimentally investigated mutations retrieved in literature and the predicted overall variation in the aggregation propensity.

While most of the mutations reported to increase fibril formation are predicted to have little effect on the protein behavior, G159A significantly increase the aggregation propensity. This is also the only reported mutation to increase the SDS solubility midpoint ( Supplementary Table S1 ). There are, however, mispredicted mutations that highlight the complexity of the molecular events leading to aggregation. For example, L169H increases stabilization but AgMata assesses it as decreasing aggregation. This type of effect does not involve a change in the interaction potentials between different parts of the region, and is therefore not taken into consideration by the model. This residue is close to Leu 89, which is important for aggregation. The L169H mutant could therefore destabilize interactions and so lead to protein misfolding or increased exposure of the aggregating region. The V79A mutant is next to a flexible loop connecting to a helix hairpin, and is also predicted to be decreasing aggregation while it in fact increases it. This could be due to interactions involving the full ataxin-3 protein, as this helix hairpin is suspected to be involved in the aggregation process ( Sanfelice et al., 2014), and the V79A mutation could change its interactions. These examples highlight the complexity of the molecular processes involved in aggregation, and the importance of taking into account as many factors as possible when predicting changes in molecular behavior.


Genetic Models of Schizophrenia

Mikhail V. Pletnikov , in Progress in Brain Research , 2009

Inducible expression of mutant DISC1

We generated a mouse model of conditional and inducible expression of human mutant DISC1 using the Tet-off system ( Pletnikov et al., 2008 ). Mutant DISC1 is a hypothetical protein product of the balanced t(111) chromosomal translocation identified in a Scottish pedigree with high load of major mental disorders, including schizophrenia and major depression ( Millar et al., 2001 Ishizuka et al., 2006 Chubb et al., 2008 ). Fine mapping and cloning have identified a disrupted gene on chromosome 1, hence the name DISC1. As the breakpoint is in the middle of open reading frame, the translocation is hypothesized to produce the truncated N-terminus product, mutant DISC1 ( Millar et al., 2001 ). The identifiable mutation that is strongly associated with major mental diseases makes DISC1 and the mutant protein product interesting and attractive candidates for studying the neurobiology of psychiatric disorders ( Ross et al., 2006 ). There are several examples of how similar functional mutations have helped to shed light on the molecular mechanisms of neurodegenerative diseases, including familial forms of Parkinson’s disease and Alzheimer’s disease (e.g., Davidzon et al., 2006 Piscopo et al., 2008 ).

Recent studies have implicated DISC1 in neuronal development, neuronal migration, and synaptogenesis (Ishizuko et al., 2006 Ross et al., 2006 Camargo et al., 2007 ). They have also suggested that mutant DISC1 may interfere with the functions of normal wild-type (WT) DISC1 via dominant-negative mechanisms, leading to loss-of-function of DISC1 ( Kamiya et al., 2005 ). Thus, we generated transgenic mouse model of inducible and conditional expression of mutant human DISC1 to study the molecular mechanisms whereby this protein affects neurodevelopment.

Our inducible DISC1 mouse model is a standard bi-transgenic Tet-off system ( Fig. 1 ). In order to turn off transgene expression, DOX is added to mouse food or drinking water. As transcription of tTA is regulated by the α-calmodulin kinase II (CAMKII) promoter, expression of mutant DISC1 is present in neurons of the olfactory bulbs, cortex, hippocampus, striatum but not cerebellum. It was found that expression of mutant DISC1 starts prenatally as early as embryonic day (E) 15 as detected by western blot and E9 when assayed by RT-PCR (unpublished data). Thus, our model provides the opportunity to regulate both prenatal and postnatal expression of mutant DISC1.

The initial characterization of our model has included evaluation of the neurobehavioral effects of mutant DISC1 when its expression was present throughout the entire life span of mice. Expression of mutant DISC1 was on the mixed SJLB6CBA background ( Pletnikov et al., 2008 ). We found that expression of mutant DISC1 was associated with increased spontaneous locomotor activity in male but not female mice, decreased social interaction and increased aggressive behavior in male mice when measured in open field test, and decreased spatial recognition memory in Morris water maze in female mice only despite comparable rates of learning between mutant and control mice. These alterations are reminiscent of positive and negative symptoms, and cognitive impairments seen in schizophrenia ( Ross et al., 2006 ). No effects of mutant DISC1 were found in pre-pulse inhibition (PPI) of the acoustic startle and novelty-induced activity in open field.

These behavioral alterations were accompanied by enlargement of the lateral ventricle, the most consistent structural pathology seen in schizophrenic patients ( Vita et al., 2006 Pagsberg et al., 2007 ). Ventricular enlargement can be partly explained by attenuated dendritic arborization found in primary cortical neurons derived from mutant DISC1 embryos, in line with human postmortem studies, showing decreased dendritic length and dendritic arborization in certain cortical areas ( Glantz et al., 2000 ). Our biochemical assays showed that the effects of mutant DISC1 may be mediated by binding mutant DISC1 to endogenous mouse Disc1, producing decreased levels of mouse Disc1 and its interacting partner, Lis1, which have been implicated in the molecular mechanisms of neuronal maturation ( Morris et al., 2003 Ozeki et al., 2003 ).

The main drawback of the study is that mutant DISC1 was expressed steadily throughout the entire life. Thus, the contribution of prenatal vs. postnatal periods remained unclear. Our recent experiments with regulation of expression of mutant DISC1 have demonstrated that prenatal and postnatal expression selectively affected different neurobehavioral phenotypes, suggesting the effects of mutant DISC1 may vary across neurodevelopment (manuscript in revision).


Abstract

The prediction of interresidue contacts and distances from coevolutionary data using deep learning has considerably advanced protein structure prediction. Here, we build on these advances by developing a deep residual network for predicting interresidue orientations, in addition to distances, and a Rosetta-constrained energy-minimization protocol for rapidly and accurately generating structure models guided by these restraints. In benchmark tests on 13th Community-Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP13)- and Continuous Automated Model Evaluation (CAMEO)-derived sets, the method outperforms all previously described structure-prediction methods. Although trained entirely on native proteins, the network consistently assigns higher probability to de novo-designed proteins, identifying the key fold-determining residues and providing an independent quantitative measure of the “ideality” of a protein structure. The method promises to be useful for a broad range of protein structure prediction and design problems.

Clear progress in protein structure prediction was evident in the recent 13th Community-Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP13) structure-prediction challenge (1). Multiple groups showed that application of deep learning-based methods to the protein structure-prediction problem makes it possible to generate fold-level accuracy models of proteins lacking homologs in the Protein Data Bank (PDB) (2) directly from multiple sequence alignments (MSAs) (3 ⇓ ⇓ –6). In particular, AlphaFold (A7D) from DeepMind (7) and Xu with RaptorX (4) showed that distances between residues (not just the presence or absence of a contact) could be accurately predicted by deep learning on residue-coevolution data. The 3 top-performing groups (A7D, Zhang-Server, and RaptorX) all used deep residual-convolutional networks with dilation, with input coevolutionary coupling features derived from MSAs, either using pseudolikelihood or by covariance matrix inversion. Because these deep learning-based methods produce more complete and accurate predicted distance information, 3-dimensional (3D) structures can be generated by direct optimization. For example, Xu (4) used Crystallography and NMR System (CNS) (8) and the AlphaFold group (7) used gradient descent following conversion of the predicted distances into smooth restraints. Progress was also evident in protein structure refinement at CASP13 using energy-guided refinement (9 ⇓ –11).

In this work, we integrate and build upon the CASP13 advances. Through extension of deep learning-based prediction to interresidue orientations in addition to distances, and the development of a Rosetta-based optimization method that supplements the predicted restraints with components of the Rosetta energy function, we show that still more accurate models can be generated. We also explore applications of the model to the protein design problem. To facilitate further development in this rapidly moving field, we make all of the codes for the improved method available.


Results and discussion

Experiment setup

Given an unknown sequence, the objective is to determine if the sequence is an adaptor protein and thus this can be treated as a supervised learning classification. As a representation, we defined adaptor protein as positive data with label “Positive”, and otherwise, non-adaptor protein as negative data with label “Negative”. We applied 5-fold cross-validation method in our training dataset with hyper-parameter optimization techniques. Finally, the independent dataset was used to evaluate the correctness as well as overfitting in our model.

Our proposed RNN model was implemented using PyTorch library with a Titan Xp GPU. We trained the RNN model from scratch using Adam optimizer for 30 epochs. The learning rate was fixed to 1×10 −4 in the entire training process. Due to the significant imbalance in the sample numbers of adaptor proteins and non-adaptor proteins in the dataset, we adopted weighted binary cross-entropy loss in the training process. The weighting factors were the inverse class frequency.

Sensitivity, specificity, accuracy, and MCC (Matthew’s correlation coefficient) were used to measure the prediction performance. TP, FP, TN, FN are true positives, false positives, true negatives, and false negatives, respectively.