Now showing 1 - 10 of 25
  • Publication
    Pairwise Interaction Field Neural Networks For Drug Discovery
    (University College Dublin. School of Computer Science and Informatics, 2013-06) ;
    Automatically mapping small, drug-like molecules into their biological activity is an open problem in chemioinformatics. Numerous approaches to solve the problem have been attempted, which typically rely on different machine learning tools and, critically, depend on the how a molecule is represented (be it as a one-dimensional string, a two-dimensional graph, its three-dimensional structure, or a feature vector of some kind). In fact arguably the most critical bottleneck in the process is how to encode the molecule in a way that is both informative and can be dealt with by the machine learning algorithms downstream. Recently we have introduced an algorithm which entirely does away with this complex, error-prone and time-consuming encoding step by automatically finding an optimal code for a molecule represented as a twodimensional graph. In this report we introduce a model which we have recently developed (Neural Network Pairwise Interaction Fields) to extend this same approach to molecules represented as their three-dimensional structures. We benchmark the algorithm on a number of public data sets. While our tests confirm that three-dimensional representations are generally less informative than two-dimensional ones (possibly because the former are generally the result of a prediction process, and as such contain noise), the algorithm we introduce compares well with the state of the art in 3D-based prediction, in spite of not requiring any prior knowledge about the domain, or prior encoding of the molecule.
      24
  • Publication
    Template-based Recognition of Natively Disordered Regions in Proteins
    (University College Dublin. School of Computer Science and Informatics, 2012-04) ; ;
    Disordered proteins are increasingly recognised as a fundamental component of the cellular machinery. Parallel to this, the prediction of protein disorder by computational means has emerged as an aid to the investigation of protein functions. Although predictors of disorder have met with considerable success, it is increasingly clear that further improvements are most likely to come from additional sources of information, to complement patterns extracted from the primary sequence of a protein. In this article, a system for the prediction of protein disorder that relies both on sequence information and on structural information from homologous proteins of known structure (templates) is described. Structural information is introduced directly (as a further input to the predictor) and indirectly through highly reliable template-based predictions of structural features of the protein. The predictive system, based on Support Vector Machines, is tested by rigorous 5-fold cross validation on a large, non-redundant set of proteins extracted from the Protein Data Bank. In these tests the introduction of structural information, which is carefully weighed based on sequence identity between homologues and query, results in large improvements in prediction accuracy. The method, when re-trained on a 2004 version of the PDB, clearly outperforms the algorithms that ranked top at the 2006 CASP competition.
      23
  • Publication
    Protein Backbone Angle Prediction in Multidimensional φ-ψ Space
    (University College Dublin. School of Computer Science and Informatics, 2006-01-20) ; ;
    A significant step towards establishing the structure and function of a protein is the prediction of the local conformation of the polypeptide chain. In this article we present systems for the prediction of 3 new alphabets of local structural motifs. The motifs are built by applying multidimensional scaling (MDS) and clustering to pair-wise angular distances for multiple φ-ψ angle values collected from high-resolution protein structures. The predictive systems, based on ensembles of bidirectional recurrent neural network architectures, and trained on a large non-redundant set of protein structures, achieve 72%, 66% and 60% correct structural motif prediction on an independent test set for di-peptides (6 classes), tripeptides (8 classes) and tetra-peptides (14 classes), respectively, 28-30% above base-line statistical predictors. To demonstrate that structural motif predictions contain relevant structural information, we build a further system, based on ensembles of two-layered bidirectional recurrent neural networks, to map structural motif predictions into traditional 3-class (helix, strand, coil) secondary structure. This system achieves 79.5% correct prediction using the “hard” CASP 3-class assignment, and 81.4% with a more lenient assignment, outperforming a sophisticated state-of-the-art predictor (Porter) trained in the same experimental conditions. All the predictive systems will be provided free of charge to academic users and made publicly available at the address http://distill.ucd.ie/.
      15
  • Publication
    Ab initio and homology based prediction of protein domains by recursive neural networks
    Background: Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. Results: We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality. We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. Conclusion: The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: http://distill.ucd.ie/shandy/ and we plan on running them on a multi-genomic scale and make the results public in the near future.
      656Scopus© Citations 14
  • Publication
    Potential utility of docking to identify protein-peptide binding regions
    (University College Dublin. School of Computer Science and Informatics, 2013-05) ; ; ; ;
    Disordered regions of proteins often bind to structured domains, mediating interactions within and between proteins. However, it is difficult to identify a priori the short regions involved in binding. We set out to determine if docking peptides to peptide binding domains would assist in these predictions. First, we investigated the docking of known short peptides to their native and non-native peptide binding domains. We then investigated the docking of overlapping peptides adjacent to the native peptide. We found only weak discrimination of docking scores between native peptide and adjacent peptides in this context with similar results for both ordered and disordered regions. Finally, we trained a bidirectional recurrent neural network using as input the peptide sequence, predicted secondary structure, Vina docking score and Pepsite score.We conclude that docking has only modest power to define the location of a peptide within a larger protein region known to contain it. However, this information can be used in training machine learning methods which may allow for the identification of peptide binding regions within a protein sequence.
      20
  • Publication
    Prediction of short linear protein binding regions
    Short linear motifs in proteins (typically 3–12 residues in length) play key roles in protein–protein interactions by frequently binding specifically to peptide binding domains within interacting proteins. Their tendency to be found in disordered segments of proteins has meant that they have often been overlooked. Here we present SLiMPred (short linear motif predictor), the first general de novo method designed to computationally predict such regions in protein primary sequences independent of experimentally defined homologs and interactors. The method applies machine learning techniques to predict new motifs based on annotated instances from the Eukaryotic Linear Motif database, as well as structural, biophysical, and biochemical features derived from the protein primary sequence. We have integrated these data sources and benchmarked the predictive accuracy of the method, and found that it performs equivalently to a predictor of protein binding regions in disordered regions, in addition to having predictive power for other classes of motif sites such as polyproline II helix motifs and short linear motifs lying in ordered regions. It will be useful in predicting peptides involved in potential protein associations and will aid in the functional characterization of proteins, especially of proteins lacking experimental information on structures and interactions. We conclude that, despite the diversity of motif sequences and structures, SLiMPred is a valuable tool for prioritizing potential interaction motifs in proteins.
      9697Scopus© Citations 58
  • Publication
    Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information
    Background : Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio. Results: Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available. Conclusion: The predictive system are publicly available at the address http://distill.ucd.ie
      365Scopus© Citations 95
  • Publication
    De Novo Protein Subcellular Localization Prediction by N-to-1 Neural Networks
    Knowledge of the subcellular location of a protein provides valuable information about its function and possible interaction with other proteins. In the post-genomic era, fast and accurate predictors of subcellular location are required if this abundance of sequence data is to be fully exploited. We have developed a subcellular localization predictor (SCL pred) which predicts the location of a protein into four classes for animals and fungi and five classes for plants (secretory pathway, cytoplasm, nucleus, mitochondrion and chloroplast) using high throughput machine learning techniques trained on large non-redundant sets of protein sequences. The algorithm powering SCL pred is a novel Neural Network (N-to-1 Neural Network, or N1-NN) which is capable of mapping whole sequences into single properties (a functional class, in this work) without resorting to predefined transformations, but rather by adaptively compressing the sequence into a hidden feature vector. We benchmark SCL pred against other publicly available predictors using two benchmarks including a new subset of Swiss-Prot release 57. We show that SCL pred compares favourably to the other state-of-the-art predictors. Moreover, the N1-NN algorithm is fully general and may be applied to a host of problems of similar shape, that is, in which a whole sequence needs to be mapped into a fixed-size array of properties, and the adaptive compression it operates may even shed light on the space of protein sequences.
      105Scopus© Citations 3
  • Publication
    SCLpred-EMS: Subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks
    Motivation: The subcellular location of a protein can provide useful information for protein function prediction and drug design. Experimentally determining the subcellular location of a protein is an expensive and time-consuming task. Therefore, various computer-based tools have been developed, mostly using machine learning algorithms, to predict the subcellular location of proteins. Results: Here, we present a neural network-based algorithm for protein subcellular location prediction. We introduce SCLpred-EMS a subcellular localization predictor powered by an ensemble of Deep N-to-1 Convolutional Neural Networks. SCLpred-EMS predicts the subcellular location of a protein into two classes, the endomembrane system and secretory pathway versus all others, with a Matthews correlation coefficient of 0.75-0.86 outperforming the other state-of-the-art web servers we tested. Contact: catherine.mooney@ucd.ie
      213Scopus© Citations 13
  • Publication
    Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction
    Protein Secondary Structure prediction has been a central topic of research in Bioinformatics for decades. In spite of this, even the most sophisticated ab initio SS predictors are not able to reach the theoretical limit of three-state prediction accuracy (88-90%), while only a few predict more than the 3 traditional Helix, Strand and Coil classes. In this study we present tests on different models trained both on single sequence and evolutionary profile-based inputs and develop a new state-of-the-art system with Porter 5. Porter 5 is composed of ensembles of cascaded Bidirectional Recurrent Neural Networks and Convolutional Neural Networks, incorporates new input encoding techniques and is trained on a large set of protein structures. Porter 5 achieves 84% accuracy (81% SOV) when tested on 3 classes and 73% accuracy (70% SOV) on 8 classes on a large independent set. In our tests Porter 5 is 2% more accurate than its previous version and outperforms or matches the most recent predictors of secondary structure we tested. When Porter 5 is retrained on SCOPe based sets that eliminate homology between training/testing samples we obtain similar results. Porter is available as a web server and standalone program at http://distilldeep.ucd.ie/porter/ alongside all the datasets and alignments.
      106Scopus© Citations 41