Now showing 1 - 10 of 25
  • Publication
    Potential utility of docking to identify protein-peptide binding regions
    (University College Dublin. School of Computer Science and Informatics, 2013-05) ; ; ; ;
    Disordered regions of proteins often bind to structured domains, mediating interactions within and between proteins. However, it is difficult to identify a priori the short regions involved in binding. We set out to determine if docking peptides to peptide binding domains would assist in these predictions. First, we investigated the docking of known short peptides to their native and non-native peptide binding domains. We then investigated the docking of overlapping peptides adjacent to the native peptide. We found only weak discrimination of docking scores between native peptide and adjacent peptides in this context with similar results for both ordered and disordered regions. Finally, we trained a bidirectional recurrent neural network using as input the peptide sequence, predicted secondary structure, Vina docking score and Pepsite score.We conclude that docking has only modest power to define the location of a peptide within a larger protein region known to contain it. However, this information can be used in training machine learning methods which may allow for the identification of peptide binding regions within a protein sequence.
  • Publication
    CPPpred: prediction of cell penetrating peptides
    Summary: Cell penetrating peptides (CPPs) are attracting much attention as a means of overcoming the inherently poor cellular uptake of various bioactive molecules. Here, we introduce CPPpred, a web server for the prediction of CPPs using a N-to-1 neural network. The server takes one or more peptide sequences, between 5 and 30 amino acids in length, as input and returns a prediction of how likely each peptide is to be cell penetrating. CPPpred was developed with redundancy reduced training and test sets, offering an advantage over the only other currently available CPP prediction method.
      318Scopus© Citations 121
  • Publication
    In Silico Protein Motif Discovery and Structural Analysis
    A wealth of in silico tools is available for protein motif discovery and structural analysis. The aim of this chapter is to collect some of the most common and useful tools and to guide the biologist in their use. A detailed explanation is provided for the use of Distill, a suite of web servers for the prediction of protein structural features and the prediction of full-atom 3D models from a protein sequence. Besides this, we also provide pointers to many other tools available for motif discovery and secondary and tertiary structure prediction from a primary amino acid sequence. The prediction of protein intrinsic disorder and the prediction of functional sites and SLiMs are also briefly discussed. Given that user queries vary greatly in size, scope and character, the trade-offs in speed, accuracy and scale need to be considered when choosing which methods to adopt.
      104Scopus© Citations 1
  • Publication
    Protein Structure Annotations
    (Springer, 2019-03-28) ;
    This chapter aims to introduce to the specifics of protein structure annotations and their fundamental position in structural bioinformatics, bioinformatics in general. Proteins are profoundly characterised by their structure in every aspect of their functioning and, while over the last decades there has been a close to exponential growth of known protein sequences, the growth of known protein structures has been closer to linear because of the high complexity and cost of determining them. Thus, protein structure predictors are among the most thoroughly assessed tools in bioinformatics (in venues such as CASP or CAMEO) because they allow the structural study of proteins on a large scale. This chapter presents the key types of protein structure annotation and the methods and algorithms for predicting them, with the aim to give both a historical perspective on their development and a snapshot of their current state of the art. From one-dimensional protein annotations – i.e. secondary structure, solvent accessibility and torsion angles – to more complex and informative two-dimensional protein abstractions, i.e. contact maps, both mature and currently developing methods for protein structure annotations are introduced. The aim of this overview is to facilitate the adoption and development of state-of-the-art protein structural predictors. Particular attention is given to some of the best performing and freely available web servers and standalone programmes to predict protein structure annotations.
      377Scopus© Citations 6
  • Publication
    SCL-Epred: A generalised de novo eukaryotic protein subcellular localisation predictor
    Knowledge of the subcellular location of a protein provides valuable information about its function, possible interaction with other proteins and drug targetability, among other things. The experimental determination of a protein's location in the cell is expensive, time consuming and open to human error. Fast and accurate predictors of subcellular location have an important role to play if the abundance of sequence data which is now available is to be fully exploited. In the post-genomic era, genomes in many diverse organisms are available. Many of these organisms are important in human and veterinary disease and fall outside of the well-studied plant, animal and fungi groups. We have developed a general eukaryotic subcellular localisation predictor (SCL-Epred) which predicts the location of eukaryotic proteins into three classes which are important, in particular, for determining the drug targetability of a protein - secreted proteins, membrane proteins and proteins that are neither secreted nor membrane. The algorithm powering SCL-Epred is a N-to-1 neural network and is trained on very large non-redundant sets of protein sequences. SCL-Epred performs well on training data achieving a Q of 86 % and a generalised correlation of 0.75 when tested in tenfold cross-validation on a set of 15,202 redundancy reduced protein sequences. The three class accuracy of SCL-Epred and LocTree2, and in particular a consensus predictor comprising both methods, surpasses that of other widely used predictors when benchmarked using a large redundancy reduced independent test set of 562 proteins. SCL-Epred is publicly available at
      223Scopus© Citations 8
  • Publication
    Deeper Profiles and Cascaded Recurrent and Convolutional Neural Networks for state-of-the-art Protein Secondary Structure Prediction
    Protein Secondary Structure prediction has been a central topic of research in Bioinformatics for decades. In spite of this, even the most sophisticated ab initio SS predictors are not able to reach the theoretical limit of three-state prediction accuracy (88-90%), while only a few predict more than the 3 traditional Helix, Strand and Coil classes. In this study we present tests on different models trained both on single sequence and evolutionary profile-based inputs and develop a new state-of-the-art system with Porter 5. Porter 5 is composed of ensembles of cascaded Bidirectional Recurrent Neural Networks and Convolutional Neural Networks, incorporates new input encoding techniques and is trained on a large set of protein structures. Porter 5 achieves 84% accuracy (81% SOV) when tested on 3 classes and 73% accuracy (70% SOV) on 8 classes on a large independent set. In our tests Porter 5 is 2% more accurate than its previous version and outperforms or matches the most recent predictors of secondary structure we tested. When Porter 5 is retrained on SCOPe based sets that eliminate homology between training/testing samples we obtain similar results. Porter is available as a web server and standalone program at alongside all the datasets and alignments.
      256Scopus© Citations 55
  • Publication
    SCLpred : protein subcellular localization prediction by N-to-1 neural networks
    (Oxford University Press, 2011-08-27) ; ;
    Knowledge of the subcellular location of a protein provides valuable information about its function and possible interaction with other proteins. In the post-genomic era, fast and accurate predictors of subcellular location are required if this abundance of sequence data is to be fully exploited. We have developed a subcellular localization predictor (SCLpred), which predicts the location of a protein into four classes for animals and fungi and five classes for plants (secreted, cytoplasm, nucleus, mitochondrion and chloroplast) using machine learning models trained on large non-redundant sets of protein sequences. The algorithm powering SCLpred is a novel Neural Network (N-to-1 Neural Network, or N1-NN) we have developed, which is capable of mapping whole sequences into single properties (a functional class, in this work) without resorting to predefined transformations, but rather by adaptively compressing the sequence into a hidden feature vector. We benchmark SCLpred against other publicly available predictors using two benchmarks including a new subset of Swiss-Prot Release 2010_06. We show that SCLpred surpasses the state of the art. The N1-NN algorithm is fully general and may be applied to a host of problems of similar shape, that is, in which a whole sequence needs to be mapped into a fixed-size array of properties, and the adaptive compression it operates may shed light on the space of protein sequences. The predictive systems described in this article are publicly available as a web server at
      418Scopus© Citations 52
  • Publication
    Ab initio and homology based prediction of protein domains by recursive neural networks
    Background: Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. Results: We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality. We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. Conclusion: The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: and we plan on running them on a multi-genomic scale and make the results public in the near future.
      755Scopus© Citations 15
  • Publication
    Towards the Improved Discovery and Design of Functional Peptides: Common Features of Diverse Classes Permit Generalized Prediction of Bioactivity
    The conventional wisdom is that certain classes of bioactive peptides have specific structural features that endow their particular functions. Accordingly, predictions of bioactivity have focused on particular subgroups, such as antimicrobial peptides. We hypothesized that bioactive peptides may share more general features, and assessed this by contrasting the predictive power of existing antimicrobial predictors as well as a novel general predictor, PeptideRanker, across different classes of peptides.We observed that existing antimicrobial predictors had reasonable predictive power to identify peptides of certain other classes i.e. toxin and venom peptides. We trained two general predictors of peptide bioactivity, one focused on short peptides (4-20 amino acids) and one focused on long peptides (>20 amino acids). These general predictors had performance that was typically as good as, or better than, that of specific predictors. We noted some striking differences in the features of short peptide and long peptide predictions, in particular, high scoring short peptides favour phenylalanine. This is consistent with the hypothesis that short and long peptides have different functional constraints, perhaps reflecting the difficulty for typical short peptides in supporting independent tertiary structure.We conclude that there are general shared features of bioactive peptides across different functional classes, indicating that computational prediction may accelerate the discovery of novel bioactive peptides and aid in the improved design of existing peptides, across many functional classes. An implementation of the predictive method, PeptideRanker, may be used to identify among a set of peptides those that may be more likely to be bioactive.
      2562Scopus© Citations 340
  • Publication
    SCLpred-EMS: Subcellular localization prediction of endomembrane system and secretory pathway proteins by Deep N-to-1 Convolutional Neural Networks
    Motivation: The subcellular location of a protein can provide useful information for protein function prediction and drug design. Experimentally determining the subcellular location of a protein is an expensive and time-consuming task. Therefore, various computer-based tools have been developed, mostly using machine learning algorithms, to predict the subcellular location of proteins. Results: Here, we present a neural network-based algorithm for protein subcellular location prediction. We introduce SCLpred-EMS a subcellular localization predictor powered by an ensemble of Deep N-to-1 Convolutional Neural Networks. SCLpred-EMS predicts the subcellular location of a protein into two classes, the endomembrane system and secretory pathway versus all others, with a Matthews correlation coefficient of 0.75-0.86 outperforming the other state-of-the-art web servers we tested. Contact:
      404Scopus© Citations 18