Computer Science and Informatics Technical Reports

Permanent URI for this collection

A techincal report series created to coincide with the launch of the UCD School of Computer Science and Informatics. These can be downloaded freely. Queries about the technical report series should be addressed to Alexey Lastovetsky.

Please see original homepage, archived at:


Recent Submissions

Now showing 1 - 5 of 52
  • Publication
    Pairwise Interaction Field Neural Networks For Drug Discovery
    (University College Dublin. School of Computer Science and Informatics, 2013-06) ;
    Automatically mapping small, drug-like molecules into their biological activity is an open problem in chemioinformatics. Numerous approaches to solve the problem have been attempted, which typically rely on different machine learning tools and, critically, depend on the how a molecule is represented (be it as a one-dimensional string, a two-dimensional graph, its three-dimensional structure, or a feature vector of some kind). In fact arguably the most critical bottleneck in the process is how to encode the molecule in a way that is both informative and can be dealt with by the machine learning algorithms downstream. Recently we have introduced an algorithm which entirely does away with this complex, error-prone and time-consuming encoding step by automatically finding an optimal code for a molecule represented as a twodimensional graph. In this report we introduce a model which we have recently developed (Neural Network Pairwise Interaction Fields) to extend this same approach to molecules represented as their three-dimensional structures. We benchmark the algorithm on a number of public data sets. While our tests confirm that three-dimensional representations are generally less informative than two-dimensional ones (possibly because the former are generally the result of a prediction process, and as such contain noise), the algorithm we introduce compares well with the state of the art in 3D-based prediction, in spite of not requiring any prior knowledge about the domain, or prior encoding of the molecule.
  • Publication
    Potential utility of docking to identify protein-peptide binding regions
    (University College Dublin. School of Computer Science and Informatics, 2013-05) ; ; ; ;
    Disordered regions of proteins often bind to structured domains, mediating interactions within and between proteins. However, it is difficult to identify a priori the short regions involved in binding. We set out to determine if docking peptides to peptide binding domains would assist in these predictions. First, we investigated the docking of known short peptides to their native and non-native peptide binding domains. We then investigated the docking of overlapping peptides adjacent to the native peptide. We found only weak discrimination of docking scores between native peptide and adjacent peptides in this context with similar results for both ordered and disordered regions. Finally, we trained a bidirectional recurrent neural network using as input the peptide sequence, predicted secondary structure, Vina docking score and Pepsite score.We conclude that docking has only modest power to define the location of a peptide within a larger protein region known to contain it. However, this information can be used in training machine learning methods which may allow for the identification of peptide binding regions within a protein sequence.
  • Publication
    Template-based Recognition of Natively Disordered Regions in Proteins
    (University College Dublin. School of Computer Science and Informatics, 2012-04) ; ;
    Disordered proteins are increasingly recognised as a fundamental component of the cellular machinery. Parallel to this, the prediction of protein disorder by computational means has emerged as an aid to the investigation of protein functions. Although predictors of disorder have met with considerable success, it is increasingly clear that further improvements are most likely to come from additional sources of information, to complement patterns extracted from the primary sequence of a protein. In this article, a system for the prediction of protein disorder that relies both on sequence information and on structural information from homologous proteins of known structure (templates) is described. Structural information is introduced directly (as a further input to the predictor) and indirectly through highly reliable template-based predictions of structural features of the protein. The predictive system, based on Support Vector Machines, is tested by rigorous 5-fold cross validation on a large, non-redundant set of proteins extracted from the Protein Data Bank. In these tests the introduction of structural information, which is carefully weighed based on sequence identity between homologues and query, results in large improvements in prediction accuracy. The method, when re-trained on a 2004 version of the PDB, clearly outperforms the algorithms that ranked top at the 2006 CASP competition.
  • Publication
    High-Level Data Partitioning for Parallel Computing on Heterogeneous Hierarchical HPC Platforms
    (University College Dublin. School of Computer Science and Informatics, 2011)
    The current state and foreseeable future of high performance scientific computing (HPC) can be described in three words: heterogeneous, parallel and distributed. These three simple words have a great impact on the architecture and design of HPC platforms and the creation and execution of efficient algorithms and programs designed to run on them. As a result of the inherent heterogeneity, parallelism and distribution which promises to continue to pervade scientific computing in the coming years, the issue of data distribution and therefore data partitioning is unavoidable. This data distribution and partitioning is due to the inherent parallelism of almost all scientific computing platforms. Cluster computing has become all but ubiquitous with the development of clusters of clusters and grids becoming increasingly popular. Even at a lower level, high performance symmetric multiprocessor (SMP) machines, General Purpose Graphical Processing Unit (GPGPU) computing, and multiprocessor parallel machines play an important role. At a very low level, multicore technology is now widespread, increasing in heterogeneity, and promises to be omnipresent in the near future. The prospect of prevalent manycore architectures will inevitably bring yet more heterogeneity. Scientific computing is undergoing a paradigm shift like none before. Only a decade ago most high performance scientific architectures were homogeneous in design and heterogeneity was seen as a difficult and somewhat limiting feature of some architectures. However this past decade has seen the rapid development of architectures designed not only to exploit heterogeneity but architectures designed to be heterogeneous. Grid and massively distributed computing has led the way on this front. The current shift is moving from this to architectures that are not heterogeneous by definition, but heterogeneous by necessity. Cloud and exascale computing architectures and platforms are not designed to be heterogeneous as much as they are heterogeneous by definition. Indeed such architectures cannot be homogeneous on any large (and useful) scale. In fact more and more researchers see heterogeneity as the natural state of computing. Further to hardware advances, scientific problems have become so large that the use of more than one of any of the above platforms in parallel has become necessary, if not unavoidable. Problems such as climatology and projects including the Large Hadron Collider necessitate the use of extreme-scale parallel platforms, often encompassing more than one geographically central supercomputer or cluster. Even at the core level large amounts of information must be shared efficiently. One of the greatest difficulties in solving problems on such architectures is the distribution of data between the different components in a way that optimizes runtime. There have been numerous algorithms developed to do so over the years. Most seek to optimize runtime by reducing the total volume of communication between processing entities. Much research has been conducted to do so between distinct processors or nodes, less so between distributed clusters. This report presents new data partitioning algorithms for matrix and linear algebra operations. These algorithms would in fact work with little or no modification for any application with similar communication patterns. In practice these partitionings distribute data between a small number of computing entities, each of which can have great computational power themselves, and an even greater aggregate power. These partitionings may also be deployed in a hierarchical manner, which allows the flexibility to be employed in a great range of problem domains and computational platforms. These partitionings, in hybrid form, working together with more traditional partitionings, minimize the total volume of communication between entities in a manner proven to be optimal. This is done regardless of the power ratio that exists between the entities, thus minimizing execution time. There is also no restriction on the algorithms or methods employed on the clusters themselves locally, thus maximizing flexibility. Finally, most heterogeneous algorithms and partitionings are designed by modifying existing homogeneous ones. With this in mind the ultimate contribution of this report is to demonstrate that non-traditional and perhaps unintuitive algorithms and partitionings designed with heterogeneity in mind from the start can result in better, and in many cases optimal, algorithms and partitionings for heterogeneous platforms. The importance of this given the current outlook for, and trends in, the future of high performance scientific computing is obvious.
  • Publication
    The Quantititive Estimation of Asynchrony Among Concurrent Speakers
    (University College Dublin. School of Computer Science and Informatics, 2008-03-17)
    A novel method for the estimation of asynchrony among two speakers reading together is proposed. Previous estimates of asynchrony were based only on the pointwise measurement of lag. We here adapt the well known method of dynamic time warping to align two utterances. The resulting warp path allows a quantitative estimate of asynchrony. Illustrative examples are provided, which demonstrate that the novel method can distinguish synchronization performance in a variety of speaking conditions.