- PublicationParallel Basic Linear Algebra Subprograms for Heterogeneous Computational Clusters of Multicore Processors(University College Dublin. School of Computer Science and Informatics, 2009)
; ; In this document, we describe two strategies of distribution of computations that can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneous Data Distribution Strategy (HDS). They are not novel and have already been researched thoroughly. However, the advent of multicores necessitates enhancements to them. We conduct experiments using six applications utilizing the various distribution strategies to perform parallel matrix-matrix multiplication (PMM) on a local HCoM. The first application calls ScaLAPACK PBLAS routine PDGEMM, which uses the traditional homogeneous strategy of distribution of computations. The second application is an MPI application, which utilizes HDS to perform the PMM. The application requires an input, which is the two-dimensional processor grid arrangement to use during the execution of the PMM. The third application is also an MPI application but that uses HPS to perform the PMM. The application requires two inputs, which are the number of threads to run per process and the two-dimensional process grid arrangement to use during the execution of the PMM. The fourth application is the HeteroMPI application using the HDS strategy. It calls the HeteroMPI group management routines to determine the optimal two-dimensional processor grid arrangement and uses it during the execution of the PMM. The fifth application is the HeteroMPI application using the HPS strategy. It calls the HeteroMPI group management routines to determine the optimal twodimensional process grid arrangement and uses it during the execution of the PMM. The final application is the Heterogeneous ScaLAPACK application, which applies the HPS strategy and reuses the ScaLAPACK PBLAS routine PDGEMM. For the last two applications, the number of threads to run per process must be preconfigured. We compare the results of execution of these six applications. The results reveal that the two strategies can compete with each other. The MPI applications employing HDS perform the best since they fully exploit the increased thread-level parallelism (TLP) provided by the multicore processors. However, for large problem sizes, the non-cartesian nature of the data distribution may lead to excessive communications that can be very expensive. For such cases, the HPS strategy has been shown to equal and even out-perform the HDS strategy. We also conclude that HeteroMPI is a valuable tool to implement heterogeneous parallel algorithms on HCoMs because it provides desirable features that determine optimal values of the algorithmic parameters such as the total number of processors and the 2D processor grid arrangement.71 - PublicationTheoretical Results on Optimal Partitoning for Matrix-Matrix Multiplication with Two Processors(University College Dublin. School of Computer Science and Informatics, 2011-09)
; In this report, we consider a simple but important linear algebra kernel, matrix-matrix multiplication. Building multi-core processors based on heterogeneous cores is an important current trend. In this context, it is of great interest to study optimal matrix partitioning algorithms for small cases (i.e. small number of cores). Indeed, the general case, with relatively high numbers of heterogeneous resources is now well understood, however the problem is in general NP-Complete when one aims at balancing the load while minimizing the communications. Nonetheless several approximation algorithms have been successfully designed. Nevertheless, negative complexity results do not apply for very few heterogeneous cores. Additionally, the case of a small number of processors is useful as a model for heterogeneous clusters and clusters of clusters. In this paper, we provide a complete study of 2 heterogeneous resources and we prove that in this case, the optimal partitioning is based on non-standard decomposition techniques.84 - PublicationA Parallel Algorithm for the Solution of the Deconvolution Problem in Heterogeneous Networks(University College Dublin. School of Computer Science and Informatics, 2005-12-19)
; ; In this work we present two parallel algorithms for the solution of a given least squares problem with structured matrices. This problem arises in many applications most related to digital signal processing, an example is given. Both parallel algorithms have been designed to speedâ€“up the sequential one in a heterogeneous network of computers. They differ from the approximation followed to implement parallel algorithms on heterogeneous networks of computers known as HeHo and HoHe strategies. However, our study goes beyond the practical usefulness of our heterogeneous parallel application. One one hand, the results obtained validates the recent developed HeteroMPI as a very useful tool for programming heterogeneous parallel algorithms. On the other hand, although HeteroMPI has initially been designed to apply the HeHo strategy, we propose a way this tool can be used in the HoHe strategy. Pros and cons of the use of HeteroMPI for both strategies will be deeply study through the application example.137 - PublicationHeterogeneous PBLAS: A Set of Parallel Basic Linear Algebra Subprograms for Heterogeneous Computational Clusters(University College Dublin. School of Computer Science and Informatics, 2008)
; ; We present a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for Heterogeneous Computational Clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.60 - PublicationA Non-Intrusive and Incremental Approach to Enabling Direct Communications in RPC-based Grid Programming Systems(University College Dublin. School of Computer Science and Informatics, 2005-04)This paper advocates a non-intrusive and incremental approach to enabling existing Grid programming systems with new features. In particular, it presents a software component enabling NetSolve applications with direct communications between remote tasks. The software component is a supplementary one working on the top of the basic NetSolve system. Its design also allows remote tasks to be freely mixed in a single application, independent on whether each particular task is enabled for direct communications or not. Experiments with this software are also presented.
99 - PublicationSmartGridRPC: The new RPC model for high performance Grid computing(University College Dublin. School of Computer Science and Informatics, 2009-10)
; ; ; ; The paper presents the SmartGridRPC model, an extension of the GridRPC model, which aims to achieve higher performance. The traditional GridRPC provides a programming model and API for mapping individual tasks of an application in a distributed Grid environment, which is based on the client-server model characterised by the star network topology. SmartGridRPC provides a programming model and API for mapping a group of tasks of an application in a distributed Grid environment, which is based on the fully connected network topology. The SmartGridRPC programming model and API, its implementation in SmartGridSolve and its performance advantages over the GridRPC model are outlined in this paper. In addition, experimental results using a real-world application are also presented.146 - PublicationGrid-Enabled Hydropad: a Scientific Application for Benchmarking GridRPC-Based Programming Systems(University College Dublin. School of Computer Science and Informatics, 2008-12-12)
; GridRPC is a standard API that allows an application to easily interface with a Grid environment. It implements a remote procedure call with a single task map and client-server communication model. In addition to non-performance-related benefits, scientific applications having large computation and small communication tasks can also obtain important performance gains by being implemented in GridPRC. However, such convenient applications are not representative of the majority of scientific applications and therefore cannot serve as fair benchmarks for comparison of the performance of different GridRPC-based systems. In this paper, we present Hydropad, a real life astrophysical simulation, which is composed of tasks that have a balanced ratio between computation and communication. While Hydropad is not the ideal application for performance benefits from its implementation with GridRPC middleware, we show how even its performance can be improved by using GridSolve and SmartGridSolve. We believe that the Grid-enabled Hydropad is a good candidate application to benchmark GridRPC-based programming systems in order to justify their use for high performance scientific computing.104 - PublicationA Comparative Study of Methods for Measurement of Energy of ComputingEnergy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using external power meters; (b) Measurements using on-chip power sensors and (c) Energy predictive models. In this work, we present a comprehensive study comparing the accuracy of state-of-the-art on-chip power sensors and energy predictive models against system-level physical measurements using external power meters, which we consider to be the ground truth. We show that the average error of the dynamic energy profiles obtained using on-chip power sensors can be as high as 73% and the maximum reaches 300% for two scientific applications, matrix-matrix multiplication and 2D fast Fourier transform for a wide range of problem sizes. The applications are executed on three modern Intel multicore CPUs, two Nvidia GPUs and an Intel Xeon Phi accelerator. The average error of the energy predictive models employing performance monitoring counters (PMCs) as predictor variables can be as high as 32% and the maximum reaches 100% for a diverse set of seventeen benchmarks executed on two Intel multicore CPUs (one Haswell and the other Skylake). We also demonstrate that using inaccurate energy measurements provided by on-chip sensors for dynamic energy optimization can result in significant energy losses up to 84%. We show that, owing to the nature of the deviations of the energy measurements provided by on-chip sensors from the ground truth, calibration can not improve the accuracy of the on-chip sensors to an extent that can allow them to be used in optimization of applications for dynamic energy. Finally, we present the lessons learned, our recommendations for the use of on-chip sensors and energy predictive models and future directions.
463ScopusÂ© Citations 44 - PublicationHow Algorithm Definition Language (ADL) Improves the Performance of SmartGridSolve Applications(University College Dublin. School of Computer Science and Informatics, 2009-07)
; ; In this paper, we study the importance of languages for the specification of algorithms in high performance Grid computing. We present one such language, the Algorithm Definition Language (ADL), designed and implemented for the use in conjunction with SmartGridSolve. We demonstrate that the use of this type of language can significantly improve the performance of Grid applications. We discuss how ADL can be used to improve the execution of some typical algorithms that use conditional statements, iterative computations and adaptive methods. We present experimental results demonstrating significant performance gains due to the use of ADL.190 - PublicationModeling Performance of Many-to-One Collective Communication Operations in Heterogeneous Clusters(University College Dublin. School of Computer Science and Informatics, 2006-05-30)
; ; This paper presents a performance model for Many-to-One type communications on a dedicated heterogeneous cluster of workstations based on a switched Ethernet network. This study finds that Many-to-One communication is more complex than One-to-Many and Point-to-Point communications as it does not show a linear or even continuous dependence of the execution time on message sizes. It displays a very high jump in execution time for a significant range of message sizes. As a result, the proposed model is divided into three parts. The first part is for small sized messages whose model is linear, the second part models the congestion region, and the last part is for large message sizes where linearity resumes. The proposed model is validated for accuracy by the experiments on various platforms with different MPI implementations.62