Now showing 1 - 10 of 15
  • Publication
    Parallel Basic Linear Algebra Subprograms for Heterogeneous Computational Clusters of Multicore Processors
    (University College Dublin. School of Computer Science and Informatics, 2009) ; ;
    In this document, we describe two strategies of distribution of computations that can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneous Data Distribution Strategy (HDS). They are not novel and have already been researched thoroughly. However, the advent of multicores necessitates enhancements to them. We conduct experiments using six applications utilizing the various distribution strategies to perform parallel matrix-matrix multiplication (PMM) on a local HCoM. The first application calls ScaLAPACK PBLAS routine PDGEMM, which uses the traditional homogeneous strategy of distribution of computations. The second application is an MPI application, which utilizes HDS to perform the PMM. The application requires an input, which is the two-dimensional processor grid arrangement to use during the execution of the PMM. The third application is also an MPI application but that uses HPS to perform the PMM. The application requires two inputs, which are the number of threads to run per process and the two-dimensional process grid arrangement to use during the execution of the PMM. The fourth application is the HeteroMPI application using the HDS strategy. It calls the HeteroMPI group management routines to determine the optimal two-dimensional processor grid arrangement and uses it during the execution of the PMM. The fifth application is the HeteroMPI application using the HPS strategy. It calls the HeteroMPI group management routines to determine the optimal twodimensional process grid arrangement and uses it during the execution of the PMM. The final application is the Heterogeneous ScaLAPACK application, which applies the HPS strategy and reuses the ScaLAPACK PBLAS routine PDGEMM. For the last two applications, the number of threads to run per process must be preconfigured. We compare the results of execution of these six applications. The results reveal that the two strategies can compete with each other. The MPI applications employing HDS perform the best since they fully exploit the increased thread-level parallelism (TLP) provided by the multicore processors. However, for large problem sizes, the non-cartesian nature of the data distribution may lead to excessive communications that can be very expensive. For such cases, the HPS strategy has been shown to equal and even out-perform the HDS strategy. We also conclude that HeteroMPI is a valuable tool to implement heterogeneous parallel algorithms on HCoMs because it provides desirable features that determine optimal values of the algorithmic parameters such as the total number of processors and the 2D processor grid arrangement.
      20
  • Publication
    A Parallel Algorithm for the Solution of the Deconvolution Problem in Heterogeneous Networks
    (University College Dublin. School of Computer Science and Informatics, 2005-12-19) ; ;
    In this work we present two parallel algorithms for the solution of a given least squares problem with structured matrices. This problem arises in many applications most related to digital signal processing, an example is given. Both parallel algorithms have been designed to speed–up the sequential one in a heterogeneous network of computers. They differ from the approximation followed to implement parallel algorithms on heterogeneous networks of computers known as HeHo and HoHe strategies. However, our study goes beyond the practical usefulness of our heterogeneous parallel application. One one hand, the results obtained validates the recent developed HeteroMPI as a very useful tool for programming heterogeneous parallel algorithms. On the other hand, although HeteroMPI has initially been designed to apply the HeHo strategy, we propose a way this tool can be used in the HoHe strategy. Pros and cons of the use of HeteroMPI for both strategies will be deeply study through the application example.
      58
  • Publication
    How Algorithm Definition Language (ADL) Improves the Performance of SmartGridSolve Applications
    (University College Dublin. School of Computer Science and Informatics, 2009-07) ; ;
    In this paper, we study the importance of languages for the specification of algorithms in high performance Grid computing. We present one such language, the Algorithm Definition Language (ADL), designed and implemented for the use in conjunction with SmartGridSolve. We demonstrate that the use of this type of language can significantly improve the performance of Grid applications. We discuss how ADL can be used to improve the execution of some typical algorithms that use conditional statements, iterative computations and adaptive methods. We present experimental results demonstrating significant performance gains due to the use of ADL.
      91
  • Publication
    Theoretical Results on Optimal Partitoning for Matrix-Matrix Multiplication with Two Processors
    (University College Dublin. School of Computer Science and Informatics, 2011-09) ;
    In this report, we consider a simple but important linear algebra kernel, matrix-matrix multiplication. Building multi-core processors based on heterogeneous cores is an important current trend. In this context, it is of great interest to study optimal matrix partitioning algorithms for small cases (i.e. small number of cores). Indeed, the general case, with relatively high numbers of heterogeneous resources is now well understood, however the problem is in general NP-Complete when one aims at balancing the load while minimizing the communications. Nonetheless several approximation algorithms have been successfully designed. Nevertheless, negative complexity results do not apply for very few heterogeneous cores. Additionally, the case of a small number of processors is useful as a model for heterogeneous clusters and clusters of clusters. In this paper, we provide a complete study of 2 heterogeneous resources and we prove that in this case, the optimal partitioning is based on non-standard decomposition techniques.
      20
  • Publication
    Heterogeneous PBLAS: A Set of Parallel Basic Linear Algebra Subprograms for Heterogeneous Computational Clusters
    (University College Dublin. School of Computer Science and Informatics, 2008) ; ;
    We present a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the development of a parallel linear algebra package for Heterogeneous Computational Clusters. We demonstrate the efficiency of the HeteroPBLAS programs on a homogeneous computing cluster and a heterogeneous computing cluster.
      22
  • Publication
    Modeling Performance of Many-to-One Collective Communication Operations in Heterogeneous Clusters
    (University College Dublin. School of Computer Science and Informatics, 2006-05-30) ; ;
    This paper presents a performance model for Many-to-One type communications on a dedicated heterogeneous cluster of workstations based on a switched Ethernet network. This study finds that Many-to-One communication is more complex than One-to-Many and Point-to-Point communications as it does not show a linear or even continuous dependence of the execution time on message sizes. It displays a very high jump in execution time for a significant range of message sizes. As a result, the proposed model is divided into three parts. The first part is for small sized messages whose model is linear, the second part models the congestion region, and the last part is for large message sizes where linearity resumes. The proposed model is validated for accuracy by the experiments on various platforms with different MPI implementations.
      17
  • Publication
    Partitioning for Parallel Matrix-Matrix Multiplication with Heterogeneous Processors: The Optimal Solution
    The problem of matrix partitioning for parallel matrix-matrix multiplication on heterogeneous processors has been extensively studied since the mid 1990s. During this time, previous research focused mainly on the design of efficient partitioning algorithms, optimally or sub-optimally partitioning matrices into rectangles. The optimality of the rectangular partitioning shape itself has never been studied or even seriously questioned. The accepted approach is that consideration of non-rectangular shapes will not significantly improve the optimality of the solution, but can significantly complicate the partitioning problem, which is already NP-complete even for the restricted case of rectangular shapes. There is no published research, however, supporting this approach. The shape of the globally optimal partitioning, and how the best rectangular partitioning compares with this global optimum, are still wide open problems. Solution of these problems will decide if new partitioning algorithms searching for truly optimal, and not necessarily rectangular, solutions are needed. This paper presents the first results of our research on the problem of optimal partitioning shapes for parallel matrix-matrix multiplication on heterogeneous processors. Namely, the case of two interconnected processors is comprehensively studied. We prove that, depending on performance characteristics of the processors and the communication link, the globally optimal partitioning will have one of just two well-specified shapes, one of which is rectangular and the other is non-rectangular. The theoretical analysis is conducted using an original mathematical technique proposed in the paper. It is shown that the technique can also be applied in the case of arbitrary numbers of processors. While comprehensive analysis of the cases of three and more processors is more complicated and the subject for future work, the paper does prove the optimality of some particular non-rectangular partitioning shapes f- r some combinations of performance characteristics of heterogeneous processors and communication links. The paper also presents experimental results demonstrating that the optimal non-rectangular partitioning can significantly outperform the optimal rectangular one on real-life heterogeneous HPC platforms.
      266Scopus© Citations 12
  • Publication
    Matrix Multiplication on Two Interconnected Processors
    This paper presents a new partitioning algorithm to perform matrix multiplication on two interconnected heterogeneous processors. Data is partitioned in a way which minimizes the total volume of communication between the processors compared to more general partitionings, resulting in a lower total execution time whenever the power ratio between the processors is greater than 3:1. The algorithm has interesting and important applicability, particularly as the top-level partitioning in a hierarchal algorithm that is to perform matrix multiplication on two interconnected clusters of computers.
      265Scopus© Citations 9
  • Publication
    Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters
    We present a new data partitioning strategy for parallel computing on three interconnected clusters. This partitioning has two advantages over existing partitionings. First it can reduce communication time due to a lower total volume of communication and a more efficient communication schedule. When the network topology is a linear array this partitioning always results in a lower total volume of communication compared to existing partitionings, provided the most powerful node is at the center of the array. When the topology is fully connected this partitioning results in a lower total volume of communication for all but a few power ratios. Second, it allows for the overlapping of communication and computation. These two inherent advantages work together to reduce overall execution time significantly.
      250Scopus© Citations 11
  • Publication
    A Non-Intrusive and Incremental Approach to Enabling Direct Communications in RPC-based Grid Programming Systems
    (University College Dublin. School of Computer Science and Informatics, 2005-04)
    This paper advocates a non-intrusive and incremental approach to enabling existing Grid programming systems with new features. In particular, it presents a software component enabling NetSolve applications with direct communications between remote tasks. The software component is a supplementary one working on the top of the basic NetSolve system. Its design also allows remote tasks to be freely mixed in a single application, independent on whether each particular task is enabled for direct communications or not. Experiments with this software are also presented.
      55