Now showing 1 - 4 of 4
  • Publication
    A Geometric Distance Oracle for Large Real-World Graphs
    Many graph processing algorithms require determination of shortest-path distances between arbitrary numbers of node pairs. Since computation of exact distances between all node-pairs of a large graph, e.g., 10M nodes and up, is prohibitively expensive both in computational time and storage space, distance approximation is often used in place of exact computation. In this paper, we present a novel and scalable distance oracle that leverages the hyperbolic core of real-world large graphs for fast and scalable distance approximation. We show empirically that the proposed oracle significantly outperforms prior oracles on a random set of test cases drawn from public domain graph libraries. There are two sets of prior work against which we benchmark our approach. The first set, which often outperforms other oracles, employs embedding of the graph into low dimensional Euclidean spaces with carefully constructed hyperbolic distances, but provides no guarantees on the distance estimation error. The second set leverages Gromov-type tree contraction of the graph with the additive error guaranteed not to exceed $2\delta\log{n}$, where $\delta$ is the hyperbolic constant of the graph. We show that our proposed oracle 1) is significantly faster than those oracles that use hyperbolic embedding (first set) with similar approximation error and, perhaps surprisingly, 2) exhibits substantially lower average estimation error compared to Gromov-like tree contractions (second set). We substantiate our claims through numerical computations on a collection of a dozen real world networks and synthetic test cases from multiple domains, ranging in size from 10s of thousand to 10s of millions of nodes.
  • Publication
    Scalable Disambiguation System Capturing Individualities of Mentions
    Entity disambiguation, or mapping a phrase to its canonical representation in a knowledge base, is a fundamental step in many natural language processing applications. Existing techniques based on global ranking models fail to capture the individual peculiarities of the words and hence, struggle to meet the accuracy-time requirements of many real-world applications. In this paper, we propose a new system that learns specialized features and models for disambiguating each ambiguous phrase in the English language. We train and validate the hundreds of thousands of learning models for this purpose using a Wikipedia hyperlink dataset with more than 170 million labelled annotations. The computationally intensive training required for this approach can be distributed over a cluster. In addition, our approach supports fast queries, efficient updates and its accuracy compares favorably with respect to other state-of-the-art disambiguation systems.
    Scopus© Citations 2  494
  • Publication
    Analysis of the Semi-synchronous Approach to Large-scale Parallel Community Finding
    Community-finding in graphs is the process of identifying highly cohesive vertex subsets. Recently the vertex-centric approach has been found effective for scalable graph processing and is implemented in systems such as Graph Lab and Pregel. In the vertex-centric approach, the analysis is decomposed into a set of local computations at each vertex of the graph, with results propagated to neighbours along the vertexs edges. Many community finding algorithms area menable to this approach as they are based on the optimisation of an objective through a process of iterative local update (ILU), in which vertices are successively moved to the community of one of their neighbours in order to achieve the highest local gain in the quality of the objective. The sequential processing of such iterative algorithms generally benefits from an asynchronous approach, where a vertex update uses the most recent state as generated by the previous update of vertices in its neighbourhood. When vertices are distributed over a parallel machine, the asynchronous approach can encounter race conditions that impact on its performance and destroy the consistency of the results. Alternatively,a semi-synchronous approach ensures that only non-conflicting vertices are updated simultaneously. In this paper we study the semi-synchronous approach to ILU algorithms for community finding on social networks. Because of the heavy-tailed vertex distribution, the order inwhich vertex updates are applied in asynchronous ILU can greatly impact both convergence time and quality of the found communities. We study the impact of ordering on the distributed label propagation and modularity maximisation algorithms implemented on a shared-memory multicore architecture.We demonstrate that the semi-synchronous ILU approach is competitive in time and quality with the asynchronous approach, while allowing the analyst to maintain consistent control over update ordering. Thus, our implementation results in a more robust and predictable performance and provides control over the order in which the node labels are updated, which is crucial to obtaining the correct trade-off between running time and quality of communities on many graph classes.
    Scopus© Citations 3  400
  • Publication
    Processing Large Graphs: Representations, Storage, Systems and Algorithms
    Analyzing and processing large graphs is of fundamental importance for an ever-growing number of applications. Significant advancements in the last few years at both, systems and algorithmic side, let graph processing become increasingly scalable and efficient. Often, these advances are still not well-known and well-understood outside the systems and algorithms communities. In particular, there is very little understanding of the various trade-offs involved in the usage of particular combinations of algorithms, data structures, and systems. This tutorial will have a particular focus on this aspect, imparting theoretical knowledge intertwined with hands-on experience. Since there is no clearly winning system/algorithm combination that performs best on all the different metrics, it is of utmost importance to understand the pros and cons of the various alternatives. The tutorial will enable application developers in industry and academics, students as well as researchers to make corresponding decisions in an informed way. The participants do neither require any particular a-priori knowledge apart from a basic understanding of core computer science concepts, nor any special equipment apart from their laptop. After a general introduction, we will describe the critical dimensions that need to be tackled together to effectively and efficiently overcome problems in large graph processing: data representation, data storage, acceleration via multi-core programming, and horizontally scalable graph-processing infrastructures. Thereafter, we will provide an overview of existing graph-processing systems and graph databases. This will be followed by hands-on experiences with popular representatives of such systems. Finally, we will provide a detailed description of algorithms used in these systems for fundamental problems like shortest paths and Pagerank, how they are implemented, and how this affects the overall performance. We will also cover basic data structures such as distance oracles that can be built on these systems to efficiently answer distance queries for real-world graphs.
    Scopus© Citations 3  237