Now showing 1 - 10 of 35
  • Publication
    Improved external memory BFS implementations
    (Society for Industrial and Applied Mathematics Philadelphia, 2007-01-06) ; ;
    Breadth first search (BPS) traversal on massive graphs in external memory was considered non-viable until recently, because of the large number of I/Os it incurs. Ajwani et al. [3] showed that the randomized variant of the o(n) I/O algorithm of Mehlhorn and Meyer [24] (MM.BFS) can compute the BPS level decomposition for large graphs (around a billion edges) in a few hours for small diameter graphs and a few days for large diameter graphs. We improve upon their implementation of this algorithm by reducing the overhead associated with each BPS level, thereby improving the results for large diameter graphs which are more difficult for BPS traversal in external memory. Also, we present the implementation of the deterministic variant of MM.BFS and show that in most cases, it outperforms the randomized variant. The running time for BPS traversal is further improved with a heuristic that preserves the worst case guarantees of MM_BFS, Together, they reduce the time for BFS on large diameter graphs from days shown in [3] to hours. In particular, on line graphs with random layout on disks, our implementation of the deterministic variant of MM_BFS with the proposed heuristic is more than 75 times faster than the previous best result for the randomized variant of MM.BFS in [3].
  • Publication
    Breadth first search on massive graphs
    (American Mathematical Society, 2009) ; ; ;
    We consider the problem of Breadth First Search (BFS) traversal on massive sparse undirected graphs. Despite the existence of simple linear time algorithms in the RAM model, it was considered non-viable for massive graphs because of the I/O cost it incurs. Munagala and Ranade [29] and later Mehlhorn and Meyer [27] gave efficient algorithms (refered to as MR BFS and MM BFS, respectively) for computing BFS level decompositions in an external memory model. Ajwani et al. [3] implemented MR BFS and the randomized variant of MM BFS using the external memory library STXXL and gave a comparative study of the two algorithms on various graph classes. In this paper, we review and extend that result demonstrating the effectiveness and viability of the BFS implementations on various other synthetic and real world benchmarks. Furthermore, we present the implementation of the deterministic variant of MM BFS and show that in most cases, it outperforms the randomized variant.
  • Publication
    Automated Knowledge Hierarchy Assessment
    (CEUR Workshop Proceedings, 2018-07-12) ; ; ;
    Automated construction of knowledge hierarchies is gaining increasing attention to tackle the infeasibility of manually extracting and semantically linking millions of concepts. With the evolution of knowledge hierarchies, there is a need for measures to assess its temporal evolution, quantifying the similarities between different versions and identifying the relative growth of different subgraphs in the knowledge hierarchy. This work proposes a principled and scalable similarity measure, based on Katz similarity between concept nodes, for comparing knowledge hierarchies, modeled as generic Directed Acyclic Graphs (DAGs).
  • Publication
    A Geometric Distance Oracle for Large Real-World Graphs
    Many graph processing algorithms require determination of shortest-path distances between arbitrary numbers of node pairs. Since computation of exact distances between all node-pairs of a large graph, e.g., 10M nodes and up, is prohibitively expensive both in computational time and storage space, distance approximation is often used in place of exact computation. In this paper, we present a novel and scalable distance oracle that leverages the hyperbolic core of real-world large graphs for fast and scalable distance approximation. We show empirically that the proposed oracle significantly outperforms prior oracles on a random set of test cases drawn from public domain graph libraries. There are two sets of prior work against which we benchmark our approach. The first set, which often outperforms other oracles, employs embedding of the graph into low dimensional Euclidean spaces with carefully constructed hyperbolic distances, but provides no guarantees on the distance estimation error. The second set leverages Gromov-type tree contraction of the graph with the additive error guaranteed not to exceed $2\delta\log{n}$, where $\delta$ is the hyperbolic constant of the graph. We show that our proposed oracle 1) is significantly faster than those oracles that use hyperbolic embedding (first set) with similar approximation error and, perhaps surprisingly, 2) exhibits substantially lower average estimation error compared to Gromov-like tree contractions (second set). We substantiate our claims through numerical computations on a collection of a dozen real world networks and synthetic test cases from multiple domains, ranging in size from 10s of thousand to 10s of millions of nodes.
  • Publication
    A Network Configuration Algorithm Based on Optimization of Kirchhoff Index
    Traditionally, a parallel application is partitioned, mapped and then routed on a network of compute nodes where the topology of the interconnection network is fixed and known beforehand. Such a topology often comes with redundant links to accommodate the communication patterns of a wide range of applications. With recent advances in technology for optical circuit switches, it is now possible to construct a network with much fewer links, and to make the link endpoints configurable to suit the communication pattern of a given application. While this is economical (saving both links and the power to run them), it raises the difficult problem of how to configure the network and how to reconfigure it quickly when the application's communication pattern changes. In this paper, we propose the Kirchhoff index (KI) of a certain weighted graph related to the interconnection network as a proxy for its communication throughput. Our usage of this metric is based on a theoretical analogy between resistances in an electrical network and communication loads in the interconnection network. We show how mathematical techniques for reducing KI can be used to configure a network in a dramatically shorter time as compared to the current state-of-the-art scheme.
      350Scopus© Citations 5
  • Publication
    Enriching Taxonomies With Functional Domain Knowledge
    The rising need to harvest domain specific knowledge in several applications is largely limited by the ability to dynamically grow structured knowledge representations, due to the increasing emergence of new concepts and their semantic relationships with existing ones. Such enrichment of existing hierarchical knowledge sources with new information to better model the "changing world" presents two-fold challenges: (1) Detection of previously unknown entities or concepts, and (2) Insertion of the new concepts into the knowledge structure, respecting the semantic integrity of the created relationships. To this end we propose a novel framework, ETF, to enrich large-scale, generic taxonomies with new concepts from resources such as news and research publications. Our approach learns a high-dimensional embedding for the existing concepts of the taxonomy, as well as for the new concepts. During the insertion of a new concept, this embedding is used to identify semantically similar neighborhoods within the existing taxonomy. The potential parent-child relationships linking the new concepts to the existing ones are then predicted using a set of semantic and graph features. Extensive evaluation of ETF on large, real-world taxonomies of Wikipedia and WordNet showcase more than 5% F1-score improvements compared to state-of-the-art baselines. We further demonstrate that ETF can accurately categorize newly emerging concepts and question-answer pairs across different domains.
      643Scopus© Citations 27
  • Publication
    Processing Large Graphs: Representations, Storage, Systems and Algorithms
    Analyzing and processing large graphs is of fundamental importance for an ever-growing number of applications. Significant advancements in the last few years at both, systems and algorithmic side, let graph processing become increasingly scalable and efficient. Often, these advances are still not well-known and well-understood outside the systems and algorithms communities. In particular, there is very little understanding of the various trade-offs involved in the usage of particular combinations of algorithms, data structures, and systems. This tutorial will have a particular focus on this aspect, imparting theoretical knowledge intertwined with hands-on experience. Since there is no clearly winning system/algorithm combination that performs best on all the different metrics, it is of utmost importance to understand the pros and cons of the various alternatives. The tutorial will enable application developers in industry and academics, students as well as researchers to make corresponding decisions in an informed way. The participants do neither require any particular a-priori knowledge apart from a basic understanding of core computer science concepts, nor any special equipment apart from their laptop. After a general introduction, we will describe the critical dimensions that need to be tackled together to effectively and efficiently overcome problems in large graph processing: data representation, data storage, acceleration via multi-core programming, and horizontally scalable graph-processing infrastructures. Thereafter, we will provide an overview of existing graph-processing systems and graph databases. This will be followed by hands-on experiences with popular representatives of such systems. Finally, we will provide a detailed description of algorithms used in these systems for fundamental problems like shortest paths and Pagerank, how they are implemented, and how this affects the overall performance. We will also cover basic data structures such as distance oracles that can be built on these systems to efficiently answer distance queries for real-world graphs.
      182Scopus© Citations 3
  • Publication
    Automated assessment of knowledge hierarchy evolution: comparing directed acyclic graphs
    Automated construction of knowledge hierarchies from huge data corpora is gaining increasing attention in recent years, in order to tackle the infeasibility of manually extracting and semantically linking millions of concepts. As a knowledge hierarchy evolves with these automated techniques, there is a need for measures to assess its temporal evolution, quantifying the similarities between different versions and identifying the relative growth of different subgraphs in the knowledge hierarchy. In this paper, we focus on measures that leverage structural properties of the knowledge hierarchy graph to assess the temporal changes. We propose a principled and scalable similarity measure, based on Katz similarity between concept nodes, for comparing different versions of a knowledge hierarchy, modeled as a generic directed acyclic graph. We present theoretical analysis to depict that the proposed measure accurately captures the salient properties of taxonomic hierarchies, assesses changes in the ordering of nodes, along with the logical subsumption of relationships among concepts. We also present a linear time variant of the measure, and show that our measures, unlike previous approaches, are tunable to cater to diverse application needs. We further show that our measure provides interpretability, thereby identifying the key structural and logical difference in the hierarchies. Experiments on a real DBpedia and biological knowledge hierarchy showcase that our measures accurately capture structural similarity, while providing enhanced scalability and tunability. Also, we demonstrate that the temporal evolution of different subgraphs in this knowledge hierarchy, as captured purely by our structural measure, corresponds well with the known disruptions in the related subject areas.
      507Scopus© Citations 5
  • Publication
    ANNOTATE: orgANizing uNstructured cOntenTs viA Topic labEls
    With the advent of Big Data paradigm, filtering, retrieval, and linking of unstructured multi-modal data has become a necessity. Assigning topic labels to contents, that accurately capture the meaning and contextual information, is a fundamental problem in organizing unstructured data. The usage of manually-assigned tags for this purpose introduces inconsistencies because of different »surface forms». On the other hand, existing automated approaches either use hierarchical multi-label classification, or are unsupervised and rely on (undirected) graph measures leveraging taxonomies. While the former requires large training data set to learn the characteristics of each topic class, the latter lacks the flexibility to learn broad range of related topics and are less accurate. We propose a novel framework, ANNOTATE based on a small set of features and directed traversal of taxonomies to learn a broad spectrum of related topics using limited training data. We also show that our approach provides accurate labels for several domains without the need for re-training. For instance, the framework, trained on a small set of BBC news articles, exhibits close matches to user-generated tags for Quora documents. Experimental results, on the same model, for news classification and identifying aspects of Amazon product reviews, based on Amazon Mechanical Turk evaluation show our approach to be significantly better than state-of-the-art. We further present real-life case studies of our proposed framework for automatically tagging Quora posts, and topically segmenting, indexing and linking related YouTube videos (using our publicly available Chrome browser extension).
  • Publication
    Geometric Algorithms for Private-Cache Chip Multiprocessors
    We study techniques for obtaining efficient algorithms for geometric problems on private-cache chip multiprocessors.
      257Scopus© Citations 11