Now showing 1 - 10 of 39
  • Publication
    Towards Quantifying the Distance between Opinions
    Increasingly, critical decisions in public policy, governance, and business strategy rely on a deeper understanding of the needs and opinions of constituent members (e.g. citizens, shareholders). While it has become easier to collect a large number of opinions on a topic, there is a necessity for automated tools to help navigate the space of opinions. In such contexts understanding and quantifying the similarity between opinions is key. We find that measures based solely on text similarity or on overall sentiment often fail to effectively capture the distance between opinions. Thus, we propose a new distance measure for capturing the similarity between opinions that leverages the nuanced observation -- similar opinions express similar sentiment polarity on specific relevant entities-of-interest. Specifically, in an unsupervised setting, our distance measure achieves significantly better Adjusted Rand Index scores (up to 56x) and Silhouette coefficients (up to 21x) compared to existing approaches. Similarly, in a supervised setting, our opinion distance measure achieves considerably better accuracy (up to 20% increase) compared to extant approaches that rely on text similarity, stance similarity, and sentiment similarity.
  • Publication
    Engineering a Topological Sorting Algorithm for Massive Graphs
    (Society for Industrial and Applied Mathematics, 2011-12-01) ; ;
    We present an I/O-efficient algorithm for topologically sorting directed acyclic graphs (DAGs). No provably I/O-efficient algorithm for this problem is known. Similarly, the performance of our algorithm, which we call IterTS, may be poor in the worst case. However, our experiments show that IterTS achieves good performance in practise. The strategy of IterTS can be summarized as follows. We call an edge satisfied if its tail has a smaller number than its head. A numbering satisfying at least half the edges in the DAG is easy to find: A random numbering is expected to have this property. IterTS starts with such a numbering and then iteratively corrects the numbering to satisfy more and more edges until all edges are satisfied. To evaluate IterTS, we compared its running time to those of three competitors: PeelTS, an I/O-efficient implementation of the standard strategy of iteratively removing sources and sinks; ReachTS, an I/O-efficient implementation of a recent parallel divide-and-conquer algorithm based on reachability queries; and SeTS, standard DFS-based topological sorting built on top of a semi-external DFS algorithm. In our evaluation on various types of input graphs, IterTS consistently outperformed PeelTS and ReachTS, by at least an order of magnitude in most cases. SeTS outperformed IterTS on most graphs whose vertex sets fit in memory. However, IterTS often came close to the running time of SeTS on these inputs and, more importantly, SeTS was not able to process graphs whose vertex sets were beyond the size of main memory, while IterTS was able to process such inputs efficiently.
      387Scopus© Citations 3
  • Publication
    Conflict-free coloring for rectangle ranges using O(n.382) colors
    Given a set of points P ⊆ R 2 , a conflict-free coloring of P w.r.t. rectangle ranges is an assignment of colors to points of P, such that each non-empty axis-parallel rectangle T in the plane contains a point whose color is distinct from all other points in P ∩ T. This notion has been the subject of recent interest, and is motivated by frequency assignment in wireless cellular networks: one naturally would like to minimize the number of frequencies (colors) assigned to bases stations (points), such that within any range (for instance, rectangle), there is no interference. We show that any set of n points in R 2 can be conflict-free colored with O˜(n β+ ) colors in expected polynomial time, for any arbitrarily small > 0 and β = 3− √ 5 2 < 0.382. This improves upon the previously known bound of O( p n log log n/ log n).
      353Scopus© Citations 26
  • Publication
    I/O-efficient approximation of graph diameters by parallel cluster growing - A first experimental study
    A fundamental step in the analysis of a massive graph is to compute its diameter. In the RAM model, the diameter of a connected undirected unweighted graph can be efficiently 2-approximated using a Breadth-First Search (BFS) traversal from an arbitrary node. However, if the graph is stored on disk, even an external memory BFS traversal is prohibitive, owing to the large number of I/Os it incurs. Meyer (2008) proposed a parametrized algorithm to compute an approximation of graph diameter with fewer I/Os than that required for exact BFS traversal of the graph. The approach is based on growing clusters around randomly chosen vertices `in parallel' until their fringes meet. We present an implementation of this algorithm and compare it with some simple heuristics and external-memory BFS in order to determine the trade-off between the approximation ratio and running-time achievable in practice. Our experiments show that with carefully chosen parameters, the new approach is indeed capable to produce surprisingly good diameter approximations in shorter time. We also confirm experimentally, that there are graph-classes where the parametrized approach runs into bad approximation ratios just as the theoretical analysis in (Meyer, 2008) suggests.
  • Publication
    Automated Knowledge Hierarchy Assessment
    (CEUR Workshop Proceedings, 2018-07-12) ; ; ;
    Automated construction of knowledge hierarchies is gaining increasing attention to tackle the infeasibility of manually extracting and semantically linking millions of concepts. With the evolution of knowledge hierarchies, there is a need for measures to assess its temporal evolution, quantifying the similarities between different versions and identifying the relative growth of different subgraphs in the knowledge hierarchy. This work proposes a principled and scalable similarity measure, based on Katz similarity between concept nodes, for comparing knowledge hierarchies, modeled as generic Directed Acyclic Graphs (DAGs).
  • Publication
    A Geometric Distance Oracle for Large Real-World Graphs
    Many graph processing algorithms require determination of shortest-path distances between arbitrary numbers of node pairs. Since computation of exact distances between all node-pairs of a large graph, e.g., 10M nodes and up, is prohibitively expensive both in computational time and storage space, distance approximation is often used in place of exact computation. In this paper, we present a novel and scalable distance oracle that leverages the hyperbolic core of real-world large graphs for fast and scalable distance approximation. We show empirically that the proposed oracle significantly outperforms prior oracles on a random set of test cases drawn from public domain graph libraries. There are two sets of prior work against which we benchmark our approach. The first set, which often outperforms other oracles, employs embedding of the graph into low dimensional Euclidean spaces with carefully constructed hyperbolic distances, but provides no guarantees on the distance estimation error. The second set leverages Gromov-type tree contraction of the graph with the additive error guaranteed not to exceed $2\delta\log{n}$, where $\delta$ is the hyperbolic constant of the graph. We show that our proposed oracle 1) is significantly faster than those oracles that use hyperbolic embedding (first set) with similar approximation error and, perhaps surprisingly, 2) exhibits substantially lower average estimation error compared to Gromov-like tree contractions (second set). We substantiate our claims through numerical computations on a collection of a dozen real world networks and synthetic test cases from multiple domains, ranging in size from 10s of thousand to 10s of millions of nodes.
  • Publication
    Average-case analysis of incremental topological ordering
    (Elsevier, 2010-02-28) ;
    Many applications like pointer analysis and incremental compilation require maintaining a topological ordering of the nodes of a directed acyclic graph (DAG) under dynamic updates. All known algorithms for this problem are either only analyzed for worst-case insertion sequences or only evaluated experimentally on random DAGs. We present the first average-case analysis of incremental topological ordering algorithms. We prove an expected runtime of O(n2polylog(n)) under insertion of the edges of a complete DAG in a random order for the algorithms of Alpern et al. (1990) [4], Katriel and Bodlaender (2006) [18], and Pearce and Kelly.
      384Scopus© Citations 10
  • Publication
    I/O-Optimal Distribution Sweeping on Private-Cache Chip Multiprocessors
    The parallel external memory (PEM) model has been used as a basis for the design and analysis of a wide range of algorithms for private-cache multi-core architectures. As a tool for developing geometric algorithms in this model, a parallel version of the I/O-efficient distribution sweeping framework was introduced recently, and a number of algorithms for problems on axis-aligned objects were obtained using this framework. The obtained algorithms were efficient but not optimal. In this paper, we improve the framework to obtain algorithms with the optimal I/O complexity of O(sort P(N) + K/PB) for a number of problems on axis-aligned objects, P denotes the number of cores/processors, B denotes the number of elements that fit in a cache line, N and K denote the sizes of the input and output, respectively, and sort P(N) denotes the I/O complexity of sorting N items using P processors in the PEM model. To obtain the above improvement, we present a new one-dimensional batched range counting algorithm on a sorted list of ranges and points that achieves an I/O complexity of O((N + K)/PB), where K is the sum of the counts of all the ranges. The key to achieving efficient load balancing among the processors in this algorithm is a new method to count the output without enumerating it, which might be of independent interest.
      454Scopus© Citations 4
  • Publication
    Learning fine-grained search space pruning and heuristics for combinatorial optimization
    Combinatorial optimization problems arise naturally in a wide range of applications from diverse domains. Many of these problems are NP-hard and designing efficient heuristics for them requires considerable time, effort and experimentation. On the other hand, the number of optimization problems in the industry continues to grow. In recent years, machine learning techniques have been explored to address this gap. In this paper, we propose a novel framework for leveraging machine learning techniques to scale-up exact combinatorial optimization algorithms. In contrast to the existing approaches based on deep-learning, reinforcement learning and restricted Boltzmann machines that attempt to directly learn the output of the optimization problem from its input (with limited success), our framework learns the relatively simpler task of pruning the elements in order to reduce the size of the problem instances. In addition, our framework uses only interpretable learning models based on intuitive local features and thus the learning process provides deeper insights into the optimization problem and the instance class, that can be used for designing better heuristics. For the classical maximum clique enumeration problem, we show that our framework can prune a large fraction of the input graph (around 99% of nodes in case of sparse graphs) and still detect almost all of the maximum cliques. Overall, this results in several fold speedups of state-of-the-art algorithms. Furthermore, the classification model used in our framework highlights that the chi-squared value of neighborhood degree has a statistically significant correlation with the presence of a node in a maximum clique, particularly in dense graphs which constitute a significant challenge for modern solvers. We leverage this insight to design a novel heuristic we call ALTHEA for the maximum clique detection problem, outperforming the state-of-the-art for dense graphs.
      37Scopus© Citations 1
  • Publication
    A computational study of external-memory BFS algorithms
    Breadth First Search (BFS) traversal is an archetype for many important graph problems. However, computing a BFS level decomposition for massive graphs was considered nonviable so far, because of the large number of I/Os it incurs. This paper presents the first experimental evaluation of recent external-memory BFS algorithms for general graphs. With our STXXL based implementations exploiting pipelining and disk-parallelism, we were able to compute the BFS level decomposition of a web-crawl based graph of around 130 million nodes and 1.4 billion edges in less than 4 hours using single disk and 2.3 hours using 4 disks. We demonstrate that some rather simple external-memory algorithms perform significantly better (minutes as compared to hours) than internal-memory BFS, even if more than half of the input resides internally.
      430Scopus© Citations 45