Now showing 1 - 10 of 38
  • Publication
    Dimensionality Reduction and Visualisation Tools for Voting Record
    (CEUR Workshop Proceedings, 2016-09-21) ; ; ;
    Recorded votes in legislative bodies are an important source of data for political scientists. Voting records can be used to describe parliamentary processes, identify ideological divides between members and reveal the strength of party cohesion. We explore the problem of working with vote data using popular dimensionality reduction techniques and cluster validation methods, as an alternative to more traditional scaling techniques. We present results of dimensionality reduction techniques applied to votes from the 6th and 7th European Parliaments, covering activity from 2004 to 2014.
      280
  • Publication
    Score Normalization and Aggregation for Active Learning in Multi-label Classification
    (University College Dublin. School of Computer Science and Informatics, 2010-02) ; ; ;
    Active learning is useful in situations where labeled data is scarce, unlabeled data is available, and labeling a large number of examples is costly or impractical. These techniques help by identifying a minimal set of examples to label that will support the training of an effective classifier. Thus active learning is particularly relevant for the automation of annotation tasks in multimedia. In this paper we consider the problem of employing active learning for the assignment of multiple annotations or “tags” to images in personal image collections. This form of multi-label classification has received a lot of attention in recent years, however active multi-label classification is still a new research area. The main challenge in active multilabel classification is the selection of unlabeled examples that will be informative for all tags under consideration. This selection task proves surprisingly difficult primarily because of the paucity of labeled data available. In this paper we present some solutions to this problem based on aggregated rankings from classifiers for individual tags.
      106
  • Publication
    ThemeCrowds: Multiresolution Summaries of Twitter Usage
    (University College Dublin. School of Computer Science and Informatics, 2011-06) ; ; ; ;
    Users of social media sites, such as Twitter, rapidly generate large volumes of text content on a daily basis. Visual summaries are needed to understand what groups of people are saying collectively in this unstructured text data. Users will typically discuss a wide variety of topics, where the number of authors talking about a specific topic can quickly grow or diminish over time, and what the collective is saying about the subject can shift as a situation develops. In this paper, we present a technique that summarises what collections of Twitter users are saying about certain topics over time. As the correct resolution for inspecting the data is unknown in advance, the users are clustered hierarchically over a fixed time interval based on the similarity of their posts. The visualisation technique takes this data structure as its input. Given a topic, it finds the correct resolution of users at each time interval and provides tags to summarise what the collective is discussing. The technique is tested on three microblogging corpora, consisting of up to tens of millions of tweets and over a million users. We provide some preliminary user feedback from a research group interested in the area of social media analysis, where this tool could be applied.
      77
  • Publication
    Taking the pulse of the web : assessing sentiment on topics in online media
    The task of identifying sentiment trends in the popular media has long been of interest to analysts and pundits. Until recently, this task has required professional annotators to manually inspect individual articles in order to identify their polarity. With the increased availability of large volumes of online news content via syndicated feeds, researchers have begun to examine ways to automate aspects of this process. In this work, we describe a sentiment analysis system that uses crowdsourcing to gather non-expert annotations for economic news articles. By using these annotations in conjunction with a supervised machine learning strategy, we can generalize to label a much larger set of articles, allowing us to effectively track sentiment in different news sources over time.
      265
  • Publication
    Distortion as a validation criterion in the identification of suspicious reviews
    (University College Dublin. School of Computer Science and Informatics, 2010-05-02) ; ; ;
    Assessing the trustworthiness of reviews is a key issue for the maintainers of opinion sites such as TripAdvisor. In this paper we propose a distortion criterion for assessing the impact of methods for uncovering suspicious hotel reviews in TripAdvisor. The principle is that dishonest reviews will distort the overall popularity ranking for a collection of hotels. Thus a mechanism that deletes dishonest reviews will distort the popularity ranking significantly, when compared with the removal of a similar set of reviews at random. This distortion can be quantified by comparing popularity rankings before and after deletion, using rank correlation. We present an evaluation of this strategy in the assessment of shill detection mechanisms on a dataset of hotel reviews collected from TripAdvisor.
      1220
  • Publication
    Adaptive Representations for Tracking Breaking News on Twitter
    Twitter is often the most up-to-date source for finding and tracking breaking news stories. Therefore, there is considerable interest in developing filters for tweet streams in order to track and summarize stories. This is a non-trivial text analytics task as tweets are short,and standard text similarity metrics often fail as stories evolve over time. In this paper we examine the effectiveness of adaptive text similarity mechanisms for tracking and summarizing breaking news stories. We evaluate the effectiveness of these mechanisms on a number of recent news events for which manually curated timelines are available. Assessments based on the ROUGE metric indicate that an adaptive similarity mechanism is best suited for tracking evolving stories on Twitter.
      171
  • Publication
    Multi-View Clustering for Mining Heterogeneous Social Network Data
    (University College Dublin. School of Computer Science and Informatics, 2009-03) ;
    Uncovering community structure is a core challenge in social network analysis. This is a significant challenge for large networks where there is a single type of relation in the network (e.g. friend or knows). In practice there may be other types of relation, for instance demographic or geographic information, that also reveal network structure. Uncovering structure in such multi-relational networks presents a greater challenge due to the difficulty of integrating information from different, often discordant views. In this paper we describe a system for performing cluster analysis on heterogeneous multi-view data, and present an analysis of the research themes in a bibliographic literature network, based on the integration of both co-citation links and text similarity relationships between papers in the network.
      65
  • Publication
    Time Series Analysis of VLE Activity Data
    Virtual Learning Environments (VLE), such as Moodle, are purpose-built platforms in which teachers and students interact to exchange, review, and submit learning material and information. In this paper, we examine a complex VLE dataset from a large Irish university in an attempt to characterize student behavior with respect to deadlines and grades. We demonstrate that, by clustering activity profiles represented as time series using Dynamic Time Warping, we can uncover meaningful clusters of students exhibiting similar behaviors even in a sparsely-populated system. We use these clusters to identify distinct activity patterns among students, such as Procrastinators, Strugglers, and Experts. These patterns can provide us with an insight into the behavior of students, and ultimately help institutions to exploit deployed learning platforms so as to better structure their courses.
      179
  • Publication
    Identifying representative textual sources in blog networks
    (University College Dublin. School of Computer Science and Informatics, 2011-02) ; ; ; ;
    We apply methods from social network analysis and visualization to facilitate a study of the Irish blogosphere from a cultural studies perspective. We focus on solving the practical issues that arise when the goal is to perform textual analysis of the corpus produced by a network of bloggers. Previous studies into blogging networks have noted difficulties arising when trying to identify the extent and boundaries of these networks. As a response to calls for increasingly data-led approaches in media and cultural studies, we discuss a variety of social network analysis methods that can be used to identify which blogs can be seen as members of a posited "Irish blogging network". We identify hub blogs, communities of sites corresponding to different topics, and representative bloggers within these communities. Based on this study, we propose a set of analysis guidelines for researchers who wish to map out blogging networks.
      2947
  • Publication
    Community Finding in Large Social Networks Through Problem Decomposition
    (University College Dublin. School of Computer Science and Informatics, 2008-08) ; ; ;
    The identification of cohesive communities is a key process in social network analysis. However, the algorithms that are effective for finding communities do not scale well to very large problems, as their time complexity is worse than linear in the number of edges in the graph. This is an important issue for those interested in applying social network analysis techniques to very large networks, such as networks of mobile phone subscribers. In this respect the contributions of this report are two-fold. First we demonstrate these scaling issues using a prominent community-finding algorithm as a case study. We then show that a twostage process, whereby the network is first decomposed into manageable subnetworks using a multilevel graph partitioning procedure, is effective in finding communities in networks with more than 106 nodes.
      113