Now showing 1 - 10 of 19
  • Publication
    Constructing Subsumption Hierarchies of Web Queries
    In this work, we present an approach for automatically identifying subsumption relations between web queries, a difficult (due to feature sparseness and ambiguity), but extremely useful task for many applications, ranging from user profiling and semantic enhancement of query logs, to traffic minimisation in distributed search environments (e.g., federations of digital libraries or cloud-based systems). We start by matching each query to the topics of a comprehensive web directory, and use these topics to apply query expansion in an iterative fashion. Subsequently, all expanded queries are mapped onto the DMOZ hierarchy, and the resulting subsumption relations are directly inferred from the directory structure once conflicts in the hierarchy are resolved. We evaluate our technique on real-world queries, and show that our approach is effective under all settings.
  • Publication
    Greenwatch-shing: Using AI to Detect Greenwashing
    (Institute of Certified Public Accountants in Ireland, 2020-06) ; ; ;
    The rise of fake news on the Internet has shown that society has few defence mechanisms to cope with misinformation as well as limited abilities to regulate its attention towards inaccurate or sometimes outright false claims, no matter the topic of such content.
  • Publication
    Hashtagger+: Efficient High-Coverage Social Tagging of Streaming News
    News and social media now play a synergistic role and neither domain can be grasped in isolation. On one hand, platformssuch as Twitter have taken a central role in the dissemination and consumption of news. On the other hand, news editors rely on socialmedia for following their audiences attention and for crowd-sourcing news stories. Twitter hashtags function as a key connectionbetween Twitter crowds and the news media, by naturally naming and contextualizing stories, grouping the discussion of news andmarking topic trends. In this work we propose Hashtagger+, an efficient learning-to-rank framework for merging news and socialstreams in real-time, by recommending Twitter hashtags to news articles. We provide an extensive study of different approaches forstreaming hashtag recommendation, and show that pointwise learning-to-rank is more effective than multi-class classification as wellas more complex learning-to-rank approaches. We improve the efficiency and coverage of a state-of-the-art hashtag recommendationmodel by proposing new techniques for data collection and feature computation. In our comprehensive evaluation on real-data weshow that we drastically outperform the accuracy and efficiency of prior methods. Our prototype system delivers recommendations inunder 1 minute, with a Precision@1 of 94% and article coverage of 80%. This is an order of magnitude faster than prior approaches,and brings improvements of 5% in precision and 20% in coverage. By effectively linking the news stream to the social stream via therecommended hashtags, we open the door to solving many challenging problems related to story detection and tracking. To showcasethis potential, we present an application of our recommendations to automated news story tracking via social tags. Ourrecommendation framework is implemented in a real-time Web system available from
      898Scopus© Citations 23
  • Publication
    Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
    (Public Library of Science, 2014-01-20) ; ; ;
    Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
    Scopus© Citations 11  216
  • Publication
    Topy: Real-time Story Tracking via Social Tags
    The Topy system automates real-time story tracking by utilizing crowd- sourced tagging on social media platforms. Topy employs a state-of-the-art Twitter hashtag recommender to continuously annotate news articles with hashtags, a rich meta-data source that allows connecting articles under drastically different timelines than typical keyword based story tracking systems. Employing social tags for story tracking has the following advantages: (1) social annotation of news enables the detection of emerging concepts and topic drift in a story; (2) hashtags go beyond topics by grouping articles based on connected themes (e.g., #rip, #blacklivesmatter, #icantbreath); (3) hashtags link articles that focus on subplots of the same story (e.g., #palmyra, #isis, #refugeecrisis).
      619Scopus© Citations 3
  • Publication
    SocialTree: Socially Augmented Structured Summaries of News Stories
    News story understanding entails having an effective summary of a related group of articles that may span different time ranges, involve different topics and entities, and have connections to other stories. In this work, we present an approach to efficiently extract structured summaries of news stories by augmenting news media with the structure of social discourse as reflected in social media in the form of social tags. Existing event detection, topic-modeling, clustering and summarization methods yield news story summaries based only on noun phrases and named entities. These representations are sensitive to the article wording and the keyword extraction algorithm. Moreover, keyword-based representations are rarely helpful for highlighting the inter-story connections or for reflecting the inner structure of the news story because of high word ambiguity and clutter from the large variety of keywords describing news stories. Our method combines the news and social media domains to create structured summaries of news stories in the form of hierarchies of keywords and social tags, named SocialTree. We show that the properties of social tags can be exploited to augment the construction of hierarchical summaries of news stories and to alleviate the weaknesses of existing keyword-based representations. In our quantitative and qualitative evaluation the proposed method strongly outperforms the state-of-the-art with regard to both coverage and informativeness of the summaries.
      333Scopus© Citations 3
  • Publication
    Be In The Know: Connecting News Articles to Relevant Twitter Conversations
    In this paper we propose a framework for tracking and automatically connecting news articles to Twitter conversations as captured by Twitter hashtags. For example, such a system could alert journalists about news that get a lot of Twitter reaction, so they can investigate those conversations for new developments in the story, promote their article to a set of interested consumers, or discover general sentiment towards the story. Mapping articles to hashtags is nevertheless challenging, due to different language style of articles versus tweets, the streaming aspect, and user behavior when marking tweet-terms as hashtags. We track the Irish Times RSS-feed and a focused Twitter stream over a two months period, and present a system that assigns hashtags to each article, based on its Twitter echo. We propose a machine learning approach for classifying article hashtag pairs. Our empirical study shows that our system delivers high precision for this task.
  • Publication
    Background Knowledge Injection for Interpretable Sequence Classification
    Sequence classification is the supervised learning task of building models that predict class labels of unseen sequences of symbols. Although accuracy is paramount, in certain scenarios interpretability is a must. Unfortunately, such trade-off is often hard to achieve since we lack human-independent interpretability metrics. We introduce a novel sequence learning algorithm, that combines (i) linear classifiers - which are known to strike a good balance between predictive power and interpretability, and (ii) background knowledge embeddings. We extend the classic subsequence feature space with groups of symbols which are generated by background knowledge injected via word or graph embeddings, and use this new feature space to learn a linear classifier. We also present a new measure to evaluate the interpretability of a set of symbolic features based on the symbol embeddings. Experiments on human activity recognition from wearables and amino acid sequence classification show that our classification approach preserves predictive power, while delivering more interpretable models.
  • Publication
    A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
    Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.
  • Publication
    Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering
    Twitter has become as much of a news media as a social network, and much research has turned to analysing its content for tracking real-world events, from politics to sports and natural disasters. This paper describes the techniques we employed for the SNOW Data Challenge 2014, described in [16]. We show that aggressive lettering of tweets based on length and structure, combined with hierarchical clustering of tweets and ranking of the resulting clusters, achieves encouraging results. We present empirical results and discussion for two different Twitter streams focusing on the US presidential elections in 2012 and the recent events about Ukraine, Syria and the Bitcoin, in February 2014.