Now showing 1 - 10 of 19
  • Publication
    Real time News Story Detection and Tracking with Hashtags
    Topic Detection and Tracking (TDT) is an important research topic in data mining and information retrieval and has been explored for many years. Most of the studies have approached the problem from the event tracking point of view. We argue that the definition of stories as events is not reflecting the full picture. In this work we propose a story tracking method built on crowd-tagging in social media, where news articles are labeled with hashtags in real-time. The social tags act as rich metadata for news articles, with the advantage that, if carefully employed, they can capture emerging concepts and address concept drift in a story. We present an approach for employing social tags for the purpose of story detection and tracking and show initial empirical results. We compare our method to classic keyword query retrieval and discuss an example of story tracking over time.
      544
  • Publication
    Background Knowledge Injection for Interpretable Sequence Classification
    Sequence classification is the supervised learning task of building models that predict class labels of unseen sequences of symbols. Although accuracy is paramount, in certain scenarios interpretability is a must. Unfortunately, such trade-off is often hard to achieve since we lack human-independent interpretability metrics. We introduce a novel sequence learning algorithm, that combines (i) linear classifiers - which are known to strike a good balance between predictive power and interpretability, and (ii) background knowledge embeddings. We extend the classic subsequence feature space with groups of symbols which are generated by background knowledge injected via word or graph embeddings, and use this new feature space to learn a linear classifier. We also present a new measure to evaluate the interpretability of a set of symbolic features based on the symbol embeddings. Experiments on human activity recognition from wearables and amino acid sequence classification show that our classification approach preserves predictive power, while delivering more interpretable models.
      215
  • Publication
    Greenwatch-shing: Using AI to Detect Greenwashing
    (Institute of Certified Public Accountants in Ireland, 2020-06) ; ; ;
    The rise of fake news on the Internet has shown that society has few defence mechanisms to cope with misinformation as well as limited abilities to regulate its attention towards inaccurate or sometimes outright false claims, no matter the topic of such content.
      33
  • Publication
    Interpretable Time Series Classification using Linear Models and Multi-resolution Multi-domain Symbolic Representations
    The time series classification literature has expanded rapidly over the last decade, with many new classification approaches published each year. Prior research has mostly focused on improving the accuracy and efficiency of classifiers, with interpretability being somewhat neglected. This aspect of classifiers has become critical for many application domains and the introduction of the EU GDPR legislation in 2018 is likely to further emphasize the importance of interpretable learning algorithms. Currently, state-of-the-art classification accuracy is achieved with very complex models based on large ensembles (COTE) or deep neural networks (FCN). These approaches are not efficient with regard to either time or space, are difficult to interpret and cannot be applied to variable-length time series, requiring pre-processing of the original series to a set fixed-length. In this paper we propose new time series classification algorithms to address these gaps. Our approach is based on symbolic representations of time series, efficient sequence mining algorithms and linear classification models. Our linear models are as accurate as deep learning models but are more efficient regarding running time and memory, can work with variable-length time series and can be interpreted by highlighting the discriminative symbolic features on the original time series. We advance the state-of-the-art in time series classification by proposing new algorithms built using the following three key ideas: (1) Multiple resolutions of symbolic representations: we combine symbolic representations obtained using different parameters, rather than one fixed representation (e.g., multiple SAX representations); (2) Multiple domain representations: we combine symbolic representations in time (e.g., SAX) and frequency (e.g., SFA) domains, to be more robust across problem types; (3) Efficient navigation in a huge symbolic-words space: we extend a symbolic sequence classifier (SEQL) to work with multiple symbolic representations and use its greedy feature selection strategy to effectively filter the best features for each representation. We show that our multi-resolution multi-domain linear classifier (mtSS-SEQL+LR) achieves a similar accuracy to the state-of-the-art COTE ensemble, and to recent deep learning methods (FCN, ResNet), but uses a fraction of the time and memory required by either COTE or deep models. To further analyse the interpretability of our classifier, we present a case study on a human motion dataset collected by the authors. We discuss the accuracy, efficiency and interpretability of our proposed algorithms and release all the results, source code and data to encourage reproducibility.
      1104Scopus© Citations 72
  • Publication
    SocialTree: Socially Augmented Structured Summaries of News Stories
    News story understanding entails having an effective summary of a related group of articles that may span different time ranges, involve different topics and entities, and have connections to other stories. In this work, we present an approach to efficiently extract structured summaries of news stories by augmenting news media with the structure of social discourse as reflected in social media in the form of social tags. Existing event detection, topic-modeling, clustering and summarization methods yield news story summaries based only on noun phrases and named entities. These representations are sensitive to the article wording and the keyword extraction algorithm. Moreover, keyword-based representations are rarely helpful for highlighting the inter-story connections or for reflecting the inner structure of the news story because of high word ambiguity and clutter from the large variety of keywords describing news stories. Our method combines the news and social media domains to create structured summaries of news stories in the form of hierarchies of keywords and social tags, named SocialTree. We show that the properties of social tags can be exploited to augment the construction of hierarchical summaries of news stories and to alleviate the weaknesses of existing keyword-based representations. In our quantitative and qualitative evaluation the proposed method strongly outperforms the state-of-the-art with regard to both coverage and informativeness of the summaries.
      342Scopus© Citations 3
  • Publication
    Hashtagger+: Efficient High-Coverage Social Tagging of Streaming News
    News and social media now play a synergistic role and neither domain can be grasped in isolation. On one hand, platformssuch as Twitter have taken a central role in the dissemination and consumption of news. On the other hand, news editors rely on socialmedia for following their audiences attention and for crowd-sourcing news stories. Twitter hashtags function as a key connectionbetween Twitter crowds and the news media, by naturally naming and contextualizing stories, grouping the discussion of news andmarking topic trends. In this work we propose Hashtagger+, an efficient learning-to-rank framework for merging news and socialstreams in real-time, by recommending Twitter hashtags to news articles. We provide an extensive study of different approaches forstreaming hashtag recommendation, and show that pointwise learning-to-rank is more effective than multi-class classification as wellas more complex learning-to-rank approaches. We improve the efficiency and coverage of a state-of-the-art hashtag recommendationmodel by proposing new techniques for data collection and feature computation. In our comprehensive evaluation on real-data weshow that we drastically outperform the accuracy and efficiency of prior methods. Our prototype system delivers recommendations inunder 1 minute, with a Precision@1 of 94% and article coverage of 80%. This is an order of magnitude faster than prior approaches,and brings improvements of 5% in precision and 20% in coverage. By effectively linking the news stream to the social stream via therecommended hashtags, we open the door to solving many challenging problems related to story detection and tracking. To showcasethis potential, we present an application of our recommendations to automated news story tracking via social tags. Ourrecommendation framework is implemented in a real-time Web system available from insight4news.ucd.ie.
      913Scopus© Citations 23
  • Publication
    Large Scale Identification and Categorization of Protein Sequences Using Structured Logistic Regression
    (Public Library of Science, 2014-01-20) ; ; ;
    Background: Structured Logistic Regression (SLR) is a newly developed machine learning tool first proposed in the context of text categorization. Current availability of extensive protein sequence databases calls for an automated method to reliably classify sequences and SLR seems well-suited for this task. The classification of P-type ATPases, a large family of ATP-driven membrane pumps transporting essential cations, was selected as a test-case that would generate important biological information as well as provide a proof-of-concept for the application of SLR to a large scale bioinformatics problem. Results: Using SLR, we have built classifiers to identify and automatically categorize P-type ATPases into one of 11 pre-defined classes. The SLR-classifiers are compared to a Hidden Markov Model approach and shown to be highly accurate and scalable. Representing the bulk of currently known sequences, we analysed 9.3 million sequences in the UniProtKB and attempted to classify a large number of P-type ATPases. To examine the distribution of pumps on organisms, we also applied SLR to 1,123 complete genomes from the Entrez genome database. Finally, we analysed the predicted membrane topology of the identified P-type ATPases. Conclusions: Using the SLR-based classification tool we are able to run a large scale study of P-type ATPases. This study provides proof-of-concept for the application of SLR to a bioinformatics problem and the analysis of P-type ATPases pinpoints new and interesting targets for further biochemical characterization and structural analysis.
      225Scopus© Citations 12
  • Publication
    Analyzing the impact of electricity price forecasting on energy cost-aware scheduling
    Energy cost-aware scheduling, i.e., scheduling that adapts to real-time energy price volatility, can save large energy consumers millions of dollars every year in electricity costs. Energy price forecasting coupled with energy price-aware scheduling, is a step toward this goal. In this work, we study cost-aware schedules and the effect of various price forecasting schemes on the end schedule-cost. We show that simply optimizing price forecasts based on classical regression error metrics (e.g., Mean Squared Error), does not work well for scheduling. Price forecasts that do result in significantly better schedules, optimize a combination of metrics, each having a different impact on the end-schedule-cost. For example, both price estimation and price ranking are important for scheduling, but they carry different weight. We consider day-ahead energy price forecasting using the Irish Single Electricity Market as a case-study, and test our price forecasts for two real-world scheduling applications: animal feed manufacturing and home energy management systems. We show that price forecasts that co-optimize price estimation and price ranking, result in significant energy-cost savings. We believe our results are relevant for many real-life scheduling applications that are currently plagued with very large energy bills.
      570Scopus© Citations 19
  • Publication
    Insight4News: Connecting News to Relevant Social Conversations
    We present the Insight4News system that connects news articles to social conversations, as echoed in microblogs such as Twitter. Insight4News tracks feeds from mainstream media, e.g., BBC, Irish Times, and extracts relevant topics that summarize the tweet activity around each article, recommends relevant hashtags, and presents complementary views and statistics on the tweet activity, related news articles, and timeline of the story with regard to Twitter reaction. The user can track their own news article or a topic-focused Twitter stream. While many systems tap on the social knowledge of Twitter to help users stay on top of the information wave, none is available for connecting news to relevant Twitter content on a large scale, in real time, with high precision and recall. Insight4News builds on our award winning Twitter topic detection approach and several machine learning components, to deliver news in a social context.
      426Scopus© Citations 6
  • Publication
    Learning-to-Rank for Real-Time High-Precision Hashtag Recommendation for Streaming News
    We address the problem of real-time recommendation ofstreaming Twitter hashtags to an incoming stream of newsarticles. The technical challenge can be framed as largescale topic classication where the set of topics (i.e., hashtags)is huge and highly dynamic. Our main applicationscome from digital journalism, e.g., for promoting originalcontent to Twitter communities and for social indexing ofnews to enable better retrieval, story tracking and summarisation.In contrast to state-of-the-art methods that focus onmodelling each individual hashtag as a topic, we propose alearning-to-rank approach for modelling hashtag relevance,and present methods to extract time-aware features fromhighly dynamic content. We present the data collection andprocessing pipeline, as well as our methodology for achievinglow latency, high precision recommendations. Our empiricalresults show that our method outperforms the state-of-theart,delivering more than 80% precision. Our techniques areimplemented in a real-time system1, and are currently underuser trial with a big news organisation.
      2461Scopus© Citations 30