Now showing 1 - 6 of 6
- PublicationExtending Jensen Shannon Divergence to Compare Multiple CorporaInvestigating public discourse on social media platforms has proven a viable way to reflect the impacts of political issues. In this paper we frame this as a corpus comparison problem in which the online discussion of different groups are treated as different corpora to be compared. We propose an extended version of the Jensen-Shannon divergence measure to compare multiple corpora and use the FP-growth algorithm to mix unigrams and bigrams in this comparison. We also propose a set of visualizations that can illustrate the results of this analysis. To demonstrate these approaches we compare the Twitter discourse surrounding Brexit in Ireland and Great Britain across a 14 week time period.
- PublicationA Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled DataTraining deep learning models with limited labelled data is an attractive scenario for many NLP tasks, including document classification. While with the recent emergence of BERT, deep learning language models can achieve reasonably good performance in document classification with few labelled instances, there is a lack of evidence in the utility of applying BERT-like models on long document classification. This work introduces a long-text-specific model — the Hierarchical BERT Model (HBM) — that learns sentence-level features of the text and works well in scenarios with limited labelled data. Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances, especially when documents are long. Also, as an extra benefit of HBM, the salient sentences identified by learned HBM are useful as explanations for labelling documents based on a user study.
5Scopus© Citations 7
- PublicationEffect of Combination of HBM and Certainty Sampling on Workload of Semi-Automated Grey Literature ScreeningWith the rapid increase of unstructured text data, grey literature has become an important source of information to support research and innovation activities. In this paper, we propose a novel semiautomated grey literature screening approach that combines a Hierarchical BERT Model (HBM) with active learning to reduce the human workload in grey literature screening. Evaluations over three real-world grey literature datasets demonstrate that the proposed approach can save up to 64.88% of the human screening workload, while maintaining high screening accuracy. We also demonstrate how the use of the HBM model allows salient sentences within grey literature documents to be selected and highlighted to support workers in screening tasks.
- PublicationSupervised and Unsupervised Text Mining for Grey Literature Screening(University College Dublin. School of Computer Science, 2021)
;0000-0001-7149-6961The increasing recognition of the value of Open Innovation (OI) and the Multi-actor Approach (MAA) in research and innovation activities highlights the need for an efficient and effective process for searching and extracting knowledge from a wide range of different sources, e.g. knowledge is required from academic sources but also from practitioners and intermediaries such as businesses, advisors, policymakers and non-government organisations. While knowledge from academic sources can be relatively easily accessed through peer-reviewed publications, knowledge from other sources may be more widely dispersed. This highlights the potential value of exploring and exploiting grey literature, information produced by organisations where publishing and distributing is not the primary focus, to support research and innovation activities. However, this is not easy given the lack of structure in grey literature, as well as the potentially large amount of irrelevant data that is likely to be included in any grey literature collection. As such, machine-learning-based text mining approaches can be used to facilitate the exploration and exploitation of grey literature, and thus, to enhance research and innovation activities. As one of the most important sectors in Ireland, the agri-food sector underperforms in relation to innovation activities in comparison to other sectors. Therefore, this thesis proposes using text mining approaches to fuel the advance of research and innovation activities in the agri-food sector. There are many challenges in applying text mining approaches to grey literature to support research and innovation activities. In this thesis, we focus on two aspects: using semi-supervised approaches to assist innovation scholars in grey literature screening; using unsupervised corpus comparison to support grey literature content analysis. To semi-automate grey literature screening, we reframe this as a problem of using active learning for grey literature classification. Firstly, we explore the most suitable text representation technique used in active learning, as text representations play an important role in the performance of an active learning system. To this end, we conduct a benchmark experiment comparing the effectiveness of different text representations in the active learning context, especially focusing on more recent high-performing transformer-based text representations. Furthermore, we incorporate the fine-tuning approach into active learning to improve the performance of the transformer-based text representations in active learning. A feature of grey literature compared to other texts is that it is unstructured and often includes long texts, so it is crucial to design a text representation that is suitable for grey literature, and that also works well in the active learning context where labelled data is scarce. Therefore, we develop the Hierarchical BERT Model (HBM) and combine it with certainty sampling. Experiments demonstrate that HBM outperforms state-of-the-art methods when labelled data is scarce, and it can work well with certainty sampling to reduce the workload associated with screening grey literature. For corpus comparison, we firstly compare the variants of Jensen-Shannon divergence (JSD) in the literature and identify JSD-pechenick as the appropriate variant to use in corpus comparison. Then we extend JSD-pechenick to enable a multi-corpus comparison. Lastly, we develop a Multi-corpus Topic-based Corpus Comparison (MTCC) approach by integrating topic modelling into corpus comparison. Based on the previous findings, we propose a pipeline that uses HBM+certainty and MTCC to support innovation scholars to explore and exploit agri-food innovation-related grey literature datasets. 437
- PublicationA Topic-Based Approach to Multiple Corpus ComparisonCorpus comparison techniques are often used to compare different types of online media, for example social media posts and news articles. Most corpus comparison algorithms operate at a word-level and results are shown as lists of individual discriminating words which makes identifying larger underlying differences between corpora challenging. Most corpus comparison techniques also work on pairs of corpora and do need easily extend to multiple corpora. To counter these issues, we introduce Multi-corpus Topic-based Corpus Comparison (MTCC) a corpus comparison approach that works at a topic level and that can compare multiple corpora at once. Experiments on multiple real-world datasets are carried demonstrate the effectiveness of MTCC and compare the usefulness of different statistical discrimination metrics - the χ2 and Jensen-Shannon Divergence metrics are shown to work well. Finally we demonstrate the usefulness of reporting corpus comparison results via topics rather than individual words. Overall we show that the topic-level MTCC approach can capture the difference between multiple corpora, and show the results in a more meaningful and interpretable way than approaches that operate at a word-level.
- PublicationDiverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison TasksJensen-Shannon divergence (JSD) is a distribution similarity measurement widely used in natural language processing. In corpus comparison tasks, where keywords are extracted to reveal the divergence between different corpora (for example, social media posts from proponents of different views on a political issue), two variants of JSD have emerged in the literature. One of these uses a weighting based on the relative sizes of the corpora being compared. In this paper we argue that this weighting is unnecessary and, in fact, can lead to misleading results. We recommend that this weighted version is not used. We base this recommendation on an analysis of the JSD variants and experiments showing how they impact corpus comparison results as the relative sizes of the corpora being compared change.