Options
A Topic-Based Approach to Multiple Corpus Comparison
Author(s)
Date Issued
2019-12-06
Date Available
2024-02-09T16:47:57Z
Abstract
Corpus comparison techniques are often used to compare different types of online media, for example social media posts and news articles. Most corpus comparison algorithms operate at a word-level and results are shown as lists of individual discriminating words which makes identifying larger underlying differences between corpora challenging. Most corpus comparison techniques also work on pairs of corpora and do need easily extend to multiple corpora. To counter these issues, we introduce Multi-corpus Topic-based Corpus Comparison (MTCC) a corpus comparison approach that works at a topic level and that can compare multiple corpora at once. Experiments on multiple real-world datasets are carried demonstrate the effectiveness of MTCC and compare the usefulness of different statistical discrimination metrics - the χ2 and Jensen-Shannon Divergence metrics are shown to work well. Finally we demonstrate the usefulness of reporting corpus comparison results via topics rather than individual words. Overall we show that the topic-level MTCC approach can capture the difference between multiple corpora, and show the results in a more meaningful and interpretable way than approaches that operate at a word-level.
Sponsorship
Teagasc
Type of Material
Conference Publication
Publisher
CEUR Workshop Proceedings
Series
CEUR Workshop Proceedings
2563
Copyright (Published Version)
2019 the Authors
Web versions
Language
English
Status of Item
Peer reviewed
Journal
Curry, E., Keane, M., Adegboyega, O., and Salwala, D. (eds.). AICS 2019: Proceedings for the 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science NUI Galway: Galway, Ireland, December 5-6th, 2019
Conference Details
The 27th AIAI Irish Conference on Artificial Intelligence and Cognitive Science (AICS 2019), Galway, Ireland, 5-6 December 2019
ISSN
1613-0073
This item is made available under a Creative Commons License
File(s)
Loading...
Name
aics_8.pdf
Size
576.6 KB
Format
Adobe PDF
Checksum (MD5)
826af6765b3dae50e46fe4ae8624a1e0
Owning collection