Now showing 1 - 2 of 2
- PublicationDiverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison TasksJensen-Shannon divergence (JSD) is a distribution similarity measurement widely used in natural language processing. In corpus comparison tasks, where keywords are extracted to reveal the divergence between different corpora (for example, social media posts from proponents of different views on a political issue), two variants of JSD have emerged in the literature. One of these uses a weighting based on the relative sizes of the corpora being compared. In this paper we argue that this weighting is unnecessary and, in fact, can lead to misleading results. We recommend that this weighted version is not used. We base this recommendation on an analysis of the JSD variants and experiments showing how they impact corpus comparison results as the relative sizes of the corpora being compared change.
- PublicationExtending Jensen Shannon Divergence to Compare Multiple CorporaInvestigating public discourse on social media platforms has proven a viable way to reflect the impacts of political issues. In this paper we frame this as a corpus comparison problem in which the online discussion of different groups are treated as different corpora to be compared. We propose an extended version of the Jensen-Shannon divergence measure to compare multiple corpora and use the FP-growth algorithm to mix unigrams and bigrams in this comparison. We also propose a set of visualizations that can illustrate the results of this analysis. To demonstrate these approaches we compare the Twitter discourse surrounding Brexit in Ireland and Great Britain across a 14 week time period.