An Analysis of the Coherence of Descriptors in Topic Modeling
|Title:||An Analysis of the Coherence of Descriptors in Topic Modeling||Authors:||O'Callaghan, Derek
|Permanent link:||http://hdl.handle.net/10197/6482||Date:||1-Aug-2015||Abstract:||In recent years, topic modeling has become an established method in the analysis of text corpora, with probabilistic techniques such as latent Dirichlet allocation (LDA) commonly employed for this purpose. However, it might be argued that adequate attention is often not paid to the issue of topic coherence, the semantic interpretability of the top terms usually used to describe discovered topics. Nevertheless, a number of studies have proposed measures for analyzing such coherence, where these have been largely focused on topics found by LDA, with matrix decomposition techniques such as Non-negative Matrix Factorization (NMF) being somewhat overlooked in comparison. This motivates the current work, where we compare and analyze topics found by popular variants of both NMF and LDA in multiple corpora in terms of both their coherence and associated generality, using a combination of existing and new measures, including one based on distributional semantics. Two out of three coherence measures find NMF to regularly produce more coherent topics, with higher levels of generality and redundancy observed with the LDA topic descriptors. In all cases, we observe that the associated term weighting strategy plays a major role. The results observed with NMF suggest that this may be a more suitable topic modeling method when analyzing certain corpora, such as those associated with niche or non-mainstream domains.||Type of material:||Journal Article||Publisher:||Elsevier||Copyright (published version):||2015 Elsevier||Keywords:||Machine learning;Statistics;Topic modeling;Topic coherence;LDA;NMF||DOI:||10.1016/j.eswa.2015.02.055||Language:||en||Status of Item:||Peer reviewed|
|Appears in Collections:||Computer Science Research Collection|
Insight Research Collection
Show full item record
This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. For other possible restrictions on use please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.