Now showing 1 - 2 of 2
- PublicationWeak Supervision for Semi-Supervised Topic Modeling via Word EmbeddingsSemi-supervised algorithms have been shown to improve the results of topic modeling when applied to unstructured text corpora. However, sufficient supervision is not always available. This paper proposes a new process, Weak+, suitable for use in semi-supervised topic modeling via matrix factorization, when limited supervision is available. This process uses word embeddings to provide additional weakly-labeled data, which can result in improved topic modeling performance.
- PublicationFinding Niche Topics using Semi-Supervised Topic Modeling via Word EmbeddingsTopic modeling techniques generally focus on the discovery of the predominant thematic structures in text corpora. In contrast, a niche topic is made up of a small number of documents related to a common theme. Such a topic may have so few documents relative to the overall corpus size that it fails to be identified when using standard techniques. This paper proposes a new process, called Niche+, for finding these kinds of niche topics. It assumes interactions with a user who can provide a strictly limited level of supervision, which is subsequently employed in semi-supervised matrix factorization. Furthermore, word embeddings are used to provide additional weakly-labeled data. Experimental results show that documents in niche topics can be successfully identified using Niche+. These results are further supported via a use case that explores a real-world company email database.