ANNOTATE: orgANizing uNstructured cOntenTs viA Topic labEls

Files in This Item:
File Description SizeFormat 
ajwani_bigdata18.pdf693.19 kBAdobe PDFDownload
Title: ANNOTATE: orgANizing uNstructured cOntenTs viA Topic labEls
Authors: Ajwani, Deepak
Taneva, Bilyana
Dutta, Sourav
et al.
Permanent link: http://hdl.handle.net/10197/9888
Date: 13-Dec-2018
Online since: 2019-04-10T11:06:51Z
Abstract: With the advent of Big Data paradigm, filtering, retrieval, and linking of unstructured multi-modal data has become a necessity. Assigning topic labels to contents, that accurately capture the meaning and contextual information, is a fundamental problem in organizing unstructured data. The usage of manually-assigned tags for this purpose introduces inconsistencies because of different »surface forms». On the other hand, existing automated approaches either use hierarchical multi-label classification, or are unsupervised and rely on (undirected) graph measures leveraging taxonomies. While the former requires large training data set to learn the characteristics of each topic class, the latter lacks the flexibility to learn broad range of related topics and are less accurate. We propose a novel framework, ANNOTATE based on a small set of features and directed traversal of taxonomies to learn a broad spectrum of related topics using limited training data. We also show that our approach provides accurate labels for several domains without the need for re-training. For instance, the framework, trained on a small set of BBC news articles, exhibits close matches to user-generated tags for Quora documents. Experimental results, on the same model, for news classification and identifying aspects of Amazon product reviews, based on Amazon Mechanical Turk evaluation show our approach to be significantly better than state-of-the-art. We further present real-life case studies of our proposed framework for automatically tagging Quora posts, and topically segmenting, indexing and linking related YouTube videos (using our publicly available Chrome browser extension).
Type of material: Conference Publication
Publisher: IEEE
Start page: 1699
End page: 1708
Copyright (published version): 2018 IEEE
Keywords: TaxonomyLabelingSemanticsEncyclopediasElectronic publishingInternet
DOI: 10.1109/BigData.2018.8622647
Other versions: http://cci.drexel.edu/bigdata/bigdata2018/
Language: en
Status of Item: Not peer reviewed
Is part of: Abe, N., Liu, H., Pu, C. et al. (eds.). Proceedings: 2018 IEEE International Conference on Big Data, Dec 10 - Dec 13, 2018, Seattle, WA, USA
Conference Details: 2018 IEEE International Conference on Big Data, Seattle, United States of America, 10-13 December 2018
ISBN: 9781538650356
Appears in Collections:Computer Science Research Collection

Show full item record

Google ScholarTM

Check

Altmetric


This item is available under the Attribution-NonCommercial-NoDerivs 3.0 Ireland. No item may be reproduced for commercial purposes. For other possible restrictions on use please refer to the publisher's URL where this is made available, or to notes contained in the item itself. Other terms may apply.