A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
2005.10070v1.pdf | 435.69 kB | Adobe PDF | Download |
Title: | A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal | Authors: | Ghalandari, Demian Gholipour; Hokamp, Chris; Pham, Nghia The; Glover, John; Ifrim, Georgiana | Permanent link: | http://hdl.handle.net/10197/12036 | Date: | 10-Jul-2020 | Online since: | 2021-03-11T16:05:39Z | Abstract: | Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques. | Funding Details: | Irish Research Council Science Foundation Ireland |
Funding Details: | Aylien Ltd. | Type of material: | Conference Publication | Keywords: | Multi-document summarization; News events; Deep learning methods | DOI: | 10.18653/v1/2020.acl-main.120 | Other versions: | https://acl2020.org/ | Language: | en | Status of Item: | Peer reviewed | Is part of: | Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics | Conference Details: | The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5-10 July 2020 | This item is made available under a Creative Commons License: | https://creativecommons.org/licenses/by/3.0/ie/ |
Appears in Collections: | Computer Science Research Collection Insight Research Collection |
Show full item record
Page view(s)
124
Last Week
3
3
Last month
checked on Apr 11, 2021
Download(s)
11
checked on Apr 11, 2021
Google ScholarTM
Check
Altmetric
If you are a publisher or author and have copyright concerns for any item, please email research.repository@ucd.ie and the item will be withdrawn immediately. The author or person responsible for depositing the article will be contacted within one business day.