Options
A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal
File(s)
File | Description | Size | Format | |
---|---|---|---|---|
2005.10070v1.pdf | 435.69 KB |
Date Issued
10 July 2020
Date Available
11T16:05:39Z March 2021
Abstract
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.
Sponsorship
Irish Research Council
Science Foundation Ireland
Other Sponsorship
Aylien Ltd.
Type of Material
Conference Publication
Web versions
Language
English
Status of Item
Peer reviewed
Part of
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Description
The 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5-10 July 2020
This item is made available under a Creative Commons License
Owning collection
Views
414
Last Week
2
2
Last Month
2
2
Acquisition Date
Feb 4, 2023
Feb 4, 2023
Downloads
1020
Last Week
1
1
Last Month
5
5
Acquisition Date
Feb 4, 2023
Feb 4, 2023