HaRD: a heterogeneity-aware replica deletion for HDFS

Ciritoglu, Hilmi Egemen; Murphy, John; Thorpe, Christina

doi:10.1186/s40537-019-0256-6

HaRD: a heterogeneity-aware replica deletion for HDFS

Author(s)

Ciritoglu, Hilmi Egemen

Murphy, John

Thorpe, Christina

Uri

http://hdl.handle.net/10197/11328

Date Issued

2019-10-21

Date Available

2020-03-20T13:14:59Z

Abstract

The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.

Sponsorship

European Commission - European Regional Development Fund

Science Foundation Ireland

Type of Material

Journal Article

Publisher

Springer

Journal

Journal of Big Data

Volume

6

Issue

1

Copyright (Published Version)

2019 the Authors

Subjects

Hadoop distributed fi...

Replication factor

Replica management fr...

Software performance

DOI

10.1186/s40537-019-0256-6

Dataset(s)

https://www.ncdc.noaa.gov/cdo-web/datasets

Language

English

Status of Item

Peer reviewed

ISSN

2196-1115

This item is made available under a Creative Commons License

https://creativecommons.org/licenses/by-nc-nd/3.0/ie/

Name

s40537-019-0256-6(1).pdf

Size

1.57 MB

Format

Adobe PDF

Checksum (MD5)

c403e15b53d9a53abea3689a08709294

Owning collection

Computer Science Research Collection

Options

HaRD: a heterogeneity-aware replica deletion for HDFS