Options
Study of Distributed Dynamic Clustering Framework for Spatial Data Mining
Author(s)
Date Issued
2017
Date Available
2019-05-22T12:59:17Z
Abstract
The amount of data generated per year will reach more than 44, 000 billion gigabytes in 2020, ten times more than in 2003 and this is likely to continue according to the current trends. This means more than 10, 000gigabytes per person and per year of data were generated by the daily life. Therefore, the term of "Big Data" was introduced. Big Data refers to very large datasets that are collected from different fields, which heterogeneous and continue to grow at rapid pace. Analysing and extracting relevant information from these datasets is one of the biggest challenges due to their needs to huge storage capacity, processing power, efficient mining algorithms to deal not only with the size but also with heterogeneity, noise, and their learning capacity. These require architectural modifications in the data storage and in the data management, as well as the development of new algorithms for efficient Big Data mining. In fact, the analysis of Big Data requires powerful, scalable, and accurate data analytics techniques that the traditional data mining and machine learning do not have as a whole. Therefore, new data analytics frameworks are needed to deal with the Big Data challenges such as volume, velocity, veracity, variety of the data. Distributed data mining constitutes a promising approach for Big Data analytics, as datasets are usually produced in distributed locations, and processing them on their local sites will reduce significantly the response times, communications, etc. In this thesis, we developed and implemented a data mining framework that can analyse Big Data within a reasonable response time, produce accurate results, and use existing and current computing and storage infrastructure, such as cloud computing. The framework is distributed and deals with issues of high-performance computing. The proposed approach was developed and implemented for spatial data mining. It is general and can handle very large data and deals with data heterogeneity and velocity of the datasets. The approach consists of two phases. The first phase generates local models and the second one tends to aggregate the local results to obtain global models. It is capable of analysing the datasets located in each site using different clustering techniques. The aggregation phase is designed in such a way that the final clusters are compact and accurate while the overall process is efficient in time and memory allocation. The approach was thoroughly tested and compared to well-known clustering algorithms. The results show that the approach not only produces high-quality results compared to the existing approaches but also has super-linear speed-up and scales up very well by taking advantage of theHadoop MapReduce paradigm.
Type of Material
Doctoral Thesis
Publisher
University College Dublin. School of Computer Science
Qualification Name
Ph.D.
Copyright (Published Version)
2017 the author
Web versions
Language
English
Status of Item
Peer reviewed
This item is made available under a Creative Commons License
File(s)
Loading...
Name
Bendechache_ucd_5090D_10195.pdf
Size
10.19 MB
Format
Adobe PDF
Checksum (MD5)
dde525f81fbe16c10b9b91c85c680f47
Owning collection