A Parallel Clustering Analysis Based on Hadoop Multi-Node and Apache Mahout

Noor S.  Sagheer; Suhad A.  Yousif

doi:10.24996/ijs.2021.62.7.32

Authors

Noor S. Sagheer Department of Computer Science, Al-Nahrain University, Baghdad, Iraq
Suhad A. Yousif Department of Computer Science, Al-Nahrain University, Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2021.62.7.32

Keywords:

Big Data, Hadoop, Mahout, Predictive Analytics, Parallel K-means

Abstract

The conventional procedures of clustering algorithms are incapable of overcoming the difficulty of managing and analyzing the rapid growth of generated data from different sources. Using the concept of parallel clustering is one of the robust solutions to this problem. Apache Hadoop architecture is one of the assortment ecosystems that provide the capability to store and process the data in a distributed and parallel fashion. In this paper, a parallel model is designed to process the k-means clustering algorithm in the Apache Hadoop ecosystem by connecting three nodes, one is for server (name) nodes and the other two are for clients (data) nodes. The aim is to speed up the time of managing the massive scale of healthcare insurance dataset with the size of 11 GB and also using machine learning algorithms, which are provided by the Mahout Framework. The experimental results depict that the proposed model can efficiently process large datasets. The parallel k-means algorithm outperforms the sequential k-means algorithm based on the execution time of the algorithm, where the required time to execute a data size of 11 GB is around 1.847 hours using the parallel k-means algorithm, while it equals 68.567 hours using the sequential k-means algorithm. As a result, we deduce that when the nodes number in the parallel system increases, the computation time of the proposed algorithm decreases.