Comparing K-Means, Nearest Neighbor, and Lloyd's Clustering Algorithms

Shaymaa Qasim  Noor; Tareef Kamil  Mustafa

doi:10.24996/ijs.2024.65.11.40

Authors

Shaymaa Qasim Noor Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq https://orcid.org/0009-0006-0752-6509
Tareef Kamil Mustafa Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2024.65.11.40

Keywords:

Data Mining, Cluster, K-means Clustering, Lloyd's Algorithm, Nearest Neighbour Algorithm

Abstract

Clustering Organizing items into groups based on their properties such that the items in the same group are similar and those in other groups are distinct is known as clustering and is one method of unsupervised learning. The primary benefit of clustering is that, with little or no prior information, fascinating patterns and structures can be discovered directly from very large data sets. The most representative algorithms, the K-Means algorithm, the nearest neighbor algorithm, and Lloyd's algorithm, were explored and evaluated in this study based on their basic strategies. The proposed algorithms proved highly efficient in classifying data, as k-means results were high by creating data points and classifying that data into 10 groups, while the nearest neighbor algorithm proved highly effective in predicting new groups in light of pre-existing groups for new data points, and finally Lloyd's algorithm achieved high results through several iterations to reach the target groups. Using the random function, 100 random values for inputs as x and 100 for outputs as y are generated, and these values are grouped as points. Calculations of the mean and standard deviation for the data set indicate that 68% of the data points will fall within one standard deviation, 95% within two standard deviations, and 99.7% within three standard deviations. In the nearest neighbor algorithm, confidence in data points within a specified standard deviation number is determined by the coverage factor, or k value. For k = 10, 97% of the data points are expected to fall within one standard deviation. At Lloyd's, a normal distribution curve appears when generating calibration or measurement data.