Application of Data Mining and Imputation Algorithms for Missing Value Handling: A Study Case Car Evaluation Dataset

Wahyu Widyananda; Muhammad Fauzan Edy Purnomo; Muhammad Aswin; Panca Mudjirahardjo; Sholeh Hadi Pramono

doi:10.24996/ijs.2023.64.5.32

Authors

Wahyu Widyananda Electrical Engineering Department, Brawijaya University, East Java, Indonesia https://orcid.org/0000-0003-2550-4240
Muhammad Fauzan Edy Purnomo Electrical Engineering Department, Brawijaya University, East Java, Indonesia
Muhammad Aswin Electrical Engineering Department, Brawijaya University, East Java, Indonesia
Panca Mudjirahardjo Electrical Engineering Department, Brawijaya University, East Java, Indonesia
Sholeh Hadi Pramono Electrical Engineering Department, Brawijaya University, East Java, Indonesia

DOI:

https://doi.org/10.24996/ijs.2023.64.5.32

Keywords:

C5.0, k-NNI, Data Mining, Missing Value Handling, R Studio

Abstract

Data mining is a data analysis process using software to find certain patterns or rules in a large amount of data, which is expected to provide knowledge to support decisions. However, missing value in data mining often leads to a loss of information. The purpose of this study is to improve the performance of data classification with missing values, precisely and accurately. The test method is carried out using the Car Evaluation dataset from the UCI Machine Learning Repository. RStudio and RapidMiner tools were used for testing the algorithm. This study will result in a data analysis of the tested parameters to measure the performance of the algorithm. Using test variations: performance at C5.0, C4.5, and k-NN at 0% missing rate, performance at C5.0, C4.5, and k-NN at 5–50% missing rate, performance at C5.0 + k-NNI, C4.5 + k-NNI, and k-NN + k-NNI classifier at 5–50% missing rate, and performance at C5.0 + CMI, C4.5 + CMI, and k-NN + CMI classifier at 5–50% missing rate, The results show that C5.0 with k-NNI produces better classification accuracy than other tested imputation and classification algorithms. For example, with 35% of the dataset missing, this method obtains 93.40% validation accuracy and 92% test accuracy. C5.0 with k-NNI also offers fast processing times compared with other methods.