Gender Classification Based on Iraqi Names Using Machine Learning

Huda  Hallawi; Ahmed F  Almukhtar; Dhamyaa A.  Nasrawi; Ali Durr  Salah; Tariq Zaid  Faisal

doi:10.24996/ijs.2024.65.11.42

Authors

Huda Hallawi Department of Information Technology, College of Computers Science & Information Technology, University of Kerbala, Karbala, Iraq
Ahmed F Almukhtar Department of Information Technology, College of Computers Science & Information Technology, University of Kerbala, Karbala, Iraq
Dhamyaa A. Nasrawi Department of Computer Science, College of Computers Science & Information Technology, University of Kerbala, Karbala, Iraq
Ali Durr Salah Department of Computer Science, College of Computers Science & Information Technology, University of Kerbala, Karbala, Iraq
Tariq Zaid Faisal Department of Computer Science, College of Computers Science & Information Technology, University of Kerbala, Karbala, Iraq

DOI:

https://doi.org/10.24996/ijs.2024.65.11.42

Keywords:

Gender classification, machine learning techniques, Iraqi names, multi features, unique dataset

Abstract

In machine learning, the classification task is about building a model to predict a class of elements based on their attributes and set of examples. This work aims to classify people based on their names. Two models were developed; the former is based on a single feature that is represented by a name. Whereas the latter is built upon nine features derived from the name itself, which are: is_longname, is_vowelend, is_vowelbegin, 2_gramend, 2_grambegin, 1_gramend, 1_grambegin, is_contain_abo, and is_contain_abed. Furthermore, two datasets were utilized: the first was collected from the Ministry of Labor and Social Affairs, while the second was gathered from the Iraqi university website. There are a lot of strange IRAQI names in two datasets, as well as spelling errors, which represent a real challenge in the classification process. Five machine learning methods were applied and tested within the developed models, including Random Forest, Naive Bayes, Logistic Regression, Multilayer Perceptron, and Extreme Gradient Boost. Ultimately, the experimental results demonstrate an increase in accuracy when applying the model to the original dataset, which includes names and their frequencies. The Multilayer Perceptron has achieved 97% accuracy in one feature model, while the Extreme Gradient Boost has achieved 97% accuracy in the multi-feature model. On the other hand, the results do not exceed 79% when the models are applied to the unique dataset (names without their frequencies).