Gender Classification Based on Iraqi Names Using Machine Learning
DOI:
https://doi.org/10.24996/ijs.2024.65.11.42Keywords:
Gender classification, machine learning techniques, Iraqi names, multi features, unique datasetAbstract
In machine learning, the classification task is about building a model to predict a class of elements based on their attributes and set of examples. This work aims to classify people based on their names. Two models were developed; the former is based on a single feature that is represented by a name. Whereas the latter is built upon nine features derived from the name itself, which are: is_longname, is_vowelend, is_vowelbegin, 2_gramend, 2_grambegin, 1_gramend, 1_grambegin, is_contain_abo, and is_contain_abed. Furthermore, two datasets were utilized: the first was collected from the Ministry of Labor and Social Affairs, while the second was gathered from the Iraqi university website. There are a lot of strange IRAQI names in two datasets, as well as spelling errors, which represent a real challenge in the classification process. Five machine learning methods were applied and tested within the developed models, including Random Forest, Naive Bayes, Logistic Regression, Multilayer Perceptron, and Extreme Gradient Boost. Ultimately, the experimental results demonstrate an increase in accuracy when applying the model to the original dataset, which includes names and their frequencies. The Multilayer Perceptron has achieved 97% accuracy in one feature model, while the Extreme Gradient Boost has achieved 97% accuracy in the multi-feature model. On the other hand, the results do not exceed 79% when the models are applied to the unique dataset (names without their frequencies).
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Iraqi Journal of Science
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.