The Effect of False Predictions of Machine Learning on the Security of the Big Data Environment

Ammar Hatem  Farhan; Omar Salah F.  Shareef; Rehab Flaih  Hasan

doi:10.24996/ijs.2025.66.1.29

Authors

Ammar Hatem Farhan Computer Center, University of Fallujah, Anbar, Iraq
Omar Salah F. Shareef Computer Center, University of Fallujah, Anbar, Iraq https://orcid.org/0000-0002-2171-5203
Rehab Flaih Hasan Computer Sciences Department, University of Technology, Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2025.66.1.29

Keywords:

SQLI, Logistic Regression, Big Data, Confidentiality, Integrity, Availability

Abstract

The exchange of data between customers and organizations has become a major target for hackers who seek to illegally access this data, compromising the three main components of security information: confidentiality, integrity, and availability (CIA). Structured query language injection (SQLI) is one of the most common forms of cyberattack. However, most of the previous research has only looked at SQLI attacks that target web-based applications. There hasn't been much time to sort the kind of SQLI payload that the client sent into the vast amounts of data needed to create machine learning models. Additionally, there hasn't been a study that looks at the risks of machine learning models making mistakes and how they affect the three information security principles.

To address this gap, this research aims to create a model that serves as an intermediate protective interface that is a link between the customer's layers and the database server to improve security during communication from SQLI attacks. Additionally, it shortens the time required to identify the client's request type. Finally, study the impact of false predictions of machine learning algorithms on CIA. The proposed method is to train a model using a logistics regression technique (LR) with the Spark ML library that works to process big data containing SQL payloads (harmful and benign).

Comparing our proposed model with previous studies, the results obtained show that the proposed model achieved outstanding results, with an accuracy ratio of 98.10%, a precision ratio of 98.13%, a call ratio of 98.10%, and an F1 index of 98.10%. The results also showed that the time needed to detect and prevent such attacks was only 00.09 seconds.