Comparing the Random Forest vs. Extreme Gradient Boosting using Cuckoo Search Optimizer for Detecting Arabic Cyberbullying

Cyberbullying is one of the major electronic problems, and it is not a new phenomenon. It was present in the traditional form before the emergence of social networks, and cyberbullying has many consequences, including emotional and physiological states such as depression and anxiety. Given the prevalence of this phenomenon and the importance of the topic in society and its negative impact on all age groups, especially adolescents, this work aims to build a model that detects cyberbullying in the comments on social media (Twitter) written in the Arabic language using Extreme Gradient Boosting (XGBoost) and Random Forest methods in building the models. After a series of pre-processing, we found that the accuracy of classification of these comments was 0.861 in XGBoost, and 0.849 in Random Forest. Then the results of this model were improved by using one of the optimization algorithms called cuckoo search to adjust the parameters in two methods. The results are improved clearly in the random forest method, which obtained results similar to the extreme gradient boosting method, with a value of 0.867.


Introduction
The internet is considered one of the daily necessities in the lives of all people.Moreover, access to the Internet is easy without any limitations on distance and time.Therefore, getting information is very easy for anyone [1].The means of access to the Internet provide an opportunity for all social media to expand the extent of their use among all Internet users, especially teenagers, who consider social media a recent trend to occupy their spare time with activities and events in the electronic space.Although the Internet is considered a harmless thing for users, the flexibility that exists on the Internet may be a major factor slowly The emergence of problems such as cyberbullying [2], which has recently been considered a health and national issue [3], [4].Cyberbullying cannot be considered a new phenomenon in the world that has emerged recently.Rather, it existed, but it existed in the traditional form before the emergence of social networks, for example, face-to-face between the bully and the victim [5] and [6].This is the opposite of cyberbullying because the incident is in a broader field, which is cyberspace, so there is difficulty in detecting the bullies because there is ambiguity in identifying the parties involved in the incident [7] and [8].Among the most important consequences of electronic bullying are emotional and physiological, for example, depression, anxiety, panic attacks, lack of self-esteem, and low self-confidence [9] and [10].However, the harshest and most serious consequence of cyberbullying is suicide [11] and [12].Given the spread of this phenomenon and the importance of the subject to society, we aim through this work to build a model to detect cyberbullying in the Arabic comments on social media by a machine learning method and improve the results of this model through one of the optimization algorithms, cuckoo search (CS).

Related Work
In [13], the goal of this research study is to collect a dataset of tweets and evaluate and categorize them using various machine learning methods.The performance of different classifiers is shown to vary depending on the size of the data collection.The findings were that Naive Byes and ID3 performed better with balanced data sets.With imbalanced data sets, different classifiers (K-NN, Decision Tree, RF) performed better.DT and K-NN show superior results with imbalanced datasets.With balanced datasets, the highest accuracy is 39.4 %, but with unbalanced datasets, it is 82.7 %.
In [14], rhetorical strategies in Arabic are defined as forms of linguistic expression that communicate thoughts and sentiments through written or spoken texts.They developed an XGBoost classifier for classifying multi-layer Arabic pictorial texts.This specialized research point is in the Arabic language.This Arabic picture collection was used to design, train, and test the XGBoost Workbook (AFC).The XGBoost rated as F1 has an 88 % performance rating.
In [15], they uses machine learning to detect fear in response to government attempts to combat the epidemic based on social media input.Sentiment analysis will be used in this idea to identify anxiety based on positive and negative input from Internet users.K-NN, Bernoulli, Decision Tree, SVM, RF, and XGBoost are some of the machine learning algorithms used.The sample data utilized was obtained via crawling YouTube comments.The highest accuracy was a random forest with vector number extraction advantage and TF-IDF, with 84.99 % and 82.%, respectively.K-NN is the most accurate test, whereas XG-Boost has the best recall.[16] presented Arabic Cyberbullying Detection Using Arabic Sentiment Analysis.The researchers used tools such as AraBully Keywords to collect data from Twitter for their study.The total number of tweets was 17748, and the researcher used SVM, WEKA, and Python compilation tools to work in the compilation phase after pre-processing.They found that when using Light Stemmer, WEKA accurately ranked 15,252.6312tweets (85.49%), and when using ArabicStemmerKhoja, WEKA ranked 15,154 tweets (85.3843%), but Python correctly ranked 14,908.32tweets (84.03 %).
In [17], the researchers present a method for categorizing textual tweets in the Arabic language into five separate groups based on their linguistic traits and content.The Support Vector Machine (SVM), Gaussian Nave Bayes (GNB), and Random Forest (RF) were tested.When employed with stemming and term frequency-inverse document frequency (TF-IDF), the RF and the SVM with radial basis function (RBF) kernel fared similarly well statistically, with macro-F1 scores ranging from 98.09% to 98.14%.
In [18], the purpose of this research is to label a news item automatically based on its lexical properties.Two huge datasets were compiled from several Arab news sources.They utilized a collection of 10 shallow learning classifiers to look at the single-label data set.In addition, all of the classifiers tested, incorporated an aggregate model that used the majority voting approach.The classifiers' performance in the first data set varied from 87.7% (AdaBoost) to 97.9% in both shallow learning and multi-tag deep learning algorithms.The accuracy of XGBoost was 84.7 %, while the accuracy of logistic regression was 81.3 %.The accuracy of the second dataset was 94.85 % after it was examined by various deep learning neural networks.The Convolutional Gated Recurrent Unit (CGRU) was shown to be the top multi-label classifier.

Data Collection
The raw data set in this work was obtained from [16] and stored in a "comma-separated value" (CSV).It includes comments on Twitter written in Arabic, some of which use words that refer to cyberbullying, such as ugly descriptions and offensive words in expressing an opinion or describing people with negative qualities, and others using words that refer to cyberbullying.Positive in praising a specific content or nice words in the grace of religions or praising the performance of well-known personalities.The total number of tweets is 17,748, the number of cyber-bullying tweets is 14,178, and the non-cyber-bullying tweets are (3,570).So, the dataset is considered an unbalanced dataset.

Imbalanced Class Distribution
Imbalanced class distribution means the distribution of the classes that are not equal.The machine learning algorithm works well when the classes have almost the same number of instances [19].There are various techniques for solving imbalanced classes, one of which is downsampling, done by randomly picking the number of instances in the highest class and making it semi-equal to the lowest class.The second technique is to resample only the training set and make it over-sampling [19] and [20].Upsampling is done using the Synthetic Minority Oversampling Technique (SMOTE) [21].We combine up-sampling and down-sampling together as another technique.

Methodology and Experiments
Detecting cyberbullying involves a binary classification task related to comments and distinguishing whether it is cyberbullying (denoted by C) or non-cyberbullying (denoted by N).
The general steps of the proposed model start by collecting the dataset and rebalancing it through three techniques (downsampling, upsampling, and an up-down sampling technique by combining the first two techniques).Data is preprocessed using seven steps: 1-Remove duplicated: This process deletes any duplicated tweets in the dataset.2-Tokenization: It is the process of splitting up tweets into words called tokens, separated by commas [22].3-Normalization: This is the process of transforming a text into a canonical form by removing noises, such as dates, whitespaces, abbreviations, acronyms, and diacritics [23], [24].4-Stop word removal: This process is used to filter out unnecessary data by converting it into an accepted form.The removal of this unnecessary data does not affect the general meaning of the text [25].This work obtained a list of Arabic stop words from [16] and store it in an excel sheet (.xlsx).5-Stemming: It is a normalization technique in which a list of distinct words is converted into shortened root words to eliminate redundancy.This is done by removing their affixes and suffixes.Root stemming and light stemming are two types of stemming in the Arabic language [26].6-Lemmatization: It is used to reduce words to their base representation by returning them to their meaning in proper form considering their morphological analysis.7-Padding: Since some sentences are long and others are short, post-zeros padding with a maximum length equal to the longest sentence in the dataset is used to equalize the length of sentences [27].Table (1) shows examples for each step in pre-processing.The pre-processed data is split into training and testing.Then the feature extractor method is applied.The resulting features are classified and evaluated using the random forest and XGBoost methods before and after the Cuckoo search optimizer method.Figure (1) shows the proposed model used in this work.

Feature extraction
Feature extraction is considered an important step in data mining and information retrieval.It transforms unstructured main text into structured information that the computer can distinguish and process [28].This work applied the time-frequency-inverse document frequency (TF_IDF) for the feature extraction process as shown in eq. ( 1), which utilizes the TF and IDF of each word in the document after getting vectors for all words.In eq. ( 2), TF (w) signifies the frequency of the word w in the document, count (w) and count (wn) denote the number of samples including the word w in the dataset, and n denotes the number of samples containing the word w in a corpus, respectively [29].IDF (w) denotes the inverse file frequency of the word w in equation (3).

Machine Learning Algorithm
Machine learning (ML) refers to a computer's ability to teach itself how to make decisions using both available data and experience [30].The available data is called "training data".Decisions to be made in ML are either classification or prediction of new things or data.A computer classifies new data or objects based on learning algorithms.If the training data is classified by human experts, then all algorithms that depend on this type of data are called supervised learning algorithms [31].There are many machine learning algorithms available, and the algorithms used in this work will be mentioned as follows: 1-Decision Tree (DT): It is one of the most well-known classification algorithms in supervised learning.Its work is to build a prediction tree structure using information-entropy concepts using annotated training datasets [32].DT: Learners use a set of labelled data and classify it using a command-and-conquer approach.Each tree consists of leaves representing a classification class and arches representing a feature checked from the training data [33].
2-Random Forest (RF): is an ensemble learning algorithm [34], proposed by Breiman (2001) [35].It is composed of multiple independent DTs that are trained independently on a random subset of data.RF builds many tree models by using the training data.These trees can be used for prediction in later phases.This type of ML algorithm is different from DT.It does not suffer from the overfitting problem that is found in DT [36].Various training set samples are randomly created from the main training set using a bootstrap sampling method in its process.DT operations are then conducted on the new training sets while constructing the DT.The DT is created by choosing a random column variable and merging it with a random row observation.All these sets of DT together make (RF) [35].RF can be used in classification and regression problems [37].
3-XGBoost (Extreme Gradient Boosting): it is an ensemble learning algorithm [38] proposed by Dr. Chen in 2016 [39] and considered one of the effective ML algorithms based on DT algorithms as its central unit.It is considered more accurate than a single DT algorithm.The XGBoost algorithm uses successive dataset training processes to merge weak predictors (DT) and achieve more strong predictors [35].XGBoost is a robust ML algorithm for both classification [40] and regression [41].

Evaluation Metric
Minimizing the error rate of classification performance over the validation set of given training data [42] is the aim of the proposed algorithm.This performance was assessed by its accuracy and the f1-score [43].Table (2) shows their formula.

Hyperparameter Optimization
The handling of hyperparameters in machine learning allows for controlling the performance of the algorithm in the dataset.Hyperparameter optimization adjusts a set of hyperparameters of the learning algorithm to improve the performance of the ML model [44].Because of the different types of hyperparameters, each ML algorithm has a different tuning process for its hyperparameters [45].

Cuckoo Search Algorithm
Cuckoo search is inspired by the bird cuckoo, which can be defined as a meta-heuristic algorithm.This kind of bird puts its eggs in the nest of another host bird because it never builds its own nest.The host bird recognizes the eggs that are not its own eggs and either disposes of the eggs from its nest or simply removes them from its nest and builds a new nest.Each egg in a nest represents a solution, and a new and good solution is represented by the cuckoo egg.The obtained solution is a new solution based on the existing one with some differences [46].There are three main rules in CS [47].
First: each cuckoo selects a nest randomly to puts eggs in.Second: the number of available host nests is constant, and nests with the top-quality eggs will carry over to the next generations.
Third: if the host bird distinguishes the cuckoo egg, it can dispose of the egg or leave the nest, and build a new one.
There is a constant number of host nests, and the probability that an egg laid by a cuckoo is distinguished by the host bird is pc ∈ [0, 1]. Figure (2) shows a change in the value of the parameters using a cuckoo search.

Hyperparameters in XGBoost and RF
XGBoost and RF classifiers have many kinds of hyperparameters that need to be determined and applied to the optimization algorithm (CS).Table 3 shows the RF and XGBoost classifier's hyperparameters, range, and default values [48], [40].

Experimental Setup
Before feature extraction and classification, the dataset must be split into (80-20) or (70-30)% train-test datasets.This division gives the ability to model for training and testing on an unseen dataset.Each set of training and tests was applied separately to the feature extraction process and then to the classifier stage.Figure (3) shows the five corpora corpus used in experiments and the number of instances of the four corpora: Raw dataset (R), Downsampling dataset (D), Upsampling dataset (U), and Up+Down sampling Dataset (UD).

Figure 3: Main Corpora and Number of Instances
All experiments done in this work were dependent on stratified 4-fold cross-validation in the case of 70-30% train test split and on 5-fold cross-validation in the case of 80-20% train test split by using an inner loop (for the training set) and an outer loop (for the main dataset).

Results and Discussion
A total of 24 sets of experiments have been trained and tested on different corpora using two classifiers.Tables (4) and (5) show the performance of classification and metric evaluation (accuracy, F1-score) on different corpora for XGBoost and RF, respectively.The classification operation in Table (4) was done by using an XGBoost classifier with four corpora in light, root stemming, and lemmatization, so the highest accuracy was with the root stemming and R corpus.While almost the lowest accuracy with the D corpus, this means that downsampling does not work well with this type of dataset and classifier.While in the table (5), the classification is done by using an RF classifier with four corpora in light, root stemming, and lemmatization, so the highest accuracy is with the root stemming and UD corpora.While R corpus and D have nearly the lowest accuracy.After completing the classification process and before starting the optimization process, the range for each hyperparameter in both classifiers must be determined.Table (6) shows the range for each parameter in both XGBoost and RF.After selecting a range of hyperparameters in both classifiers, the optimization process starts by applying the CS algorithm with R and UD corpora, since they show higher accuracy with a smaller number of iterations Table (7) shows that the highest accuracy in the XGBoost classifier applied to 80% of the training data were achieved on the R corpus with lemmatization.It has a scale-pos-weight of (1.171), a gamma of (1.665), a min-child-weight of (0.268), and an Eta of (0.735), with the best accuracy on 70% of the training data achieved on the R corpus with a light stemmer.As shown in Table (8), while in the RF classifier, the highest accuracy was achieved in the UD corpus in both 80% and 70% of the training data with light stemming, max depth (1000), and (879), respectively.Following all of the previous experiments with various corpora and comparisons with previous works, it has been discovered that the best accuracy is achieved on XGBoost without using the cuckoo search optimizer, as shown in Table (9).Cuckoo search improves the accuracy of both classifiers, and they achieve the same accuracy.The RF responded greatly to the CS optimizer, raising its accuracy level to equal the XGBoost accuracy.

Conclusion
In this work, the unbalanced dataset problem was solved by producing three corpora, U, D, and UD, for upsampling, downsampling, and up-down sampling, respectively.The D corpus almost shows less accuracy with the XGBoost classifier since much information was missing during the rebalancing operation.
In the RF classifier, the R corpus shows the worst accuracy, while the UD corpus shows the best accuracy.This indicates that the random forest algorithm is inefficient with unbalanced datasets.In the XGBoost classifier, the R corpus shows the best accuracy, which indicates the resistance of the XGBoost algorithm to the unbalanced dataset.The Cuckoo search optimizer improves RF and XGBoost accuracy to 0.867 for both classifiers.

Figure 1 :
Figure 1: The proposed model

Figure 2 :
Figure 2: Change Value of the Parameters Using a Cuckoo Search .6.Hyperparameters in XGBoost and RFXGBoost and RF classifiers have many kinds of hyperparameters that need to be determined and applied to the optimization algorithm (CS).Table3shows the RF and XGBoost classifier's hyperparameters, range, and default values[48],[40].

Table 3 :
Hyperparameters in XGBoost and RF

Table 5 :
RF classifier on average of 5-fold cross-validation

Table 6 :
The Range of Hyperparameters

Table 7 :
Optimization Stage with XGBoost Classifier

Table 8 :
Optimization Stage with RF Classifier

Table 9 :
Comparison with Previous Work