An Integrated Information Gain with A Black Hole Algorithm for Feature Selection: A Case Study of E-mail Spam Filtering

The current issues in spam email detection systems are directly related to spam email classification's low accuracy and feature selection's high dimensionality. However, in machine learning (ML), feature selection (FS) as a global optimization strategy reduces data redundancy and produces a collection of precise and acceptable outcomes. A black hole algorithm-based FS algorithm is suggested in this paper for reducing the dimensionality of features and improving the accuracy of spam email classification. Each star's features are represented in binary form, with the features being transformed to binary using a sigmoid function. The proposed Binary Black Hole Algorithm (BBH) searches the feature space for the best feature subsets, and feature selection is based on a fitness function that is proportional to the accuracy achieved using a Naive Bayesian Classifier (NBC). When measuring the performance of the BBH with the SpamBase dataset, the performance of the classifier and the dimension of the selected feature vector used as a classifier input are considered. The experiments revealed that the BBH can produce good FS results even with a small set of selected features. This shows that when utilizing the NBC-based BBH, good spam email categorization accuracy is possible.


Introduction
Email is widely regarded as the most dependable and effective mode of communication, but it has recently become a major target for cyberattacks.Spam or junk emails account for a significant portion of this attack, as they are distributed through various protocols such as the simple mail transfer protocol (SMTP) [1] [2].As a consequence, spam emails may pose a risk to government institutions [3] [4].In general, spam email detection relies on correctly classifying emails into spam and non-spam categories.
The majority of contemporary spam detection frameworks use machine learning approaches to classify spam emails [5][6][7].However, selecting the classifiers' ideal input feature subsets, which is done through an FS process, is a serious issue that concerns email categorization.Meanwhile, most classifiers, such as the Artificial Neural Network (ANN), Support Vector Machine (SVM), and NBC, suffer from the problem of excessive data dimensionality, which is related to the FS process [7][8][9][10][11][12][13]High data dimensionality is thought to be prevented by restricting feature space and lowering the message's huge number of features.Irrelevant characteristics can have an effect on categorization accuracy.It can also have an impact on the amount of time it ISSN: 0067-2904 takes to train a classifier, the cost of features, and the number of learning instances needed [14], [15].Nature was the main source of inspiration for the researchers in developing different types of optimization algorithms [16].Nature-inspired algorithms have been used to improve the performance of various machine learning models [17][18][19][20][21]. Swarm-based and evolutionary methods such as Ant Colony Optimization (ACO) [22][23][24], Genetic Algorithm (GA) [25][26][27], Artificial Bee Colony (ABC) [28], [29] Particle Swarm Optimization (PSO) [30], [31], Bat Algorithm (BA) [32][33][34], and Harmony Search Algorithm (HSA) have recently been used to solve FS [35] problems.The Black Hole algorithm (BH) has been developed recently for solving different optimization problems.It simulates the natural phenomenon of a black hole in the universe [36].
The problem of feature selection is addressed using a novel hybrid model that combines whale optimization methods and flower pollination algorithms.This model is founded on the idea of opposition-based learning.Experiments are conducted in two stages to assess the performance of the suggested algorithm.Ten sets of data selection features from the UCI data repository were used in the trials that made up the first phase.The proposed algorithm was put to the test in the second step by looking for spam emails.According to the first step's results, the suggested method outperformed other fundamental meta-heuristic algorithms in terms of average selection size and classification accuracy when applied to 10 UCI data sets.The results of the second step also demonstrate that the suggested algorithm has been able to detect spam emails with accuracy [37] Investigate the potential effects of adversarial scenarios on the safety of machine learning-based systems like email spam filters.They created and tested three intrusive strategies, namely synonym replacement, ham-word injection, and spam word spacing, using natural language processing (NLP) and the Bayesian model as examples.The adversarial examples and results suggest that these techniques are effective in fooling the machine learning models [38].
Three effective binary solutions to the FS problem were described, all of which were based on the Symbiotic Organisms Search (SOS) algorithm.For the binarization of the SOS in the first and second techniques, several S-shaped and V-shaped transfer functions were employed.They were referred to as BSOSS and BSOSV.For binarization of the SOS, two new operators known as the binary mutualism phase (BMP) and the binary commensalism phase (BCP) were presented, resulting in the Efficient Binary SOS (EBSOS).18 standard UCI datasets were used to test the suggested methods, which were then contrasted with the fundamental and significant meta-heuristic algorithms [39].
The email spam detection concept can be applied to text and image datasets.The invention of adaptive capsule networks and multi-objective feature selection for email spam detection is considered to be the primary contribution in this case.Two feature extraction methods, Term Variance (TV) and Term Frequency-Inverse Document Frequency (TF-IDF), are used when working with text datasets, while Fisher Discriminate Analysis (FDA), Walsh-Hadamard Transform (WHT), and color correlograms are utilized when working with image datasets.The hybrid meta-heuristic method Grey-Sail Fish Optimization (G-SFO) performs the multiobjective feature selection due to the length of the features appearing to be long and to reduce the training complexity [40].
The primary goal of this study is to use a hybrid soft computing model to improve the classification performance of an e-mail spam filtering system.The proposed model consists of a machine learning model, or classifier, called Naïve Bayesian Classifier (NBC), for classifying the e-mails, and a hybrid filter called the wrapper feature selection algorithm, which is based on the Information Gain (IG) and Black Hole (BH) algorithms.
The remainder of the paper is organized as follows: The usual BH and NBC are described in Section 2, whereas the proposed algorithm is explained in Section 3. The experimental results are depicted in Section 4. Finally, the study's conclusion is presented in the final part.

Black hole algorithm
A black hole (BH) is a region in space and time (x, y, and t) that has a very strong gravitational field from which nothing can escape.Based on the general relativity concept, a sufficiently compact mass is required to deform space-time to generate a BH.Surrounding the BH is a mathematically defined surface known as the event horizon, which marks the point of no return, as anything that gets closer to it or crosses the Schwarzschild radius gets drawn into the BH and will disappear forever.The BH concept is just a region of space with so much mass that nothing can escape its gravitational pull.Whatever falls into the BH is forever lost, including light.The three main components of the BH algorithm are as follows: First, there's the black hole, which represents the best candidate (or solution) among all the options at any given iteration.Second, there are the "stars," which represent the other standard solutions or candidates.The BH was not created at random and is one of the real candidates in the population.Finally, based on their present location and a random number, all candidates are transported towards the black hole in the movement component.The BH algorithm has been successfully applied to a variety of optimization problems, including training a CNN [41], solving the traveling salesman problem (TSP) [42], and exploring the stars via Levy Flight for global optimization problems and data clustering [43]

The proposed algorithm
The proposed algorithm consists of two main phases.In the first phase, the IG method is executed.For each feature in the dataset, a specific weight is calculated using the following equation: Equation ( 1) above consists of three main parts.In the first part, the overall entropy for the training set is calculated.In the second and third parts, the entropy of the feature is calculated.Therefore, the equation above could be simplified as follows: Where H denotes the information entropy and T represents the training set.After calculating the IG values for all features in the dataset, the features should be sorted ascendingly.The features with low IG values are removed (unselected), while the rest are kept for the next stage.In order to determine the unselected features, a threshold value should be used.The second phase, which is the binary BH algorithm, is executed after the first phase is done.As stated previously, the BH algorithm is a population-based algorithm that consists of a number of stars, each of which contains a number of dimensions based on the nature of the optimization problem itself.In this study, the number of dimensions () is equal to the number of features after applying IG.Moreover, each star consists of a unique representation of a solution, meaning that it contains two types of solutions.The first type of solution represents values in a continuous form, while the second type represents the solution in a binary form that is converted from the first type.Figure 1 below illustrates the solution representation of each star in the population.
Where   represents an individual star in the population, while upper and lower represent the boundaries of the search space.Equation (3) generates the positions of each star; the values of the positions are continuous, which should be converted into binary form in order to determine the selected and unselected features.The binary sequence is generated by comparing the values of the positions with the threshold 0.5, as follows: Where 0 represents a removed or unselected feature and 1 represents a selected feature.

4-Calculate the fitness Function:
The proposed algorithm's fitness function is to reduce the classification performance error rate over the validation set of supplied training data, as indicated in Equation ( 5), while increasing the number of non-selected features (irrelevant features).A classifier should be used to calculate the fitness function.The accuracy was determined using the Nave Bayesian Classifier in this case. = 100 −  (5) where Accuracy denotes the classification accuracy rate of the classifier; in other words, the 5-fold cross validation.In this study, Naïve Bayesian Classifier (NBC) is used for calculating the classification accuracy.

5-Determine the Black Hole
In each iteration, the best solution with the lowest error should be determined and set as the black hole ().

6-Update the position of each star
As illustrated in Figure 1, two forms represent the solutions for each star.In order to move the stars in the searching space, the first array, which is in continuous form, should be updated via the following equation: Where BH represents the black hole or the best solution in the current iteration, and   .   and   .   represent the new and the old position of the start   respectively.
The updated position should be re-converted into binary form because the binary values, or the sequence of 0s and 1s, have changed.The continuous values should convert using equation ( 4).Then, the star should be re-evaluated using the fitness function (Step 4).

7-Calculate and Check the Event Horizon (𝑅)
In this step, the new positions of all stars are evaluated to determine whether they have crossed the event horizon or not.The event horizon () is calculated in each iteration based on the cost of the black hole (which should reset using Step 5) using the following equation: The cost of each star in the population is compared with the value of , and the star with a lower cost than  is eliminated and regenerated using initialization (Step 1).

8-Checking the Stop Coding
The stop condition in the proposed BBH algorithm is the number of iterations, which is a fixed number.The algorithm stops when it reaches that number; otherwise, it executes the movement and regenerates steps again.To be more specific, if the number of loops is still lower than the number of iterations, then go to Step 5. 9-Calculate the final results using the evaluation metrics, and print the final results.
The steps above can be illustrated in the following flowchart.The main flowchart of the proposed algorithm is given in Figure 2.

Results and evaluation
The dataset used for evaluating the proposed filtering system is SPAMBASE, which is a very popular dataset.It consists of around 4600 samples (emails), divided into two main classes: spam (1813 ≅ 39%) and not-spam (2788 ≅ 61%).Each sample or email has already been processed, and converted into feature vectors.In other words, the feature extraction step has already been completed, and the resulting vectors contained 57 attributes for each sample.
As explained in the previous section, the proposed filtering system consists of two main stages.In the first step, the information gain method, which is a filter method, is used to calculate the weights of all features, while the stage represents the BBH algorithm.The results of the first stage are displayed in Table 1.Using the backward feature selection method, the features with weights lower than the threshold value are removed, while the rest of the features are kept for the next stage.The value of the threshold has been determined using the try-and-error method, which is equal to 0.06.There are 12 features with values lower than this threshold.These features are F12, F13, F14, F33, F36, F38, F39, F40, F45, F49, F50, and F51.The remaining features are used in the following stage (BBH algorithm).
As mentioned earlier in the previous section, each experiment has been executed 10 times, for different numbers of iterations and stars.Thus, there are (5×3×10=150) run times.The results have been illustrated in Figure 3 (A, B, and C).Each figure presents the accuracy of all runs (10 times) for 100, 200, and 300 iterations and a specific number of stars.Furthermore, each figure depicts the overall accuracy average.
In general, the suggested algorithm outperformed the original in terms of accuracy, while BBH assisted NBC by selecting the most relevant features.Based on all 57 features, NBC's original accuracy was about 79.41%, whereas BBH's lowest result was around 79.41%.(88 percent ).
Figure 3 shows that the number of stars has a great impact on the searching process of the BBH algorithm; as the number of stars increased, the value of classification accuracy increased as well.The main reason behind this fact is that the possibility of finding better results is higher when more stars are utilized for the searching process.For example, if the population size is equal to 40 stars, then there are 40 possible solutions that are trying to reach better positions in a single iteration.On the other hand, these figures showed that the number of iterations has an effect on reaching the best solutions; as the number of iterations increased, the algorithm reached better classification accuracy.The best results obtained by BBH are compared with other email spam filtering approaches.These related works were chosen with care because they rely on the same dataset.Table 3 below presents the results.The algorithms used for the comparison are: support vector machine (SVM), K-nearest neighbor (KNN), ant colony optimization (ACO), genetic algorithm (GA), particle swarm optimization (PSO), negative selection algorithm (NSA), and distinguish feature selection (DFS) algorithm.The first three models are used to classify the SPAMBASE dataset based on all features (i.e., features = 57).while the rest of the models were enhanced using feature selection algorithms.

Conclusion
In this study, the Black Hole algorithm was used to select the most relevant features that will enhance the accuracy and prediction performance of the NBC.The Black Hole algorithm was combined with information gain before converting its positions to binary (0 represents unselected features and 1 represents selected features).The NBC was used in the proposed algorithm as a fitness function for evaluating the solutions.The proposed algorithm was able to enhance the classification accuracy up to around 91% when the number of stars was equal to 50.For future studies, the proposed algorithm can be used for solving different feature selection problems, such as those in network intrusion detection systems or different medical datasets, such as those for heart disease or diabetes.

Figure 1 :
Figure 1: Solution Representation for the stars

Figure 2 :
Figure 2: The flowchart of the proposed algorithm

( a )Figure 3 :
Results obtained using 10 stars (b) Results obtained using 20 stars C) Results obtained using 30 stars ) d) Results obtained using 40 stars ) (e) Results obtained using 50 stars The Results

TABLE 1 :
Results of Information Gain for all features

Table 2
below presents the other evaluation metrics, which are the precision, recall, and Fmeasure.It can be seen from the table below that the algorithm is stable and produces high precision and recall.

Table 3 :
Comparison Results