SMS Spam Detection Using Multiple Linear Regression and Extreme Learning Machines

With the growth of the use mobile phones, people have become increasingly interested in using Short Message Services (SMS) as the most suitable communications service. The popularity of SMS has also given rise to SMS spam, which refers to any unwanted message sent to a mobile phone as a text. Spam may cause many problems, such as traffic bottlenecks or stealing important users' information. This paper, presents a new model that extracts seven features from each message before applying a Multiple Linear Regression (MLR) to assign a weight to each of the extracted features. The message features are fed into the Extreme Learning Machine (ELM) to determine whether they are spam or ham. To evaluate the proposed model, the UCI benchmark dataset was used. The proposed model produced recall, precision, F-measure, and accuracy values of 98.7%, 93.3%, 95.9%


Introduction
The Short Message Service (SMS) has become one of the most common means of communication among millions of people as a result of the advancement of mobile communication technologies and the expansion of mobile phone capabilities, where messages must be transmitted by standard communication protocols [1].There are numerous reasons for the widespread use of SMS messages.Most users read SMS every day, whereas emails may go unread for several days.SMS is inexpensive and, in some cases, free.Finally, the number of mobile phone users has increased dramatically in many countries, reaching several million in the United States and China [2].
Spam is defined as unsolicited and unwanted electronic messages containing potentially malicious content [3].Email spam and SMS spam are both types of spam.Email spam is spam sent via the internet, whereas SMS spam is typically sent via mobile devices [4].A spammer is an individual or group of individuals who collectively send unsolicited e-mails using multiple methods of communication [5].Spam messages come in a variety of forms.Some of these are marketing communications that try to advertise something.Other sorts of spam communications can transmit malware and fool a receiver into giving personal information, which leads to stealing important data such as passwords, private photos, and credit cards' details.In addition, spam messages have a number of drawbacks, such as traffic throttling, storage space, and computational power consumption [6,7].
With the increased use of text messaging, the problem of SMS spam is becoming more prevalent.There are a number of security methods available to combat the problem of SMS spam, but they are not yet mature [8].Many Android apps exist on the Play Store to block spam texts; however, owing to a lack of awareness, most consumers are unaware of them.Apart from apps, the available filtering approaches are primarily focused on email spam, as email spam is a known problem.However, with the growth of mobile devices, SMS spam has become one of the most common concerns [9].Based on the low cost of SMS, it can be considered the simplest way for attackers to phish important information from mobile devices.Attackers are trying to discover new ways to steal sensitive information from users via SMS because SMS is the most common method of communication.Phishing can be accomplished by sending a malicious link via SMS and inviting the recipient to open the link [10].
There are many reasons for the increase in SMS spam.Since there are no applications that can filter spam SMS for both senders and recipients, spam messages may be delivered unfiltered to users' mobile phones.Unlike e-mail and other online communication services, SMS doesn't need the Internet to reach the user reliably.Another reason is the availability of inexpensive bulk SMS senders that can deliver large numbers of SMS simultaneously [11].To address the problem of SMS spam, they must be identified as soon as the message arrives.
This paper introduces a new model based on the MLR for weighting features and using the ELM to classify messages as spam or ham (legitimate messages).The following are the main contributions: 1.
Determine the importance of features in detecting spam messages by assigning a weight to each feature using MLR.

2.
Investigate the impact on classification performance of combining the proposed MLRbased feature weighting with ELM.The paper is divided into five sections: Section one is an introduction.Section 2 illustrates the representations related to spam detection.Section 3 introduces the proposed spam detection model.Section 4 presents the findings attained throughout the respective experiments in this regard.Finally, Section 5 includes with the conclusion.

Related Work
Recently, many efforts have been made by researchers to develop a variety of approaches to identify SMS spam messages.However, many of these publicly available SMS spam message filtering methods are still in the early stages of classification and are not very mature or reliable [12].Generally, SMS spam detection approaches have been divided into two categories: content-based approaches and collaborative-based approaches.The content-based approach uses some techniques to analyze the content of SMS text messages to decide whether the message is spam or not.Some techniques used fully extracted features to detect spam SMS, while other techniques used feature selection methods for detection purposes.The collaborative-based approach depends on usage and user experience [13].
Suleiman and Al-Naymat (2017) [14] developed a classifier based mainly on using H2O as a platform for comparison purposes between different machine learning algorithms.The proposed methods are based on using deep learning, random forests, and Naive Bayes.The UCI dataset was used in the experiment.The random forest gives the best results with precision, recall, f-measure, and accuracy of 96%, 86%, 91%, and 0.977%, respectively.[15] proposed a CNN-based architecture with one layer of convolution and pooling to filter SMS spam.They achieved an accuracy of 98.4%.In addition, Kaliyar et al. (2018) [16] proposed a model based on using different machine learning algorithms to filter messages from many languages, including Singaporean, American, and Indian English.The model attained a high degree of precision for the Indian English language.Pumrapee et al. (2019) [17] proposed an SMS spam detection method based on long shortterm memory (LSTM) and gated recurrent unit algorithms.The experimental results show that the accuracy of the model reached 98%.Sjarif (2019) [18] developed a method based on computing the term frequency-inverse document frequency (TF-IDF) and applying many machine learning algorithms to classify the messages as spam or junk.The Random Forest algorithm outperforms other algorithms with an accuracy of 97.50%.

Popovac et al. (2018)
A new method was introduced by Kumar (2020) [19] to classify a message as ham or spam by going further to identify a smishing message from spam messages.The proposed model is based on two main phases.The first is based on selecting eleven features from each message to classify the message as "ham" or "spam".The second phase is only applied to spam messages in order to separate Smishing messages from spam messages.The random forest gives the best results among many classification algorithms, with 94.9% accuracy.
Hameed and Ali (2021) [20] developed a new method based on binary particle swarm optimization to select the best fuzzy rules.A set of six features was extracted from an SMS spam dataset and introduced as input to the fuzzy system to generate a set of suitable rules for classification purposes.Finally, binary PSO is used to pick the best rules that decrease the complexity and improve the system's performance.The experimental results showed that the recall, precision, F-measure, and accuracy of the proposed system were 98.8%, 90.8%, 94.6%, and 98.5%, respectively.

Proposed spam-detecting model
The details of the employed dataset, preprocessing, and feature extraction technique are first explained, followed by a suggested strategy for resolving the SMS spam problem.

Dataset Collection
This paper uses the widely adopted UCI repository dataset for performance evaluations.The dataset consists of 5,574 English SMS messages classified as spam and ham.The number of spam and ham messages is shown in Table 1.The dataset contains two columns: the message string and its class (ham or spam).These messages were collected from the National University of Singapore [21,22].

Preprocessing and Feature Extraction
To prepare the dataset, tokenization, stop word removal, and stemming processes were applied.Then the feature extraction process started.Feature extraction is the process of converting the actual data into numerical values that can be processed while keeping the dataset information.The feature extraction stage plays an important role in the detection of spam messages because the selection of features can have a considerable impact on the success of machine learning approaches.As a result, finding the most helpful features that can effectively classify SMS spam messages is, in most situations, a difficult task.In this paper, a set of seven features F={f1,f2,…,f7} is extracted from each message as illustrated in the following.1. Message length: The number of characters in the message, where SMS spam length tends to be longer than ham .2. Number of words: A spam message usually contains a larger number of words than a ham message .3. Uniform Resource Locator(URLs): The existence of URLs in the message.The value of this feature is either 1 or 0. 4. Special characters: The number of special characters in the message.Spam messages usually contain special characters since spammers use these characters for various aims.special characters such as."+,& ,!,$": 5. Capitalized words: a spam message usually consists of capital words to gain the user's attention.6. Services words: The spammers use many words like "join" or "loan" to trick users into joining their services or helping them to get loans.7. The existence of the dot symbol in a message is a good sign for ham messages since the dot is used to separate sentences and for chatting.

MLR-based Feature weighting
Regression analysis is a statistical model for estimating the relationship between variables in a cause-and-effect relationship.MLR helps in assigning a weight for each feature, thus improving the system's performance.In MLR, there is one dependent variable and more than one independent variable [23,24].Equation (1) illustrates the formula for MLR.
where [Y] is the output vector.
[F] The extracted features (independent variables).Equation ( 2) is used to compute the weight of the features.
2) The regression model can be represented in a matrix form as follows. [ Where n is the number of messages used to train the model .Y is the desired output, whose value is for spam and 0 for ham..The seven extracted features were represented by (F1, F2,… F7).Finally, W represents the weight of the feature that we try to estimate.Final value features are computed as illustrated in equation (3).

ELM for Spam Detection
Extreme Learning Machine (ELM) was proposed by Huang in 2006 as a new learning algorithm.ELM is a training algorithm for a single hidden layer feed forward neural network (SLFN) [25].Different from gradient-based methods, the weights between the input layer neurons and the hidden layer neurons in the ELM algorithm are randomly assigned and frozen during the training phase [26].Generally, ELM consists of two main phases: the first one is concerned with creating the hidden layer output matrix with random hidden neurons, and the second phase is concerned with finding the output connections [27].Compared with the BP neural algorithm, ELM has many advantages over convergence speed and local optimization [25].The ELM can be expressed as in equation (4).
Suppose n is the number of input nodes, l is the number of hidden nodes, and m is the number of output nodes.Where xj is the input vector, wi is the weight between the input nodes and hidden nodes.g is the activation function, which is usually nonlinear; bi is the bias; and finally, βi is the weight of the output layer of the ith hidden layer node.

Experimental Results
This section deals with evaluation of the proposed model and comparing the results with the state-of-the-art model.
Where TP is a spam message that is classified as spam.
TN is a ham message that is classified as ham.
FN is a spam message that is classified as ham.
FP is a ham message that is classified as spam.
Also, the Receiver Operating Characteristic (ROC) curve is calculated.In a ROC curve, the true positive rate (recall) is usually plotted on the y-axis, whereas the false negative rate (FNR) is plotted on the x-axis.A ROC curve is used to calculate the classifier's accuracy.Equation ( 9) explains how to calculate the FNR.

Feature Analysis
The performance of the proposed model was evaluated using a number of tests.The main objective of the proposed model is to measure its ability to detect SMS spam and compare it with many other classification algorithms.Initially, a preprocessing step was performed to prepare the dataset, and then a set of seven features were extracted based on spam and ham message behaviors.Then, the MLR is applied to compute the weight of the feature.Finally, the proposed model classified the message as either spam or ham. Figure 1 shows the distribution of the most frequent words that appear in ham and spam messages.It is clear from the figure that there is an overlap between spam and ham words, which makes distinguishing between them difficult.
Figure1: Most Frequent words in spam and ham messages.
As is known, different features have varying degrees of significance for a given learning problem.Some features are less significant, while others are critical.Instead of using feature selection to identify the most significant features, an algorithm given the importance of the features increases its speed and accuracy.MLR was applied to satisfy two objectives: firstly, to compute feature weights, which help show the significance of each feature in the detection model.Secondly, using these weights to assign new values to the features, as illustrated in equation 3, helps to increase the separation between spam and ham features.Figure 2 explains the weight of each feature in the proposed model.A prediction failure occurs in the classification algorithm when features of classes are close to each other.Thus, increasing the separation between features improves the accuracy of the classifier.The proposed method is based on increasing the separation between ham and spam features by multiplying each feature by its weight.As shown in Figure 2, features that identify spam messages have higher weights than features that identify ham messages.Therefore, multiplying the features by their weight increases the separation, which leads to more accurate results.It is clear from Figure 3 that the message length feature has the highest weight among all the features set.Other important features include the URL, the number of words, capitalized words, special characters, service words, and the dot symbol.

Comparisons with Existing Approaches
The model was evaluated using 8-fold cross-validation to determine how accurate the predictions will be in practice.After training the models, the performance of the model was evaluated using the testing dataset.In each step, all the evaluation metrics were applied.The final values were obtained by taking the mean of each step.Figure 3 shows the ROC where the Area Under the Curve (AUC) equals 0.984.It is clear from the figure that the good performance of the classifier shows a spat between spam and ham messages.
To certify the results of the proposed model, a comparative analysis was done between the existing state-of-the-art models and the proposed model, as shown in Table 2.The proposed model outperforms other state-of-the-art models, especially in terms of accuracy.There are two main reasons behind the good results obtained: Firstly, introducing a new strategy to assign feature weight based on MLR has a direct effect on the separation of the features that distinguish between spam and ham messages, Secondly, the good classification performance of the ELM.As shown in Table 2, the results of the proposed model tend to be more efficient in dealing with SMS spam messages and less efficient in dealing with ham messages because most of the selected features work toward identifying spam messages.

Conclusion and Future Work
SMS is one of the most common ways for people to communicate.There are two categories of SMS messages: spam and ham.This paper introduces a new model to detect SMS spam messages based on feature weight and ELM.MLR is used to weight the extracted seven features, and ELM is used to determine whether SMS messages are spam or ham.According to the weighting algorithm, the message length has the highest weight among a set of features, while the dot symbol has the lowest.Furthermore, the results show that the proposed method outperforms other state-of-the-art models, particularly in terms of accuracy, which is regarded as the most important factor in determining classification algorithm performance.
A limitation of this work is that some features are dependent on the text messages being written in English only.As a result, future research should consider using a similar approach to filter spam and ham text messages written in other languages by replacing these features with others that are language independent.In addition, the dataset used in the experiment contains thousands of messages.In the future, a larger dataset can be used to prove the performance of the model.

Figure 2 :
Figure 2: Features weight of the Proposed Model.

Figure 3 :
Figure 3: ROC for the proposed model.

Table 1 :
Number of ham and spam messages in the dataset

Table 2 :
Comparisons between the proposed model and other models