A Smishing Detection Method Based on SMS Contents Analysis and URL Inspection Using Google Engine and VirusTotal

Smishing is the delivery of phishing content to mobile users via a short message service (SMS). SMS allows cybercriminals to reach out to mobile end users in a new way, attempting to deliver phishing messages, mobile malware, and online scams that appear to be from a trusted brand. This paper proposes a new method for detecting smishing by combining two detection methods. The first method is uniform resource locators (URL) analysis, which employs a novel combination of the Google engine and VirusTotal. The second method involves examining SMS content to extract efficient features and classify messages as ham or smishing based on keywords contained within them using four well-known classifiers: support vector machine (SVM), random forest (RF), adaptive boosting (AdaBoost), and extreme gradient boosting (XGBoost). The best results of the proposed method were 98.5%, 96.9%, 93.1%, and 95.05% in terms of accuracy, precision, detection rate, and F1-score, respectively. Furthermore, the evaluation results of the proposed method outperformed the state-of-the-art and showed that the proposed method is effective in detecting smishing messages.


Introduction
Phishing is the harmful attacks used to gain access to online users' sensitive financial or private data by utilizing illegal websites that appear to be authentic.Social engineering techniques are commonly used in phishing attacks to divert clients to malicious websites.Specifically, an e-mail is sent to clients from trusted sources encouraging them to change their login information by clicking/following a hyperlink [1].It uses deceptive techniques to trick internet users into disclosing their personal information, including usernames, passwords, credit card details, and bank account information, believing the website to be legitimate [2].As shown in Figure 1 [3], there has been a rise in mobile phone usage.This led to an increase in information crime; One such crime is smishing.It is a part of spam that has a significant negative impact on many users' everyday lives as they waste a lot of time dealing with spam, which attracts users but may include unanticipated dangerous attachments that can badly compromise the user's system [4].A smishing SMS, for example, informs the recipient that they won a prize or a sum of money, or that they need to resolve an issue with their bank card or electronic account.Short message service (SMS) is one of the most popular communication methods [5].Attackers prefer text messages to target victims because they can reach a large number of people with a low-cost SMS subscription.These messages contain a link to malware or phishing websites that will ask the user for sensitive information.Malware is downloaded to the user's mobile device and then performs malicious operations on the device [6].
The unstructured SMS text message data and the nonlinearity involved in interpreting SMS text message data make distinguishing between phishing and legitimate SMS a challenging task.Smishing detection models based on checking the legitimacy of Uniform Resource Locators (URLs) and analyzing SMS content are proposed in this paper using a variety of machine learning algorithms.The following are the main contributions of this paper: • Proposing a new method that combines the Google engine and VirusTotal to examine the URL authenticity in the SMS • Examining text messages to extract several features capable of distinguishing smishing messages from SMS by adopting TF-IDF with a new strategy.
• Applying different machine learning algorithms to judge the performance of the proposed smishing detection.
The remainder of the paper is structured as follows: Section 2 presents anti-smishing-related works.Section 3 explains the preliminary concepts.Section 4 presents the anti-smishing model that is being proposed.Section 5 provides and explains the research results.Section 6 concludes and presents future work.

Related work
Researchers have proposed several approaches to combat smishing attacks, including content-based, URL behavior analysis, and heuristic techniques.Some of these works are discussed bellow: Mishra and Soni [6] presented an approach based on the combination of URL behavior analysis and message content for smishing detection.The system uses SMS content analysis, a machine learning classifier, and an examination of the URL behavior method for phishing SMS classification.the presence of email IDs, phone numbers, or URLs in messages is discovered in the first phase by filtering the content of the text messages.To calculate word frequency, they used the term Frequency-Inverse Document Frequency (TF-IDF), and to classify the smishing messages, OneVsRest classifier was used.The benefit of analyzing URLs is that it detects Android Application Package (APK) downloads at the same time the source code is also inspected to see if the form tag exists in the messages.Joo et al. [7] proposed a smishing detection system to inspect and balk phishing SMS.The presence of the URL is examined in the message.They systems includes four parts: the SMS monitor, analyzer, determinant, and a database.The researchers applied Naïve Bayesian classifier (NB) to distinguish phishing SMS from legal ones.
A combination of content-based and machine-learning algorithms for a smishing detection system was suggested by Sonowal and Kuppusamy [8].Using the dimensionality reduction method to reduce the number of features, and the Pearson correlation coefficient.The system extracted 39 features, and 20 discriminate features were selected.Jain and Gupta [9] proposed content-based filtering with a rule-based approach.Three algorithms and nine rules were implemented by researchers: Repeated Incremental Pruning To Produce Error Reduction (RIPPER), Decision Tree (DT), and PRISM for message classification.the acquired result was positive and the system can notice the zero-day attack.A model of smishing detection was suggested by Goel and Jain [10].The authors implemented NB to distinguish smishing messages from legitimate ones.The messages were converted to the standard format using Text Normalization techniques, and the system also checked URLs, phone numbers, and APK downloads.The blacklist URL proposed in this model is ineffective because the malicious URL is frequently updated.
A heuristic-based algorithm was introduced by Jain and Gupta [11] for smishing detection with the use of feature selection and machine learning algorithms.The system selects ten features by analyzing the content of the messages and classifying them using classification algorithms.
A system based on a combination of the heuristic method and content-based feature extraction with machine learning classifiers was proposed by Jain et al. [12], in two-phase classification.The first phase distinguished spam from ham.The second phase filtered smishing messages, so the system can detect spam and smishing messages.Feature selection is also applied to extract relevant features using Information Gain (IG) by selecting 11 and 4 features for spam and smishing respectively.
Sonowal [13] offered a combination of content feature extraction and four correlation machine learning algorithms, namely spearman's correlation, Pearson rank correlation, point biserial rank correlation, and Kendall rank correlation for ranking features.The system achieved 98.40% accuracy with the AdaBoost classifier.
Another smishing detection model introduced by Mishra and Soni [14] consisted of the domain checking phase and the SMS classification phase.The first phase discovers the authenticity of the URL in the SMS, which leads to phishing detection, and the second phase processes the text content of the messages by extracting discriminant features.The proposed work used the Backpropagation (BP) algorithm, RF, NB, and DT for message classification.Moreover, the system obtained 97.93% accuracy.
A content-based model was suggested by Ulfath et al. [15].They evolved an automated system with the ability to differentiate smishing messages from legal ones.The proposed work has multiple steps including features extraction and selection, machine learning classification, Extreme Gradient Boosting (XGBoost), RF, Classification And Regression Tree (CART), SVM, and AdaBoost.SVM is put above the other classifiers for showing the best result with the minimum number of features Shravasti and Chavan [16] proposed a smishing detection model based on artificial intelligence.The suggested model begins with pre-processing and extracting some effective features like (term function, URL, email address, mobile number, number of characters, and currency symbol).Finally, classification techniques such as Long Short-Term Memory Recurrent Model (LSTM), K-Neighbors (KNN), Stochastic Gradient Descent (SGD), DT, NB, and RF are used to classify smishing messages from legitimate ones.In this model, the LSTM showed the best accuracy of 95.11%."SM Detector" was introduced by Ghourabi [17] as an Anti-smishing mechanism in the mobile environment.The proposed system consists of three consecutive parts.The first part uses the VirusTotal API to check URLs' authenticity.The second part investigates blacklisted words or numbers in the message's content by applying the regular expression method.The last part represents the core of the work that uses the Bert classification method.This method achieved 99.63% accuracy in both Arabic and English datasets.Jain et.al [18] proposed an intelegent system to detect smishing using URL classifier and text classifier.The authors used two datasets for smishing text and URL, furthermore they used oversampling technique for data balancing.The over all acurracy of the proposed approach was 99.03% and 98.94% for precision.
The limitation of the works presented in [6]- [9], [11] - [15], [16] and [18] was that they did not verify the validity of the URL.However, the works presented in [14], [17] attempt to avoid the limitation in the aforementioned works by detecting URL legitimacy using either Google engine or VirusTotal.
This paper attempts to circumvent the limitations of the previous works by proposing a new method for URL inspection that combines two inspection techniques: Google engine and VirusTotal.Then this will be followed by SMS classification.Table 1 compares the proposed method with other smishing detection methods from various perspectives.The domain names are verified by Google, while VirusTotal determines whether the SMS URLs are malicious or not, and APK downloads is utilized for checking file contents.Contents analysis for extracting features and feature selection are taken into account because they have an impact on smishing detection.Finally, a heuristic method depends on distinctive features from both smishing and legitimate SMSs .

Preliminary concepts
The following subsections provide a background relating to chi-square, and machine learning algorithms including SVM, RF, adaptive boosting (AdaBoost), and XGBoost.

Chi-square
Chi-square ( 2 ) test is used in statistics to determine the independence of two events.The events X and Y are considered independent when Eq. ( 1) is satisfied [19].Chi-square is used to see if the observed data matches the expected data as described in Eq. (2).

Machine learning
Machine learning algorithms are computational processes that use input data to perform desired tasks without explicitly programming them.These algorithms are "soft-coded" in the sense that they automatically change or adapt the architecture to perform the desired task through iteration.Training is the adaptation process in which samples of input data are given as well as the desired results.The algorithm then generalizes not only to achieve the desired result when the training input is presented but also to produce the desired result when new unseen data is presented [20].
Machine learning uses a variety of algorithms to address data issues.Data scientists want to emphasize that no single algorithm works well for every situation.The type of algorithm used depends on the type of problem being solved, the number of variables, the type of model that works best, etc. [21].
In this paper, four well-known machine algorithms, namely support vector machine random forest, adaptive boosting, and extreme gradient boosting algorithms are adopted.

Support vector machine
Support Vector Machine is a popular and effective machine learning algorithm.SVM is based on the structural risk minimization criterion and seeks the optimal separating hyperplane with the highest separating margin.It improves the learning machine's generalization ability and solves some problems such as non-linear, high-dimension data separation, and classification issue that lacks prior knowledge [22].
The following two points summarize its main concept: First, it builds a nonlinear kernel function that represents the inner product of the feature space, which corresponds to a nonlinear algorithm mapping data from the input space into a potentially high-dimensional feature space.Thus, a linear algorithm can be used to analyze the nonlinear properties of samples in the feature space.Second, it applies the structural risk minimization principle from statistical learning theory by generalizing the optimal hyper-plane with the greatest margin between the two classes [23].

Random forest
Breiman proposed the idea of RF in 2001 [24], which are set of tree predictors where each tree is determined by the values of a random vector sampled independently and with the same distribution for all trees in the forest.As the number of trees in a forest grows large, the generalization error converges to a limit.A forest of tree classifiers' generalization error is determined by the strength of the individual trees in the forest and their correlation [25].

Adaptive boosting
Adaptive boosting was firstly proposed in 1995 by Yoav Freund and Robert.Scientists have proposed the concept of an algorithm based on the principle of a game, horse-racing gambler.The new gambler asks experienced gamblers how to select the best horse for gambling purposes.They, in turn, will offer him some useful suggestions based on their own experiences.
The Adaboost algorithm generates a set of poor learners by keeping a collection of weights over training data and adaptively adjusting them after each weak learning cycle.The weights of training samples misclassified by the current weak learner will be increased, while the weights of correctly classified samples will be decreased [26,27].

Extreme gradient boosting
Extreme Gradient Boosting is a scalable tree boosting that incorporates efficiency and memory resources.It applies to regression and classification problems.It creates a weak learner at each step and adds it to the overall model.Gradient Boosting Machines (GBM) are created when the weak learner for each step is determined by the gradient direction of the loss function [28].

The proposed smishing detector
The main concept of proposed model is to use two analysis phases to differentiate between smishing messages and ham messages.The purpose of using two analysis phases is that Google engine and VirusTotal API are used to identify malicious URLs.While machine learning algorithms are utilized for identifying suspicious content that was not detected in the first phase of analysis.Consider the SMS collection  of  messages represented by,  = { 1 ,  2 …,   }.Each message,   is composed of words, numbers, and so on.In addition, a label   is associated with each message,   .(i.e., there is a label vector  = { 1 ,  2 …,   }).The proposed smishing detector model categorizes the SMS   , as ham or smishing depending on the analysis of the behaviour of the URL existing in the SMS and its contents.Moreover, to detect smishing, it is necessary to identify discrimination features that distinguish smishing from ham.To support machine learning algorithms, we need to extract a set of n features  = { 1 ,  2 … ,   } from .

URL inspection
A smishing attack can be difficult to detect, especially because both legitimate and smishing messages use shortened URLs.Therefore, a new method is proposed that combines Google engine and VirusTotal for inspecting URLs.To the best of our knowledge, this is the first time a URL has been investigated using the Google engine in combination with the VirusTotal API to identify malicious functionality.
A new regular expression is proposed to describe a URL search pattern.The proposed regular expression that can effectively extract URLs from SMS is (http[s]?S+) |(HTTP[s]?S+) |(www.S+)|(WWW.\S+).The existence of the URL for each message,   , ∈  is checked.If it does not exist, the message is passed to the content analysis phase.Otherwise, the URL will be extracted and inspected by the Google search engine and VirusTotal API.Algorithm 1 clarifies the URL inspection phase.
The first inspection of the URL is performed by the Google engine.To validate the URL, the domain name of the URL is extracted.In addition, the Natural Language Tool Kit (NLTK) is used to extract all nouns in a message using a text blob.The extracted nouns and domains are checked by the Google engine.The results of the top five Google searches are selected and compared to the extracted domain name and the nouns.The second inspection is performed by the VirusTotal API, which analyses the behaviour of URLs in   ,.VirusTotal is a web service that analyzes URLs and files to detect suspicious or malicious content.VirusTotal detects malicious URLs and returns whether the URLs are malicious by comparing the extracted URLs with URL databases stored by antivirus companies such as Bitdefender and Kaspersky.If the URL is not found in the top Google search engine or is not declared malicious by VirusTotal, the message is considered smishing.Otherwise, the message is passed to the next phase, content analysis.

Preprocessing
The first important step in SMS content analysis is the preprocessing to prepare the message for analysis.Preprocessing involves the following 1.Tokens identification: the message is divided into tokens, each of which is identified by a delimited space.2. Stopwords exclusion: the stopwords are removed from the set of tokens identified in the previous step, and a list of keywords is generated.In addition, all punctuation is removed 3. Stem generation: the tokens are then stemmed to identify their origin to increase the frequency of the words.For example, the words (studying and studied) are converted to the word study.4. Currency symbols, numbers, phone numbers, email IDs, and URLs are converted to specific words, as shown in Table 2, that can be processed effectively by feature extraction and increase their weights in the messages.

Feature extraction
After preprocessing, the collection of SMS, , can be represented by m different terms, which are referred to as  = { 1 ,  2 ,  3 , … ,   }.A new approach for feature extraction coined UTF-IDF is proposed where features for each   ∈ , 1 ≤  ≤  are extracted regarding term frequency-inverse document frequency in different cases: word unigram, bigram and combination of unigram and bigram.In the UTF-IDF approach, the dataset , is divided into two sets depending on the message label.In other words, two sets are drawn from the set , the first set  ℎ , contains the ham messages, and the second set   , contains the smishing messages.
Then, the correlations between the terms' behavior and the significance of specific phrases are examined by computing the frequency of word-based uni-gram and bigram for each sentence in  ℎ and   .In this paper, the top 1000 terms were considered.As a result,  ℎ = { ℎ1 ,  ℎ2 … ,  1000 } and   ={ 1 ,  2 …, 1000 } are generated that represent  features vector for Set   ← true

6: Check the URL by VirusTotal with the URL parameter
Get the total number of security vendors that reviewed the URL and save it as   If   = false or   != 0 Set  Status ← smishing 7: End for 8: End  ℎ and   respectively.Following that, the combination of  ℎ and   is calculated  ℎ =  ℎ ∪   using feature union, which represents the  features vector for .Finally, the TF-IDF for each uni-gram and bigram   in   is calculated, as in Eq. ( 3), to produce a vector of term scores for each sentence in .As a result,  are generated that represent  features vector for .The extracted features are then fed into different classifiers to be trained.)is a metric used to determine how frequently a term,   appears in sentences   .

Feature selection
In the smishing detection process, feature selection is a crucial phase since the performance of the model might be affected by irrelevant features.In this paper, chi-square is used to identify the most important feature, which increases the performance of smishing detection rate and accuracy in addition to reducing computation time.For each feature   , the chi-square is calculated and then ordered in descending order according to the chi-square value.The feature with the highest chi-square value is more reliant on the output label and has a greater impact on determining the output.Algorithm 2 clarifies the adopted feature selection algorithm.

6: Check if the features are correlated
If _ <  then the two features are correlated 7: Sort the correlated features in descending order 8: Select features set, ℱ , such that it contains the top 1000 values 9: End Machine learning algorithms have been extensively studied in SMS classification.Four well-known classification algorithms are used in this paper for detecting smishing.SVM, RF, AdaBoost, and XGBoost.Algorithm 3 demonstrates the process of classification of SMS.

Experimental results
In this paper, we used the SMS spam collection dataset from the UCI machine learning repository [29].This dataset contained 5772 messages, of which 4825 were classified as "ham" (legal SMS) and 747 as spam.In addition, Pinterest's 120 phishing SMS were employed [30].Since the smishing dataset isn't published, Pinterest's smishing images are converted to text, and all smishing messages are extracted from the SMS spam collection dataset to produce a dataset consisting of 867 smishing and 4825 ham.Stratified 3 cross-validation is used to evaluate the proposed model.Here, the dataset is split into three folds, each fold having an equal proportion of messages with a particular label.One-fold acts as a testing set and the other 2fold acts as a training set.The iteration continues until all folds are used as the testing set.
Furthermore, the Accuracy (Acc), precision (P), Detection Rate (DR), and F1-score measures were used to evaluate the proposed smishing model's performance.The experiments were carried out on a PC with an Intel Core 7 Duo 2.90 GHz processor, 8 GB RAM, and a 64bit processor operating system Microsoft Windows 10.PYTHON 3.9 by Charm was used as

4:
Calculate TF-IDF scores for each feature in  ℎ using Equation 3 and add the corresponding TF-IDF score to produce set .

5:
Apply Chi-square for  and select the top 1000 features to be fed to the classifier

6:
Train the model with SMS messages,  using one of the adopted classifiers SVM, RF, XGBoost or Adaboost.
the programming language.After extensive testing, the following tunable parameters of the utilized machine learning algorithms have been deduced: The regularization parameter of SVM was set to 1 and the radial bias function was used as a kernel.The number of RF trees was set to 2000, while the number of AdaBoost trees and learning rate were set to 100 and 1, respectively.
Finally, the number of trees, the maximum depth of a tree, and the learning rate were set to 5000, 5, and 0.01, respectively.
A comparison of the impact of combining Google engine with VirusTotal for URL inspection versus Google engine used in [14] and VirusTotal used in [17] is shown in Table 3.When two techniques are combined to detect the maliciousness of URLs, the inspection operation is improved by an increase in the number of detected smishing messages.This reflects the beneficial effect of smishing detection through the collaboration of the Google engine and VirusTotal because Google engine detects smishing messages that VirusTotal cannot detect and vice versa.To demonstrate the effectiveness of using UTF-IDF during the feature extraction process, a comparison has been made between the accuracy obtained using UTF-IDF, which is dependent on splitting the dataset into two sets: smishing and ham, and that obtained using standard TF-IDF, which operates on the entire set.Figures 2-4 depict the accuracy results of UTF-IDF against TF-IDF for a unigram, a bigram, and a combination of a unigram and a bigram.
In most cases, using UTF-IDF gives better accuracy than using TF-IDF.The reason for this is that when the data is divided by message type and the frequency of each term is calculated, the importance of the features is preserved relative to the type of message, and the weight of the features is determined by what is contained in the dataset based on the label.Furthermore, the results show that chi-square selection feature selection method has a positive impact on the performance of the classifier algorithms.To confirm the results of the experiment, the results of the proposed model are compared with previous research in [15] as reported in Table 4.The results reveal that the proposed method outperforms [15] in all measures.In another comparison, the proposed model can be assessed by the number of features, which is less than [15], but outperforms [15].This reflects that the proposed smishing model has a higher degree of discrimination between smishing and ham.This is because the extracted features of the proposed smishing have a higher capability than [15] to distinguish smishing from ham.As a result, we conclude that the proposed model can effectively detect phishing SMS.
The proposed smishing detection model can be evaluated further by plotting the receiver operating characteristic (ROC) curve and calculating the Area under the ROC Curve (AUC) that measures the degree of distinction.Fig. 5 depicts the ROC curve and AUC of SVM and XGBoost.The reason for choosing SVM and XGBoost is that SVM's performance in [15] and in the proposed smishing detection was the best, while XGBoost's performance was the worst.The figures clearly show that the proposed smishing model has a higher degree of discrimination between smishing and ham the AUC of SVM in the proposed smishing detection (equals 0.9907), whereas the AUC of SVM in [15] was equal to 0.9894.Furthermore, the AUC of XGBoost in the proposed smishing detection equaled 0.9836, whereas the AUC of XGBoost in [15] was equal to 0.9773.This is due to the extracted features of the proposed smishing model having more significant strength to distinguish smishing from ham than [15].

Conclusion
Smartphones' popularity and their consistent connection to the World Wide Web make devices vulnerable to smishing assault, which is a serious attack on mobile devices.This paper introduces a security model that combines different analysis methods to detect malicious content in SMS.This model consists of investigating malicious URLs and analyzing SMS content.Google search engine was used with VirusTotal to verify URLs and determine their malicious intent.It performs a more effective role in inspecting URLs than the Google search engine alone and VirusTotal alone.The crucial part of content analysis is to separate smishing from ham messages.This is accomplished by extracting the essential features and selecting the relevant ones.Four machine learning algorithms were used in this paper, SVM, RF, AdaBoost, and XGBoost.SVM is superior to other algorithms with an accuracy of 0.985229 due to its productivity in high dimensional.Furthermore, the proposed model outperforms the existing work in the field.
For future work, a mobile application for detecting smishing and protecting a smartphone can be developed.In addition, the number of smishing messages is less than the number of ham messages, resulting in an unbalanced class problem, which can be solved by either acquiring

Figure 1 :
Figure 1: Number of Smartphone Users from 2016 To 2026 Where   : number of times the term   appears in the   , and  = log (

Figure 5 :
Figure 5: The ROC Curve Result.(a) SVM ROC of Both [15] and the Proposed Detection Model, (b) XGBoost ROC of both [15] and the Proposed Detection Model

Table 1 :
Comparison of the Proposed Model with Some Smishing Detection Models in the Literature

:
Extract the URL from   if it exists and save it as URL.Extract the domain name from the URL and save it as   4: Extract the nouns form the   and save them as   Set   ←Concat (  , nouns) 5: Check the URL by Google Search with the   parameter Set   ← Google Search with  SMS Content analysis SMS Content analysis consists of four components: pre-processing, feature extraction, searching for the best feature set using Chi-Square, and finally, SMS classification.Feature extraction creates a feature vector by extracting new features from SMS.The feature vector is passed to chi-square to search for feature relevance.After ordering the features by score and selecting the highest score, SMS classification algorithms are used to detect smishing.

Table 2 :
specific words that convert from the original tokens

:
Calculate the observed frequency by generating a contingency table,  contains  rows and  columns, and each cell contains the frequency of feature   , belongs to ham or smishing, ∀ ∈ {1, . ., }.2: Calculate the expected frequency by generating a contingency table, E.
Extract word-based unigram and bigram for  ℎ and   and add them to feature sets ℎ and  respectively.Calculate the frequency of the extracted feature sets ℎ and  and choose the top 2000 features to generate two sets  ℎ and    ℎ ← { ℎ1 ,  ℎ2 … ,  1000 }   ← { 1 ,  2 …, 1000 } 3: Combine the two sets  ℎ and   to create a vocabulary feature set  ℎ  ℎ ←  ℎ ∪

Table 3 :
The Number of Smishing Messages Detected by Google Engine, VirusTotal, and the Proposed URL Inspection