Review of Smishing Detection Via Machine Learning

Smishing is a cybercriminal attack targeting mobile Short Message Service (SMS) devices that contains a malicious link, phone number


Introduction
Mobile phone usage has increased which has led to an increase in cybercrime. Smishing is one such crime. It is a type of spam that has a significant negative impact on many users' daily lives. They waste a lot of time processing spam that contains unexpected dangerous attachments to compromise the users' systems [1]. Information security is a major concern in our daily life that deal with controlling and preventing unauthorized access to secure data [2]. Phishing is currently one of the serious risks to human networking environments. It is a cybercrime that sends malicious links via spam or social network to trick users into gaining access to personal information such as usernames and passwords. Phishing scams can allow attackers to make money or other profits [3]. Smishing is phishing carried out through a Short Message Service (SMS) to steal user-sensitive information. In smishing, the attackers target mobile users via text messages delivered to their mobile. These messages include a link to malware or phishing websites that will request sensitive data from the user. Malware is downloaded to the mobile of the user and then performs malicious operations on the device. Attackers prefer text messages to target victims because they can aim for a huge number of users with an inexpensive SMS subscription [4]. Furthermore, the mobile phone has a smaller display that makes it hard for users to read the Uniform Resource Locator (URL) and review the suspicious features of that particular URL. In addition, mobile users' lack of knowledge, insecure user behavior, and frequent user logins make mobile phones vulnerable to smishing attacks and loss of sensitive data.
This paper presents a detailed study of anti-smishing techniques for mobile device security. The review's main contribution can be summarized as follows: • Present a review of the main and most recent research advancement of anti-smishing methods in the literature with their drawbacks and results.
• Investigate potential solutions to smishing attack problems from various perspectives such as collaboration among SMS content, URL analysis, and the combination of URL analysis and SMS content. For the rest of this paper, the smishing mechanisms are described in section 2. Section 3 presents various anti-smishing techniques. Followed by a discussion. Finally, the major findings of this review are clarified

Smishing mechanisms
Smishing operations created by attackers usually use compelling phrases such as congratulations, wins, prizes, gifts, etc. This tricks the user into contacting the attacker by clicking the link, dialing the phone number, or contacting the email provided in SMS. The process of a smishing attack, as shown in Figure 1, begins with an SMS message from the attacker containing one of three: a URL, Email ID, or phone number. If the URL is included in the SMS, simply clicking the link will redirect the user to a dangerous website. Next, a website form for the victim is opened, containing a gift and a promise about the customer's points of interest. The victim will be asked to enter personal information such as bank information to obtain this. On the other hand, the malicious message may contain a website link that redirects the victim's device to download a file.
For SMS containing an email ID or phone number, the attacker calls the phone number or email ID to trick the victim into contacting them. If the victim contacts the attacker in any way, the attacker will ask the user to disclose personal information [4,5].

Anti-Smishing techniques
Researchers have proposed several approaches to combat smishing attacks, including content-based, URL behavior analysis, and heuristic techniques. Figure 3 depicts the distribution of the three existing approaches for all publications presented in this review. A smishing system was suggested by Joo et al. [6] to detect and block smishing messages by checking if a URL exists in the message. The system consists of four parts: SMS monitor, analyzer, determinant, and database. Finally, the system can efficiently extract noun words using a morphological analyzer, Naïve Bayesian classifier was used to classify smishing messages from legitimate ones. A smishing detection model that relies on a combination of content-based and four wellknown learning -machine algorithms for detecting smishing messages was proposed by Sonowal and Kuppusamy [7]. To minimize the number of features and the dimensionality reduction method, the Pearson correlation coefficient was used to extract 39 features and the 20 best features were selected. The model was validated by experiments on both the English and non-English datasets. The model's accuracy, when applied to an English dataset, was 96.40%, while it was 90.33% when applied to a non-English dataset. The model obtained a 96.16% accuracy after feature selection.
A rule-based technique with content-based filtering was presented by Jain and Gupta [8]. The authors proposed nine rules and three algorithms: Repeated Incremental Pruning to Produce Error Reduction (RIPPER), decision tree, and PRISM to classify smishing messages from legitimate messages. The obtained results were promising and the proposed method could also detect the zero-day attack. In terms of True Positive Rate (TPR), the RIPPER outperformed DT and PRISM, which showed 90.88% and 72.65%, respectively. In terms of True Negative Rate (TNR), the RIPPER showed 99.01%, while the DT and PRISM obtained 99.17% and 99.93%, respectively.
Goel and Jain [9] proposed a three-phase smishing detection model. The first phase is the SMS analysis phase, in which the URL analyzer checks for the presence of URLs in text messages, the mobile number analyzer checks whether the phone number in the SMS is on the blacklist, and the self-answering message analyzer searches for messages that require a "yes" response to register for a service. The second phase is SMS normalization, which replaces normalized disguised noisy text with its familiar form. The third phase is SMS classification, which includes preprocessing and classification using Nave Bayesian (NB) classifier. Wu et al. [10] proposed an anti-smishing framework with three types of features: 32 token features, 50 topic features, and 93 LIWC features. To resolve the imbalanced data, the adaptive Synthetic (ADASYN) oversampling method was used. Because there were so many features, the Binary Particle Swarm Optimization (BPSO) method was used to reduce the dimensions of the features and identify feature combinations. For distinguishing smishing from legitimate messages, the RF classification method was used. In terms of accuracy, the proposed technique achieved 99.01%.
A technique based on the message content and URL behavior analysis for detecting smishing was introduced by Mishra and Soni. [4] They proposed a Smishing detection system using SMS content analysis, a machine learning classifier, and the inspection of the URL behavior method to classify the smishing messages. The first phase filters the content of text messages by detecting the presence of email IDs, phone numbers, or URLs in messages. Word occurrences were calculated using the term Frequency-Inverse Document Frequency (TFIDF) and OneVsRest classifier to distinguish between smishing and legitimate messages. Analyzing URL

Mahmood and Hameed
Iraqi Journal of Science, 2023, Vol. 64, No. 8, pp: 4244-59 24 4248 behavior can help detect APK downloads. Meanwhile, the URL source code is also determined to see if the form tag is present in the message.
[11] proposed a content-based technique known as automatic detection of smishing using machine learning algorithms using Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF). The core of the proposed work consists of preprocessing, feature extraction, and classification. To assess the performance of the proposed model, a large dataset containing smishing and legitimate messages was used. The experimental results clarify that legitimate and smishing messages were classified with a high success rate. SVM, RF, and LR classifiers achieve precision of 88%, 88%, and 89%, recall of 98%, 98%, and 96%, and F1score of 92.7%, 92.7%, and 92.4%, respectively.
Jain and Gupta [12] proposed a heuristic-based algorithm that detects smishing messages using feature selection and machine learning algorithms. Ten features from the smishing message were selected by analyzing the content of the message and using a classification algorithm to distinguish the messages depending on the selected features. The suggested algorithm has an overall accuracy of 98.74%, a TPR of 94.20%, and a TNR of 99.08% for smishing detection using an RF classifier.
A smishing detector was proposed by Mishra and Soni based on a content-based approach and a URL-based method. The system consists of four modules. The content of the message was processed by the SMS Content Analyzer. URL filters, APK downloads, and source code modules for examining URL behavior. The naive Bayes classifier was used as a machinelearning algorithm to classify smishing messages from legitimate messages and showed an accuracy of 96.29% after the evaluation of all four models [13].
A combination of the heuristic method and content-based feature extraction with machine learning classifiers namely naive bays (NB), neural network (NN), and LR to detect smishing messages was introduced by Jain et al. [14]. The system was divided into two phases. In the first phase, spam and ham messages were filtered. Then in the second phase, the smishing messages were distinguished from spam messages. This approach can detect both spam and smishing messages. The proposed approach used 11 basic features to exclude spam messages and 4 features to filter smishing messages with the use of Information Gain (IG) for feature selection to reduce redundancy. The simulation results show that using a NN classifier, the proposed method can recognize spam messages with an accuracy of 94.9% and identify smishing messages with an accuracy of 96%.
Sonowal [15] proposed a detecting smishing messages based on content feature extraction and four correlation machine learning algorithms namely spearman's correlation, Pearson rank correlation, point biserial rank correlation, and Kendall rank correlation for ranking features. The Kendall ranking algorithm reduced feature dimension by 61.53% and achieved an accuracy of 98.40%. using AdaBoost classifier.
Mishra and Soni [16] offered a smishing detection model that consists of two phases: the domain checking phase and the SMS classification phase. The first one looks at the authenticity of the SMS URL, which is an important part of SMS phishing detection. The second phase is the SMS classification phase, which looks at the textual content of the message and chooses the five most effective features from the text messages to allow machine learning categorization with a small number of features. Finally, the proposed system classifies messages using Backpropagation (BP) algorithm, RF, NB, and DT. The obtained accuracy was 97.93%. A smishing detection model based on a content-based approach was developed by [17]. They have developed an automated strategy that effectively distinguishes between legitimate and fake messages. They performed a feature extraction method, followed by a feature selection method, and analyzed the work by machine learning classifiers XGBoost, RF, Classification and Regression Tree (CART), SVM, and AdaBoost. SVM outperforms the others with accuracy, precision, recall, and F1-scores of 98.39%, 98.37%, 99.79%, and 99.08%, respectively.
A content-based approach used artificial intelligence for smishing detection. First, the message is preprocessed and features such as (term function, URL, email address, mobile number, number of characters, and currency symbol) are extracted. These features are provided to the classifier for distinguishing the smishing message from the original message. Many classifications such as Long Short-Term Memory Recurrent Model (LSTM), K-neighbors (KNN), stochastic gradient descent (SGD), DT, NB, and RF classifiers were used and it was found that LSTM achieved 95.11%, 94.88%, 91.07% and 99.03% for accuracy, F1-score, recall, and precision respectively [18]. Ghourabi [19] proposed an "SM Detector" for smishing detection in the mobile environment. This system consists of three parts. The first uses the VirusTotal API to analyze the authenticity of the URL, and the second uses the regular expression technique to analyze the content of the message for blacklisted words or numbers. The last part was the Bert classification method for classifying spam messages from original messages. The system also includes a mobile app that allows users to monitor their SMS and report smishing texts. Its main advantage is that it can handle mixed text messages written in Arabic or English. On both Arabic and English datasets, the accuracy was 99.63%. A smishing detection model based on a content-based approach was developed by Boukari et al. [20], they developed an automated strategy that effectively distinguishes between legitimate and fake messages. They performed a feature extraction method, followed by a feature selection method, and analyzed the work using machine learning classifiers. SVMs are the best in achieving superior accuracy and reducing feature dimensions. The method can also detect phishing and vishing scams.
An SMS phishing detection technique was proposed by Mishra and Soni [5] that used a neural network to extract 7 significant features and it showed the best results for detecting smishing. The overall accuracy of the NN-based 'Smishing Detector' model outperforms the results of the same model using machine learning methods. The comparison shows that the NN achieved greater accuracy, with a 1.11% difference. The NN achieved 97.40% accuracy and TPR and TNR of 92.37% and 97.91%, respectively.
Akande et al. [21] proposed a mobile application for detecting smishing attacks based on rule-based RIPPER and C4.5 classifiers. The rule-based classifiers were used to generate rules for identifying and distinguishing spam from ham, and a mobile application was developed to use the rule-based approach to detect smishing. An Application Programming Interface (API) was developed to intercept incoming SMS. The use of the C4.5 PART algorithm improved the efficiency of the rule-based method significantly. The correct classification rate was determined to be 98.42 %. However, 1.58 % of cases were labeled incorrectly.
Phadke and Thorpe [22] developed a new app to detect Smishing attacks on Android smartphones by incorporating current phishing Application Program Interfaces (APIs) into a prototype application. The system is designed to run in the background and determine whether the URL in the text message is phishing or not. Five freely available APIs were tested on a 1500 URL dataset to determine their accuracy and latency. The VirusTotal API has the highest detection rate of 99.27% with a response time of 12-15 seconds per query for the security sensitive application. Furthermore, for the time-sensitive application, the Safe-Browsing API has an 87% accuracy and a response time of 0.15ms.
A technique was offered by Mambina et al. [23] that used a machine-learning-based approach for classifying smishing SMS messages. The best model with an accuracy score of 99.86% was a hybrid model of Extra tree classifier feature selection and RF employing TF-IDF (Term Frequency Inverse Document Frequency) vectorization. The results obtained were compared to a baseline. multinomial Nave-Bayes model. Furthermore, a comparison with a group of other classifiers was performed. As a result, the lowest false positive and false negative were 2 and 4, respectively the model provides with a Log-Loss of 0.04. Jain et al. [24] presented an effective method for analyzing text content and URL in SMS. They combined the URL phishing classifier with the text classifier to increase accuracy because some SMS include the URL with no or very little content. A weighting framework TF-IDF was used to locate unusual terms in a report, two datasets were used in the system for text and URL phishing classifier. Also, to balance the training data, an oversampling method was introduced. The proposed system was able to detect smishing SMS with 99.03% and 98.94% accuracy and precision rate respectively. Figure 4 shows how many times classifiers were used in different research papers from 2017 to 2022. It demonstrates that the most commonly used classifier is random forest. Table 1 presents comparative studies by various recent researchers from different perspectives based on the approach, dataset, feature extraction, feature selection, and classifier used, followed by the results and limitations.

Discussion
First, the user should recognize specific smishing words such as abbreviations in SMS and leet words in both SMS and URL, emotional phrases, misspelled words (such as the attackers used misspelled words in the message body or URL), shortcodes, and impersonality. Second, installing protective applications from trusted source devices/phones because several attacks could be lanched without the user's knowledge by clicking on malicious links to download malware, which ends with harming the user's information. Finally, it is important to check the permission of the applications before downloading them to determine whether those requests for access to SMS, contacts, camera, etc. are legitimate. Since attackers mostly use impersonal messages so they can be sent to the largest number of victims.
As mentioned in the previous section, the most commonly used classifier in SMS classification is RF. SMS content analysis is the most commonly used strategy in many approaches. Several studies use blacklists or whitelists to validate URLs, phone numbers, and email addresses. However, whitelisting cannot be used to detect smishing because it cannot recognize updated harmful features of URLs. Also, since blacklisting cannot detect zero-day phishing tactics, blacklists must be regularly updated.

Conclusions
In the era of advanced cybercrimes and attacks, the attackers intend to gather consumer information as quickly as possible. Therefore, the attackers send SMS messages to mobile phones. The small size of a mobile phone's display, phone users' lack of understanding of security programs, and open unknown source messages prevent the user from seeing the entire harmful link. The attacker sends a message to the user's phone that contains a malicious link that redirects them to a malicious website where they are asked to submit personal information. This type of cyber-attack is referred to as smishing.
Combating smishing messages necessitates user education. The key premise that emerges from this review is that different approaches can play an important role in detecting smishing. In addition, a comparison of various approaches for distinguishing smishing messages from genuine messages, as well as their results and limitations was provided. Recent research and studies indicate that combining URL behavior analysis and SMS content analysis with a large dataset is the best strategy for combating smishing . This will direct the researchers interested in developing more effective anti-smishing methods in the future.