Network Traffic Prediction Based on Time Series Modeling

Predicting the network traffic of web pages is one of the areas that has increased focus in recent years. Modeling traffic helps find strategies for distributing network loads, identifying user behaviors and malicious traffic


Introduction
With the advancement of information technology, network resource allocation has become an important research topic in recent years [1]. An optimal resource allocation mechanism can ensure that important or high-priority traffic is not delayed or ignored when the network is overloaded or congested. At the same time, it ensures the efficient operation of the network [2,3]. The development and maturity of network traffic prediction technology makes it possible to create dynamic resource allocation based on accurate traffic forecasts [4]. According to different applications, network traffic forecasting can usually be divided into short-term and long-term forecasting. In general, long-term forecasting usually relies on historical data with relatively large details, a long period of months and days, to analyze, model, and predict future traffic flow [5]. Changes usually pay more attention to the accuracy of the trend rather than the absolute accuracy of the expected value. Short-term forecasting requires real-time performance and predicting future network traffic in seconds or even smaller ranges. Modern methods of network management can control network traffic dynamically. Analyzing network traffic in real-time and making quick decisions may be futile, so long-term network traffic modeling is the solution for better network management [6,7]. To predict network traffic systematically, machine learning models can be trained on network traffic and predict future network traffic. In this paper, machine learning algorithms (XGboost, Logistic Regression, Linear Regression, and Random Forest) were applied, and the performance of each model was evaluated using (SMAPE) and (MAPE). Specific and relevant features are used for Wikipedia's network traffic. Then it is used to generalize the data, apply models, and compare the performance of the models.

Related work
Several models have been proposed for network traffic modeling and prediction that implement supervised machine learning algorithms used dynamically or statically. Most attempts to model network traffic are based on real-time, typically using memory-use prediction algorithms. This section summarizes the work related to network traffic modeling. To predict network traffic, the researchers [8] used the time-series prediction data set of web traffic for Wikipedia articles. Using the RNN seq2seq model, they created a time-series model. It next looks at using the Symmetric Mean Absolute Percentage Error (SMAPE) to evaluate the produced model's overall performance and correctness. [9] predicted encrypted user traffic. They created a representative traffic data set, including video and web traffic, and used two comparison models (ARIMA and LSTM). The results showed the superiority of (LSTM) in accuracy and time.
In [10], the researchers presented the NetScrapper classifier, a flow-based network traffic classifier for online applications NetScrapper classifies 53 web apps, including Amazon, YouTube, Google, Twitter, and many others, using three machine learning models: K-Nearest Neighbors (KNN), Random Forest (RF), and Artificial Neural Network (ANN). The network traffic dataset contains 35,77296 stream packets with 87 different features. In [11] long-term memory (LSTM) and online sequential extreme learning machine (OS-ELM), the real traffic of the Chilean ISP was used to predict the network traffic. The results concluded that OS-ELM is superior to LSTM in computational cost. To classify the network traffic in terms of applications used [12], the researchers proposed an intelligent traffic management model using deep learning that includes multiple decision tree-based models.
The proposed model deploys a blending set-learning method for merging tree-based classifiers to increase generalization accuracy. In [13], multivariate time series models were studied using time series analysis such as clustering and sequencing to create sequence models using long-term memory architecture (LSTM). The dataset to which the study application is applied is Wikipedia web page traffic. The dataset contains about 145,000 web pages and corresponding web page traffic from July 2015 to December 2016. In [14], an updated version of the 2018-2020 Wikipedia page views dataset is used, and an LSTM Neural Network with Distributed Asynchronous Training is used. A predictive model was built using the heavy rain strategy in training to achieve parallel training.

Machine learning model
This study used four models to model time series to predict network traffic. Two modeling concepts were used: regression-based modeling (logistic regression and linear regression) and modeling based on ensemble learning with both bagging and boosting using two algorithms (XGBoost and RandomForest).

Logistic Regression
Logistic regression was chosen because it is simple, easy to implement and comprehend, and has low computational requirements. It's also one of the most widely used supervised learning algorithms nowadays. Because of these properties, logistic regression is a strong option for problems involving vast volumes of data and quick, possibly automated decisions, such as network traffic [15,16]. The logistic regression model, as shown in the following equation, links the likelihood of an outcome with a series of potential predictor variables: where p is the probability of the visited page, 0 is an intercept term, 0, 1 . . . . . are coefficients associated with each variable 0 , 1 . . . .

Linear Regression
A collection of independent variables is utilized to estimate a dependent response variable using the linear regression approach. The method attempts to find regressor coefficients β that best represent a linear relationship between the response variable Y and the regression variable X. For n observations, each consisting of k regression variables, the model can be described in the following equation [17]:

Random Forest:
The Random Forest constructs decision trees using a random approach: each tree is trained on randomly picked objects and randomly selected features, a technique known as the random subspace method. After that, a forecast may be made based on the results acquired for each tree [18]. The outcome can be determined in various ways, including utilizing a rapid majority vote or an average. Using such a random strategy can lower the model error, i.e., the spread of model predictions [19].

XGboost
Chen and Guestrin presented extreme gradient boosting (XGBoost) in 2016. This approach enhances the gradient boosting-based calculation method for the objective function and saves computation time. Parallel computing is automatically achieved throughout the training phase to address large data science issues rapidly and precisely [20]. XGBoost main premise is to learn new features by including a tree structure, fitting the residuals of the final prediction, and calculating the sample score. The sample ultimate prediction score may be calculated by aggregating the scores of each tree [21]. The framework of XGBoost is briefly described in the following paragraphs.
The estimated output of the gradient boosting tree model can be expressed as the sum of the prediction scores ̂of all trees: (3). where T is the space of regression trees, and K is the number of regression trees, xi represents the features corresponding to sample i. For a given dataset, there is a prediction score ( ), also known as leaf weight, for each leaf node j.

Methods
The data set was obtained, initialized, and divided into a training set and a test set to predict network traffic by adopting time series. The selected models were applied, and the performance of each module was measured using mean absolute percentage error and symmetric mean absolute percentage error.

Dataset
The data set used in this paper is a time-series forecast of Google web traffic consisting of 141,385 Wikipedia articles. The dataset includes a chronological order field representing the time series or multiple points. Each time series represents some daily views of a different Wikipedia article, from 7/1/2015 to 9/10/2017 [13,14,22,23]. Figure 1 shows samples of daily views of a different Wikipedia article.

Proposed framework
The framework includes data recall, setting target values, and data preprocessing that includes data generalization to reduce the effect of outliers and overfitting. Then the data set was divided into training data and test data in a ratio (33:67), with the selected models being trained and the performance of each model being measured as shown in Figure (

Evaluation criteria and experimental results
The results have been analyzed and the proposed models trained in the open-source Python tool. To evaluate the models, comparisons, and proposed results, the data has been divided into two groups: the first for training and the second for testing. To evaluate the performance of the selected models, two scales, mean absolute percentage error (MAPE) and symmetric mean absolute percentage error (SMAPE) [24,25], were used.
Where: n: is the number of instance testing, At: is the actual value, Pt: is the prediction value.
After training, the models were tested. Table (1) compares the performance of the selected models. To display the prediction results against the true values and the prediction periods for the time series and anomalies, Figures (2, 3, 4

Discuss the results
By comparing the performance results of the selected models in Table 1, it is noted that linear regression achieved the best performance on both scales, as it achieved 19.88 on the MAPE scale and 20.06 on the SMAPE scale. Both results show the possibility of modeling network traffic. At the same time, the logistic regression followed the linear regression, while the performance of random forests and XGboost was lower on both scales. The results indicate that network traffic can be modeled and that linear regression models are better for modeling this type of time series. In displaying the predictive values versus the real values, we find that the linear regression as in Figure 3 was the best compared to the rest of the models and over the length of the test examples.
After analyzing the results, the linear regression model was the best. To determine the effectiveness of this method, it is compared with previous work that used the same data set as in Table 4. Compared to the previous work, it can be concluded that linear regression was the best.

Conclusion
Traffic modeling is a critical task in network management and cybersecurity. Network traffic is modeled based on four machine learning algorithms: logistic regression, linear regression, RandomForest, and XGboost. The study found that supervised machine learning algorithms can model network traffic for time series. The results showed that these features could be modeled and categorized. The best performance was achieved by the linear regression algorithm, and it achieved 19.88 on the MAPE scale and 20.06 on the SMAPE scale.