Quality of Experience Measurement for Video Streaming Based On Adaptive Neural Fuzzy Inference System

Technological development in recent years leads to increase the access speed in the networks that allow a huge number of users watching videos online. Video streaming is one of the most popular applications in networking systems. Quality of Experience (QoE) measurement for transmitted video streaming may deal with data transmission problems such as packet loss and delay. This may affect video quality and leads to time consuming. We have developed an objective video quality measurement algorithm that uses different features, which affect video quality. The proposed algorithm has been estimated the subjective video quality with suitable accuracy. In this work, a video QoE estimation metric for video streaming services is presented where the proposed metric does not require information on the original video. This work predicts QoE of videos by extracting features. Two types of features have been used, pixel-based features and network-based features. These features have been used to train an Adaptive Neural Fuzzy Inference System (ANFIS) to estimate the video QoE. 


INTRODUCTION
The increasing of access speed on the internet led to a large amount of online video on the internet and an increasing number of Consumers have own multimedia devices that allow users to watch online video anytime, anywhere and it tries to introduce these services to the person who actually uses a particular product with specific resolutions. The experience and customer satisfaction during the transport of video streams over network infrastructure is happening due to the success and acceptability of these new digital service offerings, but at the same time, the fact of being impaired in the service like delay, packet loss and jitter can severely damage the audiovisual quality as perceived by the end users.
The provider must ensure that customers have received appropriate quality at whole times. The variety of methods for measuring video quality aims to understand and ensuring a good user experience: The research in the last years, Quality of Experience (QoE) term appeared and act as representative to user satisfaction about the quality of the displayed content and matching the computed quality to people's opinion like (bad, poor, good, very good and Excellent), it depends on the virtual experience of the human virtual system (HVS) with the video content [1].
It is important to differentiate between Quality of services (QoS) term and Quality of Experience (QoE)term because the first one refers to the performance of IP-based networks and services and the other depend on the degree of the end user delight to the application, service, or system [1,2]. The network provider viewpoint must measure video quality along the path that video delivery, so it easy to detect any fault happened. To measurement the points in the network we were using QoS metric parameters, such as packet loss, jitter, loss ratio, and arrival time.
QoS-based metrics models have the ability to distribute at any point in the network, but It does not have an idea on the user impression of the quality of the video being played. Subsequently, QoS models have a limitation to predict the video quality as perceived by the end-user. two types to measure video Quality of experience: (1) Subjective video quality experience that has different scales for video sequence likes (excellent, good, fair, poor and bad). Theselinguisticvariableswillbereflectedpeople'sinstantimpressionofa video sequence.
(2) Objective video quality metrics used to typically place the decoded video as seen by the end-user.
The objective video quality metrics can be divided into three types of (1) Full Reference (FR) metrics measure the video quality by comparisons the video with the quality of the original video. (2) Reduced Reference (RR) metrics are like FR metrics because RR has information about the original video, but it can be near as NR metrics because it has no information about the original video. (3) No Reference (NR) metrics measure the perceived quality using only the decoded video. The NR metrics are really difficult to design can be used on any video stream because it requires perfect models of human vision, object recognition, and quality judgment. There are researches on this field and this paper one of them [3,4].

Related work
No-reference video quality of experience estimation is an important and challenging task which must be considered an essential part of future interactive multimedia communication applications such as IPTV or online games [5]. Researchers have also tried to develop some relationship between the quality of experience and Media -layer parameters. Also, for estimation the video quality, different measures/metrics can be used [5].
In [6] Bao and authors a new method of QoE estimation based on fuzzy clustering heuristic algorithm which is focused on service score at the server side. The server side is responsible for collect network QoS parameter and other information. Then save this information in a big database and use it in a heuristic rule model to predict user score, this process called fuzzy clustering analysis and then they will generate service QoE score that will feedback to clients.
In [7] Mocanua and the authors display a new metric that measures the user dissatisfaction which not always refer to averaged scores. Do that by using deep learning framework / deep belief networks and two modelled the average scores and user dissatisfaction levels.
In [8] Kawano and the authors have proposed a no reference QoE measurement system comprised of two training and test parts. In the first part training phase, they calculate the sensitivity of the low quality coded videos from some features such as blockiness, blurriness, edge, and continuity, etc. then rank these features using the Principal Component Analysis (PCA) method.

The Proposal System
The main objective is to design an efficient objective metric to measure the video quality of experience where it can predictive the MOS to video streaming. The proposed NR-QoE measurement system consists of two stages at the first the system is extracting different features (network-based, pixel-based) from streaming video, and in the second, training the system where we used neural network and fuzzy system. Multi-level for the video QoE system: the proposed system consists of four steps the first step: data set selection, second step: feature extraction, third step: training phase, and fourth step: testing phase.

3.1Data Set Selection
This work used numbers of datasets of video sequences with a good distribution of spatial and temporal properties to get the best result of the proposed system. The video quality of experience databases for testing our models :

ReTRiEVED Video Quality Database
The database contains eight source videos of different content by using Video LAN (NETEM) and Network Emulator that adjusting network parameter jitter, packet loss rate, throughput, and delay, so it generated from reference videos 184 test videos. The reference video is being used for the evaluation where Mean Opinion Score (MOS) and Difference Mean Opinion Score (DMOS) used as quality assessment metrics [9,10,11]. Figure-1 shows frames of the original video of the database used.

The LIVE-Netflix Video Quality of Experience database
It has 14 reference videos with1080p as a resolution that they used to generate 112 distorted videos with 24, 25 and 30 fps evaluated by over 55 human subjects on a mobile device, get the Mean Opinion Scores (MOS) from the subjective evaluations [12,13]. Figure-2 show frames of the original videos of the database used.

3.2Feature Extractions
Feature extraction stage is the most important phase in the prediction system that deals with extracting features from the raw data of information and use the relevant for classification purposes. In the extraction features process, there are many numbers of features extracted, but not all give as a good result so must apply the dimensionality reduction of data by pruning the bad one. This is an important step because of the technical limits in memory and achieving the time.
In this proposed system we extracted from the sequence video frames two types of different features: 1. Network-based features that refer to it as QoS parameters. Most of the video characteristics can be taken from application layers such as bitrate, frame bitrate, pixel bitrate and others. While some features result from the network or channel performance such that Packet Loss Rate (PLR), Frame bitrate, Pixel bitrate, packet structure, video codec, and video content type and we also can calculate average PLR, Average of Frame Rate, Average of Bit Rate [5]. Sample of extracted features shows in Table-I.  Table I 2. Pixel-based features applied the features extraction process on video database and extract features like blocking, an average of blocking, blurring, an average of blurring, the blur ratio, Average of luminance, average noise, the noise ratio. Pixels-based features have a high correlation to the Human virtual System (HVS) and consequently impact the end user of video stream service.Evanturaly improveed the accuracy performance of the prediction model [7,12,13]. Pixels-based features have a high correlation to the Human virtual System (HVS) and consequently impact the end user of video stream service. Eventually, improve the accuracy performance of the prediction model. From the distorted videos, we extract the Spatial perceptual Information (SI) and Temporal perceptual Information (TI). These two features give us the complexity information in the videos that depend on the amount of distortion. This type of information will be useful in the proposed system. We calculate the SI and TI values for each frame individually and then calculate the average of the features over the frames, equation 1 used to calculate SI and equations 2,3 used to calculate TI [14]. SI = Max {sd i,j ,Sobel(Fr n )-}(1) TI = Max time *sd i,j ,M n (i, j)-+(2) M n (i, j) = Fr n (i, j) − Fr n (i, j)(3) SI represents the Edge block that calculates by using Soble filter and standard deviation (sd) of the frame pixels. TI value refers to the biggest video motion [15], that counted by found the maximum standard deviation video sequence, M n refers to the pixel difference in successive frames at the same place. Luminescence and average of luminescence deference are used to measure the impair quality of perception. equation 4 used to calculate the luminance of the single pixel [16], equation 5 used to calculate the average of luminescence deference [17,18]. Table II

Classification and Clustering
Classification and clustering are the most important parts of the prediction process, in addition to feature extraction. The previous phases should be designed and set aiming to succeed in the classification phase. The classifier depends on his decision on the feature vector value that extracted from the video sequence to be classified, as input, and then places it (MOS) to the suitable one [19]. The clustering operation and classification have similar concepts with each other. The main job of the clustering process is to group the feature vectors to clusters in which the correspondence of the patterns is stronger than between the clusters [20]. Training phase Learning 1-The first level: Train the system with Neural Network algorithm to build a cluster module to detect the video has the lowest value of MOS from other videos.
The proposed system main idea depends on using five Neural Network of Back Propagation each one responsible on calculating the score of MOS that has value (1=worst, 2=poor, 3=fair, 4=good, and 5=excellent) and this value represent the input to the Fuzzy Inference system. Each one of the neural networks has three layers (input, hidden, output) layers. The back propagation algorithm has local gradient value used for updating the weights. In the first for each output neuron calculating the local gradient, updated weights value that backtracking from the neurons in the hidden layer to the neurons in input and calculated the local gradients at the neurons in the hidden layer. The repetitions procedure responsible to update the weights in two path from the output neurons to the hidden neurons and from the hidden neurons to the input neurons.
In feedforward phase, the input layer has the unit (X 1 … X 10 ) represent the feature value that receives from the previous phase and distribute it's to the hidden layer that has (Z 1 , . . . Z 15 ) units. After that, the hidden units compute the activation function and sends the result to the output layer.
In the training cycle, each output unit compares its computed activation with its target value to determine the associated error for that pattern with that unit. During the backpropagation phase of learning, signals are sent in the reverse direction [21,22].
The proposed system used Back Propagation (BP) algorithm because of its accuracy where it allows itself to learn and improve itself. The network in this system contains 10 units in the input layer that refer to extracted features, 15 unties in the output layer and one unit in the output layer that represents the predicates Main Opinion Score figure (4)

CONCLUSIONS
In this work, the quality of experience of streaming video is predicted using the ANFIS model. This model can be easily used video parameters and control on the output.
It is showed that the ANFIS model is compatible with the experimental results. The model, which has the learning ability and the parallel computation of five training back propagation algorithm which is a supervised learning method and fuzzy inference system, is a hybrid artificial intelligence method. It has the advantages of both ANN and FL methods and gives an opportunity for the solution of critic and complex problems.