The Performance Differences between Using Recurrent Neural Networks and Feedforward Neural Network in Sentiment Analysis Problem

With the spread use of internet, especially the web of social media, an unusual quantity of information is found that includes a number of study fields such as psychology, entertainment, sociology, business, news, politics, and other cultural fields of nations. Data mining methodologies that deal with social media allows producing enjoyable scene on the human behaviour and interaction. This paper demonstrates the application and precision of sentiment analysis using traditional feedforward and two of recurrent neural networks (gated recurrent unit (GRU) and long short term memory (LSTM)) to find the differences between them. In order to test the system’s performance, a set of tests is applied on two public datasets. The first dataset is collected data from IMDB that contains movie reviews expressed through long sentences of English, whereas the second dataset is a collection of keyword search results of tweets using the Twitter Search API; these tweets are written in English words with short sentences. In this work, a certain pre-processing operation is added to the system and a set of tests is conducted to evaluate the performance enhancement on the whole system due to the addition of these operations. The results of the usage of the traditional feedforward neural networks are poor and do not perform the desired purpose in analysis, because of their inability to save information at a long term and, therefore, their loss of efficiency. While the results of using GRU and LSTM are relatively good and do perform the desired purpose in analysis. A recurrent neural network has been built so that any type of text-related data can be pushed to get the polarity of sentiment by multi deep operations that are dependent on the extracted information.


Introduction
Analysis of sentiment is considered as one of the major common implementation of analytics of text and is applied in wide fields of tutorials, applications of mobiles, web sites etc. These fields concentrate on the process of sentiments analysing using various text sources starting from companies surveys such as google opinions and ending by reviews of movies such as international movie database. Analysis of sentiment is currently applied in a wide range in commercial websites and pages of social media such as Twitter, Instagram and Facebook, as well as movie review websites of products, by determining the human opinions [1].
Analysis of sentiment, also commonly called mining of opinions, can be expressed as the operation that implements techniques of NLP, linguistics, dictionary resources, as well as machine learning to bring information such as modality, emotions and mood. After that, the information gained are used to calculate the sense of text documented [2]. From the sense gained, it can be determined if the text document shows negative, positive, or impartial sentiment. Furthermore, there is a much more developed analysis where additional complicated emotions can be applied such as anger, sadness, and sarcasm. Many methods are applied to do this work such as recurrent neural networks, hybrid classification, and deep convolutional neural networks [3].
At present, analysis of sentiment is a subject of large benefit and evaluation since it has too many practical implementations. With the assistance of systems used with sentiment analysis, not structured instructions could be dynamically converted to the structured information of general opinions about brands, services, products, protocols, or every subject that people might talk about [4]. This information could be very beneficial for trading implementations such as analysis of marketing, general relations, reviews of product, net promoter scoring, feedback about some product, and services of customers [4].
A previous work [5] introduced a system of recurrent neural network called quasi recurrent neural network. It can be described as a method to modelling neural orders as alternative to classical layers applied through time steps in parallel as well as functions of minimalist recurrent through channels in parallel. In spite of missing layers of trainable QRNN, it showed higher quality of accuracy than the long-short term memory with similar hidden dimension. Another work [6] proposed a bidirectional network, Bi-GRU, which does not only focus on information of position for aspect terms but alternately embodiment the connection between the sentences and aspect terms using two direction attention techniques. The practical results applied on datasets of SemEval 14 explained the strengthening of the suggested PBAN network. The basic theory of PBAN network is to establish aspect terms of position implant to determining the weights of attention. Authors of a previous study [7] suggested various techniques of LSTM structures for analysis of sentiment with review of movies. Their results indicated that the method of LSTM RNN gives a more effective performance than classical RNN and deep neural networks for analysis of sentiment. They used simple models of LSTM and evaluated their performances, then added layers of LSTM one after one, which provided increment forthe accuracy. Finally, bidirectional layers of LSTM were established to cover information with forward and backward networks. Also, other authors [8] suggested a network named CA-LSTM to combine previous tweets for classification of sentiment. The networks of context and attention-based long short-term memory depend on a hierarchal framework to simulate the sequence of microblog and determine the tweets and words with various weights utilizing mechanism of attention. Another article [9] proposed a mechanism for creating a model capable of predicting performance of learning, extracting feature of learning, and reasoning of results. An initially common feature of learning verification approach was established for converting the raw data from systems of e-learning to groups of separate features of learning. They submitted a developed parallel neural network to display the results of prediction. In this paper, deep learning neural networks are adopted to solve the classification problems related to social data. Deep learning is one of the techniques in machine learning that determines multiple layers of non-linear information manipulated for extraction of features supervised, as well as classification and pattern analysis [10]. Deep neural networks as well as recurrent neural network have been implemented in fields such as recognition of speech, computer vision, and natural language processing (NLP) [11].

Materials and Methods
The workflow of the proposed system consists of three main phases: (a) Preparation phase, (b) Neural network phase, (c) Remove garbage words phase. Each phase consists of many stages. The general structure of the proposed system is shown in Figure-

Preparation (Pre-processing) Phase
Operations of preparation are considered as the common essential operations in every data system of mining, in which the system efficiency is based on the files equality applied in forms of separators of word, length of word, noise, etc. The preparation stage includes a series of approaches of manipulation for the text file which are used to prepare the entering data flow to be more suitable for the task of taking out the related information.

Neural Network Phase
In this section the proposed systems are presented to analyse sentiments by training a set of sentences and testing others to find experimental results by using many types of neural networks. This phase is divided into two section:

Feed-forward Neural Network
One of the general common neural networks that are applied for problems of classification and uniform regression is called feed forward NN. In each layer of this network, the information passes in the forward orientation. The first layer, called the layer of input, accepts the entries; the middle layer, named the hidden layer, is used for performing manipulations and calculations. This determined information is then passed to the layer of output to generate the result. The activation kind of the equation is applied to differentiate the hyperbolic tangent contiguous for the scale of the entry to hidden points as well to differentiate a function named the logistic sigmoid which is applied to upgrade the output of the hidden signal which transit to points of output [12,13].
Learning the neural network is the operation of determining a suitable group of weights' amplitudes for points of neural network that enable creating decisions of classification that are very similar to goal values [Sca92]. The wide spread algorithm applied to learn NN of feed-forward is named backpropagation. The oversee training has been applied for purposes of training [14][15][16].
After the descendant sorting of words that are extracted from dataset , each word is given a unique number or index, then all sentences are converted from a set of words to a set of numbers. After that, some numbers (words) that have a higher occurrence, between 50 to 300 words, are taken and called map variable. The words will be ordered ascendant or descendant to their repetition.
In this experiment, 20% of sentences were used for testing and 80% for training. Feedforward neural networks were created and trained and tested on the training set. The derivative of hyperbolic tangent function was used for weighting the input to hidden nodes, whereas the derivative of logistic sigmoid function (Uni-Polar Sigmoid Function) was used to update the hidden signal output that is sent to output nodes, to train feed-forward neural networks back-propagation used to update and find better weights. To measures error performance, a mean square error (MSE) was used, while visual studio 2010 was used as a platform with C# as programming language.

Recurrent Neural Networks
The recurrent neural network structures extend from completely interconnected to partly connected networks, containing multiple feedforward networks with distinguished layers of input and output [17]. Completely connected networks do not have distinguished input node layers, and every node has sources from all other nodes. Further, it is possible that each node has feedback as shown in Figure-3 [18].

Figure 3-Completely Recurrent Network
The recurrent neural network process a non-fixed length problem of series by having a recurrent hidden layer, the activation of which at every time depends on that of the former time. In general, when there is a series A=(A_1,A_2,…A_t), the state of recurrent hidden named (Z_t) in recurrent neural network makes an update by [19]: where θ is considered as the function with nonlinearity as an example of logistic composition sigmoid applying affine transformation. As an option, the recurrent neural network could have an output B=(B_1,B_2,…B_t) which also could be non-fixed length [20]. The types of neural networks used in this paper are as follows:

Long Short Term Memory (LSTM)
LSTM is designed to overcome the difficulties that face traditional neural networks by employing gates. LSTM is characterized by having three gates, namely the input, output, and forget gates. The input gate determines how much the newly computed state for the current input is included, the forget gate determines how much the previous state is included, and finally, the output gate determines how much the internal state should be exposed to the subsequent step. In addition, the LSTM unit has an internal memory that determines how the new enter and the previous memory are combined. Basically, it is a combination of the previous memory multiplied by the forget gate and the newly computed hidden state multiplied by the input gate. The gating mechanism allows the LSTM to learn long-term dependencies. The graphical summarization of LSTM is show in Figure- [20] Not similar to the unit of recurrent, which easily calculates the sum of weights of the signal that is entered and provides a non-linear function, every unit of j th long short term memory is preserves a memory named . Then the result or long short term memory activation is calculated by [20][21][22]: where represents the gate of output which modulates the memory amount exposure of content. The gate of output is calculated by [20]: where α is a function of sigmoid and is a diagonal matrix. The memory unit [21] makes updates by partly removing the available memory and inserting a content of novel memory called ̃.

̃
( ) The content of novel memory is: The range which can be determined for the forgotten available memory is called the gate of forget, indicated by [20]. Also, the degree of content of novel memory that will be inserted to the cell of memory can be modulated with a gate of input called [20], as follows: ( ) ( ) ( ) ( ) Notice that the two symbols Vg and Vk are considered diagonal matrices. This is not similar to the classical unit of recurrent which replaces its contents every time. The long short term memory has the ability to decide for how long the available memory through the gates remains. In addition, the long short term memory can determine the essential feature of the input series at fast stages. It simply handles that information with a long distance [21].

2.
Gated Recurrent Unit (GRU) Gated recurrent unit is an advanced version of common recurrent networks, depending on the reset gate and update gate. Commonly, there are two directions which determine what data should be delivered to the output. The specific feature of the two vectors is that they can be learned to hold information long time ago, without removing it with time or eliminating information not related to the detection. The update gate defines how much of the previous state is included, and the reset gate applies to the previous hidden state. In short, the reset gate determines how to combine the new input with the previous memory [20]. The graphical summarization of GRU is show in Figure-  There is a linear relation between the nominees activation called ̃ and the prior activation called of GRU activation at time t [20].
The are called the gates of update which determine the number of times that a unit updates its activation or contents. The gate of update is calculated by [22]: ( ) ( ) The steps of abtaining a linear addition between the available state and novelty calculated state are the same to the unit of long short term memory. In the unit of gated recurrent, there is no mechanism to monitor the degree to which the state is exposed. The nominee's activation ̃ is calculated in the same way used in the unit of common recurrent [20,22], that is: is the gate of reset set and ʘ represents the element-wise multiplication. The benefit provided by the gate of reset is to forget the previously calculated state when reading the initial character of an input series. The way used to calculate the gate of reset is similar to that for the gate of update, as follows [20]:

Remove Garbage Words Phase
The target of remove garbage words is to minimize the running time by removing the words appearing in low frequency and check if these words affect the accuracy and time or not. The words with low frequency in the set of data will be removed to reduce the size of data set and the implementation time and, at the same time, the accuracy will be monitored to check if it is affected or not.

Results
In this section, the results of some conducted tests are presented and discussed to evaluate the performance of the established system. Two datasets are used for training and testing the system proposed in this paper. The first dataset is the movie review with English words and long sentences, which consists of about 25000 sentences divided into 12500 positive and 12500 negative phrases collected from IMDB. The second data set is the Twitter review collected by using keyword search via the Twitter Search API . These tweets are in English words and short sentences, which are about 50,000 sentences.

Feedforward Results
Movie review dataset can be entered in a feed-forward code. A total of 25000 sentences were used from the set, 12500 being positiveand 12500 being negative. When using 50 words to create the numerical vector, 24842 sentences were produced while the other sentences were empty. When using 100 words, 24888 sentences were produced, whereas using 200 words resulted in 24889 sentences. In the experiment, 50, 150, 100 and 200 neurons, respectively, were created and trained on the training set and tested on the training set, with 0.0001 learning rate and a momentum of 0.0001. In classification accuracy, the accuracy of the training ranged 40% -53%. Figure-6 shows the results of using 100 words (24888 sentences) and 100 neurons. If a sentence contained only positive or only negative words, then the output neurons would produce results close to 100%, but if a sentence contained a positive and a negative word, the output neurons would be closer to 50%, indicating uncertainty and that the sentence could be classified incorrectly. Limited memory became a problem, because when more sentences were used, the number of words will be increased and there is not enough memory to accommodate the data structures that are needed for training the neural network.

GRU and LSTM Results
RNN is one of the most used deep learning techniques to find the sentiment analysis and accuracy of the sentiments. RNN is particularly used in the case of large datasets. Python is used in testing of this work and the output is split using multiple iterations defined under epoch levels for better view of the accuracy at different intervals. The accuracy was found to be around 0.88, which is a good result as a system has been built where any type of text related data can be pushed to get the sentiments and its accuracy. Table-1 shows the results of the training and testing accuracy of GRU and LSTM for each dataset produced after the garbage words removal operation for the first dataset. The dataset in table-1 represents the dataset of that ratio of the removed words.  Table-2 shows the results of training and testing accuracy of GRU and LSTM for each dataset produced after garbage words removal operation for the second dataset.

Result Analysis a. Differences Between Feed-Forward and Recurrent NN
Results in section 3 showed that the traditional neural network is unable to train on a large dataset like that used in this work because of its inability to save information at a long term, therefore a loss in efficiency occurs. The accuracy of training was about 50% to 53%, which is very low because the polarity is negative or positive, and so, the probability of truth is already 50%. While the results of GRU and LSTM units showed the best effects in the two learning datasets. Deep learning principle of GRU and LSTM units can solve the memory issue because of gates called 'update gates' which determine the number of times that a unit makes its updates of activation, or updates its contents, thus storing information. GRU and LSTM units maintain the present content and add the new content on top of it. b. Differences Between GRU and LSTM The differences between accuracies for the two methods (GRU and LSTM) are shown in Figures-7  and 8 for the first and second datasets, respectively.

Figure 7-Differences between (GRU & LSTM) for 1st Dataset
For the first dataset, the differences re summarized in equation (14) which explains that the y is stronger than x by 1.2083, where x-axis represents GRU accuracies and y-axis represents LSTM accuracies.

Figure 8-Differences between (GRU & LSTM) for 2nd Dataset
For the third dataset, the differences are summarized in equation (15) hich explains that y is stronger than x by 2.9823, where x-axis represents GRU accuracies and y-axis represents LSTM accuracies.

Conclusions and Future Work
In this paper, a system architecture was presented that can be trained on short and long text sentiment analysis sentiment data and tested using different data sets, with different sizes, that contain text files (.txt). Many conclusions have been drawn from the obtained results for the proposed system. In this paper, two approaches have been proposed to analyse the sentiment. Firstly, the performances of the proposed approaches were tested against the traditional classifiers of Feedforward NN. Secondly, the recurrent neural network (GRU, LSTM) models uses recurrent neural network layers to extract effective features. Therefore, a discovering analysis of sentiment can highly promote deep learning effect. The results showed that the traditional neural network is unable to train on large datasets like those used in this work because of their inability to save information at a long term, therefore loss in efficiency occurs. The results showed that GRU and LSTM units have the best effects in the two learning datasets. Deep learning principle of GRU and LSTM units can solve the memory and storing information issues. Afterwards, we highlighted the importance of focusing on the key information of an input sequence from the word-feature level by getting rid of words that appeared slightly (garbage words). In the first dataset that consists of large sentences, removal of garbage words reduced the accuracy by about 18% -22%, whereas in the second dataset that consists of small tweets, the accuracy was reduced by about 4%. This is useful for the social media data because when the garbage words are removed, the size will be reduced, so that the speed of analysing will be increased. This effect takes place because the text data has taken a large size and therefore need more time for analysis.
For future works, the processing of individual words used can be substituted by double or tripartite words that involve many tasks such as understanding the relations between words and determining the decision words that have most occurrences in positive or negative sentences. Also, developing the system can be achieved by using the synonyms system (like WordNet in python). In order to tackle the memory space issue, words that have the same meaning were combined. Using the indexing tree the required words, and then the synonyms words, were determined. In addition, recurrent neural networks could be merged with Convolutional Neural Networks (CNNs) to obtain better results.