Word Embedding Methods for Word Representation in Deep Learning for Natural Language Processing

Natural Language Processing (NLP) deals with analysing, understanding and generating languages likes human. One of the challenges of NLP is training computers to understand the way of learning and using a language as human. Every training session consists of several types of sentences with different context and linguistic structures. Meaning of a sentence depends on actual meaning of main words with their correct positions. Same word can be used as a noun or adjective or others based on their position. In NLP, Word Embedding is a powerful method which is trained on large collection of texts and encoded general semantic and syntactic information of words. Choosing a right word embedding generates more efficient result than others. Most of the papers used pretrained word embedding vector in deep learning for NLP processing. But, the major issue of pretrained word embedding vector is that it can‟t use for all types of NLP processing. In this paper, a local word embedding vector formation process have been proposed and shown a comparison between pretrained and local word embedding vectors for Bengali language. The Keras framework is used in Python for local word embedding implementation and analysis section of this paper shows proposed model produced 87.84% accuracy result which is better than fastText pretrained word embedding vectors accuracy 86.75%. Using this proposed method NLP researchers of Bengali language can easily build the specific word embedding vectors for representation This research explores different word embedding methods than can be used to create local word embedding vector for Bengali language processing which is very important for deep learning neural network. We perform the most useful two word embedding methods for generating embedding vector. IF-IDF method used to count frequency of word which is used for classification, clustering of NLP sentiment analysis and Word2Vec method used to predict word which is used for regression or prediction analysis. After generating word embedding vector, we fed this vector to deep learning neural network as input-to-input layer. Then we used several hidden layers also CNN layers to successfully train the machine using this word embedded data set. Finally, we compared proposed local word embedding model with fastText pretrained word embedding model. Experimental result shows that proposed model accuracy is 87.84% whereas fastText model accuracy is 86.75% for Bengali language processing. In the future, we have a plan to identify all stop words during pre-processing and to consider misspelling or vocabulary out of words to calculate their word vectors for Bangla language processing.


Introduction
Word embedding is the most important topic in natural language processing, known as distributed word representation to represent words in natural language processing and information retrieval applications [1][2][3][4][5][6]. Most of the Machine Learning algorithms also Deep Learning Algorithms cannot process string or plain text in their raw form. Their input requires numbers or list of numbers to perform any type of job such as regression, classification etc. Word Embedding generally map a word to a vector using a dictionary. Word Embeddings or Distributional vectors follow distributional hypothesis where similar context occur according to similar meaning of all words. Distributional vectors try to capture the properties of the adjacent word. Generally, word embedding used as the first data processing layer in machine learning or deep learning model [7][8][9][10][11]. Valid word vector can be any set of numbers where vocabularies capture specific meanings, relationship between words etc. In word embedding, every word has a unique number of vector and embeddings are multidimensional vectors typically 50 to 500 in length. Simplest word embedding scheme is one-hot encoding where embedding space has the same size as total number of words in the vocabulary as shown in Figure 1. Main drawback of one-hot embedding is dimension size linearly depends on vocabulary size which consume huge amount of memory. If vocabulary size increase, then linearly increase number of dimensions. To reduce the dimension size, we can use categories for all vocabulary as shown in Figure 2, where similar word has similar embeddings and embedding matrix is less empty space or zeros. Creating N dimensional word embedding vector which contains relationship between similar word and neighbour"s word is very difficult for most of researcher. They use pretrained word embedding vectors like as Word2Vec created from Google, Glove word embedding vector created from Stanford or fasttext word embedding vector created form Facebook. This pretrained word embedding vector have their own algorithm to create this vector. During creating word embedding locally, we have to consider similarity between recognize words and relationship between words.

Figure 2-Lower dimensional Word Embedding
The main goal of this research is to discuss how to create local word embedding methods for Bengali text processing. Word Embedding methods in English language processing is very Wadud et al. Iraqi Journal of Science, 2022, Vol. 63, No. 3, pp: 1349-1361 1351 familiar but in Bengali language usages of this methods is very rare. The key contributions of our research are:  Identify all stop words in Bengali language  Perform frequency based local word embedding generation process in Bengali language  Perform prediction based local word embedding generation process  Apply Random Forest machine learning classifiers on local word embedding vector and pretrained word embedding vector  Made a Comparison between local and pretrained word embedding vector. Hinton [7], 1986 proposed first linear relational word representation scheme using binary relations between words by leaning distributed representation. Then other researchers improved word embedding vectors by adding different factors. Mikolov et al. (2013c) [2,12] proposed a word embedding method based on input layer weights which captures syntactic and semantic scheme and relation-specific vector offset used to characterized semantic relationship. Their given example demonstrates male-female relationship learned by representing word vector by vector equation "kingqueen = man -woman". Hermann and Blunsom work for multilingual setting from distributional representation of an input sentences that is same sentences in different language [13,14,15]. Kiros [16] explained different notation of learning from context sentence using recurrent neural network. Yih [17] proposed a method for text similarity measure where short texts are represented by TF-IDF vectors. Hill [18] also present a liner model where relationship between sentences is not consider. Global vectors for word representing (GloVE) is another vector representation for words based on unsupervised learning algorithm which is crated for English language [19]. Most of the pretrained word embeddings are based on English language. Recently Facebook introduced new word embeddings named fastText which have word vector for 157 languages [20]. FastText also have word vector for Bengali language and each vector have 300 dimensions in size. But the major problem of using fastText is large data size and need huge memory and high configurable machine for processing also have limited number of Bengali words. All of word embedding related research used pretrained word embedding vectors or recurrent neural network to skip word pre-processing or used own word embedding with limited properties. So, in this paper word embedding creating process has been discussed for the Bengali language also discussed how to use pretrained word vector for the Bengali language. Kumar et. al [21] discussed pretrained word embeddings for 14 languages including Bengali language. They apply different word embedding methods on 14 languages and find out their performance result but they didn"t discuss the local embedding generation process for all languages. They use pretrained word embedding methods to generate word vector for specific language.

Word Embedding Method
Creating local Word Embedding methods depends on aims of our NLP research. Word Embedding methods can be broadly classified into two categories-Frequency based Embedding and Prediction based Embedding. Frequency based embedding methods are easy to understand and mainly used for text classification, sentiment analysis and many more [22]. There have several frequency-based methods such as Count Vector, TF-IDF Vector, Cooccurrence Matrix etc. On the other hand, prediction-based word embedding methods predicts a target word by mapping words in the vocabulary. Most usable prediction-based methods are Continuous bag of words (CBOW) and Skip-Gram model. Both of these techniques learn weights by applying backpropagation neural network [23]. Following are short discussion of different word embedding methods and their use cases.

Count Vector
This method learns vocabularies form all of the documents then form a matrix by counting number of times each word occurs in each document. If total unique word number is T and number of documents is D then count matrix size will be T X D as shown in Figure 3. Using count vector methods any one can prepare embedding vector by choosing high frequency word from vocabulary list.

TF-IDF Vector
Term Frequency-Inverse Document Frequency (TF-IDF) [24] reflect how important a word for a specific document from a collection of documents which used in information retrieval and text mining. Formal equation of TF-IDF is: (1) Where, = (Number of times term t appears in a document d) / (Number of terms in the document d) and = log (Number of documents N) / (Number of documents a term t has appeared in) If a word has appeared in all the document, then value of will be 0 and probably the word is not relevant to a particular document.

Co-occurrence Matrix
Used for identify semantic relationship between words by using factorization which is a most common problem for word embedding and can be solved efficiently. This matrix is computed by counting how two or more words occur altogether in a given corpus.
is a formula to count neighboring word which represent how many times word follows the current word [25].

Continuous Bag of Words (CBOW)
Used to predict a word by learning the context which is very effective to find out missing word in a corpus. Context words becomes Neural Network input layer and missing word is predicted at the output layer. Error between the output layer and input layer is used to readjust the weights. The architecture of CBOW is shown in Figure 4(a), where we can use several hidden layers between input and output layer in order to maximize the conditional probability of actual output word from input words.

Skip Gram Model
This model is completely reverse of continuous bag of words model used to predict a target context by learning words. Words becomes Neural Network input layer and context of a word predicted at the output layer. Error between the output layer and input layer is used to readjust the weights. The architecture of Skip Gram is shown in Figure 4(b).

FastText
FastText [20] pretrained embedding vectors is a large collection of word vector with four types of dimension size such as 50, 100 and 200 and 300. FastText used Skip Gram and CBOW model for generating the word matrix. In word to vector generation process fastText set minimum frequency count is 2 that means if any important word is appear only once in a whole document it will be automatically discard from this vector generation process.

Performance Analysis and Discussion
For experiment purposes, a sports-related Bengali data set collected from the open-source free Bengali dataset corpus [26] is used which contains a total of 12,086 text files and each file contains more than 6500 words. So, there have almost 78,500,000 Bengali words processed to create word embedding vector which work as first input layer in deep learning neural network. Most of the Bengali researcher use count vector to count word frequency for their research but the limitation is that huge memory size and time consuming to process Bengali text processing. we used Python Anaconda 3.6 machine learning platform and Jupyter Notebook tool for this implementation.

Pre-processing
Before word embedding method implementation, we need to preprocessing our corpus, because collected data is not suitable for processing. It contains huge amount of punctuation marks, stemming and stop word etc. Preprocessing as shown in Figure 5 is the most important step during implementation. At first, we have to clean data by removing punctuation, stop words and stemming.

Tokenization and Punctuation Removal
Tokenization means breaking up a given sentence into smaller meaningful units. Each unit is called token which may be number, punctuation marks or words. Words have been identified based on the spaces and remove unnecessary items in words such as punctuation marks, hash tag, emoji, emoticon, etc.

Stop Word Removal
Every text contains some unimportant word which is known as stop word. These words have no importance during processing documents. In English language stop words are "the", "a", "an", "of", "my" etc. Similarly, in Bengali language stop words are "এই", "", "তাই ", "অথএফ", "অথচ" etc shown in Table 1. These stop words have been identified after analyzing huge Bangla data sets for a long time. For example, consider a sentence "You are still talking riddles, the real work has not started yet" where stop words are "you", "are", "still", "the", "has", "not". After translating the sentence into Bengali format it is "আনি কিন্ তু এখন েঁ য়ারি কয কথা ফরছ ন আর কাজটি এখনো ু যু কয ন নাই" where stop words are different from English language. By removing stop word the token list will be [ েঁ য়ারি, কথা, ফরছ ন, আর, কাজটি, ু যু , নাই] which are the most powerful words for this sentence.

Stemming
The process of reducing variation of a word is called stemming. There can be different forms of a word based on the context it is being used. For example, "কযা", "কযছি", "কযছিরাভ", "কযছির ", "কয ছ ","কয ছি" etc. for all these words, "কয" is the root word. Python Regulation Expression library was used for reducing variation.

TF-IDF word embedding method
TF-IDF word embedding method discussed in section 3 and in this section, only implementation process has been discussed here. Scikit-learn library in Python provide a TfidfVectorizer function to create TF-IDF word embedding vectors. At first import TfidfVectorizer from sklean.feature_extraction.text then call fit and transform to calculate the TF-IDF score for the text. Finally, top 5 words have print as shown in Table 2 with their weight through the given document. After TF-IDF operation produces following results: Total unique words are: 5430 Highest word weight is: 0.67518 Lowest word weight is: 0.00451

Local Word Embedding using Skip Gram
Word2Vec can be generated using either Continuous Bag-Of-Words or Skip-gram model which is discussed in section 3. In this paper, the Skip-gram model with a negative sample have chosen for Local word vector generation and implemented on Python using NumPy then Keras Python framework is used to implement deep learning neural network for NLP processing. After pre-processing a clean dataset has found then set the value of some hyperparameters such as learning rate, epochs, embedding size, window size, etc. and generate training data by building vocabulary also build dictionaries that map word to id or vice versa. Then use Skip-gram model to training vocabulary by forward propagation and backpropagation network. Finally get word vector and their similar words in word embedding. Full process shown in Figure 6.  Table 3 shows the data demography to check the performance of proposed local word embedding compare with pretrained fastText word embedding. The whole data set has pre-processed and trimmed before sending it to the word vector generation process. The complete datasets have categorized into 5 different features and applied the proposed local word embedding model separately in each proposed feature as shown in Table 4. For training and testing purpose 80% datasets have used for train the model and rest of 20% dataset is used to evaluate the model using Random Forest [27] machine learning classifiers.

Experimental Evaluation
The model has been trained in five separate features F1 to F5 as shown in Table 4   (2) Probable result is shown in Table 5: Proposed word embedding and fastText word embedding produce the closest neighbors to this equation. The top 5 closest neighbors are shown in Table 5 with their probability scores. Local word embedding model produce probability that is semantic related to other words which and accuracy is better than pretrained word vector fastText. Performance result of proposed & fastText word embedding shown in Table 6. The local word embedding model produces different scores for different features whereas the F1 feature produces better performance than the other four features. The proposed model achieves a maximum accuracy score of 87.84% for feature F1. FastText pretrained word embedding model produces 86.75% accuracy for feature F1 which is minimum than proposed word embedding model. Figure 8 shows the graphical representation of proposed and fastText model.  Figure 9 shows the F1 score of the local word embedding model and the fastText pretrained word embedding model. The first and third features for both models produce a more advanced F1 score where the proposed model F1 score is higher than the fastText model.   Table 7 and graphical representation shown in Figure 10. The positive prediction of the proposed model is 48.21% of the total test dataset that means it can correctly identify the 48.21% dataset as a positive class among the 53% dataset. Only 05.05% dataset was actually positive but the proposed model predicts this as a negative class. For negative classes, the proposed model predicts the negative class of 39.33% of the 47% of total negative datasets. Compared to fastText, our proposed model actually has higher scores of positive and negative probabilities that the fastText model. For false positive and false negative score fastText model probabilities is higher than proposed local word embedding model. The fast text word vector is unique and produces good results in most cases but in Bengali language processing local word vector output can be made based on requirements and shows better performance than other pre-trained word vectors. This research explores different word embedding methods than can be used to create local word embedding vector for Bengali language processing which is very important for deep learning neural network. We perform the most useful two word embedding methods for generating embedding vector. IF-IDF method used to count frequency of word which is used for classification, clustering of NLP sentiment analysis and Word2Vec method used to predict word which is used for regression or prediction analysis. After generating word embedding vector, we fed this vector to deep learning neural network as input-to-input layer. Then we used several hidden layers also CNN layers to successfully train the machine using this word embedded data set. Finally, we compared proposed local word embedding model with fastText pretrained word embedding model. Experimental result shows that proposed model accuracy is 87.84% whereas fastText model accuracy is 86.75% for Bengali language processing. In the future, we have a plan to identify all stop words during pre-processing and to consider misspelling or vocabulary out of words to calculate their word vectors for Bangla language processing.