Comprehensive Processing for Arabic Texts to Extract Their Roots

Arabic language is a highly inflectional language where a single word can have different forms using a single root with different interpretations. Arabic does not have a standard way to find roots, the reasons for having inflectional language: suffix, prefix and infix Vowels, which built in complex processes. That is why, words require good processing for information retrieval solutions, until now, and there has been no standard approach to attaining the fully proper root. The applications on Arabic words show around 99% are derived from a combination of bilateral, Trilateral and quad lateral roots. Processing word- stemming levels in order to extract a root is the process of removing all additional affixes. In case the process of matching between a word and Proper names is available, take off the affixes away, according to patterns and rules with reference to root dictionaries. This research is new series of steps using a new way of affixes' browsing, vowels and Patterns through three stages of stemming. I f a match is not found, vowel replacement and patterns readjusted to check, if not, then the word is kept unmodified. Search engine, indexing, file classification, clustering etc. need developing the root extraction, where the researcher will introduce recommendations and solutions that participate in improving Arabic root extraction. Research applies comprehensive processing on general collection of documents that done gradually to improve the root extraction by 96%.


Introduction
Searching the world of technology requires many efforts towards Information Retrieval success. For Text Mining TM on Arabic, some works have been conducted to investigate the effect of using stems or roots instead of words on Text Classification (TC) performance. By comparing the Arabic language with other languages, we find weak in the Arabic text technologically that we seek to develop and improve. This research work to develop the technologies of Arabic root extraction. The automatic extract of the root of the Arabic texts is important match with other techniques: 1. Most researches have to work with Arabic text for trilateral [1] and quad lateral [2] but not with all levels of bilateral, trilateral, quad lateral, pent lateral, six lateral roots. 2. Root extraction gives a valuable support to many natural language processing applications such as information retrieval. Stemmer eliminates the longest suffix and the longest prefix. It then matches [3] the remaining word with verbal and noun patterns. 3. Since such corpora provide researches with resources for performing different computational tasks in many fields of informatics linguistics, such as roots dictionary, information retrieval [4], and more. 4. Document's Categorization: using the Arabic root extraction algorithm to obtain the relevant root helps in classification. 5. Studying this singularity in terms of estimating associative relationships between roots, and all their potential occurrences with multiple morpho-phonetic patterns [5]. 6. Improving the performance of search engines by the entire corpus, and a post-retrieval document browsing technique. The amount of documents managed to represent the accumulated knowledge of the organizations is also quickly growing and an efficient access for these documents to become vital. These require a new development in technology fields like: a. Searching: querying and Ranking of search results. b. Optimizing information representation and storage [6]. c. Classification and Clustering of Documents [7] Search Engines, (Google, Yahoo …) are the common gateways to the huge collections of electronic text and relevant information. In comparing other languages with Arabic in relation to information retrieval, Arabic needs more and more efforts to match with the developed languages.

RELATED WORK AND BACKGROUND
This research works on achieving some improvement in order to extract the Arabic root, and the processes of extracting the root can be done through various steps. Nevertheless, Arabic language is not ruled affixes that makes its morphological analysis processes hard. The processes must be organized to the three types of affixes. Prefixes are the extra letters added to the beginning of the word, infixes are the extra letters added to the middle of the word and suffixes are the extra letters added at the end of the word.

Background
Arabic language is described as algebraic language that makes its morphological analysis process very difficult [1,8,9]. It is considered as a very challenging language due to: its complex linguistic structure in which it is characterized by a complex situation and its highly derivational nature where morphology plays a very important role.
Arabic words are divided into two types. Noun and verb are derived from a closed set of words which are generated from a set of roots, which is used now with modern language around 7,000 roots and with traditional language that does (past) not exceed 10,000 roots. Arabic language can build 168,924 roots.
The levels of roots depend on number of letters for each word it distributed approximately as follows:
The problem of the study launched in the importance of scientific technologies, sought by the radical solutions to give help in the development of Arabic language technically. Applying the morphology inflection only is not to deal with all affixes and not classified as appendages scheduling, means that there are tens of words which will not find a way the root within the selected text. On the other hand, the diacritics marks play a big role to distinguish the word meaning and the extraction may give different root.

Related Works:
Considerable research on root-base, stemming stages and morphological analysis [1] is amassing for the Arabic language, no standard root extraction oriented algorithm has yet emerged.
Mohammed Aljlayl [8] process many variant word senses are based on an identical root; thus, the root-based algorithm creates invalid conflation classes that result in an ambiguous query which degrades the performance by adding extraneous terms. This automatic rule-based stemming algorithm is not as aggressive as the root extraction algorithm.
Larkey [1] suggest four different approaches to Arabic stemming, which can be identifiedmanually constructed dictionaries, algorithmic light stemming which remove prefixes and suffixes, morphological analyses that attempt to find roots, and statistical stemmers, which group word variants. Khoja [9] has developed an algorithm that removes prefixes and suffixes, all the time checking that it's not removing part of the root and then matches the remaining word against the patterns of the same length to extract the root, but did not take into account most of Arabic morphology.
Al-Fedaghi and Al-Anzi [10] algorithm tries to find the root of the word by matching the word with all possible patterns with all possible affixes attached to it. Anne n.de roeck and waleed al-fares [11] produce a research about a morphologically sensitive clustering algorithm for identifying Arabic root, that depend on clustering the Arabic word into various category and then it will be easy for us to rank it find the appropriate root.
Beesley [12] algorithm removes the longest possible prefix, and then extracts the root by checking the first five letters of the word. This algorithm is based on an assumption that the root must appear in the first five letters of the word.
Abderahim [13] present system composed of two modules, first one consists of an analysis out of context to segment each word of the sentence into its elementary morphological units to get root. The second use the context to identify the correct root among all the possible roots of the word, which give good results.
Dilekh [14] propose to remove the prefixes before suffixes to reach the stem, apply the patterns list and without patterns matching. This method not always give better result.

Text Collection and Function Root List Document Collection
In order to support and test the work 350 Arabic texts are collected ( about 33,800 word ) arbitrarily chosen from various online Arabic newspapers, magazines, academic and other sources published online, according to ten subject categories: Politics, Economics, Human Rights, ,Sports, Science, Health, arts, Agriculture, Law, and Religions texts. There is no determination of words or text, the algorithm apply generally to any document or collection.

Root lists
The lists of the root that built by more than 99% of the root type from two to six letters. The root list divided into all parts of root as follows: Two letter roots (bilateral). "Shada marks( ّ )" mean repeat the letter. Three letter roots (trilateral), around 3/4 of the roots. Four letter roots (quad lateral). Five letter roots (pent lateral).

Six letter roots. (six lateral)
These all root classifications that give more appreciate to extract affixes and match exact root from the lists, which help to improve the percentage of extract proper root.

Root Extractor Approach
Belal Iraqi Journal of Science, 2019, Vol. 60, No. 6, pp: 1404-1411 1406 Arabic contains three genders (much like English), masculine, feminine and neuter. It also contains three persons, one to describe the speaker, one to describe the person being addressed and one to describe the person that is not present. Arabic [9] differs from Indo-European languages in that it contains three numbers instead of the more common two numbers. So as well as singular and plural, there is also the dual that is used for describing the actions of two people. We have based our Arabic tag set on this system of language teaching. As such we distinguish between the three moods of the verb (indicative, subjunctive, and jussive).

Solid Letters (SL)
It represent the essential letter of the word and not to remove it at all from the word, these letters if they appear as first letter that mean there is no prefixes in word and they appear as last letters that mean there is no suffixes in this word.
This distinguishes between letters give direct extraction to determine the exact root without any doubt of the root. Our algorithm begin by removing light prefix (LP) then all suffixes then remain prefixes, where each step trying to match with suggestion solid letter SL. The solid letters (SL) are "16" and it is not affixes at all, which mean these letters are essential of the root. Shown as follow    The LP needs rules to exclude it, if the word greater than six letters directly remove suggest ( ‫ي،‬ ،ٚ ‫ي‬ ،ْ ،َ ‫خ،‬ ‫ا،‬ ‫ب،‬ ‫.)ن،‬ The previous rule depends on letters numbers from one to four prefix also.

The Approach
Until now, there is no standard algorithm to extract all the roots from the text. This potential is to enhance the algorithm root extraction and to make it more accurate. The research was worked on the Belal Iraqi Journal of Science, 2019, Vol. 60, No. 6, pp: 1404-1411 1407 rule-based root extraction approach and morphological analysis to extract the root of a given Arabic word, stem levels, and most popular patterns.
The powerful of this research is applying general documents (collection). Developing the algorithms that has been proposed previously is to develop and make some experiments on them it also improves some of its parts by constructing a computer program that simulates the Arabic texts. Reformulation of the letters arrangement into Solid, Prefixes, Infixes and Suffixes letters, processing any word depends on the Place of Affix. The affixes remove or/and patterns apply will executing as: (each step compare with root dictionary) 1-Remove from text [1,15] non-Arabic letters and one letter. 2-There are three levels of stemming which work on removing affixes to get root: a) Stem1: Removing light prefixes LP. b) Stem2: Removing all suffixes, remain prefixes. c) Stem3: Removing infixes. 3-Comparing the patterns [5,9] with the word.

Experimental Results:
Root extraction in general collection is the orientation to build the algorithm. Arabic language is a highly inflectional language where the predefined morphological patterns attempt to find root.
Root is necessary in IR; it allows combining terms that have similar meaning, with small differences in the morphological form having a single root, and therefore it improves the quality of retrieval. The algorithm shows the effectiveness to deal with affix by place. The procedures start by removing most prefixes with applying SL, and then operate suffixes before remaining prefixes during the stemming processes. Stage three of stem to extract the root by removing the infixes and may use replacement between vowels.
Many root algorithms have been developed for a wide range of root extraction. The aim of our experiments is to evaluate different methods of proper root. A series of experiments was conducted to show the effect of each stage of stemming step, morphology inflection, replacement of infixes vowels and Solid Letter finally to get proper root from the word.. Solid letters (SL) help to find the root directly. Remove non-SL in return to dictionary of root and how to exclude the letter of root. As examples, the word"ُ‫"ذراح‬ the letters (َ ، ‫ح‬ ، ‫ر‬ ) are solid and give the root ‫رحُ)‬ ), ‫"عٕٛاْ"‬ the letters "ْ ، ْ ، ‫ع‬ " are solid and give the root ‫,)عٕٓ(‬ ‫"يرذارضْٛ"‬ the letters " ، ‫د‬ ‫ش‬ ، ‫ر‬ " are solid and give the root ‫,)درش(‬ ‫"اٌعّاٌمح"‬ the letters ‫ق"‬ ‫ي،‬ ، َ ، ‫ع‬ " are solid and give the root ‫.)عٍّك(‬ The SL algorithm gives direct solution to extract proper root by: determine the solid letters from Tables-(1, 2) and it depend letter place of Tables- (3,4).
Replacement between vowels ( ‫ي‬ ، ٚ ، ‫)ا‬ play role to find the root where vowel may be essential of root. The word if not root and have vowel, then algorithm apply replacement to get proper root. The letter priority replacement [16] to compare the root dictionary is letter "ٚ" to try replace with ‫ي"‬ ، ‫ا‬ " in any place of suggested root. If not, then priority of replacement first and last letter of root to " ‫ا‬ " to try replace with ‫ي"‬ ، ٚ " and compare with root dictionary, second letter, it is better for " ‫ي‬ " to try replace with " ٚ ، ‫ا‬ ". Also, Replace ‫يٛ"‬ ، ‫"اٚ‬ or ‫ٚي"‬ ، ‫"ٚا‬ by "ٚ" if it is not first letters in word or the word greater than four letters.
There are many of words when algorithm does not give the proper root, more than one root or wrong root. The problem appears of the same word gives different roots and vice versa. As ex, " ‫ي‬ ‫عذ‬ " the root may ‫عٛد"‬ " mean back and music machine, ‫,"عذد"‬ mean number and repair, ‫"ٚعذ"‬ mean promise.
Shada( ّ) upper letter, mean extra letter" repeat the letter", but most texts do not use Shada specially of electronic text. The bilateral word needs to add letter to be root. Hundreds of roots need to process, back word to bilateral then add extra letter. Ex, ‫"يحثُٙ"‬ back to " ‫"حة‬ then add extra second letter become ‫,"حثة"‬ ‫"االِرذاد"‬ back to " ‫"ِذ‬ then ‫."ِذد"‬ The algorithms process nouns with suggestion table lookup. The contents of the table are names, cities, places, things and so on. Generally, the noun affixes are at the beginning (prefixes) and the end (suffixes) of nouns. The algorithm can only remove LP ( ‫واي‬ ‫ٚاي،‬ ‫فاي،‬ ‫تاي،‬ ‫اي،‬ ) and suffixes like ( ‫يح،‬ ‫ي،‬ ‫)ج‬ to exclude the noun from the list.
All previous suggestion can compare between words of different other algorithms khoja [9], Ghwanmeh [2] are mentioned in Table-6.  Table-7      With all previous suggestions, the word sometimes may not refer to the proper root. Many of these reasons play an active part to cheat extraction exact root refer to one or more: 1. The huge number of rules may needed to find root. 2. Diacritics of one word give different roots. 3. Affixes complexity. 4. The conflict of anomaly rules. 5. One root may represent different meaning.

Conclusion
Many Root algorithms can be employed in Arabic Text Pre-processing to give root. The complexity is the nature of Arabic language. We can enhance our algorithm of exact root by: Apply all level of root (two to six letters). Change methods of excluding affixes and replacement between vowels. Deal with solid letter SL for direct extraction. Remove LP, Suffixes, Prefixes then Infixes "put in order" give better result.
The Tables-(1, 2, 3, 4, 5) indicators for word have uniform affixes. Using the pattern if not extract the root by table lookup. Expert judgment it does used to evaluate the results. The algorithm extracts successfully the proper root with an accuracy rate up to 96%. The algorithm power is using new root extraction methods. The research proofs that it is difficult to extract fully of appropriate root for general collection. The reasons return to the huge number of rules, diacritics of one-word give different roots, affixes complexity and the conflict of the anomaly rules. Future work needs more help from specialist experts' of Arabic language to improve percentage and accuracy of the appropriate root from word.

Corpus size and Computer program
The Arabic collection 350 documents are collected (about 33,800 word) arbitrarily chosen from various online Arabic newspapers, Our proposed stemming algorithm using Visual Basic programming language. The system accepts a text file that includes the Arabic words and produces the roots of those words. Examples of the systems' output results shown in Table-