Automatic Numeral Recognition System Using Local Statistical and Geometrical Features

Optical Character Recognition (OCR) research includes computer vision, artificial intelligence, and pattern recognition. Character recognition has garnered a lot of attention in the last decade due to its broad variety of uses and applications, including multiple-choice test data, business documents (e.g., ID cards, bank notes, passports, etc.), and automatic number plate recognition. This paper introduces an automatic recognition system for printed numerals. The automatic reading system is based on extracting local statistical and geometrical features from the text image. Those features are represented by eight vectors extracted from each digit. Two of these features are local statistical (A, A th ), and six are local geometrical (P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 ). Thus, the database created consists of 1120 statistical and geometrical features. For the purpose of recognition, the features of the test image are compared with the features of all the images saved in the database depending on the value of the Minimum Distance (MD). All digits (0 – 9) were identified with 100% accuracy. The average computational time required to recognize a numeral at any font size is 0.06879 seconds.


Introduction
Recognition of printed numerals is an active research field that accompanies technological development because of its applications in many of the necessary issues related to administrative leadership and digital governance, such as currency and passport recognition, electricity bills reading, water, zip code, commercial bar codes, and bank statements [1].Documents containing printed numerals or characters are scanned, and then the OCR engine reads the images and converts them into ASCII data [2].In general, the techniques of character recognition can be categorized into two types: off-line and on-line characters.The first type is divided into two sub-divisions: (a) Optical Character Recognition (OCR), and (b) Manual Character Recognition (MCR) [3].Various methods of identification were reviewed, for example handwritten numerals [4,5], printed and handwritten mixed kannada numerals [6], writings and scripts of Arabic [7], Arabic numbers [8], Vietnamese character recognition [9], and license plate recognition [10].In general, all studies aim to obtain the best possible accuracy in identification, but feature extraction methodologies and classification methods vary.
People interested in the field of artificial intelligence often seek to find formulas or models with accurate capabilities in classifying and distinguishing printed numerals.Since numbers are associated with us in the majority of our daily dealings, many researchers are interested in the identification of printed numbers.This work aims to design a graphical user interface to recognize and categorize a set of digits (0-9) based on the features of structure.However, a problem arises when a digit is unable to be recognized due to differences in font size, bold, or font style.Taking into account the aforementioned problem, we developed a scheme to treat the problem related to the numerals automatic reading system.The ability of the developed technique to recognize all digits of varying sizes is predictable.Furthermore, this study will develop a framework capable of presenting an appropriate method for printed numerals' identification with the use of a simple and somewhat novel methodology that primarily relies on carefully extracting features and then building a database.
To achieve the objective of the research, we divided the image of the numeral into six local statistical parts and extracted six features (P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 ).Two geometrical features (A and A th ) that represent the area of the digit before and after the thinning process, respectively, are also used.So, they are proportional to the different font sizes of Arabic numerals.The database generated consists of 1120 features concerned with printed Arabic numerals for the Times New Roman font style.The creation of the database depends on the structure of the digits being thinned in the process of extracting features.The classification process depends on the minimum distance classifier.
The research for digit recognition began in the late eighties.Since then, different recognition methods have been proposed in order to achieve the aforementioned purpose.Several approaches were employed to increase recognition accuracy, such as the decision tree classifier [4], minimum distance classifier [3], and CNN architecture [11].There are previous studies in which various techniques have been introduced and developed for the purpose of recognition and identifying printed and handwritten characters.Most of these studies depend on identifying the features of each symbol that are installed in a database to be used later to identify and classify unknown symbols.Some of these studies are: Zhou et al. [12] proposed a system for hand-printed numeral recognition based on two algorithms: one of them is hole detection and the other is contour concavity detection.Then they applied these algorithms to a practical automatic form reading system by Transym Optical Character Recognition (TOCR V1.0).TOCR reads simple characters automatically, handwritten numbers, or printed data.They collected test data while implementing the system in China as well as the CENPARMI database, and their proposal succeeded in handwriting recognition.Dhandra et al. [13] proposed a technique using Euler numbers that recognizes 17 different font styles with various sizes (8-72) points for recognition of English numerals.Their technique involved superimposing a small rectangle-bounding box over an isolated digital image and cropping it.After that, they found the pixels' exterior densities in each direction (bottom, top, left, and right).They then estimated the densities' ratios to the full area of the cropped numeral image and recorded the findings in a feature vector.
They tested 3200 numeral images, including 160 for training, and the classification accuracy of their proposed method was about 99.78%.Montazer et al. [14]; they proposed a holistic technique, including a neuro-fuzzy inference engine, to recognize the Farsi numeral.They combined the Mamdani fuzzy rule inference engine with a multilayer perceptron neural network to learn the features of different fonts of numerals (33 different Farsi fonts).The suggested strategy was able to identify Farsi numerical characters more comprehensively.Their proposal recognition rates were greater than 97%.Das et al. [15]; exploited a genetic algorithm (GA) to classify the CMATERdb handwritten Bangla numeral data set (6000 images of handwritten isolated Bangla numerals without restrictions).They presented a methodology for creating local regions of dynamically varying heights and widths.The GA is then used on the created localities to select the optimal localities in order to extract a set of optimal features.They used SVM for classification and achieved an accuracy rate of 97%.Hassanpour and Samadiani [16] have applied SOM neural networks in order to recognize 30 different font styles of different sizes ranging from (10-28) points for recognition of English numerals.They preprocessed the data by rotating and normalizing it, then split the image horizontally into two halves, and then extracted the features by whisking the data in the direction of row and column in each portion of the image.Even with varied fonts, these features are the same for a digit.They used a similarity scale to define the category of data in a neural network and the SOM neural network to recognize digits and classify them.Approximately 2000 numeral images were tested in the database.The recognition accuracy was around 99.55%.Jasim et al. [17] presented a system to recognize the Arabic handwritten numerals (0-9).Their system was called Handwritten Numeral Recognition Using Fuzzy Logic (HRUFL), which included practically (from preprocessing to recognition): converting an image to binary, denoising, thinning, isolate numerals, standardization, and feature extraction.Their system achieved a recognition rate of 94%.In other work, Damayanti et al. [18] adopted the idea of license plate recognition based on extracting the area feature for characters (A to Z) and numerals (0 to 9).They used the K-NN method to recognize the plate based on the value of features extracted.However, the computational time needed during the recognition processing is about 0.6 seconds in the area (10 x 10) pixels and 1.3 seconds in the area (20 x 20) pixels.The real size of the image is (300 x 720) pixels.As a result, the whole image recognition process would take longer.The accuracy of license plate recognition was about 68.57%, while the accuracy of character recognition on the license plate was approximately 92.72%.Many research domains have employed thinning as a preprocessing step, including numeral recognition [15], character recognition [19], image coding and biometrics [20], and fingerprint identification [19].Where the result of thinning is utilized in determining the spin direction of the fingerprint, as in [20,21].
The present work covers the following sections: Section 2 summarizes the methodology of creating a database and numeral recognition systems, which includes image acquisition, preprocessing, extraction of features, and numeral recognition.Section 3 introduces graphical user interface design.Section 4 covers the recognition results and analysis.Finally, the conclusion in section 5.

Methodology
Character recognition is a procedure that enables a reading machine or computer to understand printed or written characters in different languages.Whether they are numbers, letters, or symbols, they are converted into a formula that can be handled and modified by the computer.Generally, numeral recognition techniques are divided into two basic methods: (i) methods that rely on extraction features.(ii) OCR Method.The features technique is based on the extraction of geometric or statistical features or both, while (OCR) technique depends on the calculation of either Euclidean distance or mean square error (MSE) between the original image of the symbol and images of symbols in the database to find the best match.
Figure 1 depicts a proposed automatic recognition system, each stage of which will be explained in detail in the subsequent paragraphs.The newly introduced recognition system consists of three main phases.The first phase is preprocessing, which includes (converting to binary, segmentation, thinning, and normalization).The second step is feature computation, followed by numeral recognition.This methodology will be able to identify the numerals in the document (distinguishing the category of digits according to arrangement).

Segmentation
Thinning Normalization Images Aquisition

Images Acquisition
The Snipping Tool provided within the Windows Accessories environment and Microsoft Word Documents have been used in the image acquisition stage to obtain grayscale images for Arabic numerals (0-9) in JPEG format.The numerals are typed in a certain font size within the word document, the numbers are enclosed by a rectangle using the Snipping Tool, and then the image of the numerals is cropped and saved.Figure 2 displays some of the images that were collected.These images were acquired in the Times New Roman font.Image dimensions vary by font size.For example, 198ⅹ41, 228ⅹ46, 283ⅹ57, 390ⅹ95, 425 ⅹ70, 431ⅹ111, and 626ⅹ136 of font size (16,22,28,36,42, 50, and 72) respectively.Thus, 70 images of varied sizes were used to create the database.The test data consists of 26 test images (100 numerals).The font sizes in the test data vary, as well as the number of digits contained in each image, which ranges from a single numerical order to nine numerical orders (1-9) as detailed in Table 1.Table 1 shows the number of test images versus the number of digits contained in each image

No. of images
No. of digits contained

Pre-Processing
Because all numeral images are grayscale and highly clear, they are just converted into binary images for the purpose of segmenting and separating them from the backdrop.Grayscale images are converted to binary images using a threshold.This functionality is available in the MATLAB program.For segmentation, the MATLAB tool package called RegionProps is used.This command segments binary images into multiple targets, with the option of isolating each target by enclosing it with a rectangle.Naturally, the segmented numerals will have different dimensions depending on their geometric shape and size.As a result, we resized all rectangles (the segmented numerals) to be (100 x 60) pixels.Thinning is a typical approach for identifying the geometric features of objects, such as numbers, by reducing their thickness.It is one of the morphological operations that apply to binary images, as the image is thinned through the command bwmorph (BW, operation, Inf) available within the MATLAB library tool, where Inf refers to repeating the process until the image stops changing.Figure 4 shows the results of the normalization experiment and thinning.

Calculation of Features
Feature vectors are used to collect features that include as much useful information about an image as feasible.For the purpose of effective classification, we chose a set of nontraditional features from the data to distinguish samples that have similar shapes, such as 5, 6, and 9, as well as 0. For each segmented numeral, a vector of eight features has been presented in this study, V f = [A, A th , P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 ].
Where A is the numeral's area, which denotes the number of pixels; A th is the numeral's thinning area (number of edge pixels); and (P 1 -P 6 ) are the number of pixels in each of the six parts of the numeral as shown in Figure 5.

Characters Recognition
The identification of numerals in the text image is the final stage of the proposed system.In numeral recognition, the primary procedure is to convert the numeral image to a text file, which can be edited and used by any other program or application according to the need.
The Vector coefficients (V C ) for each character image are generated after getting a text image of numerals, which represents the test image.In this paper, the Minimum Distance (MD) technique is used, which calculates the difference between the generated vector and each vector in the database by Euclidean distance, whereas the minimum value of (MD) denotes the most careful identification of the numeral that has been tested.As a result, the numeral may be recognized.
The database (DB) consists of 1120 features and includes ten digits (0 -9) of font styles (Times New Roman 70 images and Calibri 70 images), and each digit has eight (8) features as previously mentioned (V C = A, A th , P 1 , P 2 , P 3 , P 4 , P 5 , and P 6 ).The following relationship has been used to normalize the feature values in the database to be within the range (0-1): V C (i, j) max is the maximum value of (j th ) feature, V C (i, j) min is the minimum value of (j th ) feature.
The recognition process is done by using eq. 2 [22]: Where: (MD in ) indicates the difference between the image of the entered number and those kept in the database, (C) denotes the total number of features, V (i, j) is the (j th ) feature of the (i th ) entered image, and DB (i, j) denotes the image feature (j) of the numerals that were kept in the database.

Graphical User Design
A Graphical User Interface (GUI) is an expandable graphical gateway that uses icons, signs, or other visual pointers to connect and interact with electronic gadgets.It is composed of a set of commands.In this paper, a GUI for digit recognition was designed in the MATLAB program.Here are the steps in designing a graphical user interface (GUI) performance for our proposed system: 1.When running the program, the main screen will appear (see Figure 6-a), as all buttons are inactive.
2. When the select image upload button is pressed, a file selection dialog is displayed to select the text image file to be recognized (Figure 6-b).

Experimental Results
For the numeral recognition model, a database of 1120 features was constructed based on (8) parameters for feature extraction.In this paper, many images of characters were tested; these images differ from each other in terms of font size and the number of digits each image contains.To complete the numeral recognition system, several experimental steps were adopted during the research.Figure 8 depicts some of the preparation measures and tests performed on the implemented recognition system.Equation 1 was applied to test pictures that had multiple number orders in one image, such as (ones, tens, hundreds, .. , billions), and in different font sizes.According to the test and training outcomes of all text images, our system can identify and differentiate printed numerals regardless of font size or style, where all the numerals in the text images are correctly identified.Figure 7 illustrates some samples of the program running for various entered images, as well as how the recognition system's output is saved in a vector named OUT.The suggested approach is capable of determining all digits with more accuracy and in less time than the method used by [18].The average processing time required to recognize numerals is shown in Table 2.The categories of true positive (TP) and true negative (TN) are both accurate classifications.When a sample that is actually negative is forecasted as positive, this is known as a "False Positive" (FP).A false negative (FN) occurs when a positive sample is incorrectly anticipated to be negative.
The metric of accuracy used to evaluate the suggested approach is precision, which relates to the results of data checking in terms of correctly signalling a test result (the existence of a true condition) and incorrectly indicating a test result (the presence of a false condition).The following formula can be used to represent it: The test data consists of 26 test images.The total number of numerals in these images is 100.For each test digit (0-9), the MD of the entered test characters is computed and then compared with those saved in the database to show the correctness and robustness of the proposed approach.Table 3 lists the 10 lowest MD values in increasing order.The font types (TN) and (CB) stand for Times New Roman and Calibri, respectively.In the examination field, true or false statements were employed to signify a match or a non-match.
Table 3 contains the experimental results.Correct recognition is generally indicated by the lowest MD value, i.e., the greatest match between entered image features and those saved in the database.However, in this study, the results of 10 x 10 numerals were exhibited to confirm the effectiveness of the suggested approach.All of the instances show that everything is in order.Table 4 compares the proposed method's results to those of other widely used recognition approaches.

Conclusion
This paper presents a new technique for recognizing Arabic numerals printed on a document image.Firstly, the images are processed using various stages such as segmentation, thinning, normalization, and feature-vector calculation to extract features and then generate a database.A variety of images of numerals (0-9) have been acquired by the Snipping Tool Icon, which is available under Windows.These images are miscellaneous in terms of font styles (Calibri and Times New Roman), font size (16, 22, 28, 36, 42, 50, and 72), and the number of digits contained in each image.A new approach has been taken to extract effective features from the Arabic numeral recognition system.The method for calculating features includes two stages: (a) thinning and normalization, and (b) vector-features.Both are described in section (2.3).In addition to the design of a Graphical User Interface (GUI) to display the identification result, which is detailed in section (3).
The minimum distance classifier has been used to differentiate numerals that represent the difference between the feature value of a digit within the image (entered) and those saved in the database.Whereas the least value of (MD) indicates the highest match for the test character, the numeral is recognized.The technique demonstrated a very hopeful result with 100% overall accuracy.The average time it takes to recognize each digit in any font size is 0.06879 seconds.
Despite the fact that the entered test character features (Time News Roman and Calibri) differ from those saved in the database (Time News Roman) only, the results show that 28 test numbers are identical to (Calibri), all of which were true.

Figure 2 :
Figure 2: Samples of data set: Arabic numeral (0-9) with Times New Roman font style in different sizes.
Figure 3-a and 3-b show a sample of numerals binarized and segmented, respectively.

Figure 4 :
Figure 4: Result of the normalization and thinning.

3 .
Pressing the (process) button converts the text image into binary as shown in Figure6-c.4. The segmentation button segments the text image and displays the result (Figure6-d).5. OUT button implements the numeral recognition and displays the result (Figure6-e).6.The reset button deletes the previous steps and initializes the recognition system, i.e., returns to the main screen as shown in Figure6-f.

Table 2
The average computational time required to recognize numerals at various font sizes

Table 4 :
Comparison of the proposed method to other commonly used methods