Statistical Analysis of COVID-19 Data in Iraq

The analysis of COVID-19 data in Iraq is carried out. Data includes daily cases and deaths since the outbreak of the pandemic in Iraq on February 2020 until the 28 th of June 2022. This is done by fitting some distributions to the data in order to find out the most appropriate distribution fit to both daily cases and deaths due to the COVID-19 pandemic. The statistical analysis includes estimation of the parameters, the goodness of fit tests and illustrative probability plots. It was found that the generalized extreme value and the generalized Pareto distributions may provide a good fit for the data for both daily cases and deaths. However, they were rejected by the goodness of fit test statistics due to the high variability of the data.


1.Introduction
The spread of the COVID-19 pandemic has greatly affected people's lives all over the world. The outbreak and the spread of the pandemic vary in each country. Fitting a probability model for the rapid spread of the COVID-19 pandemic locally and globally is highly recommended. Therefore, statistical models for each country should be evaluated separately. During the past two years, a great number of researchers were centering their efforts to help control this pandemic. Yonar, H. et al. [1] collected datasets including the number of cases of COVID-19 in eight selected countries. Cases were modeled by using some curve fitting models; namely, ISSN: 0067-2904 the Box-Jenkins (ARIMA) time series model and forecasted by using the Brown/ Holt linear exponential smoothing method. Zhao, J. et al. [2] compared the COVID-19 pandemic dynamics between two neighboring Asian countries Iran and Pakistan. They developed a new statistical model that provides the best fitting for the COVID-19 daily death data in the two countries. Xia J, Bin Z. and Jinming C., [3] studied the dynamics of the infectious diseases model besides the time series model to detect the trend and provide short-term prediction of the transmission of the COVID-19 pandemic. Woody, S. et al. [4] built a model using a nonparametric technique in regression called locally weighted polynomial regression (LOWESS). They incorporated a set of predictors based on mobile phone social distancing data which provides well-informed predictions on COVID-19 death rates in the United States. Shukur, S., D. and Kadhim, T., H. [5] applied time series analysis to model and forecast COVID-19 daily deaths in Iraq. They found that modeling the Coronavirus deaths series in Iraq was diagnosed as Threshold GARCH (1,1). Besides, the Holt-winter-additive method of forecasting was the best method amongst the exponential smoothing methods. Further, a number of recent studies on modeling the COVID-19 pandemic data are based on the extreme value theory approach and the machine learning methods [6,7,8,9,10].
This study aims to provide a statistical model that best fits daily cases and deaths of the COVID-19 pandemic in Iraq. A variety of statistical distributions are selected and fitted to the data. Four distributions are introduced that are well fitted by both the daily cases dataset and the daily deaths dataset.

Identification of Probability Models
In this section, we will introduce the probability distributions that are applied in our study that are best-fitted distributions [11,12].

Exponential Distribution
The probability density function is

Gamma Distribution
The probability density function is where Γ is the gamma function, and Γ is the incomplete gamma function.

Weibull Distribution
The probability density function is The cumulative distribution function is

Generalized Extreme Value Distribution
The probability density function is The cumulative distribution function is

Generalized Pareto Distribution
The probability density function is The cumulative distribution function is

The Goodness of Fit Tests
The Goodness of fit tests measures the compatibility of a random sample with a theoretical probability distribution [13]. Three types of goodness of fit tests are applied in this study. The common null and the alternative hypotheses of these tests are H0: The data follow the specific theoretical distribution H1: The data do not follow the specific theoretical distribution

The Kolmogorov-Smirnov Test
Assume that we have a random sample X1, ..., Xn from some continuous distribution with CDF F(x). The empirical CDF is denoted by The Kolmogorov-Smirnov statistic D is based on the largest vertical difference between the CDF F(x) and the empirical CDF Fn(x). That is = | ( ) − ( )| The hypothesis is rejected at the selected significance level α if the calculated statistic D exceeds the critical value obtained from a table.

The Anderson-Darling Test
The Anderson-Darling test gives more weight to the tails than the Kolmogorov-Smirnov test. It is a general test to compare the fit of an observed CDF to an expected CDF. The Anderson-Darling test statistic A 2 is given as The hypothesis is rejected at the selected significance level (α) if the test statistic, A 2 exceeds the critical value obtained from a table.

The Chi-Squared Test
The Chi-Squared test is used to determine whether the sample data follow a specified distribution. This test is applied to binned data, so the value of the test statistic depends on how the data werebinned. Please note that this test is available for continuous sample data only. The Chi-Squared statistic is defined as where k is the number of bins which is calculated based on a sample size n as = 1 + 2 . The observed frequency for bin i is Oi, and the expected frequency for bin i is Ei, where = ( 2 ) − ( 1 ), and x1, x2 are the limits for bin i. The hypothesis is rejected at the selected significance level α if the 2 test statistic is greater than the critical value of the chi square distribution ( 2 1− , −1 ).

Data Description
The total number of daily cases and deaths in Iraq were collected according to the WHO statistics [14] from February 2020 until the 28 th of June 2022. Tables 1-2 show summary statistics and percentile values concerning the daily cases and deaths, respectively  It is noticed from Tables 1-2 the high variability in the data, particularly for the number of daily cases. The coefficients of variation have values almost close to one since the values of the means and standard deviations have close levels. This indicates that the data varies exponentially. Plots of COVID-19 daily cases and deaths are presented in Figures 1-2. From Figure 1 one can notice that the number of daily cases takes the shape of waves of increasing cases over time. Figure 2 reveals that the number of daily deaths is highly increasing at the beginning of the pandemic and deceasing gradually at the end of the period. The reduction in the number of daily deaths is related to the effect of taking the vaccine.

Data Analysis
In this section, we will try to find the probability distribution that best fits Covid-19 daily cases and deaths in Iraq. First, we choose a selection of some well-known probability distributions, fit our data to these distributions and find out the distributions that get the highest four ranks according to the three types of goodness of fit tests. We will do this with the aid of the well-known Easy-Fit software.

Analysis of Daily cases
Results of the analysis are presented in Tables 3-4. Maximum likelihood estimates of the parameters are presented in Table 3. The ranking of each distribution according to the three goodness of fit test statistics is presented in Table 4. It is shown that the best-fitted distributions to the daily cases data are the generalized extreme value and the generalized Pareto distributions. Albeit, the best-fitted distributions are all rejected according to the goodness of fit tests. Figure 3, shows the PP-plot for the best-fitted distributions. From Figure 3, we can observe that the data fitted by the generalized extreme value and the generalized Pareto distributions are approximately lined up on the probability plot, which indicates a good fit.   Figure 3: A PP Plot for the best-fitted models of daily cases

Analysis of Daily Deaths
Results of the analysis are presented in Tables 5-6. Maximum likelihood estimates of the parameters are presented in Table 5. The ranking of each distribution according to the three goodness of fit test statistics is presented in Table 6. It is shown that the best-fitted distributions to the daily deaths data are the same as in the daily cases data: the generalized extreme value and the generalized Pareto distributions. Also, the best-fitted distributions are all rejected according to the goodness of fit tests. Figure 4, shows the PP-plot for the fitted distributions. From Figure 4, we can observe that the data fitted by the generalized extreme value and the generalized Pareto distributions are approximately lined up on the probability plot, which indicates a better fit.   Figure 4: A PP Plot for the best-fitted models of daily deaths

Conclusions
The plot of the COVID-19 daily cases presented in Figure 1 reveals that during the period of the study the pandemic went through four waves of which the third is the most severe in the number of infected cases. On the other hand, the plot of the COVID-19 daily deaths presented in Figure 2 revealed a dramatic increase in the number of deaths during the first wave. However, the daily deaths decrease apparently at the end of the period due to the effect of taking the vaccine.
Based on the PP plots presented in Figures 3 and 4 and the goodness of fit tests presented in Table 4 and 6, the fitting results shows that the generalized extreme value distribution and the generalized Pareto distribution were the best-fitted models for both daily cases and deaths.
However, the null hypothesis presented in section 3 is rejected for all the suggested distributions by the goodness of fit test statistics.
Accordingly, it is recommended to carry out the statistical analysis with extreme value data rather than the original daily data, since both the generalized extreme value distribution and the generalized Pareto distribution deal with extreme value or block maxima data.