Temporal Video Segmentation Using Optical Flow Estimation

Shot boundary detection is the process of segmenting a video into basic units known as shots by discovering transition frames between shots. Researches have been conducted to accurately detect the shot boundaries. However, the acceleration of the shot detection process with higher accuracy needs improvement. A new method was introduced in this paper to find out the boundaries of abrupt shots in the video with high accuracy and lower computational cost. The proposed method consists of two stages. First, projection features were used to distinguish non boundary transitions and candidate transitions that may contain abrupt boundary. Only candidate transitions were conserved for next stage. Thus, the speed of shot detection was improved by reducing the detection scope. In the second stage, the candidate segments were refined using motion feature derived from the optical flow to remove non boundary frames. The results manifest that the proposed method achieved excellent detection accuracy (0.98 according to F-Score) and effectively speeded up detection process. In addition, the comparative analysis results confirmed the superior performance of the proposed method versus other methods.


Introduction
Video is one of the important forms of multimedia. It is widely consumed in large applications like education, medicine, surveillance, sports and entertainment. Video is a multi-dimensional signal structured as a series of still images called frames. The frames content changes in time form a smooth movement for the observer. Successive frames without significant changes in visual content are grouped together and compose the so-called shots. Each video shot represents a frames series captured by a single camera. While a set of shots with semantic related content is called a scene [1]. The aim of temporal video segmentation is to split a video into shots as meaningful unit of the desired information in a large amount of video information. It represents the first step of many video processing for different applications. The essence of video temporal segmentation techniques (known as shot boundary detection techniques) is to discover the shot boundaries. The boundaries are the discontinuity frame(s) that describe the transition from one shot to the next [2]. The types of shot transitions in videos can be classified into the following categories [3,4]:  Abrupt transition (Hard transition): sudden changes of content from one shot to the next, which means that the shots are separated by a single frame.  Gradual transition (Soft transition): gradual changes of content from one shot to the next, which means that the shots are separated by several frames with very closely related visual content. The gradual transition can be furthermore categorized as follows:  Fade -in: a new shot appears by increasing gradually the brightness of black color frame.  Fade -out: a new shot disappears by decreasing gradually the brightness of frame.  Dissolve: a current shot is overlapping with an incoming one. Examples of different types of shot transitions are shown in Figure 1.
The main challenges of shot boundary detection are camera and objects movements because they may alter video content dramatically, giving an impact similar to transition effects and causing incorrect detection of shot transition. Additionally, most algorithms rely on frame-byframe comparison to discover the boundaries of the shot, which causes a high computational cost. A shot detection algorithm should have the ability to reduce computation costs and  Science, 2021, Vol. 62, No. 11, pp: 4181-4194 4183 recognize the shots boundaries without misconstruction the object or camera motion as the shot transition [5,6]. This paper presents a new method to detect abrupt transition between successive shots in the video with high detection accuracy and lower execution time as much as possible, regardless of the complexity of the video in terms of camera operations, motion effects, and illumination changes. The paper is organized as follows: the literature review is discussed in the next section. In section 3 projection features are described. Optical flow estimation is presented in section 4. The proposed method is described in section 5. In section 6 the experimental results obtained are discussed. Finally, conclusions are contained in section 7.

Literature Review
Video segmentation is widely used as first step for many applications such as video categorization, video indexing, video summarization and content based video retrieval. Many researchers have tried to detect the shot transitions depending on the comparison of consecutive frames by focusing on the visual features of video frames. First, the features are extracted from each frame, then distance or similarity measures are applied and compared to a specific threshold to distinguish the abrupt transition [7]. Histogram is a commonly used feature for shot detection techniques which are calculated using color space or a gray level. Liu and Dai [8] presented a method of shot boundary detection using grey model based on set sequence (the model was denoted as SGM) and Hue Saturation Intensity (HSI). The frame was converted to HSI and then a histogram was computed and sampled by SGM. Based on absolute mean error with thresholds, the abrupt shot was detected. Hong Shao et al. [9] exploited color histogram with Hue Saturation Value (HSV) and features of Histogram of Gradient (HOG) to detect abrupt shot. In order to decrease the influence of camera/object motion, Kar and Kanungo [10] used absolute sum gradient orientation features with threshold generated from the local and global thresholds for detecting abrupt shots. Thounaojam et al. [11] employed features extracted from Gray Level Co-occurrences Matrix (GLCM) and used correlation measure to similarity between two consecutive GLCMs of the video. Zajic et al. [12] formed a structure to measure color and texture differences between frames of decoded video sequence for abrupt transition detection. These methods are robust for camera/object motion, but require significant computational power. Local features extracted from the video frames can be employed to discover shot boundaries. Santos and Pedrini [13] introduced a method for detecting shots boundary by using the weber local descriptor with an adaptive threshold. A shot detection method designed by Hannane et al. [14] combined local and global features by using the distribution histogram of SIFT points for consecutive frames with adaptive threshold to detect the abrupt and gradual transitions. Other authors [15,16] proposed a shot boundary detection method based on fuzzy logic as an important tool for computer research. Several works in the literatures employed machine learning techniques. For example, Mondal et al. [17] introduced a Least Squares Support Vector Machine (LS-SVM) as a modified version of SVM. Features were extracted using the multiscale geometric analysis of nonsubsampled contourlet transform and reduced using principal component analysis. The frames were classified using LS-SVM and features vectors into non transition, abrupt transition or gradual transition classes. Convolutional Neural Networks (CNNs) were utilized by Xu et al. [18]. Shot boundary positions were determined using a candidate segment selection method and adaptive thresholds. Then, CNN was executed to extract representative features of the frames in the candidate segments. Finally, shot transitions were obtained by using pattern matching method. However, the performance of these methods is effective and achieves high accuracy, but the execution speed is slow.

Hato
Iraqi Journal of Science, 2021, Vol. 62, No. 11, pp: 4181-4194 4184 Researchers have tried to achieve greater accuracy and high execution speed. Lu and Shi [19] achieved an important step forward in reducing the execution time of shot boundary detection techniques. They adopted Singular Value Decomposition (SVD) and selection of a candidate segment to reduce the time execution of shot boundary detection. First, the candidate segments that may contain the shot boundaries were identified. Next, color histograms were extracted as features matrix of candidate segment frames. To reduce the feature dimension, SVD was then implemented. Finally, shot transitions were determined using pattern matching method. Dhiman et al. [20] used pixel-based technique with selection of candidate segment to accelerate detection of shot boundaries. Hato and Abdulmunem [21] provided a fast method for shot boundary. First, the features of SURF were extracted from the half number of the video frames. Then, the features matching were performed using the distance function. Finally, the similarities were calculated and compared to the global threshold. One can conclude after reviewing the techniques of shot boundary detection, that the main challenges of detection processing are fast object movement, large camera motion and cost effectiveness. Some techniques achieved high accuracy at the cost of execution time. Hence, shot boundary detection techniques still need to provide accurate results with faster execution speed. Therefore, this proposed method aims to achieve high detection accuracy and maintain low computation costs.

Projections
The projections of an image are a one-dimensional representation of its contents, which are calculated on the coordinate axes. Horizontal projection is the sum of the pixel values in the image row, while vertical projection is the sum of the pixel values in the image column. For the image I (r, c) of size M×N, the horizontal projection HP and vertical projection VP are defined as [22]: The horizontal projection has an M length and the vertical projection has an N length that corresponds to the height and width of the image, respectively. Projections are oftentimes used to quickly analyze the image structure and isolate its component parts. Additionally, projections provide shape information that is useful in applications such as text detection and character recognition, where the objects of interest can be normalized with respect to size [22].

Optical Flow Estimation
The optical flow can be defined as the velocity distribution of motions for captured images data. These motions are estimated using the comparison of two images (or two frames in the case of video), which are captured on two different times or exactly at the same time but using two cameras. In general, optical flow involves two-dimensional vectors, where each vector is representing the displacement pixel from the first frame to the next frame [23]. Consider a pixel I(x, y, t) in the first frame, which moves to the next f ame by taking time ∆t. The (∆x, ∆y) is the pixel displacement f om the fi st f ame to the next f ame, so that the pixel in the next frame will be [24]: The differential method (partial derivatives) is used to find pixel displacement with respect to the temporal and spatial coordinates, as follows: where I(x, y, t) is the pixel at location (x, y, t), c is a eal constant valued, and ∆x, ∆y, ∆t a e the movements between the two frames. The equation (5) is obtained from the equation (4): This results in: (6) where and are the velocity or optical flow of I(x, y, t) and , and are frame gradients at (x, y, t). Equation (6) cannot be solved with two unknown variables V x and V y . Therefore, many differential methods, such as Farneback and Lukas-Kanade, are provided to solve this problem [24]. The Farneback algorithm is a method for estimating the motion of two frames, which is based on a polynomial expansion to approximate a neighborhood of each pixel. More details about this method can be found elsewhere [23].

Proposed Video Segmentation Method
A simple and effective method is proposed for detecting abrupt shots between the video frames. Candidate's transitions are first detected based on an analysis of the changing pattern of the projection features. Motion feature, estimated by optical flow, is utilized to filter out the candidate's transitions. The proposed method is composed of two stages: candidate's transition detection and abrupt transition specifying.

Candidates Transition Detection
An abrupt transition can be detected by looking for two completely different sequential frames. In other words, let (F 1 , F 2 , F 3 , ….. , F N ) be the video frames, Fi is considered an abrupt transition if the dissimilarity of F i and F i+1 is high. For abrupt shot detection, extracting the optical flow field from all frames in the video and performing feature matching is computationally very expensive. Therefore, candidate frames are first detected from the video based on projection features due to their ability to reduce the information redundancy of video frames. Horizontal Projection (HP) is computed to all video frames for introductory detection of abrupt transition, so each video frame is represented by HP vector of M length. Higher -resolution images produce a large number of features, which are recommended for image classification and retrieval. But in case of shot boundary detection, it is not mandatory to use high -resolution images; a fewer numbers of features can decide whether two consecutive frames are different or similar. All frames are pre-processed by resizing them to 256×256 and converting them into grayscale images before extracting the projection features. The features extracted remain unchanged in low resolution and high resolution frames as well as preserve a low computation time as shown later in the experimental results. Detection of candidate's transitions focuses on the differences between consecutive frames and the similarities between these differences' values in the nearest neighborhood. The Mean Absolute Differences (MAD) measure is used to calculate the difference value between two consecutive HP vectors. Whenever the difference between two successive HP vectors is higher than the predefined threshold TH1, candidate boundaries are recorded. The TH1 value can be adjusted experimentally to give high precision or high recall. However, while the lower value of TH1 gives high recall, it also results in lower precision due to the change in video content. On the other hand, high value of TH1 gives high precision but low recall. To overcome the above problem, the TH1 is set to a relatively low value in order to detect all abrupt boundaries, but with false detection being removed by the second stage of the proposed method. The condition for detecting the candidate transition Cj is defined as: Where MAD (i) is: Cj: represents ith frame index of video. j: represents jth candidate transition. i: represents ith frame of video. FNum: represents the total frames number inside the video. HP: represents horizontal projection vector of length M for video frame Fi. M: represents number of rows in the video frame (HP length). r: represents summation of rth row in HP vector.

Abrupt Transition Specifying
Two differently structured frames may have the same projection. The motion feature is used to increase the discrimination ability of the proposed method and eliminate false detection. The second stage of the proposed method involves applying optical flow estimation to produce the optical flow field. The Farnebäck algorithm is used as an optical flow computation algorithm that provides a dense and more stable estimation with a reasonable computational cost. The optical flow algorithm estimates the direction and speed of the moving object in the video. It produces the orientation and the magnitude matrices which are computed from the velocity vector. Discontinuities in the optical flow can help in capturing the temporal variation between the successive frames. A number of video files were analyzed using the standard division that was computed to magnitude matrices of estimated optical flow fields in order to detect the pattern of abrupt transition. It is concluded that abrupt transition has a relatively small value of standard division. On the other hand, the normal transition has large value of standard division. Optical flow field is employed to determine discontinuity between the successive frames. Let CT= (c 1 , c 2 , c 3 , ….., c j ) represents the indexes of candidates transitions frames, optical flow estimation between the frames from Fc j-n to Fc j+n is applied where n signifies the frames index before and after cj and n is taken as two for experimentation. Then the standard division SD for each magnitude matrix of optical flow field is computed, which will be SD = (sd i-2 , sd i-1 , sd i , sd i+1 , sd i-2 ). The determination of abrupt transition is based on comparing SD values; if the frame Fc j is abrupt transition then it should have sd i value larger than those sd values of nearest neighbor frames. It was observed from the results that most of the motion and lighting effects were decreased, but there is a possibility that some of the motion affected could be classified as a candidate abrupt transition. Hence, sd i value must not only be the largest value but its value also must exceed the experimentally determined threshold condition TH2, to reduce the detection of false boundaries caused by high activity of video components. Equation (9) must be satisfied to confirm the candidate's abrupt transition; otherwise the potential transition will be ignored: AT k : represents ith frame index of video. k: represents kth abrupt transition. Max(SD): represents maximum value of SD.  Table 1. Performance was evaluated on the basis of Recall (R), Precision (P) and F-Score (F) measures. These measures can be defined as follows [25]:

Experimental Results
( ) (12) where True represents the correct detection of transition, False represents the false detection of transition, and Miss represents the missed detection of transition. Execution time was also used to evaluate performance; it is a measure of how quickly the proposed method can produce the results. A video file was provided to the proposed method as an input. Identified transition frames were compared with the corresponding ground truth to identify the True, Miss and False boundaries. The MATLAB R2018a was used to implement the code on hp laptop of Intel CORE i7 and 5500U clocked at 4.40 GHz. TH1 and TH2 were set to 1.8 and 10, respectively. Table 2 summarizes the performance of the proposed method according to R, P, F measures and execution time.  It is clear from Table 2 that the R, P and F measures are very high, indicating the high performance of the proposed method. In addition, the proposed method has a low execution time. The abrupt boundaries between shots were recognized in the presence of illumination variations and camera/object motion that represented the main sources of misleading abrupt boundaries. For more illustration, Figure 2 shows various samples of frame sequences with disturbance factors that include illumination variations, fast object motion and camera motion, which are correctly detected as non-abrupt transitions. This indicates that the proposed algorithm had robustness for illumination changes and camera/object motion.
As mentioned earlier, the video frames resolution was reduced to 256×256. The time required to compute optical flow field from original size frame is greater than the total time required for resizing frame and extracting optical flow field from lower resolution frames. In addition, the projection features is also faster to extract from lower resolution frames. A clear picture was given by visualizing the time required for optical flow estimation and projection features 4189 extraction before and after resizing of the selected frames used as test materials, as illustrated in Figure 3.
It is observed from the calculation of time that the 125×125 frame size consumes less time compared to the other frame sizes. In order to demonstrate the computational gain while maintaining efficiency by resizing frames, short videos (VF1 and VF9) were taken as illustrative examples. The proposed method was applied before and after resizing video frames. The results are summarized in Table 3. Although the execution time was reduced when the frame size was 125×125, the accuracy was minimal, as shown in Table 3 through the F values. Therefore the frame size was set to be 256×256 as the execution time was obviously reduced without affecting the performance of proposed method. As explained above, an important aspect of effective shot detection methods is computational complexity. Detecting candidate transitions reduces the computational cost by avoiding processing all frames by optical flow estimation. Performing the proposed method without processing of candidate transitions would be really exhaustive. To offer a clear overall indication of the proposed method efficiency, the performance of the proposed method without candidate's transition detection was evaluated. Besides that, performance of the proposed method with candidate's transition detection only was evaluated, where only projection feature differences between frames were used. The performance evaluation according to R, P, F and time cost are reported in Table 4.  With reference to Table 4, several important observations can be clarified. For example, the proposed method performance using the candidates transition detection only is superior, based on execution time only, to its performance using the abrupt transition specifying ; however, the performance decreases around 10% according to F measure. On the other hand, the proposed method performance took significant execution time only when using the abrupt transition specifying. This variation of time cost is mainly due to the difference in the time required to extract projection features and optical flow field. When comparing the results of Table 2 and Table 3, the two-stage integration of the proposed method is demonstrated to have the most efficient performance in terms of R, P, F and time cost. A comparison between the proposed method and other existing methods was performed to reinforce the capability of the proposed method. The comparison was made with methods presented elsewhere [19,20]. Video files VF1 to VF8 were used for comparison in accordance a previously described method [20], where Dhiman et al. presented a quantitative comparison with another method [19] according to R, P, F and cost of time. This approach was exploited by our proposed method. The performance comparisons of the proposed method with the existing methods in terms of time cost, R, P and F are tabulated in Table 5 and Table 6. For better comparison, bar graph representations of the obtained results are also used, as illustrated in Figure 4 to Figure 7.

Video Files
Method in [19] Method in [20] Proposed Method

Hato
Iraqi Journal of Science, 2021, Vol. 62, No. 11, pp: 4181-4194 4193 It is noticeable that the proposed method had a lower execution time than the other methods and could still detect the abrupt shots at high performance. The results of the comparative analysis showed the superior performance of the proposed method versus the other current methods.

Conclusions
A new method is presented in this paper to handle the problem of balancing speed and accuracy of shot boundary detection. The proposed method is based on candidate transitions selection to discard most non-boundary frames before abrupt detection, helping to reduce the execution time and effectively detect abrupt shots. First, projection features is employed to select the candidate transitions that may contain abrupt boundaries. Then optical flow field is estimated for candidate transitions with their adjacent frames to filtrate the candidate transitions and determine the abrupt boundaries by recognizing the discontinuity of optical flow field. Experimental results showed the effectiveness of the proposed method, which could achieve excellent accuracy in accordance with Recall, Precision and F-measure and increase the detection speed obviously. The proposed method performance is satisfactory for recognizing transitions of abrupt shot in the presence of motion and illumination change of video sequences. Furthermore, the presented method outperforms the different latest methods.