Deep Learning Techniques for Video Summarization Based onObject Detection

Shadan Abdul  Haleem; Eman  Hato

doi:10.24996/ijs.2026.67.5.37

Authors

Shadan Abdul Haleem Department of Computer Science, College of Science, Mustansiriyah University, Baghdad, Iraq https://orcid.org/0009-0005-0964-5539
Eman Hato Department of Computer Science, College of Science, Mustansiriyah University, Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2026.67.5.37

Keywords:

Deep Learning, Video Summarization, Keyframe Selection, Object Detection, Clustering Algorithm

Abstract

With the rapid growth of video content, effective video summarization methods are essential. This paper introduces a new framework using deep learning for object detection. YOLOv8 first identifies objects in each frame from every 15-frame sequence. These objects are cropped and resized for feature extraction with Residual Neural Network (ResNet 50). A clustering process using Hierarchical Density-Based Spatial Clustering (HDBSCAN) classifies each object. Finally, keyframes are randomly selected from each object cluster to create a concise summary. This paper primarily contributes to the identification of video objects, such as people and vehicles, to retain the most informative content. Additionally, it generates a video summary that significantly reduces the original length while preserving a diverse range of video content. The framework’s performance was tested on the SumMe dataset, with accuracy and F1-score as key metrics. Results show an overall detection accuracy of 0.8988 and an F-score of 0.9451. The method produced very short video summaries, saving an average of 95% of the time compared to the original videos, demonstrating a significant reduction in video length while maintaining summary reliability.