Automatic Image and Video Tagging Survey

Marking content with descriptive terms that depict the image content is called “tagging


Introduction
Image tagging involves analyzing the objects inside the image and assigning a tag that can properly depict the image content.On the other hand, video tagging is the process of adding a tag to each keyframe in the video [1,2].Image tagging makes internet searching easier, additionally enables the quick organization of a tremendous number of images, and makes them ISSN: 0067-2904 easily accessible.Due to the significant expansion of multimedia content already available and continuously uploaded and shared on social media platforms, machine learning algorithms have a significant role in making such information easier to find and link [3].This review consists of definitions of the goals of each paper, the datasets, and the techniques used by different researchers to improve image and video tagging efficiency.Additionally, the results of each paper have been illustrated.The organization for the rest of this paper was organized as follows: Section 2 presents the most common datasets used in the image and tag domains as well as the well-known evaluation metrics used.In Section 3, the summarized tables review the significant common methods in video and image tagging.Section four has the discussion part.The survey conclusion is presented in Section 5.

Datasets & Evaluation metrics 2.1 Datasets
Multiple researchers use tagging in various domains, such as developing an automatic attendance system for college students [4], elderly activities in the K-Log center for Alzheimer's patients' video tagging [5], and movie segmentation [6].For this kind of research, a special dataset was collected to fulfill the main objective of the research [7].For tagging as a fundamental objective, the researchers used the following datasets: Corel5K, NUS-WIDE, YOLOv3, YFCC100M, ESP Game, IAPRTC-12.5,Tencent Advertisement Video, Chicago Face Database (CFD), and Event.Table 1 describes the datasets in detail.

Datasets name No. of images/videos Description
Corel5K [1] 5000 image The average manual tag assigned for each image is three and a half keywords from 260 predetermined terms.

NUS-WIDE [8] 269648 images
The National University of Singapore created a real-world web image dataset.The dataset details are shown below: (1) 269,648 images joined with 5,018 tags.(2) Low-level features include a colored histogram, an edge direction histogram, wavelet texture, block-wise color moments, and a bag of words based on SIFT descriptions extracted from the images.(3) For evaluation purposes, a ground truth of 81 concepts was supplied.

YFCC100M [9] nearly100million
The dataset, which contains parts of Flickr images combined with hashtags and GPS coordinates

Advertisement
Video [10] 10000 videos 500 videos for training, the videos labeled using timestamps, and 500 videos for a test.The average length of the videos is 42.74 ± 14.16 seconds.A series of multiple tags for each scene, which represent the classes each scene belongs to.
There is no overlapping between scenes.

IAPRTC-12.5[11, 12] 19627 images
The captions associated with each image are used to infer the tag.The dataset has the following classes: sports, actions, people, animals, cities, landscapes, and many other aspects.ESP game [11,12] 20770 images Diverse types of images exist, such as logos, drawings, and personal photos.A total of 268 tags are included in the dataset.

videos
The statistics of a dataset can be summarized as follows: The dataset contains 208,978 video frames with an average length per video of 72.14 sec.
Approximately 25 research papers were analyzed from 2008 to 2022 within this research.All the analyzed papers were within the domain of image and video tagging.The most frequent datasets used were Corel5 and Nus-wide due to the diversity of the classes in each dataset.In addition, each dataset is provided with a manually tagged label, which eases the training process of the tagging system.Moreover, a valuable set of low-level features was extracted and provided with the datasets to increase the accuracy of the tagging process.

Evaluation metrics
Most researchers have recently used per-image metrics, such as precision (Eq.1), recall (Eq.2), F1-measure (Eq.3), accuracy (Eq.4), and mean average precision (mAP) (Eq.5), to accurately evaluate tagging performance [12,15].The metric values are averaged over all the images in the test dataset to obtain average per-image metrics [14].The definitions of per-image metrics are as follows: Where TP is the number of tags that are predicted by the model and match correctly.TN is the number of tags that are not predicted by the model and are not part of the ground truth.FP denotes the number of tags predicted by the model but not included in the ground truth, whereas FN doesn't predict the tag, but it is part of the ground truth.F1-measure combines Precision and recall.  average  of class.k the number of classes

Literature review
Many researchers use diverse methods and a tremendous number of features and techniques for image and video tagging purposes.Borth, Damian, et al. (2008) use the six Tamura features.These are contrast, directionality, coarseness, line-likeness, regularity, and roughness for automatic video tagging [16].One of the interesting methods was the implicit tagging that was introduced by J. Jiao and M. Pantic in 2010 (which is the technique to tag multimedia data based on a user's nonverbal reactions, such as facial expressions and head gestures).Nineteen facial points were employed for tracking facial expressions and used to judge explicit tagging [17].Other researchers, such as Yang et al. (2011), transfer knowledge between images and videos, with tags assigned by using structures embedded within both the image and video spaces [15].Binti Zakaria in 2012 used City Landscape Identifier (CLI) to represent image content by exploiting the edge direction and then developed a classifier that can be used for automatically tagging images with "buildings" or "non-buildings" tags.Gomez, Raul, et al. (2020) used a large set of images, tags, and geographical coordinates, to redesign a model used for tagging and retrieving images when the query combined the hashtag and location information, to reduce the effort and work time for the caregivers (CGs) for logging and monitoring Alzheimer's Disease (AD) patients.A multi-modal fusion based on machine-assisted human tagging of videos and the object detection model was introduced by Lee, Chanwoong, et al. in 2020.[5] Prediction and segmenting a movie at the viewer's choice.[6] proposed a real-time attendance system using image tagging to overcome the wastage of time in queues for biometrics or facescanning using the LBPH algorithm.

YouTube-8M dataset
The tag reference score is calculated by considering the tag frequency of occurrence for a tag, and then a tag neighbor voting algorithm is used.
The author did not mention numerical results and described them (as "effective and efficient")

Lee, Chanwoon g, et al2020[5]
To reduce the effort and the ime of work for the caregivers (CGs) for taking care of and logging the activities of Alzheimer's Disease patients K-Log Centre surveillance videos The YOLO-v3 object detection is integrated with HAR models, which are used for automatic tagging of the surveillance videos.
An accuracy of 81.4% for live video.
A tag vocabulary contains 50 movie tags.Then a dataset was created for each tag.From several movies, around 700 images for each tag were collected.
A deep learning-based technique for predicting the most relevant tags for a movie and segmenting the movie concerning the predicted tags Inception-V3 was used to train a pre-trained CNN for task transfer learning.Subsequently, a frame detection algorithm was used.mAP 76.50% F1-score 0.7551.

Create a pipeline for segmenting and tagging videos Tencent Advertisement Video
A bi-level approach that initially provides the boundaries of the scenes and then merges a confidence score for each segmented scene.The predicted tag of the class proposed for segmented video ads mAP 0.86% As mentioned earlier, the researchers used different global datasets and various accuracy measures in line with the nature and idea of the research; some of the researchers created their own custom datasets.To compare them and know the best method, the comparison will not be fair; however, we can draw approximate conclusions about the best method by comparing the researchers who signed up for the same dataset and scale as shown in Figures (1-a), (1-b), and  (1-c), where the researchers used the datasets Corel, LAPRTC, and ESP game, respectively.
While the researchers who have created the custom dataset are working, their work will be compared based on the size of the dataset that has been created, as shown in Figure 1-d: 1(a, b, c, d): side by side comparison of different researchers to Corel, LAPRTC, and Tencent Advertisement video datasets respectively, Figure (1-d) side by side comparisons for the custom-created datasets

Discussion
There are diverse methods used by researchers for image and video tagging.Most of the methods used a supervised approach, as in [4] [5] [7] [12] [14] [16] [19] [20], whereas few researchers adopted an unsupervised approach, as in [6] [8] [15] [25].Both the supervised and unsupervised approaches used for image and video tagging leveraged different types of features.Some researchers, such as [16,19], adopted low-level features followed by a simple classification method to map between image and tag.Exploiting the low-level features used to convey the correctness of the tag associated with the images is better than a random guess.The low-level features have bad noise resistance.combined with LBP and Haar to enhance the results, as in [12,20].The methods are computationally efficient in that they require only one matrix inversion per iteration.However, the prediction loss for each tag is weighed equally, which leads to the overall loss being dominated by contributions from more frequent tags, sacrificing the prediction accuracy of rare tags.The main challenge in the fully automated HAR scenarios is collecting the datasets, which demands precise human and object detection during the testing phase.A real-time method used by [5,20] shows that the existing strategies do not operate properly in instances of distinctive illumination.The methods that gave the best result used CNN as the main algorithm for mapping between the image features and the semantic tag, as in [9,27] [14] [6].It is pertinent to mention that the data used for training CNN-based models is manually tagged.In light of the spread of epidemics and infectious diseases, the future scope of tagging domains within the field of robots in sanitary isolation rooms is to help patients and reduce infection rates for medical staff.

Conclusion
Marking the image with descriptive terms is also called "tagging."A huge range of digital enterprises depend on photo tagging to manage their visual assets; e-commerce, stock photo databases, booking and travel platforms, traditional and social media, and a variety of other businesses require adequate and efficient image sorting systems.Additionally, tagging is useful to individuals; personal photo libraries can be difficult to organize and search through without user-friendly image categorization and tagging.Decades ago, traditional indexing was performed by a librarian.An intriguing alternative to traditional indexing methods was collaborative tagging, or "folksonomy," which is the practice of allowing users to attach tags to data; however, collaborative tagging suffers from slowness, expense, being highly subjective, and the inability to scale to multi-million image libraries; thus, there is a strong interest among computer vision researchers in the development of robust and efficient automatic image and video tagging systems.A discriminative model (nearest neighbor), a generative model, or a deep learning model could be used for automatic tagging.A discriminative model describes tagging as a multi-label classification problem.Each label trains a separate classifier using the features extracted from the image.Later, the trained classifier predicts a tag for the test image.By learning joint distributions over visual and contextual features, generative models accurately detect dependency between visual features and associated tags.The generative model produced a remarkable contribution to the development of tagging; the complexity of the generative model's algorithms was the reason for its inability to achieve optimization in tag prediction.The Nearest Neighbor models, in contrast to the generative models, were motivated to be widely used in the tagging domain due to their simplicity.The nearest neighbor focused on selecting similar neighbors and then assigning the tags to the test image.Image-to-image, image-to-tag, or both similarities could be used.Subsequently, a greedy label transfer mechanism is employed to assign the tag, which is the method of selecting the tags according to the co-occurrence and frequency factors of the nearest neighbors.Many computer vision tasks showed high-quality overall performance by adopting deep learning-based methods that extracted effective feature vectors from images to make perfect mappings between the image and the semantic tag.
[4] le 2, and 3 Describe in detail 23 research papers published in Scopus and Web of Science databases from 2008 to 2022 for image and video data tagging, respectively.

Table 2 :
image tagging research

Table 3 :
video tagging To create classifiers for generating a wide range of tags, it is necessary to use more powerful low-level features, such as visual terms.Colored histogram features are used for tagging purposes; sometimes the histogram features are