Applying Similarity Measures to Improve Query Expansion

Wajih A. Ghani A. Hussain

doi:10.24996/ijs.2021.62.6.31

Authors

Wajih A. Ghani A. Hussain Department of Computer Science, College of Science, University of Bagdad, Bagdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2021.62.6.31

Keywords:

Information Retrieval, Query expansion, Data source search, Cosine Similarity, Jaccard Similarity

Abstract

The huge evolving in the information technologies, especially in the few last decades, has produced an increase in the volume of data on the World Wide Web, which is still growing significantly. Retrieving the relevant information on the Internet or any data source with a query created by a few words has become a big challenge. To override this, query expansion (QE) has an important function in improving the information retrieval (IR), where the original query of user is recreated to a new query by appending new related terms with the same importance. One of the problems of query expansion is the choosing of suitable terms. This problem leads to another challenge of how to retrieve the important documents with high precision, high recall, and high F measure. In this paper, we solve this problem through applying different similarity measures with the use of English WordNet. The obtained results proved that, with a suitable selection method, we are able to take advantage of English WordNet to improve the retrieval efficiency. The work proposed in this paper is extracting the terms from all the documents and query, then applying the following steps: preprocessing, expanding the query based on English WordNet, selecting the best terms, weighting of term, and finally using the cosine similarity and Jaccard similarity to obtain the relevant documents.

Our practical results were applied on the DUC2002 dataset that contains 559 documents distributed over several categories. The average precision of cosine (for random queries) = 100% whereas the average precision of Jaccard = 84.4 %, and the average recall of cosine = 86.8% whereas the average recall of Jaccard = 73.4%. The average f-measure of cosine = 92%, whereas the average f-measure of Jaccard = 76%.