A Hybrid ANN-LSTM Speaker Identification Using Advanced Feature Extraction Techniques

Maha Adnan  shanshool; Husam Ali  Abdalmohsen

doi:10.24996/ijs.2025.66.5.23

Authors

Maha Adnan shanshool computer science department, college of science, University of Baghdad, Iraq https://orcid.org/0009-0002-8283-4326
Husam Ali Abdalmohsen computer science department, college of science, University of Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2025.66.5.23

Keywords:

Speaker Identification, Mel-frequency Cepstral coefficients (MFCC), linear predictive coding (LPC), artificial neural network (ANN), long short term memory (LSTM), Ghadeer speech crowed corpus (GSCC)

Abstract

Over the past decades, speaker identification has gained the attention of many researchers and security companies because of its many applications in identifying individuals. Therefore, through this work, a speaker identification system has been designed and implemented. The system undergoes a preprocessing phase that involves the removal of silence, the removal of outliers, the quantization of features, and the extraction of linear predictive coding (LPC) and Mel-frequency cepstral coefficients (MFCC) features. Additionally, the system performs a mean and standard deviation analysis on all features. The third phase involved applying deep learning techniques such as convolutional neural networks (CNN), artificial neural networks (ANN), long-short-term memory (LSTM), and random forests (RF). The proposed work's novel idea is a hybrid architecture, generated from ANN and LSTM. The proposed hybrid speaker identification system exhibits exceptional processing efficiency and achieves a remarkable accuracy rate of 94.63% and 99.2%, respectively. This study makes a substantial contribution to the advancement of speech recognition technologies by highlighting the adaptability and practical value of the hybrid ANN-LSTM model, especially in situations where speed is of the essence. This work was applied to a large dataset that was combined from three different sources: TIMIT, Prominent Leaders, Fluent Speech Command, and the GSCC dataset, which are all comprised of audio files only.