Speech Isolation and Recognition in Crowded Noise Using a Dual-Path Recurrent Neural Network

Zainab Haider  Ibraheem; Ammar Ibraheem  Shihab

doi:10.24996/ijs.2024.65.10.37

Authors

Zainab Haider Ibraheem Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq https://orcid.org/0009-0001-0823-7163
Ammar Ibraheem Shihab Department of Computer Science, College of Science, University of Baghdad, Baghdad, Iraq

DOI:

https://doi.org/10.24996/ijs.2024.65.10.37

Keywords:

speech separation, dual path recurrent neural network, long short-term memory, Time-Domain audio Separation Network, LibriMix dataset

Abstract

Speech separation is crucial for effective speech processing in multi-talker conditions, especially in real-time, low-latency applications. In this study, the Time-Domain Audio Separation Network (TasNet) and Dual-Path Recurrent Neural Network were used to perform a time-domain multiple-speaker speech separation challenge. One-dimensional conventional recurrent neural networks (RNNs) are not capable of accurately simulating long sequences. When their receptive length exceeds the sequence field, 1-D CNNs cannot recreate utterance-level sequences. Dual-Path Recurrent Neural Network (DPRNN) breaks up the lengthy sequential input that progressively performs intra- and inter-chunk operations with input lengths proportional to the square root of the beginning sequence length. Model outputs are more efficient than earlier systems, improving performance on the Libri Mix dataset. Investigations show that the DPRNN, sample-level modeling, and time-domain audio separation network can replace present methods. EEND-SS and other separation algorithms perform worse than DPRNN. The suggested model was able to achieve (12.376) SI-SDR, (0.969) STOI (short-time objective intelligibility), (12.363) SDR, (9.363) DER, and (97.193) SCA.