In this paper, we study aspects of single microphone speech enhancement (SE) based on deep neural networks (DNNs). Specifically, we explore the generalizability capabilities of state-of-the-art DNN-based SE systems with respect to the background noise type, the gender of the target speaker, and the signal-to-noise ratio (SNR). Furthermore, we investigate how specialized DNN-based SE systems, which have been trained to be either noise type specific, speaker specific or SNR specific, perform relative to DNN based SE systems that have been trained to be noise type general, speaker general, and SNR general. Finally, we compare how a DNN-based SE system trained to be noise type general, speaker general, and SNR general performs relative to a state-of-the-art short-time spectral amplitude minimum mean square error (STSA-MMSE) based SE algorithm. We show that DNN-based SE systems, when trained specifically to handle certain speakers, noise types and SNRs, are capable of achieving large improvements in estimated speech quality (SQ) and speech intelligibility (SI), when tested in matched conditions. Furthermore, we show that improvements in estimated SQ and SI can be achieved by a DNN-based SE system when exposed to unseen speakers, genders and noise types, given a large number of speakers and noise types have been used in the training of the system. In addition, we show that a DNN-based SE system that has been trained using a large number of speakers and a wide range of noise types outperforms a state-of-the-art STSA-MMSE based SE method, when tested using a range of unseen speakers and noise types. Finally, a listening test using several DNN-based SE systems tested in unseen speaker conditions show that these systems can improve SI for some SNR and noise type configurations but degrade SI for others.
|Tidsskrift||I E E E Transactions on Audio, Speech and Language Processing|
|Sider (fra-til)||153 - 167|
|Status||Udgivet - 2017|