TY - JOUR
T1 - Shouted and Whispered Speech Compensation for Speaker Verification Systems
AU - Prieto-Calero, Santiago
AU - Ortega, Alfonso
AU - Espejo, Ivan Lopez
AU - Lleida, Eduardo
PY - 2022/7
Y1 - 2022/7
N2 - Nowadays, speaker verification systems begin to perform very well under normal speech conditions due to the plethora of neutrally-phonated speech data available, which are used to train such systems. Nevertheless, the use of vocal effort modes other than normal severely degrades performance because of vocal effort mismatch. In this paper, in which we consider whispered, normal and shouted speech production modes, we first study how vocal effort mismatch negatively affects speaker verification performance. Then, in order to mitigate this issue, we describe a series of techniques for score calibration and speaker embedding compensation relying on logistic regression-based vocal effort mode detection. To test the validity of all of these methodologies, speaker verification experiments using a modern x-vector-based speaker verification system are carried out. Experimental results show that we can achieve, when combining score calibration and embedding compensation relying upon vocal effort mode detection, up to 19% and 52% equal error rate (EER) relative improvements under the shouted-normal and whispered-normal scenarios, respectively, in comparison with a system applying neither calibration nor compensation. Compared to our previous work [1], we obtain a 7.3% relative improvement in terms of EER when adding score calibration in shouted-normal All vs. All condition.
AB - Nowadays, speaker verification systems begin to perform very well under normal speech conditions due to the plethora of neutrally-phonated speech data available, which are used to train such systems. Nevertheless, the use of vocal effort modes other than normal severely degrades performance because of vocal effort mismatch. In this paper, in which we consider whispered, normal and shouted speech production modes, we first study how vocal effort mismatch negatively affects speaker verification performance. Then, in order to mitigate this issue, we describe a series of techniques for score calibration and speaker embedding compensation relying on logistic regression-based vocal effort mode detection. To test the validity of all of these methodologies, speaker verification experiments using a modern x-vector-based speaker verification system are carried out. Experimental results show that we can achieve, when combining score calibration and embedding compensation relying upon vocal effort mode detection, up to 19% and 52% equal error rate (EER) relative improvements under the shouted-normal and whispered-normal scenarios, respectively, in comparison with a system applying neither calibration nor compensation. Compared to our previous work [1], we obtain a 7.3% relative improvement in terms of EER when adding score calibration in shouted-normal All vs. All condition.
KW - Deep learning
KW - Domain compensation
KW - Shouted speech
KW - Speaker verification
KW - Vocal effort mismatch
KW - Whispered speech
UR - http://www.scopus.com/inward/record.url?scp=85127732756&partnerID=8YFLogxK
U2 - 10.1016/j.dsp.2022.103536
DO - 10.1016/j.dsp.2022.103536
M3 - Journal article
SN - 1051-2004
VL - 127
JO - Digital Signal Processing: A Review Journal
JF - Digital Signal Processing: A Review Journal
M1 - 103536
ER -