TY - GEN
T1 - Multi-Task Adversarial Network Bottleneck Features for Noise-Robust Speaker Verification
AU - Yu, Hong
AU - Hu, Tianrui
AU - Ma, Zhanyu
AU - Tan, Zheng-Hua
AU - Guo, Jun
PY - 2018/11/6
Y1 - 2018/11/6
N2 - Modern automatic speaker verification (ASV) systems need to be robust under various noisy conditions. Motivated by the success of generative adversarial networks (GANs), this paper proposes a multi-task adversarial network (MAN) for extracting noise-invariant bottleneck (BN) features. The MAN consists of three component networks, a feature encoding network (FEN), a speaker discriminative network (SDN) and a noise-domain adaptation network (NAN). The FEN aims to generate noise-robustness BN features, the SDN makes the features from the FEN more speaker-discriminative and the NAN guides the FEN to learn more noise-invariant feature representations. The MAN is trained using an adversarial method. When training FEN and SDN, speaker identities and the label of being clean speech are used as target labels, which can make BN features, extracted from noisy or clean speech, similar. When training NAN, on the contrary, noise types are used as training targets. We evaluate the newly proposed MAN-BN feature extraction method on a Gaussian mixture model-universal background model (GMM-UBM) based ASV system. The experimental results on the RSR2015 database show that the proposed MAN-BN feature can dramatically improve the accuracy of the ASV system under different noise-type and signal-to-noise-ratio conditions.
AB - Modern automatic speaker verification (ASV) systems need to be robust under various noisy conditions. Motivated by the success of generative adversarial networks (GANs), this paper proposes a multi-task adversarial network (MAN) for extracting noise-invariant bottleneck (BN) features. The MAN consists of three component networks, a feature encoding network (FEN), a speaker discriminative network (SDN) and a noise-domain adaptation network (NAN). The FEN aims to generate noise-robustness BN features, the SDN makes the features from the FEN more speaker-discriminative and the NAN guides the FEN to learn more noise-invariant feature representations. The MAN is trained using an adversarial method. When training FEN and SDN, speaker identities and the label of being clean speech are used as target labels, which can make BN features, extracted from noisy or clean speech, similar. When training NAN, on the contrary, noise types are used as training targets. We evaluate the newly proposed MAN-BN feature extraction method on a Gaussian mixture model-universal background model (GMM-UBM) based ASV system. The experimental results on the RSR2015 database show that the proposed MAN-BN feature can dramatically improve the accuracy of the ASV system under different noise-type and signal-to-noise-ratio conditions.
KW - Bottleneck Features
KW - Multi-task Adversarial Training
KW - Speaker Verification
UR - http://www.scopus.com/inward/record.url?scp=85058298314&partnerID=8YFLogxK
U2 - 10.1109/ICNIDC.2018.8525526
DO - 10.1109/ICNIDC.2018.8525526
M3 - Article in proceeding
SN - 978-1-5386-6066-9
T3 - International Conference on Network Infrastructure and Digital Content (IC-NIDC)
SP - 165
EP - 169
BT - 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC)
PB - IEEE
T2 - 2018 International Conference on Network Infrastructure and Digital Content (IC-NIDC)
Y2 - 22 August 2018 through 24 August 2018
ER -