Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification

Achintya Kumar Sarkar, Zheng-Hua Tan

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

3 Citationer (Scopus)

Abstract

In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

OriginalsprogEngelsk
Artikelnummer101229
TidsskriftComputer Speech and Language
Vol/bind70
ISSN0885-2308
DOI
StatusUdgivet - 2021

Fingeraftryk

Dyk ned i forskningsemnerne om 'Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification'. Sammen danner de et unikt fingeraftryk.

Citationsformater