Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification

Achintya Kumar Sarkar; Zheng-Hua Tan

doi:10.1016/j.csl.2021.101229

Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification

Achintya Kumar Sarkar, Zheng-Hua Tan

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

3 Citationer (Scopus)

Abstract

In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

Originalsprog	Engelsk
Artikelnummer	101229
Tidsskrift	Computer Speech and Language
Vol/bind	70
ISSN	0885-2308
DOI	https://doi.org/10.1016/j.csl.2021.101229
Status	Udgivet - 2021

Adgang til dokumentet

10.1016/j.csl.2021.101229

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{a982ac59448e4e748935614e7d589e97,

title = "Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification",

abstract = "In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.",

keywords = "Bottleneck feature, DNNs, HMMs, Pass-phrases, Speaker verification",

author = "Sarkar, {Achintya Kumar} and Zheng-Hua Tan",

year = "2021",

doi = "10.1016/j.csl.2021.101229",

language = "English",

volume = "70",

journal = "Computer Speech and Language",

issn = "0885-2308",

publisher = "Academic Press",

}

TY - JOUR

T1 - Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification

AU - Sarkar, Achintya Kumar

AU - Tan, Zheng-Hua

PY - 2021

Y1 - 2021

N2 - In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

AB - In this paper, we propose a novel method to segment and label pass-phrase utterances for training deep neural network (DNN) bottleneck (BN) features for text-dependent speaker verification (TD-SV). Specifically, gender-dependent hidden Markov models (HMMs) for monophones are first trained using the pass-phrase utterances that are disjoint from evaluation. Next, the trained HMMs are speaker-adapted and then used for segmenting and labeling these training utterances at the phone level. The resulted labeled data is subsequently used for training DNN models to discriminate gender-dependent phones for the purpose of extracting phone-discriminant BN features. This is in contrast to conventional approaches that apply a general-purpose, speaker-independent automatic speech recognition (ASR) system for generating segmentation and labels. The proposed method eliminates the need for a separate ASR system, which can additionally have the disadvantage of mismatch with the pass-phrase utterances in terms languages, dialects, domains, acoustic conditions and so on. Experiments are conducted on the RedDots challenge 2016 database of TD-SV using short utterances with Gaussian mixture model-universal background model and i-vector techniques. Experimental results demonstrate that the proposed method yields lower error rates in TD-SV when compared to a set of existing methods. A thorough ablation study further confirms the effectiveness of the method. Fusion in both score and feature levels also shows the complementary nature of the proposed features.

KW - Bottleneck feature

KW - DNNs

KW - HMMs

KW - Pass-phrases

KW - Speaker verification

UR - http://www.scopus.com/inward/record.url?scp=85104917992&partnerID=8YFLogxK

U2 - 10.1016/j.csl.2021.101229

DO - 10.1016/j.csl.2021.101229

M3 - Journal article

SN - 0885-2308

VL - 70

JO - Computer Speech and Language

JF - Computer Speech and Language

M1 - 101229

ER -

Self-Segmentation of Pass-Phrase Utterances for Deep Feature Learning in Text-Dependent Speaker Verification

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater