In total variability modeling, variable length speech utterances are mapped to fixed low-dimensional i-vectors. Central to computing the total variability matrix and i-vector extraction, is the computation of the posterior distribution for a latent variable conditioned on an observed feature sequence of an utterance. In both cases the prior for the latent variable is assumed to be non-informative, since for homogeneous datasets there is no gain in generality in using an informative prior. This work shows in the heterogeneous case, that using informative priors for com- puting the posterior, can lead to favorable results. We focus on modeling the priors using minimum divergence criterion or fac- tor analysis techniques. Tests on the NIST 2008 and 2010 Speaker Recognition Evaluation (SRE) dataset show that our proposed method beats four baselines: For i-vector extraction using an already trained matrix, for the short2-short3 task in SRE’08, five out of eight female and four out of eight male common conditions, were improved. For the core-extended task in SRE’10, four out of nine female and six out of nine male common conditions were improved. When incorporating prior information into the training of the T matrix itself, the proposed method beats the baselines for six out of eight female and five out of eight male common conditions, for SRE’08, and five and six out of nine conditions, for the male and female case, respectively, for SRE’10. Tests using factor analysis for estimating priors show that two priors do not offer much improvement, but in the case of three separate priors (sparse data), considerable improvements were gained.
|Journal||I E E E Transactions on Audio, Speech and Language Processing|
|Number of pages||14|
|Publication status||Published - Mar 2016|