Improved Disentangled Speech Representations Using Contrastive Learning in Factorized Hierarchical Variational Autoencoder

Yuying Xie*, Thomas Arildsen, Zheng Hua Tan

*Corresponding author for this work

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

1 Citation (Scopus)

Abstract

Leveraging the fact that speaker identity and content vary on different time scales, factorized hierarchical variational autoencoder (FHVAE) uses different latent variables to symbolize these two attributes. Disentanglement of these attributes is carried out by different prior settings of the corresponding latent variables. For the prior of speaker identity variable, FHVAE assumes it is a Gaussian distribution with an utterance-scale varying mean and a fixed variance. By setting a small fixed variance, the training process promotes identity variables within one utterance gathering close to the mean of their prior. However, this constraint is relatively weak, as the mean of the prior changes between utterances. Therefore, we introduce contrastive learning into the FHVAE framework, to make the speaker identity variables gathering when representing the same speaker, while distancing themselves as far as possible from those of other speakers. The model structure has not been changed in this work but only the training process, thus no additional cost is needed during testing. Voice conversion has been chosen as the application in this paper. Latent variable evaluations include speaker verification and identification for the speaker identity variable, and speech recognition for the content variable. Furthermore, assessments of voice conversion performance are on the grounds of fake speech detection experiments. Results show that the proposed method improves both speaker identity and content feature extraction compared to FHVAE, and has better performance than baseline on conversion.

Original languageEnglish
Title of host publication31st European Signal Processing Conference, EUSIPCO 2023 - Proceedings
Number of pages5
PublisherIEEE (Institute of Electrical and Electronics Engineers)
Publication date2023
Pages1330-1334
ISBN (Electronic)9789464593600
DOIs
Publication statusPublished - 2023
Event31st European Signal Processing Conference, EUSIPCO 2023 - Helsinki, Finland
Duration: 4 Sept 20238 Sept 2023

Conference

Conference31st European Signal Processing Conference, EUSIPCO 2023
Country/TerritoryFinland
CityHelsinki
Period04/09/202308/09/2023
SeriesEuropean Signal Processing Conference
ISSN2219-5491

Bibliographical note

Publisher Copyright:
© 2023 European Signal Processing Conference, EUSIPCO. All rights reserved.

Keywords

  • contrastive learning
  • disentangled representation learning
  • voice conversion

Fingerprint

Dive into the research topics of 'Improved Disentangled Speech Representations Using Contrastive Learning in Factorized Hierarchical Variational Autoencoder'. Together they form a unique fingerprint.

Cite this