RAVE for Speech: Efficient Voice Conversion at High Sampling Rates

Anders R. Bargum, Simon Lajboschitz, Cumhur Erkut

Research output: Contribution to book/anthology/report/conference proceedingArticle in proceedingResearchpeer-review

3 Downloads (Pure)

Abstract

Voice conversion has gained increasing popularity within the field of audio manipulation and speech synthesis. Often, the main objective is to transfer the input identity to that of a target speaker without changing its linguistic content. While current work provides high-fidelity solutions they rarely focus on model simplicity, high-sampling rate environments or stream-ability. By incorporating speech representation learning into a generative timbre transfer model, traditionally created for musical purposes, we investigate the realm of voice conversion generated directly in the time domain at high sampling rates. More specifically, we guide the latent space of a baseline model towards linguistically relevant representations and condition it on external speaker information. Through objective and subjective assessments, we demonstrate that the proposed solution can attain levels of naturalness, quality, and intelligibility comparable to those of a state-of-the-art solution for seen speakers, while significantly decreasing inference time. However, despite the presence of target speaker characteristics in the converted output, the actual similarity to unseen speakers remains a challenge.
Original languageEnglish
Title of host publicationProceedings of the 27th International Conference on Digital Audio Effects (DAFx24)
Number of pages8
Place of PublicationGuildford, UK
Publication date29 Aug 2024
Pages41-48
Publication statusPublished - 29 Aug 2024
Event27th International Conference on Digital Audio Effects - Guildford, Surrey, United Kingdom
Duration: 3 Sept 20247 Sept 2024
Conference number: 27
https://dafx24.surrey.ac.uk/

Conference

Conference27th International Conference on Digital Audio Effects
Number27
Country/TerritoryUnited Kingdom
CityGuildford, Surrey
Period03/09/202407/09/2024
Internet address

Keywords

  • cs.SD
  • eess.AS

Fingerprint

Dive into the research topics of 'RAVE for Speech: Efficient Voice Conversion at High Sampling Rates'. Together they form a unique fingerprint.

Cite this