TY - JOUR
T1 - Reimagining Speech
T2 - A Scoping Review of Deep Learning-Based Methods for Non-Parallel Voice Conversion
AU - Bargum, Anders Riddersholm
AU - Serafin, Stefania
AU - Erkut, Cumhur
PY - 2024/8/16
Y1 - 2024/8/16
N2 - Research on deep learning-powered Voice Conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods included when training voice conversion models can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 628 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 130 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. We condense the knowledge gathered to identify main challenges, supply solutions grounded in the analysis and lastly provide recommendations for future research directions.
AB - Research on deep learning-powered Voice Conversion (VC) in speech-to-speech scenarios is getting increasingly popular. Although many of the works in the field of voice conversion share a common global pipeline, there is a considerable diversity in the underlying structures, methods, and neural sub-blocks used across research efforts. Thus, obtaining a comprehensive understanding of the reasons behind the choice of the different methods included when training voice conversion models can be challenging, and the actual hurdles in the proposed solutions are often unclear. To shed light on these aspects, this paper presents a scoping review that explores the use of deep learning in speech analysis, synthesis, and disentangled speech representation learning within modern voice conversion systems. We screened 628 publications from more than 38 different venues between the years 2017 and 2023, followed by an in-depth review of a final database consisting of 130 eligible studies. Based on the review, we summarise the most frequently used approaches to voice conversion based on deep learning and highlight common pitfalls within the community. We condense the knowledge gathered to identify main challenges, supply solutions grounded in the analysis and lastly provide recommendations for future research directions.
KW - voice conversion
KW - voice transformation
KW - s, voice control, deep learning, disentanglement, speech representation learning
U2 - 10.3389/frsip.2024.1339159
DO - 10.3389/frsip.2024.1339159
M3 - Review article
SN - 2673-8198
VL - 4
JO - Frontiers in Signal Processing
JF - Frontiers in Signal Processing
M1 - 1339159
ER -