Vocal timbre effects with differentiable digital signal processing

David Südholt; Cumhur Erkut

Vocal timbre effects with differentiable digital signal processing

Research output: Contribution to journal › Conference article in Journal › Research › peer-review

23 Downloads (Pure)

Abstract

We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.

Original language	English
Book series	Proceedings of the International Conference on Digital Audio Effects, DAFx
Pages (from-to)	363-366
Number of pages	4
ISSN	2413-6700
Publication status	Published - 2023
Event	26th International Conference on Digital Audio Effects, DAFx 2023 - Copenhagen, Denmark Duration: 4 Sept 2023 → 7 Sept 2023

Conference

Conference	26th International Conference on Digital Audio Effects, DAFx 2023
Country/Territory	Denmark
City	Copenhagen
Period	04/09/2023 → 07/09/2023
Sponsor	Ableton, AudioKinetic, et al., EURAL - Algorithmically Perfect, Native Instruments, Soundtoys

Bibliographical note

Publisher Copyright:
© 2023 David Südholt et al.

Access to Document

Open Access articleFinal published version, 220 KBLicence: CC BY 4.0

https://www.dafx.de/paper-archive/2023/DAFx23_paper_29.pdfLicence: CC BY 4.0

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{21a54b6cb5aa4301ae15d7df1a4de9c9,

title = "Vocal timbre effects with differentiable digital signal processing",

abstract = "We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.",

author = "David S{\"u}dholt and Cumhur Erkut",

note = "Publisher Copyright: {\textcopyright} 2023 David S{\"u}dholt et al.; 26th International Conference on Digital Audio Effects, DAFx 2023 ; Conference date: 04-09-2023 Through 07-09-2023",

year = "2023",

language = "English",

pages = "363--366",

journal = "Proceedings of the International Conference on Digital Audio Effects, DAFx",

issn = "2413-6700",

}

TY - GEN

T1 - Vocal timbre effects with differentiable digital signal processing

AU - Südholt, David

AU - Erkut, Cumhur

PY - 2023

Y1 - 2023

N2 - We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.

AB - We explore two approaches to creatively altering vocal timbre using Differentiable Digital Signal Processing (DDSP). The first approach is inspired by classic cross-synthesis techniques. A pretrained DDSP decoder predicts a filter for a noise source and a harmonic distribution, based on pitch and loudness information extracted from the vocal input. Before synthesis, the harmonic distribution is modified by interpolating between the predicted distribution and the harmonics of the input. We provide a real-time implementation of this approach in the form of a Neutone model. In the second approach, autoencoder models are trained on datasets consisting of both vocal and instrument training data. To apply the effect, the trained autoencoder attempts to reconstruct the vocal input. We find that there is a desirable “sweet spot” during training, where the model has learned to reconstruct the phonetic content of the input vocals, but is still affected by the timbre of the instrument mixed into the training data. After further training, that effect disappears. A perceptual evaluation compares the two approaches. We find that the autoencoder in the second approach is able to reconstruct intelligible lyrical content without any explicit phonetic information provided during training.

UR - http://www.scopus.com/inward/record.url?scp=85174514681&partnerID=8YFLogxK

M3 - Conference article in Journal

AN - SCOPUS:85174514681

SN - 2413-6700

SP - 363

EP - 366

JO - Proceedings of the International Conference on Digital Audio Effects, DAFx

JF - Proceedings of the International Conference on Digital Audio Effects, DAFx

T2 - 26th International Conference on Digital Audio Effects, DAFx 2023

Y2 - 4 September 2023 through 7 September 2023

ER -

Vocal timbre effects with differentiable digital signal processing

Abstract

Conference

Bibliographical note

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this