Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective

Yuying Xie; Thomas Arildsen; Zheng-Hua Tan

doi:10.1109/MLSP52302.2021.9596320

Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective

Yuying Xie, Thomas Arildsen, Zheng-Hua Tan

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

6 Citations (Scopus)

Abstract

Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion.

Original language	English
Title of host publication	2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021
Number of pages	6
Publisher	IEEE
Publication date	28 Oct 2021
Pages	1-6
Article number	9596320
ISBN (Print)	978-1-6654-1184-4
ISBN (Electronic)	978-1-7281-6338-3
DOIs	https://doi.org/10.1109/MLSP52302.2021.9596320
Publication status	Published - 28 Oct 2021
Event	2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) - Gold Coast, Australia Duration: 25 Oct 2021 → 28 Oct 2021

Conference

Conference	2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)
Country/Territory	Australia
City	Gold Coast
Period	25/10/2021 → 28/10/2021

Series	IEEE Workshop on Machine Learning for Signal Processing
ISSN	1551-2541

Keywords

Disentangled representation learning
autoregressive predictive coding
variational autoencoder

Access to Document

10.1109/MLSP52302.2021.9596320

https://arxiv.org/pdf/2204.02166.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@inproceedings{0dd427a9492f4baca320160a83ea6aab,

title = "Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective",

abstract = "Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion.",

keywords = "Disentangled representation learning, autoregressive predictive coding, variational autoencoder",

author = "Yuying Xie and Thomas Arildsen and Zheng-Hua Tan",

year = "2021",

month = oct,

day = "28",

doi = "10.1109/MLSP52302.2021.9596320",

language = "English",

isbn = "978-1-6654-1184-4",

series = "IEEE Workshop on Machine Learning for Signal Processing",

pages = "1--6",

booktitle = "2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021",

publisher = "IEEE",

address = "United States",

note = "2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP) ; Conference date: 25-10-2021 Through 28-10-2021",

}

Xie, Y, Arildsen, T & Tan, Z-H 2021, Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective. in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021., 9596320, IEEE, IEEE Workshop on Machine Learning for Signal Processing, pp. 1-6, 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), Gold Coast, Queensland, Australia, 25/10/2021. https://doi.org/10.1109/MLSP52302.2021.9596320

Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective. / Xie, Yuying; Arildsen, Thomas; Tan, Zheng-Hua.
2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021. IEEE, 2021. p. 1-6 9596320 (IEEE Workshop on Machine Learning for Signal Processing).

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective

AU - Xie, Yuying

AU - Arildsen, Thomas

AU - Tan, Zheng-Hua

PY - 2021/10/28

Y1 - 2021/10/28

N2 - Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion.

AB - Disentangled representation learning aims to extract explanatory features or factors and retain salient information. Factorized hierarchical variational autoencoder (FHVAE) presents a way to disentangle a speech signal into sequential-level and segmental-level features, which represent speaker identity and speech content information, respectively. As a self-supervised objective, autoregressive predictive coding (APC), on the other hand, has been used in extracting meaningful and transferable speech features for multiple downstream tasks. Inspired by the success of these two representation learning methods, this paper proposes to integrate the APC objective into the FHVAE framework aiming at benefiting from the additional self-supervision target. The main proposed method requires neither more training data nor more computational cost at test time, but obtains improved meaningful representations while maintaining disentanglement. The experiments were conducted on the TIMIT dataset. Results demonstrate that FHVAE equipped with the additional self-supervised objective is able to learn features providing superior performance for tasks including speech recognition and speaker recognition. Furthermore, voice conversion, as one application of disentangled representation learning, has been applied and evaluated. The results show performance similar to baseline of the new framework on voice conversion.

KW - Disentangled representation learning

KW - autoregressive predictive coding

KW - variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85122805866&partnerID=8YFLogxK

U2 - 10.1109/MLSP52302.2021.9596320

DO - 10.1109/MLSP52302.2021.9596320

M3 - Article in proceeding

SN - 978-1-6654-1184-4

T3 - IEEE Workshop on Machine Learning for Signal Processing

SP - 1

EP - 6

BT - 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021

PB - IEEE

T2 - 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)

Y2 - 25 October 2021 through 28 October 2021

ER -

Xie Y, Arildsen T, Tan Z-H. Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective. In 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing, MLSP 2021. IEEE. 2021. p. 1-6. 9596320. (IEEE Workshop on Machine Learning for Signal Processing). doi: 10.1109/MLSP52302.2021.9596320

Disentangled speech representation learning based on factorized hierarchical variational autoencoder with self-supervised objective

Abstract

Conference

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this