A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder

Yang Xiang; Jesper Lisby Højvang; Morten Højfeldt Rasmussen; Mads Græsbøll Christensen

doi:10.1109/ICASSP43922.2022.9747036

A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder

Yang Xiang^*, Jesper Lisby Højvang, Morten Højfeldt Rasmussen, Mads Græsbøll Christensen

^*Corresponding author for this work

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

3 Citations (Scopus)

Abstract

Recently, variational autoencoder (VAE), a deep representation learning (DRL) model, has been used to perform speech enhancement (SE). However, to the best of our knowledge, current VAE-based SE methods only apply VAE to model speech signal, while noise is modeled using the traditional non-negative matrix factorization (NMF) model. One of the most important reasons for using NMF is that these VAE-based methods cannot disentangle the speech and noise latent variables from the observed signal. Based on Bayesian theory, this paper derives a novel variational lower bound for VAE, which ensures that VAE can be trained in supervision, and can disentangle speech and noise latent variables from the observed signal. This means that the proposed method can apply the VAE to model both speech and noise signals, which is totally different from the previous VAE-based SE works. More specifically, the proposed DRL method can learn to impose speech and noise signal priors to different sets of latent variables for SE. The experimental results show that the proposed method can not only disentangle speech and noise latent variables from the observed signal, but also obtain a higher scale-invariant signal-to-distortion ratio and speech quality score than the similar deep neural network-based (DNN) SE method.

Original language	English
Title of host publication	ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Number of pages	5
Publisher	IEEE
Publication date	2022
Pages	381-385
ISBN (Print)	978-1-6654-0541-6
ISBN (Electronic)	978-1-6654-0540-9
DOIs	https://doi.org/10.1109/ICASSP43922.2022.9747036
Publication status	Published - 2022
Event	47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Virtual, Online, Singapore Duration: 23 May 2022 → 27 May 2022

Conference

Conference	47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
Country/Territory	Singapore
City	Virtual, Online
Period	23/05/2022 → 27/05/2022
Sponsor	Chinese and Oriental Languages Information Processing Society (COLPIS), Singapore Exhibition and Convention Bureau, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), The Institute of Electrical and Electronics Engineers Signal Processing Society

Series	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN	1520-6149

Bibliographical note

Funding Information:
This work was partly supported by Innovation Fund Denmark (Grant No.9065-00046).

Publisher Copyright:
© 2022 IEEE

Keywords

Bayesian permutation training
Deep representation learning
speech enhancement
variational autoencoder

Access to Document

10.1109/ICASSP43922.2022.9747036

http://arxiv.org/pdf/2201.09875.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

Xiang, Yang ; Højvang, Jesper Lisby ; Rasmussen, Morten Højfeldt et al. / A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. pp. 381-385 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

@inproceedings{5677b3e508354061b9b975ff9a347ac0,

title = "A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder",

abstract = "Recently, variational autoencoder (VAE), a deep representation learning (DRL) model, has been used to perform speech enhancement (SE). However, to the best of our knowledge, current VAE-based SE methods only apply VAE to model speech signal, while noise is modeled using the traditional non-negative matrix factorization (NMF) model. One of the most important reasons for using NMF is that these VAE-based methods cannot disentangle the speech and noise latent variables from the observed signal. Based on Bayesian theory, this paper derives a novel variational lower bound for VAE, which ensures that VAE can be trained in supervision, and can disentangle speech and noise latent variables from the observed signal. This means that the proposed method can apply the VAE to model both speech and noise signals, which is totally different from the previous VAE-based SE works. More specifically, the proposed DRL method can learn to impose speech and noise signal priors to different sets of latent variables for SE. The experimental results show that the proposed method can not only disentangle speech and noise latent variables from the observed signal, but also obtain a higher scale-invariant signal-to-distortion ratio and speech quality score than the similar deep neural network-based (DNN) SE method.",

keywords = "Bayesian permutation training, Deep representation learning, speech enhancement, variational autoencoder",

author = "Yang Xiang and H{\o}jvang, {Jesper Lisby} and Rasmussen, {Morten H{\o}jfeldt} and Christensen, {Mads Gr{\ae}sb{\o}ll}",

note = "Funding Information: This work was partly supported by Innovation Fund Denmark (Grant No.9065-00046). Publisher Copyright: {\textcopyright} 2022 IEEE; 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 ; Conference date: 23-05-2022 Through 27-05-2022",

year = "2022",

doi = "10.1109/ICASSP43922.2022.9747036",

language = "English",

isbn = "978-1-6654-0541-6",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "IEEE",

pages = "381--385",

booktitle = "ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)",

address = "United States",

}

Xiang, Y, Højvang, JL, Rasmussen, MH & Christensen, MG 2022, A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder. in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, pp. 381-385, 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022, Virtual, Online, Singapore, 23/05/2022. https://doi.org/10.1109/ICASSP43922.2022.9747036

A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder. / Xiang, Yang; Højvang, Jesper Lisby; Rasmussen, Morten Højfeldt et al.
ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. p. 381-385 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Contribution to book/anthology/report/conference proceeding › Article in proceeding › Research › peer-review

TY - GEN

T1 - A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder

AU - Xiang, Yang

AU - Højvang, Jesper Lisby

AU - Rasmussen, Morten Højfeldt

AU - Christensen, Mads Græsbøll

PY - 2022

Y1 - 2022

N2 - Recently, variational autoencoder (VAE), a deep representation learning (DRL) model, has been used to perform speech enhancement (SE). However, to the best of our knowledge, current VAE-based SE methods only apply VAE to model speech signal, while noise is modeled using the traditional non-negative matrix factorization (NMF) model. One of the most important reasons for using NMF is that these VAE-based methods cannot disentangle the speech and noise latent variables from the observed signal. Based on Bayesian theory, this paper derives a novel variational lower bound for VAE, which ensures that VAE can be trained in supervision, and can disentangle speech and noise latent variables from the observed signal. This means that the proposed method can apply the VAE to model both speech and noise signals, which is totally different from the previous VAE-based SE works. More specifically, the proposed DRL method can learn to impose speech and noise signal priors to different sets of latent variables for SE. The experimental results show that the proposed method can not only disentangle speech and noise latent variables from the observed signal, but also obtain a higher scale-invariant signal-to-distortion ratio and speech quality score than the similar deep neural network-based (DNN) SE method.

AB - Recently, variational autoencoder (VAE), a deep representation learning (DRL) model, has been used to perform speech enhancement (SE). However, to the best of our knowledge, current VAE-based SE methods only apply VAE to model speech signal, while noise is modeled using the traditional non-negative matrix factorization (NMF) model. One of the most important reasons for using NMF is that these VAE-based methods cannot disentangle the speech and noise latent variables from the observed signal. Based on Bayesian theory, this paper derives a novel variational lower bound for VAE, which ensures that VAE can be trained in supervision, and can disentangle speech and noise latent variables from the observed signal. This means that the proposed method can apply the VAE to model both speech and noise signals, which is totally different from the previous VAE-based SE works. More specifically, the proposed DRL method can learn to impose speech and noise signal priors to different sets of latent variables for SE. The experimental results show that the proposed method can not only disentangle speech and noise latent variables from the observed signal, but also obtain a higher scale-invariant signal-to-distortion ratio and speech quality score than the similar deep neural network-based (DNN) SE method.

KW - Bayesian permutation training

KW - Deep representation learning

KW - speech enhancement

KW - variational autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85130493259&partnerID=8YFLogxK

U2 - 10.1109/ICASSP43922.2022.9747036

DO - 10.1109/ICASSP43922.2022.9747036

M3 - Article in proceeding

AN - SCOPUS:85130493259

SN - 978-1-6654-0541-6

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 381

EP - 385

BT - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

PB - IEEE

T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022

Y2 - 23 May 2022 through 27 May 2022

ER -

Xiang Y, Højvang JL, Rasmussen MH, Christensen MG. A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2022. p. 381-385. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP43922.2022.9747036

A Bayesian Permutation Training Deep Representation Learning Method for Speech Enhancement with Variational Autoencoder

Abstract

Conference

Bibliographical note

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this