rVAD: An unsupervised segment-based robust voice activity detection method

Zheng Hua Tan; Achintya kr Sarkar; Najim Dehak

doi:10.1016/j.csl.2019.06.005

rVAD: An unsupervised segment-based robust voice activity detection method

Zheng Hua Tan^*, Achintya kr Sarkar, Najim Dehak

^*Corresponding author for this work

Research output: Contribution to journal › Journal article › Research › peer-review

84 Citations (Scopus)

180 Downloads (Pure)

Abstract

This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

Original language	English
Journal	Computer Speech and Language
Volume	59
Pages (from-to)	1-21
Number of pages	21
ISSN	0885-2308
DOIs	https://doi.org/10.1016/j.csl.2019.06.005
Publication status	Published - 1 Jan 2020

Keywords

a posteriori SNR
Energy
Pitch detection
Speaker verification
Spectral flatness
Speech enhancement
Voice activity detection

Access to Document

10.1016/j.csl.2019.06.005

rVAD--An-Unsupervised-Segment-Based-Robust-Voice-Act_2019_Computer-Speech---Accepted author manuscript, 1.51 MBLicence: CC BY-NC-ND 4.0

https://arxiv.org/pdf/1906.03588.pdf

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{71b49eb483c04a689f6e04748a8a19f5,

title = "rVAD: An unsupervised segment-based robust voice activity detection method",

abstract = "This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.",

keywords = "a posteriori SNR, Energy, Pitch detection, Speaker verification, Spectral flatness, Speech enhancement, Voice activity detection",

author = "Tan, {Zheng Hua} and Sarkar, {Achintya kr} and Najim Dehak",

year = "2020",

month = jan,

day = "1",

doi = "10.1016/j.csl.2019.06.005",

language = "English",

volume = "59",

pages = "1--21",

journal = "Computer Speech and Language",

issn = "0885-2308",

publisher = "Academic Press",

}

TY - JOUR

T1 - rVAD

T2 - An unsupervised segment-based robust voice activity detection method

AU - Tan, Zheng Hua

AU - Sarkar, Achintya kr

AU - Dehak, Najim

PY - 2020/1/1

Y1 - 2020/1/1

N2 - This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

AB - This paper presents an unsupervised segment-based method for robust voice activity detection (rVAD). The method consists of two passes of denoising followed by a voice activity detection (VAD) stage. In the first pass, high-energy segments in a speech signal are detected by using a posteriori signal-to-noise ratio (SNR) weighted energy difference and if no pitch is detected within a segment, the segment is considered as a high-energy noise segment and set to zero. In the second pass, the speech signal is denoised by a speech enhancement method, for which several methods are explored. Next, neighbouring frames with pitch are grouped together to form pitch segments, and based on speech statistics, the pitch segments are further extended from both ends in order to include both voiced and unvoiced sounds and likely non-speech parts as well. In the end, a posteriori SNR weighted energy difference is applied to the extended pitch segments of the denoised speech signal for detecting voice activity. We evaluate the VAD performance of the proposed method using two databases, RATS and Aurora-2, which contain a large variety of noise conditions. The rVAD method is further evaluated, in terms of speaker verification performance, on the RedDots 2016 challenge database and its noise-corrupted versions. Experiment results show that rVAD is compared favourably with a number of existing methods. In addition, we present a modified version of rVAD where computationally intensive pitch extraction is replaced by computationally efficient spectral flatness calculation. The modified version significantly reduces the computational complexity at the cost of moderately inferior VAD performance, which is an advantage when processing a large amount of data and running on low resource devices. The source code of rVAD is made publicly available.

KW - a posteriori SNR

KW - Energy

KW - Pitch detection

KW - Speaker verification

KW - Spectral flatness

KW - Speech enhancement

KW - Voice activity detection

UR - http://www.scopus.com/inward/record.url?scp=85067546641&partnerID=8YFLogxK

UR - https://github.com/zhenghuatan/rVAD

U2 - 10.1016/j.csl.2019.06.005

DO - 10.1016/j.csl.2019.06.005

M3 - Journal article

AN - SCOPUS:85067546641

SN - 0885-2308

VL - 59

SP - 1

EP - 21

JO - Computer Speech and Language

JF - Computer Speech and Language

ER -

rVAD: An unsupervised segment-based robust voice activity detection method

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this