Far-Field Voice Activity Detection and Its Applications in Adverse Acoustic Environments

Theodoros Petsatodis

Research output: PhD thesis

1040 Downloads (Pure)

Abstract

Voice Activity Detection (VAD), being in the focus of speech processing research for
many years, is nowadays a mature technology with application in several sectors. Embedded
VAD components in telecommunications systems (like in cellular telephony)
attempt to reduce power consumption of transmitters and bandwidth utilization. VAD
technology is also integrated in speech-processing systems, such as Speaker Identification,
Automatic Event Detection, and Automatic Speech Recognition, to prevent their
operation in the absence of speech, and thus reduce the error rates of each of these
systems.
The performance of VAD systems depends strongly on various factors, including the
discriminative ability of the classification criterion employed, the dynamics of the additive
noise and the signal to noise ratio. Speech signals transmitted within reverberant
enclosures and captured using far-field microphones are subject to reverberation effects,
competitive sound sources, and speaker movement. Furthermore, speech distribution
varies with time and can be affected by several unpredictable factors including speaker’s
temper, mood, gender, age, and more. Thus, during the design phase of a VAD, special
considerations have to be taken in order to build a robust system able to operate under
variable and adverse conditions.
Given that for most of speech processing systems it is of crucial importance to have a
reasonable approximation for the probability density function (pdf) of speech, understanding
the properties of speech distribution plays a very important role in the design
of speech processing systems. Within the framework of this work, variability of speech
distribution, when using far-field microphones, is analysed under the presence of noise
and reverberation.
Observations of how speech distribution is shaped by external interferences are then
used as the basis to develop an adaptive unsupervised VAD scheme. This VAD, in
contrary to other approaches employing fixed distribution assumptions, relies on effectively
modelling the distribution of speech as convex combination of a Gaussian,
a Laplacian, and a two-sided Gamma distribution. The increased adaptability of the
system along with the encapsulated adaptive threshold allows the system to perform
remarkably under adverse complex phenomena.
Following recent technological trends, of incorporating microphone arrays in numerous
commercial applications (eg. mobile phones, VOIP terminals) and research environments
(smart rooms), a multiple microphone VAD is also considered. The system
processes signals captured by far-field sensors in order to integrate spatial information
in addition to the frequency content available at a single sensor. The core of the system
resides on the modification of a multiple observation hypothesis, testing at each sensor
the probability of having an active speaker and then fusing the decisions. The VAD
operates without the need of Direction-of-Arrival (DOA) estimation and eliminates additional
delay imposed by previous multi-microphone VAD technologies.
The system developed for the multi-microphone VAD serves as the platform to merge
VAD with a very powerful analysis framework namely, the Empirical Mode Decomposition
(EMD). This highly efficient method relies on local characteristics of time scale of
the data to analyse and decompose non-stationary signals into a set of so called intrinsic
mode functions (IMF). These functions are injected to the multiple microphone VAD
scheme in order to decide upon speech presence or absence. The outcome of this procedure
demonstrates significantly enhanced performance compared to single microphone
approaches.
Speech distribution information is also encapsulated in a supervised VAD scheme. Operating
in the far-field, the core of the system employs Hidden Markov Models the
states of which are modelled using Gaussian Mixture Models to cater for the dynamics
of captured speech. Given the bi-modality of speech production, a simple visual-VAD
is also developed to examine performance enhancement when fusing audio and video
information.
In the final part of the work, applications of VAD in the context of integration with other
signal processing systems are also considered. Performance benefits of combining the
multi-microphone VAD with DOA estimation are demonstrated. Optimization through
adaptation of speech shape characteristics in the embedded Time Delay Estimation
(TDE) scheme is also considered, the same way that was beneficial for the convex
combination based VAD. Towards this direction, the underlying assumption of Gaussian
distributed source is replaced by that of Generalized Gaussian distribution that allows
the evaluation of the problem under a larger set of speech-shaped distributions, ranging
from Gaussian to Laplacian and Gamma. The analysis performed, revealed a significant
research outcome.
Furthermore, performance enhancement when using VAD in combination with noise
reduction systems is also discussed in terms of residual suppression within silence intervals.
For this scope, a noise reduction architecture has been developed based on
cascading an one-pass scheme.
The final application of VAD, examined in the thesis, is in the area of biomedical
signal processing. A modification of one of the VAD systems developed, is employed
to provide preliminary detection of one of the major breathing-related sleep disorders,
apnea. The idea behind the development of this system is the capability of unobtrusively
monitoring patients at home, improving the reliability of detection of sleep disorders in
home environments, offering comfort and time saving to patients.
Original languageEnglish
Publication statusPublished - 2012

Fingerprint

Dive into the research topics of 'Far-Field Voice Activity Detection and Its Applications in Adverse Acoustic Environments'. Together they form a unique fingerprint.

Cite this