Pre-processing of Speech Signals for Robust Parameter Estimation

Alfredo Esquivel Jaramillo

doi:10.54337/aau456472165

Pre-processing of Speech Signals for Robust Parameter Estimation

Alfredo Esquivel Jaramillo

Publikation: Ph.d.-afhandling

100 Downloads (Pure)

Abstract

The topic of this thesis is methods of pre-processing speech signals for robust estimation of model parameters in models of these signals. Here, there is a special focus on the situation where the desired signal is contaminated by colored noise. In order to estimate the speech signal, or its voiced and unvoiced components, from a noisy observation, it is important to have robust estimators that can handle colored and non-stationary noise.

Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be pre-processed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple pre-processing filter (pre-whitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In non-stationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive pre-whitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A pre-whitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a pre-whitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the pre-whitening filter, based on the modelled residual, and a re-estimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors.

The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the state-of-the-art decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use time-varying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths.

Originalsprog	Engelsk
Vejledere	Christensen, Mads Græsbøll, Hovedvejleder Nielsen, Jesper Kjær, Virksomhedsvejleder, Ekstern person
Udgiver	Aalborg Universitetsforlag
ISBN'er, elektronisk	978-87-7210-984-8
DOI	https://doi.org/10.54337/aau456472165
Status	Udgivet - 2021

Bibliografisk note

PhD supervisor:
Professor Mads Græsbøll Christensen, Aalborg University

Assistant PhD supervisor:
Associate Professor Jesper Kjær Nielsen, Siemens Gamesa

Adgang til dokumentet

10.54337/aau456472165

PHD_AEJ_E-pdfForlagets udgivne version, 5,59 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Citationsformater

@misc{d705661b0a384098bfe9f5d917f33323,

title = "Pre-processing of Speech Signals for Robust Parameter Estimation",

abstract = "The topic of this thesis is methods of pre-processing speech signals for robust estimation of model parameters in models of these signals. Here, there is a special focus on the situation where the desired signal is contaminated by colored noise. In order to estimate the speech signal, or its voiced and unvoiced components, from a noisy observation, it is important to have robust estimators that can handle colored and non-stationary noise. Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be pre-processed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple pre-processing filter (pre-whitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In non-stationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive pre-whitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A pre-whitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a pre-whitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the pre-whitening filter, based on the modelled residual, and a re-estimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors. The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the state-of-the-art decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use time-varying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths. ",

author = "{Esquivel Jaramillo}, Alfredo",

note = "PhD supervisor: Professor Mads Gr{\ae}sb{\o}ll Christensen, Aalborg University Assistant PhD supervisor: Associate Professor Jesper Kj{\ae}r Nielsen, Siemens Gamesa",

year = "2021",

doi = "10.54337/aau456472165",

language = "English",

series = "Ph.d.-serien for Det Tekniske Fakultet for IT og Design, Aalborg Universitet",

publisher = "Aalborg Universitetsforlag",

}

TY - GEN

T1 - Pre-processing of Speech Signals for Robust Parameter Estimation

AU - Esquivel Jaramillo, Alfredo

N1 - PhD supervisor: Professor Mads Græsbøll Christensen, Aalborg University Assistant PhD supervisor: Associate Professor Jesper Kjær Nielsen, Siemens Gamesa

PY - 2021

Y1 - 2021

N2 - The topic of this thesis is methods of pre-processing speech signals for robust estimation of model parameters in models of these signals. Here, there is a special focus on the situation where the desired signal is contaminated by colored noise. In order to estimate the speech signal, or its voiced and unvoiced components, from a noisy observation, it is important to have robust estimators that can handle colored and non-stationary noise. Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be pre-processed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple pre-processing filter (pre-whitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In non-stationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive pre-whitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A pre-whitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a pre-whitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the pre-whitening filter, based on the modelled residual, and a re-estimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors. The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the state-of-the-art decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use time-varying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths.

AB - The topic of this thesis is methods of pre-processing speech signals for robust estimation of model parameters in models of these signals. Here, there is a special focus on the situation where the desired signal is contaminated by colored noise. In order to estimate the speech signal, or its voiced and unvoiced components, from a noisy observation, it is important to have robust estimators that can handle colored and non-stationary noise. Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be pre-processed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple pre-processing filter (pre-whitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In non-stationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive pre-whitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A pre-whitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a pre-whitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the pre-whitening filter, based on the modelled residual, and a re-estimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors. The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the state-of-the-art decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use time-varying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths.

U2 - 10.54337/aau456472165

DO - 10.54337/aau456472165

M3 - PhD thesis

T3 - Ph.d.-serien for Det Tekniske Fakultet for IT og Design, Aalborg Universitet

PB - Aalborg Universitetsforlag

ER -

Pre-processing of Speech Signals for Robust Parameter Estimation

Abstract

Bibliografisk note

Adgang til dokumentet

AUB Link

Fingeraftryk

Citationsformater