Abstract
The topic of this thesis is methods of preprocessing speech signals for robust estimation of model parameters in models of these signals. Here, there is a special focus on the situation where the desired signal is contaminated by colored noise. In order to estimate the speech signal, or its voiced and unvoiced components, from a noisy observation, it is important to have robust estimators that can handle colored and nonstationary noise.
Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be preprocessed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple preprocessing filter (prewhitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In nonstationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive prewhitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A prewhitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a prewhitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the prewhitening filter, based on the modelled residual, and a reestimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors.
The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the stateoftheart decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use timevarying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths.
Two important aspects are investigated. The first one is a robust estimation of the speech signal parameters, such as the fundamental frequency, which is required in many contexts. For this purpose, fast estimation methods based on a simple white Gaussian noise (WGN) assumption are often used. To keep using those methods, the noisy signal can be preprocessed using a filter. If the colored noise is modelled as an autoregressive (AR) process, whose parameters are estimated from the noisy signal, it is possible to render the noise component closer to white with a simple preprocessing filter (prewhitener). This makes it possible to estimate the fundamental frequency using the aforementioned assumption of white Gaussian noise. In nonstationary noise scenarios, it is possible to obtain better estimates of the noise spectral envelope as well as a higher degree of spectral flatness by using an adaptive prewhitening filter based on supervised noise statistics estimates, than one based on unsupervised noise statistics. A prewhitening filter also improves the accuracy of a source localization method. The problem of joint estimation of the parameters of the voiced speech and the stochastic signal parts (i.e., unvoiced and additive noise) is solved first by the cascade of a prewhitening filter and the nonlinear least squares (NLS) fundamental frequency estimator, followed by an iterative estimation of the prewhitening filter, based on the modelled residual, and a reestimation of the fundamental frequency. This will further reduce the number of gross errors of fundamental frequnecy estimates and the voicing detection errors.
The second aspect is as follows: after a more accurate estimation of the parameters is obtained, the extraction of individual speech components (i.e., voiced and unvoiced speech) from a noisy speech signal, is investigated through linear filtering based on the statistics of the individual components. A Wiener filtering approach allows for a better recovery of both components when compared to the stateoftheart decomposition methods, which assume that the additive noise is small and insignificant. Instead of using a fixed segment length for the extraction, we also propose to use timevarying segment lengths that are adapted to the signal. The optimal segmentation is obtained once the parameter estimates of a hybrid speech model have been found for all possible candidate models and segment lengths.
Originalsprog  Engelsk 

Vejledere 

Udgiver  
ISBN'er, elektronisk  9788772109848 
DOI  
Status  Udgivet  2021 
Bibliografisk note
PhD supervisor:Professor Mads Græsbøll Christensen, Aalborg University
Assistant PhD supervisor:
Associate Professor Jesper Kjær Nielsen, Siemens Gamesa