Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech

Research output: Contribution to journalJournal articleResearchpeer-review

Abstract

In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.
Close

Details

In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.
Original languageEnglish
Article number756754
JournalI E E E Transactions on Audio, Speech and Language Processing
Volume24
Issue number12
Pages (from-to)2354-2367
ISSN1558-7916
DOI
Publication statusPublished - Dec 2016
Publication categoryResearch
Peer-reviewedYes

    Research areas

  • Harmonic chirp model, Parameter estimation, Segmentation, Prewhitening

Download statistics

No data available
ID: 221556270