Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech

Sidsel Marie Nørholm; Jesper Rindom Jensen; Mads Græsbøll Christensen

doi:10.1109/TASLP.2016.2608948

Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech

Sidsel Marie Nørholm, Jesper Rindom Jensen, Mads Græsbøll Christensen

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

25 Citationer (Scopus)

372 Downloads (Pure)

Abstract

In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.

Originalsprog	Engelsk
Artikelnummer	756754
Tidsskrift	I E E E Transactions on Audio, Speech and Language Processing
Vol/bind	24
Udgave nummer	12
Sider (fra-til)	2354-2367
ISSN	1558-7916
DOI	https://doi.org/10.1109/TASLP.2016.2608948
Status	Udgivet - dec. 2016

Emneord

Harmonic chirp model, parameter estimation, segmentation, prewhitening.

Adgang til dokumentet

10.1109/TASLP.2016.2608948

chirp_ultimateIndsendt manuskript, 623 KBLicens: Ikke-specificeret

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Localization and Tracking of Speech - a Joint Audio-Visual Approach
Jensen, J. R.
01/10/2013 → 30/09/2016
Projekter: Projekt › Forskning
Spatio-Temporal Filtering Methods for Enhancement and Separation of Speech Signals
Christensen, M. G., Nørholm, S. M., Karimian-Azari, S. & Jensen, J. R.
01/08/2012 → 30/06/2015
Projekter: Projekt › Forskning

Citationsformater

@article{044f42f77c2e47988688cbeb8949c444,

title = "Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech",

abstract = "In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.",

keywords = "Harmonic chirp model, parameter estimation, segmentation, prewhitening., Harmonic chirp model, Parameter estimation, Segmentation, Prewhitening",

author = "N{\o}rholm, {Sidsel Marie} and Jensen, {Jesper Rindom} and Christensen, {Mads Gr{\ae}sb{\o}ll}",

year = "2016",

month = dec,

doi = "10.1109/TASLP.2016.2608948",

language = "English",

volume = "24",

pages = "2354--2367",

journal = "I E E E Transactions on Audio, Speech and Language Processing",

issn = "1558-7916",

publisher = "IEEE Signal Processing Society",

number = "12",

}

Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech. / Nørholm, Sidsel Marie; Jensen, Jesper Rindom ; Christensen, Mads Græsbøll.
I: I E E E Transactions on Audio, Speech and Language Processing, Bind 24, Nr. 12, 756754, 12.2016, s. 2354-2367.

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

TY - JOUR

T1 - Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech

AU - Nørholm, Sidsel Marie

AU - Jensen, Jesper Rindom

AU - Christensen, Mads Græsbøll

PY - 2016/12

Y1 - 2016/12

N2 - In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.

AB - In speech processing, the speech is often considered stationary within segments of 20–30 ms even though it is well known not to be true. In this paper, we take the non-stationarity of voiced speech into account by using a linear chirp model to describe the speech signal. We propose a maximum likelihood estimator of the fundamental frequency and chirp rate of this model, and show that it reaches the Cramer-Rao bound. Since the speech varies over time, a fixed segment length is not optimal, and we propose to make a segmentation of the signal based on the maximum a posteriori (MAP) criterion. Using this segmentation method, the segments are on average seen to be longer for the chirp model compared to the traditional harmonic model. For the signal under test, the average segment length is 24.4 ms and 17.1 ms for the chirp model and traditional harmonic model, respectively. This suggests a better fit of the chirp model than the harmonic model to the speech signal. The methods are based on an assumption of white Gaussian noise, and, therefore, two prewhitening filters are also proposed.

KW - Harmonic chirp model, parameter estimation, segmentation, prewhitening.

KW - Harmonic chirp model

KW - Parameter estimation

KW - Segmentation

KW - Prewhitening

U2 - 10.1109/TASLP.2016.2608948

DO - 10.1109/TASLP.2016.2608948

M3 - Journal article

SN - 1558-7916

VL - 24

SP - 2354

EP - 2367

JO - I E E E Transactions on Audio, Speech and Language Processing

JF - I E E E Transactions on Audio, Speech and Language Processing

IS - 12

M1 - 756754

ER -

Instantaneous Fundamental Frequency Estimation with Optimal Segmentation for Nonstationary Voiced Speech

Abstract

Emneord

Adgang til dokumentet

AUB Link

Fingeraftryk

Projekter

Localization and Tracking of Speech - a Joint Audio-Visual Approach

Spatio-Temporal Filtering Methods for Enhancement and Separation of Speech Signals

Citationsformater