Abstract
We present new results on single-channel speech
separation and suggest a new separation approach to improve
the speech quality of separated signals from an observed mix-
ture. The key idea is to derive a mixture estimator based
on sinusoidal parameters. The proposed estimator is aimed at
finding sinusoidal parameters in the form of codevectors from
vector quantization (VQ) codebooks pre-trained for speakers
that, when combined, best fit the observed mixed signal. The
selected codevectors are then used to reconstruct the recovered
signals for the speakers in the mixture. Compared to the log-
max mixture estimator used in binary masks and the Wiener
filtering approach, it is observed that the proposed method
achieves an acceptable perceptual speech quality with less cross-
talk at different signal-to-signal ratios. Moreover, the method is
independent of pitch estimates and reduces the computational
complexity of the separation by replacing the short-time Fourier
transform (STFT) feature vectors of high dimensionality with
sinusoidal feature vectors. We report separation results for the
proposed method and compare them with respect to other
benchmark methods. The improvements made by applying the
proposed method over other methods are confirmed by employing
perceptual evaluation of speech quality (PESQ) as an objective
measure and a MUSHRA listening test as a subjective evaluation
for both speaker-dependent and gender-dependent scenarios.
separation and suggest a new separation approach to improve
the speech quality of separated signals from an observed mix-
ture. The key idea is to derive a mixture estimator based
on sinusoidal parameters. The proposed estimator is aimed at
finding sinusoidal parameters in the form of codevectors from
vector quantization (VQ) codebooks pre-trained for speakers
that, when combined, best fit the observed mixed signal. The
selected codevectors are then used to reconstruct the recovered
signals for the speakers in the mixture. Compared to the log-
max mixture estimator used in binary masks and the Wiener
filtering approach, it is observed that the proposed method
achieves an acceptable perceptual speech quality with less cross-
talk at different signal-to-signal ratios. Moreover, the method is
independent of pitch estimates and reduces the computational
complexity of the separation by replacing the short-time Fourier
transform (STFT) feature vectors of high dimensionality with
sinusoidal feature vectors. We report separation results for the
proposed method and compare them with respect to other
benchmark methods. The improvements made by applying the
proposed method over other methods are confirmed by employing
perceptual evaluation of speech quality (PESQ) as an objective
measure and a MUSHRA listening test as a subjective evaluation
for both speaker-dependent and gender-dependent scenarios.
Originalsprog | Engelsk |
---|---|
Tidsskrift | I E E E Transactions on Audio, Speech and Language Processing |
Vol/bind | 19 |
Udgave nummer | 5 |
Sider (fra-til) | 1265-1277 |
Antal sider | 13 |
ISSN | 1558-7916 |
DOI | |
Status | Udgivet - 2011 |