A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Pejman Mowlaee; Rahim Saeidi; Mads Græsbøll Christensen; Zheng-Hua Tan; Tomi Kinnunen; Pasi Franti; Søren Holdt Jensen

doi:10.1109/TASL.2012.2208627

A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Pejman Mowlaee, Rahim Saeidi, Mads Græsbøll Christensen, Zheng-Hua Tan, Tomi Kinnunen, Pasi Franti, Søren Holdt Jensen

Research output: Contribution to journal › Journal article › Research › peer-review

39 Citations (Scopus)

803 Downloads (Pure)

Abstract

In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.

Original language	English
Journal	I E E E Transactions on Audio, Speech and Language Processing
Volume	20
Issue number	9
Pages (from-to)	2586 - 2601
Number of pages	16
ISSN	1558-7916
DOIs	https://doi.org/10.1109/TASL.2012.2208627
Publication status	Published - Nov 2012

Keywords

BSS EVAL
single-channel speech separation
sinusoidal modeling
speaker identiﬁcatio
speech recognition

Access to Document

10.1109/TASL.2012.2208627

06239572 2 Final published version, 1.63 MB

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{5a431878158e47408c82b0c87ac87d53,

title = "A Joint Approach for Single-Channel Speaker Identification and Speech Separation",

abstract = "In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.",

keywords = "BSS EVAL, single-channel speech separation, sinusoidal modeling, speaker identiﬁcatio, speech recognition",

author = "Pejman Mowlaee and Rahim Saeidi and Christensen, {Mads Gr{\ae}sb{\o}ll} and Zheng-Hua Tan and Tomi Kinnunen and Pasi Franti and Jensen, {S{\o}ren Holdt}",

year = "2012",

month = nov,

doi = "10.1109/TASL.2012.2208627",

language = "English",

volume = "20",

pages = "2586 -- 2601",

journal = "I E E E Transactions on Audio, Speech and Language Processing",

issn = "1558-7916",

publisher = "IEEE Signal Processing Society",

number = "9",

}

TY - JOUR

T1 - A Joint Approach for Single-Channel Speaker Identification and Speech Separation

AU - Mowlaee, Pejman

AU - Saeidi, Rahim

AU - Christensen, Mads Græsbøll

AU - Tan, Zheng-Hua

AU - Kinnunen, Tomi

AU - Franti, Pasi

AU - Jensen, Søren Holdt

PY - 2012/11

Y1 - 2012/11

N2 - In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.

AB - In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.

KW - BSS EVAL

KW - single-channel speech separation

KW - sinusoidal modeling

KW - speaker identiﬁcatio

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=84865693405&partnerID=8YFLogxK

U2 - 10.1109/TASL.2012.2208627

DO - 10.1109/TASL.2012.2208627

M3 - Journal article

SN - 1558-7916

VL - 20

SP - 2586

EP - 2601

JO - I E E E Transactions on Audio, Speech and Language Processing

JF - I E E E Transactions on Audio, Speech and Language Processing

IS - 9

ER -

A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this