Most of the well-known classic beamformers have resulted from optimization problems that minimize a cost function such as the mean-square error (MSE) between the noisy speech and a reference clean speech. The rationale behind these formulations involves a speech-versus-noise dichotomy, where anything branded as noise shall be suppressed as much as possible. While leading to simple closed-form solutions and reasonably practical beamformers, this rationale has its own limitations, for instance, when the ambient noise provides context and is therefore not entirely undesirable. In this paper, we offer a new rationale, where the output of the beamformer is minimally processed with respect to a certain reference signal, as long as a given performance criterion is fulfilled. We provide a case study where the performance criterion is inspired by the Speech Intelligibility Index (SII), and the processing penalty is MSE. Regarding the reference signal, we consider two cases. In the first case, the reference signal is set to the unprocessed recording from a reference microphone, giving rise to a beamformer that limits the processing of the noisy signal to a minimum necessary for fulfilling the intelligibility requirement. For the second case, the reference signal is the output of an aggressive beamformer, yielding a beamformer that essentially eliminates the noise unless the concomitant distortion of the clean speech violates the intelligibility requirement. Through simulation studies, we demonstrate some of the benefits that each of the two cases offer in relevant contexts.
|Journal||IEEE/ACM Transactions on Audio, Speech, and Language Processing|
|Publication status||Published - 2021|