2 Citationer (Scopus)
22 Downloads (Pure)

Abstract

Equipping robots with the ability to identify who is talking to them is an important step towards natural and effective verbal interaction. However, speaker identification for voice control remains largely unexplored compared to recent progress in natural language instruction and speech recognition. This motivates us to tackle text-independent speaker identification for human-robot interaction applications in industrial environments. By representing audio segments as time-frequency spectrograms, this can be formulated as an image classification task, allowing us to apply state-of-the-art convolutional neural network (CNN) architectures. To achieve robust prediction in unconstrained, challenging acoustic conditions, we take a data-driven approach and collect a custom dataset with a far-field microphone array, featuring over 3 hours of "in the wild"audio recordings from six speakers, which are then encoded into spectral images for CNN-based classification. We propose a shallow 3-layer CNN, which we compare with the widely used ResNet-18 architecture: in addition to benchmarking these models in terms of accuracy, we visualize the features used by these two models to discriminate between classes, and investigate their reliability in unseen acoustic scenes. Although ResNet-18 reaches the highest raw accuracy, we are able to achieve remarkable online speaker recognition performance with a much more lightweight model which learns lower-level vocal features and produces more reliable confidence scores. The proposed method is successfully integrated into a robotic dialogue system and showcased in a mock user localization and authentication scenario in a realistic industrial environment: https://youtu.be/IVtZ8LKJZ7A.

OriginalsprogEngelsk
Titel2021 30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021
Antal sider7
ForlagIEEE
Publikationsdato8 aug. 2021
Sider272-278
AnsøgerHorizon Europe
ISBN (Elektronisk)9781665404921
DOI
StatusUdgivet - 8 aug. 2021
Begivenhed30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021 - Virtual, Vancouver, Canada
Varighed: 8 aug. 202112 aug. 2021

Konference

Konference30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021
Land/OmrådeCanada
ByVirtual, Vancouver
Periode08/08/202112/08/2021
NavnIEEE RO-MAN proceedings
ISSN1944-9445

Bibliografisk note

Publisher Copyright:
© 2021 IEEE.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Why talk to people when you can talk to robots? Far-field speaker identification in the wild'. Sammen danner de et unikt fingeraftryk.

Citationsformater