2 Citations (Scopus)
22 Downloads (Pure)

Abstract

Equipping robots with the ability to identify who is talking to them is an important step towards natural and effective verbal interaction. However, speaker identification for voice control remains largely unexplored compared to recent progress in natural language instruction and speech recognition. This motivates us to tackle text-independent speaker identification for human-robot interaction applications in industrial environments. By representing audio segments as time-frequency spectrograms, this can be formulated as an image classification task, allowing us to apply state-of-the-art convolutional neural network (CNN) architectures. To achieve robust prediction in unconstrained, challenging acoustic conditions, we take a data-driven approach and collect a custom dataset with a far-field microphone array, featuring over 3 hours of "in the wild"audio recordings from six speakers, which are then encoded into spectral images for CNN-based classification. We propose a shallow 3-layer CNN, which we compare with the widely used ResNet-18 architecture: in addition to benchmarking these models in terms of accuracy, we visualize the features used by these two models to discriminate between classes, and investigate their reliability in unseen acoustic scenes. Although ResNet-18 reaches the highest raw accuracy, we are able to achieve remarkable online speaker recognition performance with a much more lightweight model which learns lower-level vocal features and produces more reliable confidence scores. The proposed method is successfully integrated into a robotic dialogue system and showcased in a mock user localization and authentication scenario in a realistic industrial environment: https://youtu.be/IVtZ8LKJZ7A.

Original languageEnglish
Title of host publication2021 30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021
Number of pages7
PublisherIEEE
Publication date8 Aug 2021
Pages272-278
Commissioning bodyHorizon Europe
ISBN (Electronic)9781665404921
DOIs
Publication statusPublished - 8 Aug 2021
Event30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021 - Virtual, Vancouver, Canada
Duration: 8 Aug 202112 Aug 2021

Conference

Conference30th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN 2021
Country/TerritoryCanada
CityVirtual, Vancouver
Period08/08/202112/08/2021
SeriesIEEE RO-MAN proceedings
ISSN1944-9445

Bibliographical note

Funding Information:
ACKNOWLEDGEMENTS We would like to thank the research assistants Martin Bieber, Jinha Park and Hahyeon Kim for their significant help with robotic integration and lending their voices to this experiment. We would also like to thank Letizia Marchegiani for her valuable technical input. Lastly, we would like to acknowledge support by the H2020-WIDESPREAD project no. 857061 “Networking for Research and Development of Human Interactive and Sensitive Robotics Taking Advantage of Additive Manufacturing – R2P2” and EU’s SMART EUREKA programme under grant agreement S0218-chARmER.

Publisher Copyright:
© 2021 IEEE.

Keywords

  • speaker identification
  • human-robot interaction
  • CNN
  • little helper robot

Fingerprint

Dive into the research topics of 'Why talk to people when you can talk to robots? Far-field speaker identification in the wild'. Together they form a unique fingerprint.

Cite this