Attention in Multimodal Neural Networks for Person Re-identification

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

Abstract

In spite of increasing interest from the research commu-
nity, person re-identification remains an unsolved problem.
Correctly deciding on a true match by comparing images
of a person, captured by several cameras, requires extrac-
tion of discriminative features to counter challenges such as
changes in lighting, viewpoint and occlusion. Besides de-
vising novel feature descriptors, the setup can be changed
to capture persons from an overhead viewpoint rather than
a horizontal. Furthermore, additional modalities can be
considered that are not affected by similar environmental
changes as RGB images. In this work, we present a Multi-
modal ATtention network (MAT) based on RGB and depth
modalities. We combine a Convolution Neural Network with
an attention module to extract local and discriminative fea-
tures that are fused with globally extracted features. At-
tention is based on correlation between the two modalities
and we finally also fuse RGB and depth features to generate
a joint multilevel RGB-D feature. Experiments conducted
on three datasets captured from an overhead view show the
importance of attention, increasing accuracies by 3.43%,
2.01% and 2.13% on OPR, DPI-T and TVPR, respectively.
Luk

Detaljer

In spite of increasing interest from the research commu-
nity, person re-identification remains an unsolved problem.
Correctly deciding on a true match by comparing images
of a person, captured by several cameras, requires extrac-
tion of discriminative features to counter challenges such as
changes in lighting, viewpoint and occlusion. Besides de-
vising novel feature descriptors, the setup can be changed
to capture persons from an overhead viewpoint rather than
a horizontal. Furthermore, additional modalities can be
considered that are not affected by similar environmental
changes as RGB images. In this work, we present a Multi-
modal ATtention network (MAT) based on RGB and depth
modalities. We combine a Convolution Neural Network with
an attention module to extract local and discriminative fea-
tures that are fused with globally extracted features. At-
tention is based on correlation between the two modalities
and we finally also fuse RGB and depth features to generate
a joint multilevel RGB-D feature. Experiments conducted
on three datasets captured from an overhead view show the
importance of attention, increasing accuracies by 3.43%,
2.01% and 2.13% on OPR, DPI-T and TVPR, respectively.
OriginalsprogEngelsk
Titel2018 IEEE Computer Vision and Pattern Recognition Workshops: Visual Understanding of Humans in Crowd Scene
Antal sider9
ForlagIEEE
Publikationsdatojun. 2018
Sider179-187
StatusUdgivet - jun. 2018
PublikationsartForskning
Peer reviewJa
BegivenhedIEEE Conference on Computer Vision and Pattern Recognition, 2018 - Salt Lake City, USA
Varighed: 18 jun. 201822 jun. 2018

Konference

KonferenceIEEE Conference on Computer Vision and Pattern Recognition, 2018
LandUSA
BySalt Lake City
Periode18/06/201822/06/2018

Download-statistik

Ingen data tilgængelig
ID: 274076311