Resumé

Person re-identification requires extraction of dis-
criminative features to ensure a correct match; this must be
done independent of challenges, such as occlusion, view, or
lighting changes. While occlusion can be eliminated by changing
the camera setup from a horizontal to a vertical (overhead)
viewpoint, other challenges arise as the total visible surface
area of persons is decreased. As a result, methods that focus
on the most discriminative regions of persons must be applied,
while different domains should also be considered to extract
different semantics. To further increase feature discriminability,
complementary features extracted at different abstraction levels
should be fused. To emphasize features at certain abstraction
levels depending on the input, fusion should be done intel-
ligently. This work considers multiple domains and feature
discrimination, where a multimodal convolution neural network
is applied to fuse RGB and depth information. To extract multi-
local discriminative features, two different attention modules are
proposed: (1) a spatial attention module, which is able to capture
local information at different abstraction levels, and (2) a layer-
wise attention module, which works as a dynamic weighting
scheme to assign weights and fuse local abstraction-level features
intelligently, depending on the input image. By fusing local and
global features in a multimodal context, we show state-of-the-art
accuracies on two publicly available datasets, DPI-T and TVPR,
while increasing the state-of-the-art accuracy on a third dataset,
OPR. Finally, through both visual and quantitative analysis, we
show the ability of the proposed system to leverage multiple
frames, by adapting feature weighting depending on the input.
OriginalsprogEngelsk
TidsskriftI E E E Transactions on Information Forensics and Security
Vol/bind15
Sider (fra-til)1216 - 1231
Antal sider16
ISSN1556-6013
DOI
StatusUdgivet - 5 sep. 2019

Fingerprint

Electric fuses
Convolution
Fusion reactions
Semantics
Cameras
Chemical analysis

Citer dette

@article{fbd549f2f61f41b4a0fa0030c1321853,
title = "Person Re-identification Using Spatial and Layer-Wise Attention",
abstract = "Person re-identification requires extraction of dis-criminative features to ensure a correct match; this must bedone independent of challenges, such as occlusion, view, orlighting changes. While occlusion can be eliminated by changingthe camera setup from a horizontal to a vertical (overhead)viewpoint, other challenges arise as the total visible surfacearea of persons is decreased. As a result, methods that focuson the most discriminative regions of persons must be applied,while different domains should also be considered to extractdifferent semantics. To further increase feature discriminability,complementary features extracted at different abstraction levelsshould be fused. To emphasize features at certain abstractionlevels depending on the input, fusion should be done intel-ligently. This work considers multiple domains and featurediscrimination, where a multimodal convolution neural networkis applied to fuse RGB and depth information. To extract multi-local discriminative features, two different attention modules areproposed: (1) a spatial attention module, which is able to capturelocal information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weightingscheme to assign weights and fuse local abstraction-level featuresintelligently, depending on the input image. By fusing local andglobal features in a multimodal context, we show state-of-the-artaccuracies on two publicly available datasets, DPI-T and TVPR,while increasing the state-of-the-art accuracy on a third dataset,OPR. Finally, through both visual and quantitative analysis, weshow the ability of the proposed system to leverage multipleframes, by adapting feature weighting depending on the input.",
keywords = "Artificial neural networks, Dynamic feature fusion, Multimodal sensors, Person re-identification, Soft attention",
author = "Lejb{\o}lle, {Aske Rasch} and Kamal Nasrollahi and Benjamin Krogh and Moeslund, {Thomas B.}",
year = "2019",
month = "9",
day = "5",
doi = "10.1109/TIFS.2019.2938870",
language = "English",
volume = "15",
pages = "1216 -- 1231",
journal = "I E E E Transactions on Information Forensics and Security",
issn = "1556-6013",
publisher = "IEEE",

}

Person Re-identification Using Spatial and Layer-Wise Attention. / Lejbølle, Aske Rasch; Nasrollahi, Kamal; Krogh, Benjamin; Moeslund, Thomas B.

I: I E E E Transactions on Information Forensics and Security, Bind 15, 05.09.2019, s. 1216 - 1231.

Publikation: Bidrag til tidsskriftTidsskriftartikelForskningpeer review

TY - JOUR

T1 - Person Re-identification Using Spatial and Layer-Wise Attention

AU - Lejbølle, Aske Rasch

AU - Nasrollahi, Kamal

AU - Krogh, Benjamin

AU - Moeslund, Thomas B.

PY - 2019/9/5

Y1 - 2019/9/5

N2 - Person re-identification requires extraction of dis-criminative features to ensure a correct match; this must bedone independent of challenges, such as occlusion, view, orlighting changes. While occlusion can be eliminated by changingthe camera setup from a horizontal to a vertical (overhead)viewpoint, other challenges arise as the total visible surfacearea of persons is decreased. As a result, methods that focuson the most discriminative regions of persons must be applied,while different domains should also be considered to extractdifferent semantics. To further increase feature discriminability,complementary features extracted at different abstraction levelsshould be fused. To emphasize features at certain abstractionlevels depending on the input, fusion should be done intel-ligently. This work considers multiple domains and featurediscrimination, where a multimodal convolution neural networkis applied to fuse RGB and depth information. To extract multi-local discriminative features, two different attention modules areproposed: (1) a spatial attention module, which is able to capturelocal information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weightingscheme to assign weights and fuse local abstraction-level featuresintelligently, depending on the input image. By fusing local andglobal features in a multimodal context, we show state-of-the-artaccuracies on two publicly available datasets, DPI-T and TVPR,while increasing the state-of-the-art accuracy on a third dataset,OPR. Finally, through both visual and quantitative analysis, weshow the ability of the proposed system to leverage multipleframes, by adapting feature weighting depending on the input.

AB - Person re-identification requires extraction of dis-criminative features to ensure a correct match; this must bedone independent of challenges, such as occlusion, view, orlighting changes. While occlusion can be eliminated by changingthe camera setup from a horizontal to a vertical (overhead)viewpoint, other challenges arise as the total visible surfacearea of persons is decreased. As a result, methods that focuson the most discriminative regions of persons must be applied,while different domains should also be considered to extractdifferent semantics. To further increase feature discriminability,complementary features extracted at different abstraction levelsshould be fused. To emphasize features at certain abstractionlevels depending on the input, fusion should be done intel-ligently. This work considers multiple domains and featurediscrimination, where a multimodal convolution neural networkis applied to fuse RGB and depth information. To extract multi-local discriminative features, two different attention modules areproposed: (1) a spatial attention module, which is able to capturelocal information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weightingscheme to assign weights and fuse local abstraction-level featuresintelligently, depending on the input image. By fusing local andglobal features in a multimodal context, we show state-of-the-artaccuracies on two publicly available datasets, DPI-T and TVPR,while increasing the state-of-the-art accuracy on a third dataset,OPR. Finally, through both visual and quantitative analysis, weshow the ability of the proposed system to leverage multipleframes, by adapting feature weighting depending on the input.

KW - Artificial neural networks

KW - Dynamic feature fusion

KW - Multimodal sensors

KW - Person re-identification

KW - Soft attention

U2 - 10.1109/TIFS.2019.2938870

DO - 10.1109/TIFS.2019.2938870

M3 - Journal article

VL - 15

SP - 1216

EP - 1231

JO - I E E E Transactions on Information Forensics and Security

JF - I E E E Transactions on Information Forensics and Security

SN - 1556-6013

ER -