Person Re-identification Using Spatial and Layer-Wise Attention

Aske Rasch Lejbølle; Kamal Nasrollahi; Benjamin Krogh; Thomas B. Moeslund

doi:10.1109/TIFS.2019.2938870

Person Re-identification Using Spatial and Layer-Wise Attention

Aske Rasch Lejbølle, Kamal Nasrollahi, Benjamin Krogh, Thomas B. Moeslund

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

14 Citationer (Scopus)

266 Downloads (Pure)

Abstract

Person re-identification requires extraction of dis-
criminative features to ensure a correct match; this must be
done independent of challenges, such as occlusion, view, or
lighting changes. While occlusion can be eliminated by changing
the camera setup from a horizontal to a vertical (overhead)
viewpoint, other challenges arise as the total visible surface
area of persons is decreased. As a result, methods that focus
on the most discriminative regions of persons must be applied,
while different domains should also be considered to extract
different semantics. To further increase feature discriminability,
complementary features extracted at different abstraction levels
should be fused. To emphasize features at certain abstraction
levels depending on the input, fusion should be done intel-
ligently. This work considers multiple domains and feature
discrimination, where a multimodal convolution neural network
is applied to fuse RGB and depth information. To extract multi-
local discriminative features, two different attention modules are
proposed: (1) a spatial attention module, which is able to capture
local information at different abstraction levels, and (2) a layer-
wise attention module, which works as a dynamic weighting
scheme to assign weights and fuse local abstraction-level features
intelligently, depending on the input image. By fusing local and
global features in a multimodal context, we show state-of-the-art
accuracies on two publicly available datasets, DPI-T and TVPR,
while increasing the state-of-the-art accuracy on a third dataset,
OPR. Finally, through both visual and quantitative analysis, we
show the ability of the proposed system to leverage multiple
frames, by adapting feature weighting depending on the input.

Originalsprog	Engelsk
Artikelnummer	8826013
Tidsskrift	I E E E Transactions on Information Forensics and Security
Vol/bind	15
Sider (fra-til)	1216 - 1231
Antal sider	16
ISSN	1556-6013
DOI	https://doi.org/10.1109/TIFS.2019.2938870
Status	Udgivet - 5 sep. 2019

Adgang til dokumentet

10.1109/TIFS.2019.2938870

Person Re-identification Using Spatial and Layer-wise AttentionAccepteret manuskript, 10,7 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=85076673010&partnerID=8YFLogxK

Vision-based Person Re-identification in a Queue
Lejbølle, A. R.
01/01/2017 → 31/12/2019
Projekter: Projekt › Forskning

Citationsformater

@article{fbd549f2f61f41b4a0fa0030c1321853,

title = "Person Re-identification Using Spatial and Layer-Wise Attention",

abstract = "Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.",

keywords = "Artificial neural networks, Dynamic feature fusion, Multimodal sensors, Person re-identification, Soft attention, dynamic feature fusion, soft attention, multimodal sensors, person re-identification",

author = "Lejb{\o}lle, {Aske Rasch} and Kamal Nasrollahi and Benjamin Krogh and Moeslund, {Thomas B.}",

year = "2019",

month = sep,

day = "5",

doi = "10.1109/TIFS.2019.2938870",

language = "English",

volume = "15",

pages = "1216 -- 1231",

journal = "I E E E Transactions on Information Forensics and Security",

issn = "1556-6013",

publisher = "IEEE",

}

TY - JOUR

T1 - Person Re-identification Using Spatial and Layer-Wise Attention

AU - Lejbølle, Aske Rasch

AU - Nasrollahi, Kamal

AU - Krogh, Benjamin

AU - Moeslund, Thomas B.

PY - 2019/9/5

Y1 - 2019/9/5

N2 - Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.

AB - Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.

KW - Artificial neural networks

KW - Dynamic feature fusion

KW - Multimodal sensors

KW - Person re-identification

KW - Soft attention

KW - dynamic feature fusion

KW - soft attention

KW - multimodal sensors

KW - person re-identification

UR - http://www.scopus.com/inward/record.url?scp=85076673010&partnerID=8YFLogxK

U2 - 10.1109/TIFS.2019.2938870

DO - 10.1109/TIFS.2019.2938870

M3 - Journal article

SN - 1556-6013

VL - 15

SP - 1216

EP - 1231

JO - I E E E Transactions on Information Forensics and Security

JF - I E E E Transactions on Information Forensics and Security

M1 - 8826013

ER -

Person Re-identification Using Spatial and Layer-Wise Attention

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Projekter

Vision-based Person Re-identification in a Queue

Citationsformater