Person Re-identification Using Spatial and Layer-Wise Attention

Aske Rasch Lejbølle; Kamal Nasrollahi; Benjamin Krogh; Thomas B. Moeslund

doi:10.1109/TIFS.2019.2938870

Person Re-identification Using Spatial and Layer-Wise Attention

Aske Rasch Lejbølle, Kamal Nasrollahi, Benjamin Krogh, Thomas B. Moeslund

Research output: Contribution to journal › Journal article › Research › peer-review

14 Citations (Scopus)

270 Downloads (Pure)

Abstract

Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.

Original language	English
Article number	8826013
Journal	I E E E Transactions on Information Forensics and Security
Volume	15
Pages (from-to)	1216 - 1231
Number of pages	16
ISSN	1556-6013
DOIs	https://doi.org/10.1109/TIFS.2019.2938870
Publication status	Published - 5 Sept 2019

Keywords

Artificial neural networks
Dynamic feature fusion
Multimodal sensors
Person re-identification
Soft attention
dynamic feature fusion
soft attention
multimodal sensors
person re-identification

Access to Document

10.1109/TIFS.2019.2938870

Person Re-identification Using Spatial and Layer-wise AttentionAccepted author manuscript, 10.7 MB

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{fbd549f2f61f41b4a0fa0030c1321853,

title = "Person Re-identification Using Spatial and Layer-Wise Attention",

abstract = "Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.",

keywords = "Artificial neural networks, Dynamic feature fusion, Multimodal sensors, Person re-identification, Soft attention, dynamic feature fusion, soft attention, multimodal sensors, person re-identification",

author = "Lejb{\o}lle, {Aske Rasch} and Kamal Nasrollahi and Benjamin Krogh and Moeslund, {Thomas B.}",

year = "2019",

month = sep,

day = "5",

doi = "10.1109/TIFS.2019.2938870",

language = "English",

volume = "15",

pages = "1216 -- 1231",

journal = "I E E E Transactions on Information Forensics and Security",

issn = "1556-6013",

publisher = "IEEE",

}

TY - JOUR

T1 - Person Re-identification Using Spatial and Layer-Wise Attention

AU - Lejbølle, Aske Rasch

AU - Nasrollahi, Kamal

AU - Krogh, Benjamin

AU - Moeslund, Thomas B.

PY - 2019/9/5

Y1 - 2019/9/5

N2 - Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.

AB - Person re-identification requires extraction of discriminative features to ensure a correct match; this must be done independent of challenges, such as occlusion, view, or lighting changes. While occlusion can be eliminated by changing the camera setup from a horizontal to a vertical (overhead) viewpoint, other challenges arise as the total visible surface area of persons is decreased. As a result, methods that focus on the most discriminative regions of persons must be applied, while different domains should also be considered to extract different semantics. To further increase feature discriminability, complementary features extracted at different abstraction levels should be fused. To emphasize features at certain abstraction levels depending on the input, fusion should be done intelligently. This work considers multiple domains and feature discrimination, where a multimodal convolution neural network is applied to fuse RGB and depth information. To extract multilocal discriminative features, two different attention modules are proposed: (1) a spatial attention module, which is able to capture local information at different abstraction levels, and (2) a layer-wise attention module, which works as a dynamic weighting scheme to assign weights and fuse local abstraction-level features intelligently, depending on the input image. By fusing local and global features in a multimodal context, we show state-of-the-art accuracies on two publicly available datasets, DPI-T and TVPR, while increasing the state-of-the-art accuracy on a third dataset, OPR. Finally, through both visual and quantitative analysis, we show the ability of the proposed system to leverage multiple frames, by adapting feature weighting depending on the input.

KW - Artificial neural networks

KW - Dynamic feature fusion

KW - Multimodal sensors

KW - Person re-identification

KW - Soft attention

KW - dynamic feature fusion

KW - soft attention

KW - multimodal sensors

KW - person re-identification

UR - http://www.scopus.com/inward/record.url?scp=85076673010&partnerID=8YFLogxK

U2 - 10.1109/TIFS.2019.2938870

DO - 10.1109/TIFS.2019.2938870

M3 - Journal article

SN - 1556-6013

VL - 15

SP - 1216

EP - 1231

JO - I E E E Transactions on Information Forensics and Security

JF - I E E E Transactions on Information Forensics and Security

M1 - 8826013

ER -

Person Re-identification Using Spatial and Layer-Wise Attention

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Vision-based Person Re-identification in a Queue

Cite this

Person Re-identification Using Spatial and Layer-Wise Attention

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Projects

Vision-based Person Re-identification in a Queue

Cite this