Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

Seyed Mojtaba Marvasti Zadeh; Hossien Ghanei-Yakhdan; Shohreh  Kasaei; Kamal Nasrollahi; Thomas B. Moeslund

doi:10.1007/s00371-021-02304-1

Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

Seyed Mojtaba Marvasti Zadeh, Hossien Ghanei-Yakhdan^*, Shohreh Kasaei, Kamal Nasrollahi, Thomas B. Moeslund

^*Kontaktforfatter

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

1 Citationer (Scopus)

214 Downloads (Pure)

Abstract

Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of 12 state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128, UAV-123 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.

Originalsprog	Engelsk
Tidsskrift	Visual Computer
Vol/bind	38
Udgave nummer	12
Sider (fra-til)	4397-4417
Antal sider	21
ISSN	0178-2789
DOI	https://doi.org/10.1007/s00371-021-02304-1
Status	Udgivet - dec. 2022

Adgang til dokumentet

10.1007/s00371-021-02304-1

2004.01382Accepteret manuskript, 8,73 MB

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{b0713e163ca9476bb1faabc65904af00,

title = "Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking",

abstract = "Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of 12 state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128, UAV-123 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.",

keywords = "Appearance modeling, Deep convolutional neural networks, Discriminative correlation filters, Robust visual tracking",

author = "Zadeh, {Seyed Mojtaba Marvasti} and Hossien Ghanei-Yakhdan and Shohreh Kasaei and Kamal Nasrollahi and Moeslund, {Thomas B.}",

year = "2022",

month = dec,

doi = "10.1007/s00371-021-02304-1",

language = "English",

volume = "38",

pages = "4397--4417",

journal = "Visual Computer",

issn = "0178-2789",

publisher = "Physica-Verlag",

number = "12",

}

TY - JOUR

T1 - Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

AU - Zadeh, Seyed Mojtaba Marvasti

AU - Ghanei-Yakhdan, Hossien

AU - Kasaei, Shohreh

AU - Nasrollahi, Kamal

AU - Moeslund, Thomas B.

PY - 2022/12

Y1 - 2022/12

N2 - Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of 12 state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128, UAV-123 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.

AB - Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of 12 state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128, UAV-123 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.

KW - Appearance modeling

KW - Deep convolutional neural networks

KW - Discriminative correlation filters

KW - Robust visual tracking

UR - http://www.scopus.com/inward/record.url?scp=85117342136&partnerID=8YFLogxK

U2 - 10.1007/s00371-021-02304-1

DO - 10.1007/s00371-021-02304-1

M3 - Journal article

SN - 0178-2789

VL - 38

SP - 4397

EP - 4417

JO - Visual Computer

JF - Visual Computer

IS - 12

ER -

Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater