Gate-Shift-Fuse for Video Action Recognition

Swathikiran Sudhakaran; Sergio Escalera; Oswald Lanz

doi:10.1109/TPAMI.2023.3268134

Gate-Shift-Fuse for Video Action Recognition

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

2 Citationer (Scopus)

Abstract

Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.

Originalsprog	Engelsk
Tidsskrift	IEEE Transactions on Pattern Analysis and Machine Intelligence
Vol/bind	45
Udgave nummer	9
Sider (fra-til)	10913-10928
Antal sider	16
ISSN	0162-8828
DOI	https://doi.org/10.1109/TPAMI.2023.3268134
Status	Udgivet - 1 sep. 2023

Bibliografisk note

Publisher Copyright:
IEEE

Adgang til dokumentet

10.1109/TPAMI.2023.3268134

https://arxiv.org/pdf/2203.08897.pdf

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{6be8675642ca4d42a7494fd4d562e110,

title = "Gate-Shift-Fuse for Video Action Recognition",

abstract = "Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.",

keywords = "Action recognition, channel fusion, Computer architecture, Convolution, Feature extraction, Kernel, Logic gates, Optical imaging, spatial gating, Three-dimensional displays, video classification",

author = "Swathikiran Sudhakaran and Sergio Escalera and Oswald Lanz",

note = "Publisher Copyright: IEEE",

year = "2023",

month = sep,

day = "1",

doi = "10.1109/TPAMI.2023.3268134",

language = "English",

volume = "45",

pages = "10913--10928",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Communications Society",

number = "9",

}

TY - JOUR

T1 - Gate-Shift-Fuse for Video Action Recognition

AU - Sudhakaran, Swathikiran

AU - Escalera, Sergio

AU - Lanz, Oswald

N1 - Publisher Copyright: IEEE

PY - 2023/9/1

Y1 - 2023/9/1

N2 - Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.

AB - Convolutional Neural Networks are the de facto models for image recognition. However 3D CNNs, the straight forward extension of 2D CNNs for video recognition, have not achieved the same success on standard action recognition benchmarks. One of the main reasons for this reduced performance of 3D CNNs is the increased computational complexity requiring large scale annotated datasets to train them in scale. 3D kernel factorization approaches have been proposed to reduce the complexity of 3D CNNs. Existing kernel factorization approaches follow hand-designed and hard-wired techniques. In this paper we propose Gate-Shift-Fuse (GSF), a novel spatio-temporal feature extraction module which controls interactions in spatio-temporal decomposition and learns to adaptively route features through time and combine them in a data dependent manner. GSF leverages grouped spatial gating to decompose input tensor and channel weighting to fuse the decomposed tensors. GSF can be inserted into existing 2D CNNs to convert them into an efficient and high performing spatio-temporal feature extractor, with negligible parameter and compute overhead. We perform an extensive analysis of GSF using two popular 2D CNN families and achieve state-of-the-art or competitive performance on five standard action recognition benchmarks.

KW - Action recognition

KW - channel fusion

KW - Computer architecture

KW - Convolution

KW - Feature extraction

KW - Kernel

KW - Logic gates

KW - Optical imaging

KW - spatial gating

KW - Three-dimensional displays

KW - video classification

UR - http://www.scopus.com/inward/record.url?scp=85153505784&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2023.3268134

DO - 10.1109/TPAMI.2023.3268134

M3 - Journal article

AN - SCOPUS:85153505784

SN - 0162-8828

VL - 45

SP - 10913

EP - 10928

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

IS - 9

ER -

Gate-Shift-Fuse for Video Action Recognition

Abstract

Bibliografisk note

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater