Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

Swathikiran Sudhakaran; Sergio Escalera; Oswald Lanz

doi:10.1109/TPAMI.2021.3058649

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

Swathikiran Sudhakaran, Sergio Escalera, Oswald Lanz

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

6 Citationer (Scopus)

Abstract

We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.

Originalsprog	Engelsk
Tidsskrift	IEEE Transactions on Pattern Analysis and Machine Intelligence
ISSN	0162-8828
DOI	https://doi.org/10.1109/TPAMI.2021.3058649
Status	Accepteret/In press - 2021
Udgivet eksternt	Ja

Bibliografisk note

Publisher Copyright:
IEEE

Adgang til dokumentet

10.1109/TPAMI.2021.3058649

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{242415476746497a9251e414b3f779c5,

title = "Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries",

abstract = "We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.",

keywords = "Action Recognition, Attention, Cameras, Egocentric Vision, Encoding, Feature extraction, Higher Order Pooling, Logic gates, Optical imaging, Task analysis, Video Classification, Visualization",

author = "Swathikiran Sudhakaran and Sergio Escalera and Oswald Lanz",

note = "Publisher Copyright: IEEE",

year = "2021",

doi = "10.1109/TPAMI.2021.3058649",

language = "English",

journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",

issn = "0162-8828",

publisher = "IEEE Communications Society",

}

TY - JOUR

T1 - Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

AU - Sudhakaran, Swathikiran

AU - Escalera, Sergio

AU - Lanz, Oswald

N1 - Publisher Copyright: IEEE

PY - 2021

Y1 - 2021

N2 - We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.

AB - We present EgoACO, a deep neural architecture for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels in egocentric video datasets. The core component of EgoACO is class activation pooling (CAP), a differentiable pooling operation that combines ideas from bilinear pooling for fine-grained recognition and from feature learning for discriminative localization. CAP uses self-attention with a dictionary of learnable weights to pool from the most relevant feature regions. Through CAP, EgoACO learns to decode object and scene context descriptors from video frame features. For temporal modeling in EgoACO, we design a recurrent version of class activation pooling termed Long Short-Term Attention (LSTA). LSTA extends convolutional gated LSTM with built-in spatial attention and a re-designed output gate. Action, object and context descriptors are fused by a multi-head prediction that accounts for the inter-dependencies between noun-verb-action structured labels in egocentric video datasets. EgoACO features built-in visual explanations, helping learning and interpretation. Results on the two largest egocentric action recognition datasets currently available, EPIC-KITCHENS and EGTEA, show that by explicitly decoding action-context-object descriptors, EgoACO achieves state-of-the-art recognition performance.

KW - Action Recognition

KW - Attention

KW - Cameras

KW - Egocentric Vision

KW - Encoding

KW - Feature extraction

KW - Higher Order Pooling

KW - Logic gates

KW - Optical imaging

KW - Task analysis

KW - Video Classification

KW - Visualization

UR - http://www.scopus.com/inward/record.url?scp=85101426518&partnerID=8YFLogxK

U2 - 10.1109/TPAMI.2021.3058649

DO - 10.1109/TPAMI.2021.3058649

M3 - Journal article

AN - SCOPUS:85101426518

SN - 0162-8828

JO - IEEE Transactions on Pattern Analysis and Machine Intelligence

JF - IEEE Transactions on Pattern Analysis and Machine Intelligence

ER -

Learning to Recognize Actions on Objects in Egocentric Video with Attention Dictionaries

Abstract

Bibliografisk note

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater