TY - JOUR
T1 - A deep co-attentive hand-based video question answering framework using multi-view skeleton
AU - Rastgoo, Razieh
AU - Kiani, Kourosh
AU - Escalera, Sergio
N1 - Funding Information:
This work has been partially supported by the Spanish project PID2019-105093GB-I00, ICREA under the ICREA Academia programme, and High Intelligent Solution (HIS) company in Iran.
Publisher Copyright:
© 2022, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2023/1
Y1 - 2023/1
N2 - In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
AB - In this paper, we present a novel hand –based Video Question Answering framework, entitled Multi-View Video Question Answering (MV-VQA), employing the Single Shot Detector (SSD), Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), Bidirectional Encoder Representations from Transformers (BERT), and Co-Attention mechanism with RGB videos as the inputs. Our model includes three main blocks: vision, language, and attention. In the vision block, we employ a novel representation to obtain some efficient multiview features from the hand object using the combination of five 3DCNNs and one LSTM network. To obtain the question embedding, we use the BERT model in language block. Finally, we employ a co-attention mechanism on vision and language features to recognize the final answer. For the first time, we propose such a hand-based Video-QA framework including the multi-view hand skeleton features combined with the question embedding and co-attention mechanism. Our framework is capable of processing the arbitrary numbers of questions in the dataset annotations. There are different application domains for this framework. Here, as an application domain, we applied our framework to dynamic hand gesture recognition for the first time. Since the main object in dynamic hand gesture recognition is the human hand, we performed a step-by-step analysis of the hand detection and multi-view hand skeleton impact on the model performance. Evaluation results on five datasets, including two datasets in VideoQA, two datasets in dynamic hand gesture, and one dataset in hand action recognition show that MV-VQA outperforms state-of-the-art alternatives.
KW - BERT
KW - Co-attention
KW - Dynamic hand gesture recognition
KW - RGB video
KW - Video question answering (video-QA)
UR - http://www.scopus.com/inward/record.url?scp=85135340411&partnerID=8YFLogxK
U2 - 10.1007/s11042-022-13573-w
DO - 10.1007/s11042-022-13573-w
M3 - Journal article
AN - SCOPUS:85135340411
SN - 1380-7501
VL - 82
SP - 1401
EP - 1429
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 1
ER -