Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining

Benjia Zhou, Zhigang Chen, Albert Clapés, Jun Wan*, Yanyan Liang, Sergio Escalera, Zhen Lei, Du Zhang

*Kontaktforfatter

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

31 Citationer (Scopus)

Abstract

Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation, i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT based on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (≥+5) and the CSL-Daily dataset (≥+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods.

OriginalsprogEngelsk
TitelProceedings - 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Antal sider11
ForlagIEEE (Institute of Electrical and Electronics Engineers)
Publikationsdato2023
Sider20814-20824
ISBN (Trykt)979-8-3503-0719-1
ISBN (Elektronisk)979-8-3503-0718-4
DOI
StatusUdgivet - 2023
Begivenhed2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 - Paris, Frankrig
Varighed: 2 okt. 20236 okt. 2023

Konference

Konference2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023
Land/OmrådeFrankrig
ByParis
Periode02/10/202306/10/2023
NavnProceedings of the IEEE International Conference on Computer Vision
ISSN1550-5499

Bibliografisk note

Publisher Copyright:
© 2023 IEEE.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Gloss-free Sign Language Translation: Improving from Visual-Language Pretraining'. Sammen danner de et unikt fingeraftryk.

Citationsformater