Abstract

In recent years, Transformer models have revolutionized machine learning. While this has resulted in impressive re-sults in the field of Natural Language Processing, Computer Vision quickly stumbled upon computation and memory problems due to the high resolution and dimensionality of the input data. This is particularly true for video, where the number of tokens increases cubically relative to the frame and temporal resolutions. A first approach to solve this was Vision Transformers, which introduce a partitioning of the input into embedded grid cells, lowering the effective reso-lution. More recently, Swin Transformers introduced a hi-erarchical scheme that brought the concepts of pooling and locality to transformers in exchange for much lower computational and memory costs. This work proposes a refor-mulation of the latter that views Swin Transformers as reg-ular Transformers applied over a quadtree representation of the input, intrinsically providing a wider range of de-sign choices for the attentional mechanism. Compared to similar approaches such as Swin and MaxViT, our method works on the full range of scales while using a single attentional mechanism, allowing us to simultaneously take into account both dense short range and sparse long range de-pendencies with low computational overhead and without introducing additional sequential operations, thus making full use of GPU parallelism.

OriginalsprogEngelsk
TitelProceedings - 2024 IEEE Winter Conference on Applications of Computer Vision Workshops, WACVW 2024
Antal sider9
ForlagIEEE (Institute of Electrical and Electronics Engineers)
Publikationsdato2024
Sider193-201
ISBN (Elektronisk)9798350370287
DOI
StatusUdgivet - 2024
Begivenhed2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW 2024 - Waikoloa, USA
Varighed: 4 jan. 20248 jan. 2024

Konference

Konference2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACVW 2024
Land/OmrådeUSA
ByWaikoloa
Periode04/01/202408/01/2024
NavnIEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)
ISSN2572-4398

Bibliografisk note

Publisher Copyright:
© 2024 IEEE.

Fingeraftryk

Dyk ned i forskningsemnerne om 'Swin on Axes: Extending Swin Transformers to Quadtree Image Representations'. Sammen danner de et unikt fingeraftryk.

Citationsformater