Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

Joakim Bruslund Haurum; Meysam Madadi; Sergio Escalera Guerrero; Thomas B. Moeslund

doi:10.1016/j.autcon.2022.104614

Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

Joakim Bruslund Haurum, Meysam Madadi, Sergio Escalera Guerrero, Thomas B. Moeslund

Research output: Contribution to journal › Journal article › Research › peer-review

4 Citations (Scopus)

233 Downloads (Pure)

Abstract

A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.

Original language	English
Article number	104614
Journal	Automation in Construction
Volume	144
ISSN	0926-5805
DOIs	https://doi.org/10.1016/j.autcon.2022.104614
Publication status	Published - Dec 2022

Keywords

Closed-Circuit Television
Convolutional Neural Networks
Sewer Defect Classification
Sewer Inspection
Sinkhorn-Knopp
Vision Transformers

Access to Document

10.1016/j.autcon.2022.104614Licence: CC BY 4.0

Open Access articleFinal published version, 4.87 MBLicence: CC BY 4.0

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{90c6164bd8b54edd8c3cb966692b7f8d,

title = "Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification",

abstract = "A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.",

keywords = "Closed-Circuit Television, Convolutional Neural Networks, Sewer Defect Classification, Sewer Inspection, Sinkhorn-Knopp, Vision Transformers",

author = "Haurum, {Joakim Bruslund} and Meysam Madadi and Guerrero, {Sergio Escalera} and Moeslund, {Thomas B.}",

year = "2022",

month = dec,

doi = "10.1016/j.autcon.2022.104614",

language = "English",

volume = "144",

journal = "Automation in Construction",

issn = "0926-5805",

publisher = "Elsevier",

}

TY - JOUR

T1 - Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

AU - Haurum, Joakim Bruslund

AU - Madadi, Meysam

AU - Guerrero, Sergio Escalera

AU - Moeslund, Thomas B.

PY - 2022/12

Y1 - 2022/12

N2 - A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.

AB - A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.

KW - Closed-Circuit Television

KW - Convolutional Neural Networks

KW - Sewer Defect Classification

KW - Sewer Inspection

KW - Sinkhorn-Knopp

KW - Vision Transformers

UR - http://www.scopus.com/inward/record.url?scp=85140050969&partnerID=8YFLogxK

U2 - 10.1016/j.autcon.2022.104614

DO - 10.1016/j.autcon.2022.104614

M3 - Journal article

SN - 0926-5805

VL - 144

JO - Automation in Construction

JF - Automation in Construction

M1 - 104614

ER -

Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

ASIR: ASIR: Autonomous Sewer Inspection Robot