Multi-scale hybrid vision transformer and Sinkhorn tokenizer for sewer defect classification

A crucial part of image classification consists of capturing non-local spatial semantics of image content. This paper describes the multi-scale hybrid vision transformer (MSHViT), an extension of the classical convolutional neural network (CNN) backbone, for multi-label sewer defect classification. To better model spatial semantics in the images, features are aggregated at different scales non-locally through the use of a lightweight vision transformer, and a smaller set of tokens was produced through a novel Sinkhorn clustering-based tokenizer using distinct cluster centers. The proposed MSHViT and Sinkhorn tokenizer were evaluated on the Sewer-ML multi-label sewer defect classification dataset, showing consistent performance improvements of up to 2.53 percentage points.


Introduction
The sewerage infrastructure is one of a few critical infrastructures in modern society.If the infrastructure does not function properly, it can lead to dramatic environmental damage and pose a risk to the public health [1].Therefore, the sewer pipes require regular inspections in order to determine when a pipe has to be replaced or rehabilitated.However, with more than 1.2 million kilometers of public sewerage infrastructure in just the U.S. [1], this becomes an unimaginable task to perform manually on a regular basis, as each inspection has to be performed by a professional sewer inspector.Therefore, the task of automating the sewer inspection process has been researched for more than three decades, through the development and application of sensors and computer vision algorithms [2][3][4][5].
Since its adoption in 2017, the Convolutional Neural Network (CNN) has been the method of choice within the automated sewer inspection domain [2].A key component of the CNN is the convolutional layers, which efficiently model local spatial semantics within the image.However, for tasks such as multi-label image classification, object detection, and object segmentation, it is essential to model non-local spatial semantics [7].For example, a displaced joint and intruding roots could be simultaneously in an image but in opposite corners.This represents a case where multi-scale non-local spatial semantics are helpful, as knowing the presence of the displaced joint is a strong signal for inferring the presence of the roots.
Two different approaches have been adopted for vision taskseither replacing convolutions within the CNN with non-local operations [8,7,9,10] or appending CNNs with non-local operations [11][12][13][14][15], denoted Hybrid Vision Transformer (HViT)-like methods in this paper.However, none of these methods explicitly model non-local spatial semantics across scales for image classification, even though it is used as a common approach in object detection and segmentation.We therefore propose the Multi-Scale Hybrid Vision Transformer (MSHViT), where a Vision Transformer (ViT) [13] is appended at different stages of a CNN backbone for non-local aggregation of features and cross-scale propagation of features.We also introduce the Sinkhorn tokenizer, a clustering-based tokenizer to replace the simple patch based tokenizer in ViTs and act as another source of non-local spatial semantics.Furthermore, we demonstrate that the Sinkhorn tokenizer successfully cluster the CNN features, which are expected to have a high amount of redundant information due to successively applying overlapping convolutional filters and pooling layers.We find that introducing these multiscale and non-local spatial semantics operations leads to a relative improvement compared to using just the CNN backbone.
In this work we focus on the challenging task of multi-label sewer defect classification, which has been shown by Haurum and Moeslund to be an unsolved problem [2,16], highlighting the difficulties of distinguishing visually similar defect classes and poor classification rates of sewer defects with the highest economic impact.Furthermore, improving sewer defect classification performance is crucial for advancing sewer defect detection and segmentation, as such models build upon pre-trained classification models [17].
Our main contributions are as follows: • We present the Multi-Scale Hybrid Vision Transformer (MSHViT), a novel multi-scale extension of the Hybrid Vision Transformer model for capturing non-local spatial semantics across scales.• We present the Sinkhorn tokenizer, a novel clustering-based tokenizer using Sinkhorn distances, which reduces the number of tokens and improves metric performance.We visually verify the cross-scale non-local interactions.• We demonstrate that the MSHViT model outperforms the baseline CNN approaches and other HViT-like approaches on the Sewer-ML multi-label sewer classification dataset, when only considering the defect classification task.• We demonstrate the applicability of MSHViT and the Sinkhorn tokenizer across the backbones in the ResNet and TResNet CNN architecture families, and thoroughly investigate the impact of each introduced hyperparameter.
The paper is structured as follows.In Section 2, we review the related works within automated sewer inspections, Vision Transformers, nonlocal CNN blocks, and tokenizers.In Section 3, we introduce MSHViT and the Sinkhorn tokenizer.In Section 4, we determine the improvement obtained by introducing the MSHViT and Sinkhorn tokenizer and compare to other HViT-like approaches.In Section 5 we conduct an extensive ablation study of the proposed methods.In Section 6 we qualitatively investigate the clustering assignment made by the Sinkhorn tokenizer, and in Section 7 we discuss the limitations and practical use of the proposed method.Finally, in Section 8, we conclude the paper.

Related works
In this section we review the literature within the automated sewer inspection domain, as well as recent progress within Vision Transformers, non-local CNN blocks, and tokenization approaches.

Automated sewer inspections
The automated sewer inspection research field has been active for more than three decades, developing domain-specific computer vision algorithms to handle the unique environment that is the sewerage infrastructure [2].However, Haruum and Moeslund [2] found that the research field has been hindered by the lack of open source code and data, which in combination with differing evaluation protocols, has made it extremely difficult to compare the proposed methods in the literature and caused the field to lag behind the general computer vision domain.This has been rectified for the classification tasks with the introduction of the public Sewer-ML dataset [16], enabling fair and open comparisons of multi-label classification approaches.Using the Sewer-ML dataset Haurum and Moeslund showed that the sewer defect classification tasks are far from solved, comparing the leading sewer defect classification methods from Kumar et al. [18], Meijer et al. [19], Xie et al. [20], Chen et al. [21], Hassan et al. [22], and Myrans et al. [23].Concurrent research directions in the sewer defect classification subfield have focused on the usage of StyleGAN-based approaches to increase the effective size of small training dataset [24,25], developing and deploying networks on embedded devices [26,27], and providing defect localization information without explicit localization labels [28,29].
However, the main focus of the field within recent years has been on the defect detection and segmentation tasks [30][31][32]17,[33][34][35][36], where no public datasets are available.The field has, however, become more transparent as many have started to directly compare different methods on the same datasets, in an effort to offset the lack of public detection and segmentation datasets [17,36,34].Recently, the field has also started investigating other parts of the sewer inspection process [30,32,17,[37][38][39][40][41], such as Haurum et al. [37] proposing a multi-task classification approach for simultaneously classifying defects, water level, pipe material, and pipe shape, and Wang et al. [30] proposed a framework to accurately determine the severity of defects related to the operation and maintenance of the pipes.The field has also adopted recent trends from the general computer vision field such as selfsupervised learning [39], synthetic data generation [25,24,[42][43][44], neural architecture search [45], and usage of the Transformer architecture [17,46], indicating that the automated sewer inspection field is catching up to the general computer vision domain.

Vision transformers
Transformers were originally developed for Natural Language Processing (NLP) [47].Dosovitskiy et al. [13] demonstrated how a pure Transformer based architecture, denoted Vision Transformer (ViT), led to competitive performance on several vision classification tasks.The ViT architecture has led to an increased research focus on adapting Transformers for vision tasks [48][49][50][51][52][53][54][55][56][57][58].A general trend has been introducing components from CNNs into the ViTs, such as limited region of interests and hierarchical representations [53,50,54,55] or extending CNNs with Transformers in a hybrid approach [13,15,48,12].However, unlike CNNs the ViT only processes the input image on a single scale due to the initial tokenization step and the absence of pooling operations.This problem has been approached in two ways, by introducing either hierarchical representations inspired by classical CNN architecture design [53][54][55][56] or multi-scale representations by applying different ViTs sequentially [59] or working on variations of the input in parallel [60,58].Our proposed method differs fundamentally from the prior work as we introduce multi scale features by combining CNNs and ViTs, instead of adapting a purely ViT-based model.

Non-local CNN blocks
Combining non-local blocks and operations with classical CNNs have been of great interest as a way of capturing global spatial semantics.The Non-Local Network (NLN) [8] was proposed as an extension of the ResNet architecture family, where non-local aggregation operations were inserted into the last blocks of the architecture.The NLN architecture was extended by Srinivas et al. [7] who introduced the Bottleneck Transformer, where Multi-Head Self-Attention was inserted directly into the ResNet bottleneck blocks.Both of these approaches lead to direct improvements on several vision tasks.Appending CNNs with non-local operations have similarly lead to improvements in image classification as shown by Dai et al. [14] who investigated how to design Hybrid Vision Transformers (HViTs), i.e.CNNs appended with a ViT, and in tasks such as object detection with the DETR model [11] and enabling image-caption pair based training [15].In contrast to the previous application of non-local blocks, we append the CNN at several stages in order to explicitly introduce multi-scale interactions through the proposed MSHViT architecture.

Tokenizers
An essential part of the Transformer architecture is the choice of how to generate the token embeddings.In NLP several embedding methods have been utilized through the years in order to represent sentences and words [61,62].However, for image data this has not been the case.Dosovitskiy et al. [13] proposed simply extracting non-overlapping patches of the input image and linearly map the patches to an embedding space.This approach has since been iterated upon, by instead extracting overlapping patches [57], learning to select the patch size of the conventional patch tokenizer [63], as well as replacing the initial layer of the Transformer with a convolutional stem similar to those found in CNNs [49].In parallel, different token downsampling approaches have been investigated in order to reduce token redundancy.Goyal et al. [64] and Rao et al. [65] propose score-based token downsampling methods, where each token is assigned a score based on the incoming attention from other tokens or a predictive subnetwork, respectively.In contrast, this work and the concurrent work by Marin et al. [66] propose clustering based approaches for reducing the number of tokens.The method by Marin et al. utilizes a K-means/medoids based approach, whereas our proposed Sinkhorn tokenizer utilizes Sinkhorn distances [6] in order to softly assign the input tokens to a set of cluster centers.All of the prior approaches [64][65][66] are focused on pure ViT architectures and inserted in between each encoder block progressively decimating the number of tokens present.Comparatively, the proposed Sinkhorn tokenizer is applied on HViTs in order to reduce redundancy in the CNN feature-based tokens.

Methodology
In this section we first review the Vision Transformer and its hybrid variant originally proposed by Dosovitskiy et al. [13].Then we present our novel clustering-based Sinkhorn tokenizer, designed to reduce the number of redundant tokens in ViTs.Lastly, we present our MSHViT architecture, designed to non-locally combine CNN features at the ith scale and progressively combine features across scales, as illustrated in Fig. 1.An overview of the introduced symbols and notations can be found in Appendix A.

Vision transformers
The Vision Transformer [13] demonstrated that the original Transformer architecture [47] can be used with little modifications for image classification, and without the image-related inductive biases found in CNNs.

Tokenization
The Transformer takes a series of 1D token embeddings as input, and process the series in parallel.In order to convert image data to a series of 1D tokens the input image X ∈ R C×H×W is convolved with D different P × P kernels with a stride of P and flattend to a 1D vector per patch, producing N = HW/P 2 linearly embedded tokens T p ∈ R D×N .
Furthermore, a special class (CLS) token x CLS ∈ R D is appended to T p .The CLS token is randomly initialized and used to generate an imagelevel feature representation.In order to encode a spatial ordering into the tokens a learnable positional embedding E pos ∈ R D×N+1 is added, leading to the final token representations: where ‖ denotes concatenation.

ViT model
The Transformer consists of L stacked encoder blocks, each consisting of a token-aggregation step, such as Multi-Head Self-Attention (MHSA), followed by an inverted bottleneck projecting each token into an intermediate R D⋅r space, where r is an adjustable hyperparameter, followed by a down projection to the D-dimensional feature space.Layer normalization (LN) [67] is applied before both actions and residual connections are inserted around each action.The final feature representation is the CLS token after L blocks and a final layer normalization step, y = LN(Z L,0 ).

Hybrid ViT
Unlike CNNs, ViTs have very little image-specific inductive biases [13].Therefore, ViTs often require large amount of training data in order to learn relevant relations, which are encoded directly into CNN architectures.However, this lack of inductive biases similarly allows ViTs to learn relations within images, which are not viable with CNNs, such as capturing non-local spatial semantics.The HViT aims at combining these two architectures, by first using a CNN to encode local features, and then compute non-local spatial semantics using a ViT.This is realized by extracting the tokens T p from a CNN feature map with a kernel size P = 1, typically at the last feature map before the commonly used global pooling step.This is in contrast to the ViT model where the tokens are extracted directly from the input image X.

Sinkhorn tokenizer
The original ViTs generate the token representations of the image through a non-overlapping patch based method [13].Several methods have been proposed to improve the tokenizer either by reducing the stride of the convolutional layer such that the patches overlap [57], or instead use a convolutional stem which aggressively downsamples the  9).(Bottom) The Sinkhorn tokenizer reduces the number of tokens by first measuring the cosine similarity, V, between all input tokens T p and cluster centers C. The Sinkhorn distances [6] are then computed by applying Sinkhorn-Knopp for t SK iterations, resulting in the soft assignment matrix, Q *⊤ .Using Q *⊤ the input features are clustered into the smaller set of tokens, T S .spatial dimensions of the input [49].However, these methods do not consider the redundancy of features stemming from encoding similar patches in the image and therefore lead to disproportionately representing these in the generated tokens.While this may be implicitly handled by the attention mechanisms in the ViT, it introduces an unnecessary processing overhead and requires the model to learn these relations.
To deal with the redundant features we introduce a clustering-based tokenizer using Sinkhorn distances [6], inspired by clustering-based selfsupervised learning [68,69].The approach builds upon the original patch tokenizer with P = 1.The N patch tokens T p are compared to K cluster centers C ∈ R D×K which are initialized from a D-dimensional Normal distribution with zero-mean and unit variance.We assume both T p and C are ℓ 2 normalized and measure similarity using the cosine similarity V = C ⊤ T p ∈ R K×N .Based on the similarity scores V we compute the soft assignment matrix Q ∈ R K×N + , which belongs to the set of valid assignment matrices Q, such that the similarity between the cluster centers and features is maximized: where H is the matrix entropy function and ∊ controls the weighting of the entropy loss and thereby the smoothness of the assignment scores.
Similar to [68,69] we constrain Q to be in the transportation polytope under an equipartition constraint of the input and cluster centers i. e. the features should on average be uniformly assigned to the cluster centers.However, instead of applying the constraint on the full dataset [68] or mini-batches [69], we apply the constraint on the N features from a single input, see Eq. ( 3).We apply the constraint on the N features such that there is no cross-information between input images, enabling single image evaluation.
where 1 K and 1 N are K and N-dimensional vectors filled with ones, respectively.
The solution to Eq. ( 2) can then be formulated as follows: where the renormalization vectors u and v are computed using the iterative Sinkhorn-Knopp algorithm [6] through t SK iterations.
Using the soft assignments between input features T p and cluster centers C stored in Q * we transform the input features into K new tokens:

Multi-scale hybrid vision transformers
Based on prior work on combining non-local operations with classical CNNs, such as HViTs, we propose the Multi-Scale Hybrid Vision Transformer.Whereas the original HViT simply extends the backbone CNN with a ViT, we propose applying ViTs at different scales of the backbone CNN.Furthermore, we also introduce cross-scale connections between the ViTs in order to encode non-local spatial semantics in the image at different scales, see Fig. 1.
CNNs such as ResNets [70] and Inception networks [71,72] have a set of natural scales within them due to the periodic pooling operations.The representative feature map of each scale is defined to be the last feature map before each pooling operation and denoted X i for the ith scale.At every scale each feature in X i is linearly embedded into a common D-dimensional space as tokens T i p .These tokens are processed using a tokenization function ψ i , representing either the Sinkhorn tokenizer (Eq.( 5)) or an identity function for the standard patch tokenizer, with the output denoted T i .The tokens can then be processed by a scale-specific ViT of depth L, denoted as ϕ i , producing the scale features:

Cross-scale connections
In order to share information between different scales, we introduce cross-scale connections.For scale i > 1, all previous scale features, or a subset of the features, are included, denoted S i , in addition to the ith scale features T i p , see Eq. ( 7).
This cross-scale connection can occur using features from three different stages: the linearly embedded CNN features T p , see Eq. ( 8), the Sinkhorn tokens T S , see Eq. ( 9), or the final token embeddings Z L , see Eq. (10).j denotes the initial scale which we consider for scale i.For example, if j = 1 all features from scale 1 to scale i − 1 are aggregated, while if j = i − 1 only the features from scale i − 1 are aggregated.
Lastly, the overall image representation is defined to be y = LN(Z I L,0 ), where I denotes the last scale of the backbone.

Experimental results
In this section we investigate the performance of the MSHViT architecture and Sinkhorn tokenizer on the Sewer-ML dataset, a multilabel sewer defect classification dataset [16].Sewer-ML is the world's only public multi-label sewer defect dataset, consisting of 1.3 million images, 17 defect classes, and the implicit normal class.The dataset is split into three distinct training, validation, and testing splits, each containing 1 million, 130 k and 130 k images, respectively.We refer to the Supplementary material of Haurum and Moeslund [16] for example images.Defect predictions are evaluated using the class F2-scores weighted by the class importance weights (CIW), F2 CIW , which indicates the economic importance of the classes, and the normal pipes are evaluated by the F1-score, F1 Normal [16].An abbreviated introduction to the Sewer-ML dataset and the evaluation metrics can be found in Appendix B. Code and model weights can be found at the project webpage: https://vap.aau.dk/mshvit/.

Training procedure
We follow the training procedure of Haurum et al. [37] with the addition of using the Exponential Moving Average (EMA) technique on the model weights, see Table 1.We utilize the Fourier Network (FNet) based attention mechanism [73] in the HViT as an efficient alternative to the conventional MHSA based attention mechanism.
We define the ResNet architecture to have five natural scales: the convolutional stem followed by four ResNet blocks, numbered from 1 to 5.These stages are chosen as they act on feature maps with different spatial dimensions.

Hyperparameter search
The hyperparameter search for the MSHViT and Sinkhorn tokenizer is conducted in a sequential manner in order to reduce the search space due to the number of hyperparameters and the investigated value ranges.The investigated hyperparameter values as well as the initial and final values are shown in Table 2.The initial Sinkhorn Tokenizer values were set as in Caron et al. [69], except for the number of clusters K, where we chose 64 centers as the initial value to ensure a large average assignment probability per cluster in each scale.For the MSHViT architecture we initialized the model by appending the last two layers, where higher-order features are available.The hyperparameters of the ViTs were chosen such that only a moderate parameter increase was introduced.After each step in the sequential search we used the configuration which performed the best for the next step.The steps of the sequential search were ordered such that the Sinkhorn Tokenizer cluster and MSHViT cross-scale hyperparameters were determined, and lastly the structure of the ViTs.The entire hyperparameter search was conducted with the ResNet-50 backbone.The order of the search was as follows: 1. Search over the entropic regularization ∊ in the Sinkhorn tokenizer.2. Search over the number of iterations t SK in the Sinkhorn tokenizer.3. Search over the number of clusters K in the Sinkhorn tokenizer.4. Search over which scales to be used and selection of j in the MSHViT extension.5. Search over the multi-scale method, S. 6. Search over token dimensionality D. 7. Search over the MLP ratio r.

Search over ViT depth L.
We find that the initial hyperparameters perform well, with only the entropic regularization and number of iterations in the Sinkhorn-Knopp algorithm being adapted.

Comparative models
We investigate the performance increase incurred when applying MSHViT to the ResNet-{18, 34, 50, 101}, a commonly used backbone architecture in the image classification literature [7,12,76], as well as TResNet backbone [75], an adaption of the ResNet backbone using concepts such as anti-aliased downsampling and Squeeze and Excitation (SE) [77] layers.The same MSHViT hyperparameters are used for all backbones.Furthermore, we compare performance against the HViTlike models BoTNet-50-S1 [7] and CoAtNet-{0, 1} [14], as well as the original HViT structure [13].BotNet and CoAtNet were trained with the model structure described in the original papers, while the HViT model uses the same ViT parameters described in Table 2 with the exception of the attention mechanism where we use the classical MHSA-based token mixing.We compare using both the conventional patch based tokenizer and the proposed Sinkhorn tokenizer.Lastly, we compare to the previously published results on Sewer-ML [16,37].We run all experiments within the same codebase, using the torchvision [78], Pytorch Lightning [79] and timm [80] libraries.All models were trained using a single Nvidia V100 GPU except for the CoAtNet models which required two V100 GPUs due to a higher VRAM consumption.

Results
We find that introducing the MSHViT and Sinkhorn Tokenizer leads to a noticeable improvement on all tested backbones, see Table 3.On the F2 CIW metric we observe an increase between 0.7 and 2.5 percentage points, with the largest increase observed on the ResNet-50, where the performance is improved by 2.4-2.5 percentage points on both the validation and testing splits.This is significantly better than the benchmark algorithm from Haurum and Moeslund [16], and a comparable performance to the previous best performing model on Sewer-ML, the multi-task classification method CT-GAT [37], while only using the sewer defect labels during training.This demonstrates that it is possible to significantly increase the sewer defect classification performance without needing auxiliary data such as water level, pipe shape, and pipe material.For the non-defective pipes we observe a more moderate increase of up to 0.24 percentage points in the F1 Normal metric.However, we observe a higher baseline performance compared to previous methods.
Interestingly, we observe that the ResNet-34 backbone perform

Table 3
Results on Sewer-ML.Comparison using the investigated CNN backbones.We compare each backbone with and without the MSHViT and Sinkhorn tokenizer extension (denoted MSHViT) using the F2 CIW and F1 Normal metrics [16].Best performance per column is denoted in bold.We also include the previous published results on Sewer-ML [16,37], and HViT-like models [7,13,14].*denotes that the method was trained in a multi-task classification framework.When comparing to other HViT-like models we see that the MSHViT extension outperforms the original HViT structure, as well as all models where the Transformer structure is incorporated directly into the backbone.It should be noted that on the validation split the BotNet-50-S1 model nearly matches the ResNet-50-MSHViT's F2 CIW score and achieves the highest F1 Normal performance.However, on the test split the F2 CIW performance is significantly lower compared to the ResNet-50-MSHViT, indicating the model does not generalize as well as the ResNet-50-MSHViT model.
From these results we can conclude that the proposed MSHViT extension led to improvements without tuning the hyperparameters for the backbone.We hypothesize that if hyperparameters were tuned for each backbone, the performance gain would further increase.

Per-class analysis
In order to better understand how the compared models work, we investigate how the baseline and MSHViT extended models differ in their class predictions on the validation split.In Fig. 2a we present the per-class F2-scores for all MSHViT models, and in Fig. 2b we determine the difference in per-class F2-scores when comparing the MSHViT variants with the baseline models, see Eq. (11).
where δ c is the difference in F2-scores for class c, and c MSHViT and c Baseline are F2-scores for class c for the MSHViT and Baseline models, respectively.When analyzing the absolute per-class performance in Fig. 2, we see that the ResNet-34, ResNet-50, and ResNet-101 all perform similarly well on nearly all classes, with the ResNet-34 and ResNet-50 achieving noticeable performances in the highest weighted classes, whereas the Fig. 2. Per-Class F2-scores analysis.We present the per-class F2-scores on the validation split for all MSHViT-based models as well as the difference between the MSHViT variants and the baseline models, δ c .The classes are sorted in ascending order by their class-importance weight [16].Class names and abbreviations are described in Appendix B.
TResNet models and ResNet-18 have a noticeably lower score on several classes.We also observe that for all models the performance is low on intruding sealing material (IS), the obstacles (FO), cracks, breaks, and collapses (RB).This can be explained by the fact that the IS and FO classes are some of the most rare classes in the Sewer-ML dataset, with less than 10,000 examples per class.Additionally, all three classes have a large variation in their visual appearance within the class, while being less visually distinct from other classes.For example, the FO class is defined such that it encompasses any possible foreign objects that can block the pipes.In Fig. 2b we observe that when using MSHViT together with the ResNet backbones performance increases on nearly all classes, except for consistent decreases on the attached deposits (BE) class and on the connection with construction changes (OK) class.For the ResNet-34 backbone we also observe a significant decrease in performance on the deformation (DE) class.However, there is a noticeable increase in performance on both the lateral reinstatement cuts (OS) and cracks, breaks, and collapses (RB), the two highest weighted classes, across all ResNet backbones.On the other hand we see that the TResNet backbones behaves very poorly on the OS class, which drags down the overall score, even though it performs well on nearly all other classes.

Qualitative examples
In addition to quantiative per-class comparison, we also look into specific cases where the predictions of the compared models differ.Focusing on the ResNet-50 backbone we compare cases where the MSHViT extensions match all classes correctly while the baseline misclassifies some or all classes and vice versa, see Fig. 3. Four examples are shown where the MSHViT model correctly predicts all classes.In the top left image, the MSHVIT correctly predicts the pipe to be normal, whereas the baseline predicts surface damage (OB).This is most likely due to the missing top half of the pipe, as the image is taken from within the sewer well.In the top middle and bottom left cases the baseline misses the cracks, breaks, and collapses (RB) and lateral reinstatement cuts (OS) classes, the two highest weighted classes by CIW.Missing these classes could lead to significant economic repercussions.The RB class is most likely missed due to its visual similarity to the displaced joint (FS) deeper in the pipe, whereas the OS is similarly missed as the baseline misses the fact that a lining has been inserted and the low severity of the class.In the bottom middle example, the baseline simply misses the intruding sealing material (IS) class, instead only classifying the displaced joint (FS).In the top right and bottom right, the MSHViT variant misses the displaced joint (FS) and roots (RO), respectively.It is not clear why the MSHViT missed the displaced joint, however, we hypothesize it might be due to the co-occurring connection with construction changes (OK) class, where the material of the pipe changes.For the bottom right case, the MSHVIT misses the small fine roots in the joint, most likely due to focusing on the much more prevalent displaced joint (FS) and surface damage (OB).

Efficiency analysis
In order to determine the efficiency of the MSHViT extension and verify that the increased metric performance is not simply due to an increase in learnable parameters, we compare the validation F2 CIW against the number of trainable parameters in the models as well as the throughput measured in images processed per second (img/s) during both training and inference, as recommended by Dehghani et al. [81].The throughput performance is computed over 200 batches of 256 images with an initial 10 warmup batches, and averaged over five separate runs.As the method from Haurum and Moeslund [16] is a two-stage approach and the method from Haurum et al. [37] is designed for the multi-task classification task, we do not include these in the throughput comparison.The results are shown in Fig. 4. From these results it is clear that the increased performance obtained with the MSHViT extension is not only due to the increase number of parameters, as the extended models consistently outperform baseline variants with a higher number of parameters.When looking at the throughput of the models, we see that the MSHViT does lead to a slower processing speed, however, for the larger models such as ResNet-50 and ResNet-101 this slowdown is marginal at best.

Ablation studies
We conduct a series of ablation studies in order to determine the sensitivity to the hyperparameter settings in the Sinkhorn tokenizer and MSHViT architecture.All tests are conducted on the Sewer-ML validation set using a ResNet-50 backbone, with the hyperparameter values stated in Tables 1 and 2 unless otherwise stated.

Sinkhorn-Knopp hyperparameters
At the heart of the Sinkhorn tokenizer is the iterative Sinkhorn-Knopp algorithm, which is controlled by two hyperparameters: t SK and ∊.We investigate these hyperparameters' effect on the metric performance one at a time.
First, we investigate the strength of the entropic regularization term in Eq. ( 2) comparing values of ∊ = {0.05,0.25,0.50,0.75,1.00,1.25},see Table 4.We observe that the highest F2 CIW and F1 Normal are achieved using ∊ = 0.25, a slightly higher entropic regularization term than what has previously been used in the self-supervised training domain [69].In general, we see that a too high or low entropic regularization negatively affects the F2 CIW performance.
Secondly, we investigate the effect of the number of iterations conducted t SK .We compare the performance when setting t SK = {1, 3, 5, 7, 9}, see Table 5, as well as the effect on efficiency by measuring training and inference img/s, see Fig. 5.We observe that peak performance on both F2 CIW and F1 Normal is achieved when t SK is set to 5, while too few or too many iterations led to degradation in performance.We also observe a monotonic decrease in throughput when t SK is increased, as expected.When compared to the conventional patch tokenizer we observe that the training throughput and the inference throughput of the Sinkhorn tokenizer beats that of the patch tokenizer at all settings of t SK .

Number of cluster centers K
A key part of the Sinkhorn tokenizer is the number of clusters K.We investigate the effect of setting K = {32,64,128,64/64,128/64}, where x/y denotes x clusters for the 4th scale and y clusters for the 5th scale, see Table 6.We find that increasing or decreasing the number of cluster centers slightly reduced the classification performance, whereas having more clusters for earlier scales dramatically decreased performance.This is hypothesized to be due to the earlier clusters capturing similar semantics, as the larger number of cluster centers allow a less aggressive clustering process.

Tokenizer efficiency at different image resolutions
A key benefit of the Sinkhorn Tokenizer is the constant efficiency when the image resolution is increased.To demonstrate this we compare the training and inference throughput of the MSHViT model (excluding the backbone, which would simply be an offset) at different image resolutions, when using the conventional patch tokenizer and the proposed Sinkhorn tokenizer, see Fig. 6.From this it is clear that the throughput of the Sinkhorn tokenizer better handles the changes in image resolutions, whereas the throughput of the conventional patch tokenizer suffers greatly when the resolution is increased.

Effect of ℓ 2 normalization
Within the Sinkhorn-Knopp algorithm is the calculation of the cosine similarities between cluster centers and input features, V.This step requires an ℓ 2 normalization of all cluster centers and input features in order to yield output values between − 1 and 1.We investigate the effect of skipping this normalization step, see Table 7.We see that the metric performance clearly drops when the features are not normalized onto the unit D-sphere.We can therefore conclude the normalization step is crucial for the Sinkhorn tokenizer.

Effect of shared Sinkhorn tokenizer
Inspired by the Perceiver papers [82,83] we investigate the performance when sharing the tokenizer cluster centers and linear projection weights, see Table 8.We find that when sharing the tokenizer parameters, the performance decreases by nearly a half percentage point.This is expected as the same cluster centers have to meaningfully represent CNN features from all considered scales, even though the CNN features are hierarchical in nature.

Comparison of attention mechanisms and tokenizers
We investigate whether the Sinkhorn tokenizer leads to improvements compared to the standard non-overlapping tokenizer from Dosovitskiy et al. [13], as well as the effect of attention mechanism, see Table 9. Fourier and MHSA refers to the blocks used in the FNet and Transformer models without the per token MLPs, respectively.The Training and inference throughput are also shown for the conventional patch tokenizer.Note that the reported throughput differs from Fig. 4, as only the processing time of the MSHViT extension is reported.The backbone processing time has been excluded, as it is simply a constant offset along the y-axis.Fig. 6.Effect of image resolution on throughput.We compare the training and inference throughput for the Sinkhorn and patch tokenizers across commonly used image resolutions.The Sinkhorn tokenizer consistently achieves a higher throughput than the conventional patch tokenizer.Throughput is measured only for the MSHViT extension, as the backbone processing time is simply an offset.patch based tokenizer uses a kernel size and stride of P = 1 for both scales.We observe that the Sinkhorn tokenizer outperforms the conventional patch tokenizer on all attention mechanisms, and that the inverted bottleneck yields little benefit in all cases but the Sinkhorn tokenizer combined with FNet.This shows a clear benefit from the clustering-based Sinkhorn tokenizer.

Effect of multi-scale approach
In order to determine the effect of the multi-scale approach, we compare the performance when using different scales and the range of the cross-scale connections j.Specifically, we compare using subsets of the scales 2-5 of the ResNet architecture i.e. all but the convolutional stem scale, as well as cross-scale connections with j = i − 1 where only the previous scale is relevant, or j set equal to the initial scale.The comparison is listed in Table 10, where it is clear that a multi-scale approach outperforms the classic single-scale HViT architecture, and that using too many scales diminish the performance.

Comparison of cross-scale connections
A key part of the MSHViT architecture is the multi-scale connections which enable information sharing across scales.Three variations are presented in Eqs. ( 8)- (10), and compared in Table 11.We also compare against a scenario with no cross-scale information sharing between the ViTs, instead using a late-stage scale-fusion step.The late-stage fusion step combines the CLS tokens from each scale together with a learnable cross-scale CLS token, using a MHSA operation with 8 heads.We find that all cross-scale connections outperform the late-stage scale-fusion variation and that using the ViT or linearly embedded CNN features led to a decrease in metric performance.Instead the best performance is achieved by sharing the clustered tokens from the Sinkhorn tokenizer across scales, indicating that the clustering process is crucial for performance.We also compare sharing weights for the ViTs when applicable, and find that sharing ViT weights results in a clear performance benefit, unlike when sharing weights and cluster centers in the tokenizer (See Section 5.5).

Effect of ViT hyperparameters
Lastly, we investigate the effect of varying the hyperparameters of the ViT.Specifically, we investigate the effect of the token dimensionality, D, the MLP ratio, r, in the inverted bottleneck, and the depth of the ViT, L. The effect on the metrics are reported in Tables 12-14, as well as the number of trainable parameters in the MSHViT extension, #P.From these results we observe a clear decrease in metric performance when increasing the token dimensionality D, as well as when the ViT is too shallow or deep.For the MLP ratio we observe that best performance is achieved when r = 4, with performance in general decreasing when lowering r as the inverted bottleneck becomes narrower.

Sinkhorn tokenizer cluster visualizations
We visualize the cluster assignments within the Sinkhorn Tokenizer of the ResNet-50-MSHViT model to get a better understanding of how the non-local features are combined.For each cluster k we get the   Comparison of cross-scale mechanisms Comparison of metric performance when using a late-stage scale fusion step or cross-scale mechanism S (Eq.( 7)) using either CNN (Eq.( 8)), Sinkhorn (Eq.( 9)), or ViT (Eq.( 10)) features.probability for each pixel that the pixel belongs to cluster k.We then visualize this map using a "PARULA" color mapping, where the mapping ranges from the minimum to maximum probability assignment.The PARULA color mapping maps the lowest value to blue and the largest value to yellow, with teal as the intermediate color.
In tokenizers where information from previous scales is included, we visualize the clusters by first determining the assignment probability per pixel for the scale in focus.Then, for each cluster center from the previous scales we normalize the cluster assignments such that the maximum value is one.The cluster assignments are then multiplied by the assignment probability from the current scale cluster center and added to the overall assignment map.Lastly, the combined probability map is colored with a PARULA color mapping as before.
Examples are shown in Fig. 7. From these examples it is clear that not only does the Sinkhorn Tokenizer lead to non-local interactions, but it also captures the different scales of the defects.This is exemplified by the highlight of the multi-scale cracks as shown in the top example of Fig. 7 and the displaced pipe in the bottom example of Fig. 7.We observe that the clusters capture parts of the same regions, but in different context such as one cluster center capturing a crack running along the pipe wall while another cluster center captures a cross section of the pipe.

Limitations and practical use
We have demonstrated that the proposed MSHViT framework improves the sewer defect classification performance, while needing less information in the training process compared to the previously best method, the CT-GAT [37].However, the proposed method is not yet in a state where it can be used to fully automate sewer inspections, due to poor performance on defect classes such as the very important cracks, breaks, and collapses (RB) class as shown in Fig. 2a.This is, however, true for the entire sewer defect classification field, as demonstrated by the low performance of all methods compared by Haurum and Moeslund [16].Instead, it is more plausible that the MSHViT framework can be used as an assistive tool during the inspection process, providing defect predictions to the sewer inspectors, who can then choose to use, adapt, or reject the proposed classifications.Furthermore, similar to prior work from Yang et al. [28] and Dang et al. [29], the MSHViT framework has an extra benefit in that the cluster assignment maps can be used to relate the output predictions to the input image, as demonstrated in Section 6.This makes the automatic classification process less opaque to the sewer inspector, and may help reduce variability in sewer inspections [84,85].The proposed framework is also limited in that it has only been evaluated for frame-level recognition of sewer defects, and not dense recognition tasks such as object-and instance-level recognition of sewer defects.This choice has primarily been motivated by the lack of publicly available data sewer inspection data with object, semantic, or instance annotations.However, it would be possible to extend the MSHViT framework for dense sewer defect recognition by utilizing an upsampling framework [86,87], where the soft assignment scores Q * can be used to reverse the Sinkhorn-Knopp clustering step.This is, however, left for future work.

Conclusions
Vision Transformers (ViTs) have taken the computer vision domain by storm, and led a surge in Transformer focused research.A large part of this research focuses on exclusively using a Transformer based architecture, while in comparison little attention has been given to the fusion of CNNs and Transformers.
In this paper, we presented the Multi-Scale Hybrid Vision Transformer (MSHViT) for image classification, a natural extension of the Hybrid Vision Transformer (HViT) which combines CNNs and ViTs, and the Sinkhorn Tokenizer, a clustering-based tokenizer based on Sinkhorn distances.The MSHViT extension enables the model to learn multi-scale non-local spatial semantics in the input, while the Sinkhorn tokenizer produces a smaller set of tokens that captures non-local spatial semantics.
We investigated the relative performance difference when extending ResNets with MSHViT and Sinkhorn tokenizer on the Sewer-ML multilabel sewer defect classification dataset, demonstrating a relative improvement of up to 2.53 percentage points.Through an extensive ablation study, we provided insights into the sensitivity of the introduced hyperparameters, verifying that the multi-scale extension outperforms regular HViTs, as well as qualitatively showing how the Sinkhorn tokenizer cluster centers captures distinct spatial semantics from one another.
While the focus of this work has been on the sewer defect classification task, the MSHViT framework can in the future be extended to more dense recognition tasks such as defect detection and segmentation, by following commonly used upsampling-based approaches.However, this has been left for future work, due to the lack of publicly available datasets for sewer defect detection and segmentation.We hope that this work will inspire future work in the sewer defect classification area.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Appendix B. Sewer-ML dataset overview
The Sewer-ML is the world's first and only publicly available sewer defect image recognition dataset, presented by Harum and Moeslund [16].The dataset was constructed from 75,618 annotated sewer inspection videos obtained over 9 years from three different Danish water utilities.Each video was annotated by a professionally licensed sewer inspector following the Danish sewer inspection standard [88].We refer to Haurum [89] for an introduction to this standard.The sewer inspectors annotated the videos by assigning a frame-level annotation of a specific defect at a specific time.Using a set of heuristic rules 1.3 million images were extracted, all text redacted using an automated pipeline, and multi-label ground truth labels constructed based on spatial proximity of the annotations.A comprehensive breakdown of the sewer and data properties can be found in the Supplementary materials of the Sewer-ML paper [16].
Classes and Class Importance Weighting.Following the Danish sewer inspection standard [88] there is a total of 18 named defect classes, each with a score representing the economic consequence of the class [90], see Table B.16. Haurum and Moeslund normalized these scores into the range [0, 1] to create a "class-importance weight" (CIW), representing the economic importance of each defect class.It should be noted that the Water Level (VA) class was excluded as an explicit class in the experiments, as it was continuously defined throughout the videos.Instead it has since been treated as a separate classification task by Haurum et al. [38,37].Lastly, in order to represent the non-defective segment of sewer pipes the implicit "Normal" class was introduced, evaluated by the lack of classification of any of the 18 annotated sewer defect classes.
Evaluation Protocol.In the survey conducted by Haurum and Moeslund [2], it was determined that there has been no consensus on how to evaluate sewer defect recognition systems.A commonly used metric has been the accuracy metric, often used in the general computer vision domain.However, this is a poor metric for imbalanced datasets as well as multi-label datasets, such as the Sewer-ML dataset.Therefore, Haurum and Moeslund [16] proposed to evaluate the model performance using two metrics based on the Fβ metric [91], while incorporating domain knowledge, where Prc and Rcll are the precision and recall of the classifier, respectively, and β is a weighting of recall, such that the recall β times more important than precision.
A key insight made by Haurum and Moeslund was that in the sewer inspection process false negatives have a larger economic impact than false positives.This is due to false positives being verified by human inspectors before initiating a rehabilitation process, whereas false negatives allows defective pipes to further degrade.The second key insight was that the different defects do not have the same importance, as some have a larger economic impact, see Table B. 16.These domain insights were incorporated into the defect evaluation metric F2 CIW by setting β = 2, meaning the recall is weighted higher than the precision, and by weighting the class F2 scores by their CIW scores, see Eq. (B.2).
where CIW c and F2 c are the CIW and F2-score for class c, respectively, and C is the number of annotated classes.
In order to evaluate the normal pipes, which have a CIW of 0 and therefore not included in F2 CIW , Haurum and Moeslund proposed to simply use the F1 score, denoted as F1 Normal .

Fig. 1 .
Fig. 1.System overview.(Top) A CNN backbone returns feature maps from a subset of the internal scales in the CNN.The feature maps from each scale are first tokenized and then processed by a weightshared ViT.The information from previous scales are propagated forward to the next scale, shown in the figure by forwarding the Sinkhorn tokenizer output to the next scale as per Eq.(9).(Bottom) The Sinkhorn tokenizer reduces the number of tokens by first measuring the cosine similarity, V, between all input tokens T p and cluster centers C. The Sinkhorn distances[6] are then computed by applying Sinkhorn-Knopp for t SK iterations, resulting in the soft assign-

Fig. 3 .
Fig. 3. Examples of classifications with MSHViT.Example cases where the MSHViT model correctly classifies all classes as well as misclassifies some classes.The class codes are described in the original Sewer-ML paper [16].Incorrect predictions are shown in red.

Fig. 4 .
Fig. 4. Comparison of metric performance and efficiency.We compare the performance of the models in Table 3 against the parameter count of each model as well as the throughput of the models in images per second (img/s) during training and inference.MSHViT variants are linked to their baseline variant by a dotted line.

Fig. 5 .
Fig. 5. Effect oft SK on throughput.Comparison of the training and inference throughput at different number of iterations in the Sinkhorn tokenizer, t SK .Training and inference throughput are also shown for the conventional patch tokenizer.Note that the reported throughput differs from Fig.4, as only the processing time of the MSHViT extension is reported.The backbone processing time has been excluded, as it is simply a constant offset along the y-axis.

Fig. 7 .
Fig. 7. Visualization of the Sinkhorn Tokenizer clusters.We show a subset of the cluster assignments for two images using the ResNet-50-MSHViT model.The first image contains the classes cracks, breaks, and collapses (RB), displaced joint (FS), and branch pipe (GR), and the second image contains the classes surface damage (OB), displaced joint (FS), and connection with construction changes (OK).For each image, two rows of cluster assignment map examples are shown along the columns.The top row shows six examples from the 4th scale clusters, whereas the bottom row shows six examples from the 5th scale clusters.See the description of the computation of the cluster assignment maps in Section 6.

Table 1
[37]iled training procedures.We follow the training procedures of Haurum et al.[37]with the addition of utilizing model EMA.

Table 2
Hyperparameters.Overview of all searched hyperparameters, with the investigated values as well as the initial and final values.
both the baseline and MSHViT extension.Not only does the ResNet-34 baseline achieve the best performance out of the ResNet networks, it also either outperforms or matches the ResNet-101 backbone when applying the MSHViT extension.For the TResNet architectures we observe that the improvement gained by adding MSHViT extension is smaller than that for the ResNet backbones.This is most likely due to the SE layers in the TResNet model, which means the TResNet already includes some attention-based mechanisms.However, it is clear that the MSHViT extension is still beneficial.

Table 4
Effect of ∊.Comparison of different entropic regularization values in the Sinkhorn tokenizer.

Table 5 Effect
of t SK .Comparison of number of iterations in the Sinkhorn tokenizer.

Table 6 Effect of number of cluster centers.
Comparison of metric performance when varying the number of cluster centers K in the Sinkhorn tokenizer.

Table 7 Effect
of ℓ 2 normalization.Comparison of performance when ℓ 2 normalizing the cluster centers C and input features T p , before computing the similarity scores V.

Table 8 Effect of sharing tokenizer.
Comparison of metric performance when sharing tokenizer cluster centers.

Table 9 Effect of tokenizer and attention mechanism.
Comparison of metric performance when using the standard non-overlapping patch tokenizer and the Sinkhorn tokenizer.#P indicates the number of trainable parameters in the MSHViT head in millions.

Table 10 Effect of using different scales.
Comparison of metric performance when using different scales and different cross-scale sharing range j.

Table 12 Effect of token dimensionality
D. We see that increasing the token dimensionality leads to poorer performance.

Table 13 Effect of MLP ratio
r.We see that increasing the MLP ratio in general leads to better performance.

Table 14 Effect of depth of the ViTs
L. We observe that increasing or decreasing the depth of the ViTs leads to poorer performance, with the best performance obtained when L = 2.