Sunday, August 14, 2022
HomeArtificial IntelligenceRevisiting Masks Transformer from a Clustering Perspective

Revisiting Masks Transformer from a Clustering Perspective

Panoptic segmentation is a pc imaginative and prescient drawback that serves as a core job for a lot of real-world purposes. Resulting from its complexity, earlier work usually divides panoptic segmentation into semantic segmentation (assigning semantic labels, akin to “particular person” and “sky”, to each pixel in a picture) and occasion segmentation (figuring out and segmenting solely countable objects, akin to “pedestrians” and “automobiles”, in a picture), and additional divides it into a number of sub-tasks. Every sub-task is processed individually, and additional modules are utilized to merge the outcomes from every sub-task stage. This course of is just not solely complicated, nevertheless it additionally introduces many hand-designed priors when processing sub-tasks and when combining the outcomes from totally different sub-task levels.

Not too long ago, impressed by Transformer and DETR, an end-to-end resolution for panoptic segmentation with masks transformers (an extension of the Transformer structure that’s used to generate segmentation masks) was proposed in MaX-DeepLab. This resolution adopts a pixel path (consisting of both convolutional neural networks or imaginative and prescient transformers) to extract pixel options, a reminiscence path (consisting of transformer decoder modules) to extract reminiscence options, and a dual-path transformer for interplay between pixel options and reminiscence options. Nevertheless, the dual-path transformer, which makes use of cross-attention, was initially designed for language duties, the place the enter sequence consists of dozens or lots of of phrases. Nonetheless, on the subject of imaginative and prescient duties, particularly segmentation issues, the enter sequence consists of tens of hundreds of pixels, which not solely signifies a a lot bigger magnitude of enter scale, but in addition represents a lower-level embedding in comparison with language phrases.

In “CMT-DeepLab: Clustering Masks Transformers for Panoptic Segmentation”, introduced at CVPR 2022, and “kMaX-DeepLab: k-means Masks Transformer”, to be introduced at ECCV 2022, we suggest to reinterpret and redesign cross-attention from a clustering perspective (i.e., grouping pixels with the identical semantic labels collectively), which higher adapts to imaginative and prescient duties. CMT-DeepLab is constructed upon the earlier state-of-the-art methodology, MaX-DeepLab, and employs a pixel clustering strategy to carry out cross-attention, resulting in a extra dense and believable consideration map. kMaX-DeepLab additional redesigns cross-attention to be extra like a k-means clustering algorithm, with a easy change on the activation operate. We show that CMT-DeepLab achieves vital efficiency enhancements, whereas kMaX-DeepLab not solely simplifies the modification but in addition additional pushes the state-of-the-art by a big margin, with out test-time augmentation. We’re additionally excited to announce the open-source launch of kMaX-DeepLab, our greatest performing segmentation mannequin, within the DeepLab2 library.


As an alternative of immediately making use of cross-attention to imaginative and prescient duties with out modifications, we suggest to reinterpret it from a clustering perspective. Particularly, we observe that the masks Transformer object question could be thought of cluster facilities (which intention to group pixels with the identical semantic labels), and the method of cross-attention is much like the k-means clustering algorithm, which adopts an iterative means of (1) assigning pixels to cluster facilities, the place a number of pixels could be assigned to a single cluster heart, and a few cluster facilities could haven’t any assigned pixels, and (2) updating the cluster facilities by averaging pixels assigned to the identical cluster heart, the cluster facilities won’t be up to date if no pixel is assigned to them).

In CMT-DeepLab and kMaX-DeepLab, we reformulate the cross-attention from the clustering perspective, which consists of iterative cluster-assignment and cluster-update steps.

Given the recognition of the k-means clustering algorithm, in CMT-DeepLab we redesign cross-attention in order that the spatial-wise softmax operation (i.e., the softmax operation that’s utilized alongside the picture spatial decision) that in impact assigns cluster facilities to pixels is as an alternative utilized alongside the cluster facilities. In kMaX-DeepLab, we additional simplify the spatial-wise softmax to cluster-wise argmax (i.e., making use of the argmax operation alongside the cluster facilities). We observe that the argmax operation is identical because the onerous project (i.e., a pixel is assigned to just one cluster) used within the k-means clustering algorithm.

Reformulating the cross-attention of the masks transformer from the clustering perspective considerably improves the segmentation efficiency and simplifies the complicated masks transformer pipeline to be extra interpretable. First, pixel options are extracted from the enter picture with an encoder-decoder construction. Then, a set of cluster facilities are used to group pixels, that are additional up to date primarily based on the clustering assignments. Lastly, the clustering project and replace steps are iteratively carried out, with the final project immediately serving as segmentation predictions.

To transform a typical masks Transformer decoder (consisting of cross-attention, multi-head self-attention, and a feed-forward community) into our proposed k-means cross-attention, we merely substitute the spatial-wise softmax with cluster-wise argmax.

The meta structure of our proposed kMaX-DeepLab consists of three parts: pixel encoder, enhanced pixel decoder, and kMaX decoder. The pixel encoder is any community spine, used to extract picture options. The improved pixel decoder contains transformer encoders to reinforce the pixel options, and upsampling layers to generate increased decision options. The sequence of kMaX decoders remodel cluster facilities into (1) masks embedding vectors, which multiply with the pixel options to generate the expected masks, and (2) class predictions for every masks.

The meta structure of kMaX-DeepLab.


We consider the CMT-DeepLab and kMaX-DeepLab utilizing the panoptic high quality (PQ) metric on two of essentially the most difficult panoptic segmentation datasets, COCO and Cityscapes, in opposition to MaX-DeepLab and different state-of-the-art strategies. CMT-DeepLab achieves vital efficiency enchancment, whereas kMaX-DeepLab not solely simplifies the modification but in addition additional pushes the state-of-the-art by a big margin, with 58.0% PQ on COCO val set, and 68.4% PQ, 44.0% masks Common Precision (masks AP), 83.5% imply Intersection-over-Union (mIoU) on Cityscapes val set, with out test-time augmentation or utilizing an exterior dataset.

Comparability on COCO val set.
Methodology PQ APmasks mIoU
Panoptic-DeepLab 63.0% (-5.4%) 35.3% (-8.7%) 80.5% (-3.0%)
Axial-DeepLab 64.4% (-4.0%) 36.7% (-7.3%) 80.6% (-2.9%)
SWideRNet 66.4% (-2.0%) 40.1% (-3.9%) 82.2% (-1.3%)
kMaX-DeepLab 68.4% 44.0% 83.5%
Comparability on Cityscapes val set.

Designed from a clustering perspective, kMaX-DeepLab not solely has a better efficiency but in addition a extra believable visualization of the eye map to know its working mechanism. Within the instance beneath, kMaX-DeepLab iteratively performs clustering assignments and updates, which steadily improves masks high quality.

kMaX-DeepLab’s consideration map could be immediately visualized as a panoptic segmentation, which provides higher plausibility for the mannequin working mechanism (picture credit score: coco_url, and license).


We’ve demonstrated a technique to higher design masks transformers for imaginative and prescient duties. With easy modifications, CMT-DeepLab and kMaX-DeepLab reformulate cross-attention to be extra like a clustering algorithm. Because of this, the proposed fashions obtain state-of-the-art efficiency on the difficult COCO and Cityscapes datasets. We hope that the open-source launch of kMaX-DeepLab within the DeepLab2 library will facilitate future analysis on designing vision-specific transformer architectures.


We’re grateful to the dear dialogue and assist from Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Florian Schroff, Hartwig Adam, and Alan Yuille.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments