Revisiting Masks Transformer from a Clustering Perspective


Panoptic segmentation is a pc imaginative and prescient downside that serves as a core process for a lot of real-world purposes. Resulting from its complexity, earlier work typically divides panoptic segmentation into semantic segmentation (assigning semantic labels, similar to “particular person” and “sky”, to each pixel in a picture) and occasion segmentation (figuring out and segmenting solely countable objects, similar to “pedestrians” and “automobiles”, in a picture), and additional divides it into a number of sub-tasks. Every sub-task is processed individually, and additional modules are utilized to merge the outcomes from every sub-task stage. This course of is just not solely advanced, but it surely additionally introduces many hand-designed priors when processing sub-tasks and when combining the outcomes from totally different sub-task levels.

Lately, impressed by Transformer and DETR, an end-to-end resolution for panoptic segmentation with masks transformers (an extension of the Transformer structure that’s used to generate segmentation masks) was proposed in MaX-DeepLab. This resolution adopts a pixel path (consisting of both convolutional neural networks or imaginative and prescient transformers) to extract pixel options, a reminiscence path (consisting of transformer decoder modules) to extract reminiscence options, and a dual-path transformer for interplay between pixel options and reminiscence options. Nevertheless, the dual-path transformer, which makes use of cross-attention, was initially designed for language duties, the place the enter sequence consists of dozens or a whole lot of phrases. Nonetheless, on the subject of imaginative and prescient duties, particularly segmentation issues, the enter sequence consists of tens of hundreds of pixels, which not solely signifies a a lot bigger magnitude of enter scale, but in addition represents a lower-level embedding in comparison with language phrases.

In “CMT-DeepLab: Clustering Masks Transformers for Panoptic Segmentation”, introduced at CVPR 2022, and “kMaX-DeepLab: k-means Masks Transformer”, to be introduced at ECCV 2022, we suggest to reinterpret and redesign cross-attention from a clustering perspective (i.e., grouping pixels with the identical semantic labels collectively), which higher adapts to imaginative and prescient duties. CMT-DeepLab is constructed upon the earlier state-of-the-art technique, MaX-DeepLab, and employs a pixel clustering strategy to carry out cross-attention, resulting in a extra dense and believable consideration map. kMaX-DeepLab additional redesigns cross-attention to be extra like a k-means clustering algorithm, with a easy change on the activation operate. We show that CMT-DeepLab achieves vital efficiency enhancements, whereas kMaX-DeepLab not solely simplifies the modification but in addition additional pushes the state-of-the-art by a big margin, with out test-time augmentation. We’re additionally excited to announce the open-source launch of kMaX-DeepLab, our greatest performing segmentation mannequin, within the DeepLab2 library.


As an alternative of straight making use of cross-attention to imaginative and prescient duties with out modifications, we suggest to reinterpret it from a clustering perspective. Particularly, we be aware that the masks Transformer object question will be thought-about cluster facilities (which goal to group pixels with the identical semantic labels), and the method of cross-attention is just like the k-means clustering algorithm, which adopts an iterative technique of (1) assigning pixels to cluster facilities, the place a number of pixels will be assigned to a single cluster middle, and a few cluster facilities might don’t have any assigned pixels, and (2) updating the cluster facilities by averaging pixels assigned to the identical cluster middle, the cluster facilities is not going to be up to date if no pixel is assigned to them).

In CMT-DeepLab and kMaX-DeepLab, we reformulate the cross-attention from the clustering perspective, which consists of iterative cluster-assignment and cluster-update steps.

Given the recognition of the k-means clustering algorithm, in CMT-DeepLab we redesign cross-attention in order that the spatial-wise softmax operation (i.e., the softmax operation that’s utilized alongside the picture spatial decision) that in impact assigns cluster facilities to pixels is as an alternative utilized alongside the cluster facilities. In kMaX-DeepLab, we additional simplify the spatial-wise softmax to cluster-wise argmax (i.e., making use of the argmax operation alongside the cluster facilities). We be aware that the argmax operation is similar because the exhausting task (i.e., a pixel is assigned to just one cluster) used within the k-means clustering algorithm.

Reformulating the cross-attention of the masks transformer from the clustering perspective considerably improves the segmentation efficiency and simplifies the advanced masks transformer pipeline to be extra interpretable. First, pixel options are extracted from the enter picture with an encoder-decoder construction. Then, a set of cluster facilities are used to group pixels, that are additional up to date primarily based on the clustering assignments. Lastly, the clustering task and replace steps are iteratively carried out, with the final task straight serving as segmentation predictions.

To transform a typical masks Transformer decoder (consisting of cross-attention, multi-head self-attention, and a feed-forward community) into our proposed k-means cross-attention, we merely substitute the spatial-wise softmax with cluster-wise argmax.

The meta structure of our proposed kMaX-DeepLab consists of three elements: pixel encoder, enhanced pixel decoder, and kMaX decoder. The pixel encoder is any community spine, used to extract picture options. The improved pixel decoder contains transformer encoders to boost the pixel options, and upsampling layers to generate increased decision options. The sequence of kMaX decoders rework cluster facilities into (1) masks embedding vectors, which multiply with the pixel options to generate the expected masks, and (2) class predictions for every masks.

The meta structure of kMaX-DeepLab.


We consider the CMT-DeepLab and kMaX-DeepLab utilizing the panoptic high quality (PQ) metric on two of essentially the most difficult panoptic segmentation datasets, COCO and Cityscapes, in opposition to MaX-DeepLab and different state-of-the-art strategies. CMT-DeepLab achieves vital efficiency enchancment, whereas kMaX-DeepLab not solely simplifies the modification but in addition additional pushes the state-of-the-art by a big margin, with 58.0% PQ on COCO val set, and 68.4% PQ, 44.0% masks Common Precision (masks AP), 83.5% imply Intersection-over-Union (mIoU) on Cityscapes val set, with out test-time augmentation or utilizing an exterior dataset.

Comparability on COCO val set.
Panoptic-DeepLab63.0% (-5.4%)35.3% (-8.7%)80.5% (-3.0%)
Axial-DeepLab64.4% (-4.0%)36.7% (-7.3%)80.6% (-2.9%)
SWideRNet66.4% (-2.0%)40.1% (-3.9%)82.2% (-1.3%)
Comparability on Cityscapes val set.

Designed from a clustering perspective, kMaX-DeepLab not solely has a better efficiency but in addition a extra believable visualization of the eye map to grasp its working mechanism. Within the instance under, kMaX-DeepLab iteratively performs clustering assignments and updates, which regularly improves masks high quality.

kMaX-DeepLab’s consideration map will be straight visualized as a panoptic segmentation, which supplies higher plausibility for the mannequin working mechanism (picture credit score: coco_url, and license).


We’ve got demonstrated a strategy to higher design masks transformers for imaginative and prescient duties. With easy modifications, CMT-DeepLab and kMaX-DeepLab reformulate cross-attention to be extra like a clustering algorithm. Consequently, the proposed fashions obtain state-of-the-art efficiency on the difficult COCO and Cityscapes datasets. We hope that the open-source launch of kMaX-DeepLab within the DeepLab2 library will facilitate future analysis on designing vision-specific transformer architectures.


We’re grateful to the dear dialogue and help from Huiyu Wang, Dahun Kim, Siyuan Qiao, Maxwell Collins, Yukun Zhu, Florian Schroff, Hartwig Adam, and Alan Yuille.


Please enter your comment!
Please enter your name here