RTFormer: Efficient Design for Real-Time Semantic Segmentation with Transformer

Summary by Sofia Aparicio

Background

What was the context? Wha was studied already?

The transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer.

The quadratic complexity of attention makes transformer time cost,
Multi-head mechanism is not friendly for inference on GPU-like devices,
Conducting attention on high resolution feature map is not effective enough for high-level feature learning.

Methods & Nature of this study

What was the objective? How was the data collected?

They proposed RTFormer, an efficient dual-resolution transformer for real-time semantic segmentation.

Screenshot 2022-11-29 at 22.33.11.png

RTFormer block

Dual-resolution module which inherits the multi-resolution fusion paradigm and is composed of two types of attention along with their feed forward network, and arranged as stepped layout.

On the low-resolution branch they have a GPU-Friendly Attention (GFA) to capture high level global context. This is derived from external attention, inheriting a linear complexity and using vanilla matrix multiplications instead of batch-wise matrix multiplication. And in order to maintain the capability of the multi-head mechanism they use grouped double normalisation. This allows to learn more diverse information.

In the high-resolution branch, they introduce a cross-resolution attention to broadcast the high level global context learned from low-resolution branch to each high-resolution pixel, and the stepped layout is served to feed more representative feature from the low-resolution branch into the cross-resolution attention.

Screenshot 2022-11-29 at 22.51.54.png

They also altered the Feed Forward Networks as we can se bellow:

Screenshot 2022-11-29 at 22.54.19.png

The overall network can be seen bellow.

Screenshot 2022-11-29 at 22.56.11.png

Results

Highlights emerged? any surprises

Screenshot 2022-11-29 at 22.58.34.png

RTFormer owns the best speed-accuracy trade-off among all other real-time methods. For example, our RTFormer-Slim achieves 76.3% mIoU at 110.0 FPS which is faster and provides better mIoU compared to STDC2-Seg75 and DDRNet-23-Slim.

Screenshot 2022-11-29 at 23.00.08.png

Background

Methods & Nature of this study

Results

Conclusions