Summary by Sofia Aparicio
What was the context? Wha was studied already?
The transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer.
What was the objective? How was the data collected?
They proposed RTFormer, an efficient dual-resolution transformer for real-time semantic segmentation.
RTFormer block
Dual-resolution module which inherits the multi-resolution fusion paradigm and is composed of two types of attention along with their feed forward network, and arranged as stepped layout.
On the low-resolution branch they have a GPU-Friendly Attention (GFA) to capture high level global context. This is derived from external attention, inheriting a linear complexity and using vanilla matrix multiplications instead of batch-wise matrix multiplication. And in order to maintain the capability of the multi-head mechanism they use grouped double normalisation. This allows to learn more diverse information.
In the high-resolution branch, they introduce a cross-resolution attention to broadcast the high level global context learned from low-resolution branch to each high-resolution pixel, and the stepped layout is served to feed more representative feature from the low-resolution branch into the cross-resolution attention.
They also altered the Feed Forward Networks as we can se bellow:
The overall network can be seen bellow.
Highlights emerged? any surprises
RTFormer owns the best speed-accuracy trade-off among all other real-time methods. For example, our RTFormer-Slim achieves 76.3% mIoU at 110.0 FPS which is faster and provides better mIoU compared to STDC2-Seg75 and DDRNet-23-Slim.