ViT-CoMer, a pre-training-free and performance-enhanced ViT backbone, is developed to improve upon the current limitations in dense prediction tasks. It introduces spatial pyramid, multi-receptive field convolutional features, and a novel CNN-Transformer bidirectional fusion interaction module.
Highlights include:
ViT-CoMer’s advancements offer new perspectives for the development of backbones focused on dense prediction tasks and are expected to foster future research endeavors in computer vision applications. The authors invite the research community to contribute further via their released codebase.