Rotary Position Embedding (RoPE) has shown notable success in language models and its application in computer vision, specifically in Vision Transformers (ViT), presents a new horizon. This study delves into the adaptation of RoPE for 2D vision data, offering an extensive analysis of its effects on ViT performance. RoPE not only enhances vision tasks like ImageNet-1k classification but also exhibits strong extrapolation capabilities when scaling to higher image resolutions.
The integration of RoPE with ViT, as detailed in this study, signifies a noteworthy advancement in vision-based machine learning. It not only reaffirms the versatility of positional encoding methods but also underscores the potential of re-imagining language model components in new domains.