In the paper addressing Tokenization for Arabic Language Models, researchers examined the impact of tokenization strategies and vocabulary sizes on Arabic language models. The study utilized four tokenizers and various tasks from news classification to sentiment analysis, highlighting the success of BPE with Farasa. Consult the full study at Arxiv Paper.
Authored by Mohamed Taher Alrefaie, Nour Eldin Morsy, and Nada Samir, the research underlines the importance of advancing tokenization strategies for Arabic and other morphologically-rich languages.