Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs presents a specialized model for mobile UI screens, which aims to overcome the shortcomings of generic multimodal LLMs. Featuring capabilities like referring, grounding, and reasoning, Ferret-UI addresses issues such as screen aspect ratio and detail magnification that are specific to UI screens.
- To deal with UI intricacies, the model divides screens into sub-images encoded separately before being processed by LLMs.
- Training samples were collected broadly from UI tasks like icon recognition with annotations for high-precision groundings.
- Ferret-UI boasts excellent comprehension and can carry out complex instructions, surpassing many open-source UI MLLMs and GPT-4V in elementary tasks.
- The comprehensive benchmark set up for model assessment and the corresponding astounding results underline the model’s capabilities, detailed on the project page.
By providing advanced understanding and interaction with mobile UIs, Ferret-UI marks an important step in AI development. It suggests new horizons for task automation and better human-computer interaction interfaces, particularly for visually-rich tasks.
Personalized AI news from scientific papers.