The introduction of 3D-VLA, a generative world model, bridges the gap between 2D vision-language-action (VLA) models and our 3D physical reality. With the goal of enhancing embodied foundation models, it embodies interactive tokens alongside a backbone of a 3D-based large language model (LLM). The 3D-VLA model’s training was augmented by a newly curated large-scale 3D embodied instruction dataset, proving its superior reasoning, generation, and planning capabilities in embodied environments.
Summary Points:
Incorporating 3D understanding into the capabilities of LLMs, the 3D-VLA model paves the way for meaningful advances in our ability to interact with and simulate the dynamics of the real world in AI systems.