ECVA | European Computer Vision Association

QUAR-VLA: Vision-Language-Action Model for Quadruped Robots

Pengxiang Ding, Han Zhao, Wenjie Zhang, Wenxuan Song, Min Zhang, Siteng Huang, Ningxi Yang, Donglin Wang* ;

Abstract

"The important manifestation of robot intelligence is the ability to naturally interact and autonomously make decisions. Traditional quadruped robot learning typically handles language interaction and visual autonomous perception separately, which, while simplifying system design, also limits the synergy between different information streams. This separation poses challenges in achieving seamless autonomous reasoning, decision-making, and action execution. To address these limitations, a novel paradigm, named Vision-Language-Action tasks for QUAdruped Robots (QUAR-VLA), has been introduced in this paper. This approach tightly integrates visual information and instructions to generate executable actions, effectively merging perception, planning, and decision-making. The central idea is to elevate the overall intelligence of the robot. Within this framework, a notable challenge lies in aligning fine-grained instructions with visual perception information. This emphasizes the complexity involved in ensuring that the robot accurately interprets and acts upon detailed instructions in harmony with its visual observations. Consequently, we propose QUAdruped Robotic Transformer (QUART), a VLA model to integrate visual information and instructions from diverse modalities as input and generates executable actions for real-world robots and present QUAdruped Robot Dataset (QUARD), a large-scale multi-task dataset including perception, navigation and advanced capability like whole-body manipulation tasks for training QUART model. Our extensive evaluation shows that our approach leads to performant robotic policies and enables QUART to obtain a range of generalization capabilities."

Related Material

[pdf] [supplementary material] [DOI]