Can Aerial VLA Models Cooperate? Evaluating Closed-Loop Air-Ground Coordination with CARLA-Air

Abstract (EN)

Recent aerial vision-language-action (VLA) models show promising single-UAV capabilities, such as tracking moving objects and navigating to language-specified landmarks. However, it remains unclear whether these capabilities can transfer to air-ground cooperation, where a UAV and a UGV must act jointly in a shared, closed-loop physical world. We study this question with CARLA-Air, a single-process air-ground evaluation environment that unifies CARLA and AirSim inside one Unreal Engine runtime. By sharing the same world state, physics tick, and sensing pipeline, CARLA-Air enables physically consistent UAV--UGV interaction and precise measurement of simulation-timestamp alignment and effective coordination latency. Using CARLA-Air, we evaluate representative aerial VLA and planning baselines on two complementary diagnostic tasks: moving-platform landing and occlusion-recovery escort. The results show that current aerial VLA models can often track or follow a ground partner, but struggle to convert this single-agent competence into stable cooperative behavior. State prompting provides limited benefit, and naive bidirectional interaction fails to consistently improve performance and can amplify errors for most baselines. These findings suggest that, under the tested text-based cue interfaces, zero-shot cooperative air-ground VLA requires three components beyond the current paradigm: explicit partner-state grounding, low-latency action coordination, and team-level objective alignment. Our code is available at https://github.com/louiszengCN/CarlaAir.

摘要 (ZH)

最近，空中视觉-语言-动作（VLA）模型在单一无人机任务中展现出了令人期待的能力，如追踪移动物体和导航到语言指定的地标。然而，尚不清楚这些能力是否能迁移至空地协作场景——即无人机（UAV）与无人地面车（UGV）必须在共享的闭环物理世界中联合行动。我们通过CARLA-Air来研究这一问题，这是一个单进程的空地评估环境，它将CARLA和AirSim统一在同一个虚幻引擎运行时中。通过共享相同的世界状态、物理时钟和感知管线，CARLA-Air能够实现物理一致的无人机-无人地面车交互，并精确测量仿真时间戳对齐和有效协作延迟。利用CARLA-Air，我们在两个互补的诊断任务——移动平台降落和遮挡恢复护航——上评估了代表性的空中VLA和规划基线。结果表明，当前的空中VLA模型通常能够追踪或跟随地面伙伴，但难以将这种单智能体能力转化为稳定的协作行为。状态提示带来的益处有限，而简单的双向交互不仅无法持续提升性能，还可能在大多数基线上放大错误。这些发现表明，在测试的基于文本的线索接口下，零样本协作空地VLA需要超越当前范式的三个组成部分：明确的伙伴状态接地、低延迟动作协调以及团队级目标对齐。我们的代码可在https://github.com/louiszengCN/CarlaAir获取。

← Back