CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Abstract (EN)

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

摘要 (ZH)

近年来，空中视觉-语言导航数据集发展迅速，但主要针对静态目标的目标导向导航，而无人机视觉跟踪——持续跟随移动目标并保持可见性——仍缺乏专门的训练数据。我们提出CosFlyTrack，一个面向城市环境中无人机视觉跟踪的大规模多模态数据集及可扩展生成流程。该数据集包含从6000条行人路径生成的约12000条专家和扰动无人机轨迹，涵盖240万个时间步（约334小时），并配有七种对齐数据通道：RGB、度量深度、语义分割、六自由度无人机位姿、带可见性标志的目标状态、双语（中英文）指令以及轨迹对元数据。为生成高质量专家轨迹，我们开发了MuCO——一种多约束优化器，直接在三连续维空间中规划，采用BVH加速的碰撞和可见性查询，联合优化目标可见性、视点质量、碰撞避免、平滑性和运动学可行性，避免了基于网格规划器的离散化伪影和后处理平滑。对七个视觉-语言模型的微调实验表明，CosFlyTrack将跟踪性能提升至78.3%至95.6%的SR@1米，相比零样本基线提高53至69个百分点，证明了该数据集作为动态目标跟随代理训练资源的有效性。数据集公开于https://huggingface.co/datasets/AutelRobotics/CosFly；评估脚本和预训练检查点托管于https://huggingface.co/AutelRobotics/CosFly-Track。

← Back