每日自动追踪 Vision-Language-Action (VLA)、Vision-Language Navigation (VLN) 和 Vision-Language Models (VLM) 的最新 arXiv 论文。
Updated on 2026.05.18
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-05-14 | VGGT-$Ω$ VGGT-Ω 摘要 |
Christian Rupprecht Team | 2605.15195 | HJFY |
|
| 2026-05-14 | Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction 手中介入:通过无缝干预校正提升灵巧VLA模型 摘要 |
Ruoshi Wen Team | 2605.15157 | HJFY |
|
| 2026-05-14 | Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model Evo-Depth: 一种轻量级深度增强的视觉-语言-动作模型 摘要 |
Bo Zhao Team | 2605.14950 | HJFY |
|
| 2026-05-14 | Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations Slot-MPC:基于对象中心表征的目标条件模型预测控制 摘要 |
Sven Behnke Team | 2605.14937 | HJFY |
|
| 2026-05-14 | Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA 程序链:面向程序性问答的分层视觉-语言推理 摘要 |
Derek F. Wong Team | 2605.14928 | HJFY |
|
| 2026-05-14 | IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation IntentVLA:面向别名化机器人操作的短时意图建模 摘要 |
Kai Chen Team | 2605.14712 | HJFY |
|
| 2026-05-14 | Digital Twin Synchronization Over Mobile Embodied AI Network With Agentic Intelligence 具备智能体智能的移动具身AI网络中的数字孪生同步 摘要 |
Kaibin Huang Team | 2605.14625 | HJFY |
|
| 2026-05-14 | DSSP: Diffusion State Space Policy with Full-History Encoding DSSP:具备全历史编码的扩散状态空间策略 摘要 |
Yutong Ban Team | 2605.14598 | HJFY |
|
| 2026-05-14 | TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality TeachAnything:面向对称现实中具身AI智能体训练的多模态众包平台 摘要 |
Zhenliang Zhang Team | 2605.14556 | HJFY |
|
| 2026-05-14 | DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration DiffPhD:面向弹性动力学中投影异构材料且支持丰富接触的GPU加速统一可微求解器 摘要 |
Bing-Yu Chen Team | 2605.14526 | HJFY |
|
| 2026-05-08 | One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy 每帧单Token:重新审视VLA策略中世界模型的视觉带宽 摘要 |
Bin Liu Team | 2605.07931 | HJFY |
|
| 2026-05-08 | EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting EggHand:面向自我中心手部姿态预测的多模态基础模型 摘要 |
Daehee Park Team | 2605.07642 | HJFY |
|
| 2026-05-08 | ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations ForgeVLA:无需语言标注的联邦式视觉-语言-动作学习 摘要 |
Jiancheng Lyu Team | 2605.07474 | HJFY |
|
| 2026-05-08 | EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement EditRefiner:一种面向图像编辑优化的人机对齐智能体框架 摘要 |
Guangtao Zhai Team | 2605.07457 | HJFY |
|
| 2026-05-08 | Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation 摘要 |
Mike Zheng Shou Team | 2605.07381 | HJFY |
|
| 2026-05-08 | Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference 面向人类动作理解的边缘-云端协同推理任务导向通信 摘要 |
Jiawei Shao Team | 2605.07354 | HJFY |
|
| 2026-05-08 | CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations CSR:基于海量缓存状态表示的无限时域实时策略 摘要 |
Go Suzui Team | 2605.07325 | HJFY |
|
| 2026-05-08 | AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models AT-VLA:面向视觉-语言-动作模型增强反馈反应的适应性触觉注入机制 摘要 |
Hao Dong Team | 2605.07308 | HJFY |
|
| 2026-05-08 | BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation BioProVLA-Agent:一种经济实惠、协议驱动、视觉增强的VLA赋能具身多智能体系统,具备闭环推理能力用于生物实验室操作 摘要 |
Zhe Liu Team | 2605.07306 | HJFY |
|
| 2026-05-08 | Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training Sword: 通过动态潜在引导实现风格鲁棒的世界模型,用于VLA策略后训练的模拟器 摘要 |
Sheng Wen Team | 2605.07288 | HJFY |
|
| 2026-05-05 | Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing 弥合具身鸿沟:解耦跨具身视频编辑 摘要 |
Joni Pajarinen Team | 2605.03637 | HJFY |
|
| 2026-05-05 | MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models MHPR:面向大型视觉-语言模型的多维人类感知与推理基准 摘要 |
Shengzhao Wen Team | 2605.03485 | HJFY |
|
| 2026-05-05 | Neural Control: Adjoint Learning Through Equilibrium Constraints 神经控制:通过平衡约束的伴随学习 摘要 |
M. Khalid Jawed Team | 2605.03288 | HJFY |
|
| 2026-05-05 | RLDX-1 Technical Report RLDX-1技术报告 摘要 |
Jinwoo Shin Team | 2605.03269 | HJFY |
|
| 2026-05-04 | MolmoAct2: Action Reasoning Models for Real-world Deployment MolmoAct2:面向现实部署的动作推理模型 摘要 |
Ranjay Krishna Team | 2605.02881 | HJFY |
|
| 2026-05-05 | VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition VideoNet:面向领域特定动作识别的大规模数据集 摘要 |
Ranjay Krishna Team | 2605.02834 | HJFY |
|
| 2026-05-04 | Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation 从仿真中看见真实:面向视觉-语言-动作数据增强的高效视频迁移方法 摘要 |
Chang Xu Team | 2605.02757 | HJFY |
|
| 2026-05-04 | Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference 潜在桥接:面向高效双系统视觉-语言-动作模型推理的特征增量预测 摘要 |
Hai Li Team | 2605.02739 | HJFY |
|
| 2026-05-04 | Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions 从少量交互中学习等变神经增强物体动力学 摘要 |
Laura Herlant Team | 2605.02699 | HJFY |
|
| 2026-05-04 | AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs AnchorD:基于因子图的单目深度度量锚定方法 摘要 |
Abhinav Valada Team | 2605.02667 | HJFY |
|
| 2026-04-30 | LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models LaST-R1:通过自适应物理潜在推理增强VLA模型的动作能力 摘要 |
Pheng-Ann Heng Team | 2604.28192 | HJFY |
|
| 2026-04-30 | RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects RopeDreamer:面向柔性可变形线性物体动力学的运动学递归状态空间模型 摘要 |
Paula Dornhofer Paro Costa Team | 2604.28161 | HJFY |
|
| 2026-04-30 | A Pattern Language for Resilient Visual Agents 韧性视觉智能体的模式语言 摘要 |
Alois Knoll Team | 2604.28001 | HJFY |
|
| 2026-04-30 | MotuBrain: An Advanced World Action Model for Robot Control MotuBrain:一种面向机器人控制的高级世界动作模型 摘要 |
Jun Zhu Team | 2604.27792 | HJFY |
|
| 2026-04-30 | Robot Learning from Human Videos: A Survey 从人类视频中学习机器人技能:综述 摘要 |
Hesheng Wang Team | 2604.27621 | HJFY |
|
| 2026-04-30 | SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation SpaAct:面向视觉-语言导航的空间激活式迁移学习与课程自适应方法 摘要 |
Nanning Zheng Team | 2604.27620 | HJFY |
|
| 2026-04-30 | World2Minecraft: Occupancy-Driven Simulated Scenes Construction World2Minecraft:基于占用驱动的仿真场景构建 摘要 |
Xin Tan Team | 2604.27578 | HJFY |
|
| 2026-04-30 | SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation 空间语法:一种面向基于大语言模型的3D室内场景生成的领域特定语言 摘要 |
Xiaowen Chu Team | 2604.27555 | HJFY |
|
| 2026-04-30 | PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations PRTS:基于对比表示的原初推理与任务系统 摘要 |
Xuelong Li Team | 2604.27472 | HJFY |
|
| 2026-04-30 | Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving 先评判,再驾驶:一种以批评者为中心的视觉语言动作自动驾驶框架 摘要 |
Hao Yang Team | 2604.27366 | HJFY |
|
| 2026-04-23 | Long-Horizon Manipulation via Trace-Conditioned VLA Planning 基于轨迹条件视觉-语言-动作规划的长程操作 摘要 |
Sifei Liu Team | 2604.21924 | HJFY |
|
| 2026-04-23 | VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis VistaBot:基于时空感知视图合成的鲁棒机器人操作 摘要 |
Wenchao Ding Team | 2604.21914 | HJFY |
|
| 2026-04-23 | From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media 从编码本到视觉语言模型:社交媒体上气候变化视觉话语的自动化评估分析 摘要 |
Margret Keuper Team | 2604.21786 | HJFY |
|
| 2026-04-23 | Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training Hi-WM:世界模型驱动的人机协同机器人后训练框架 摘要 |
Yichen Zhu Team | 2604.21741 | HJFY |
|
| 2026-04-23 | A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge 一种基于LLM赋能的机器人交互的可复现机器人认知方法:来自企业挑战赛的证据 摘要 |
P. Olivera Brizzio Team | 2604.21377 | HJFY |
|
| 2026-04-23 | Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning 符号化根基揭示抽象视觉推理中的表征瓶颈 摘要 |
Tanel Tammet Team | 2604.21346 | HJFY |
|
| 2026-04-23 | Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning 关于可通行性的推理:语言引导的越野3D轨迹规划 摘要 |
Soonmin Hwang Team | 2604.21249 | HJFY |
|
| 2026-04-23 | CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors CorridorVLA:通过稀疏锚点为生成式动作头提供显式空间约束 摘要 |
Jianqiang Li Team | 2604.21241 | HJFY |
|
| 2026-04-23 | ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures ReCAPA: 层级预测性纠正以缓解级联故障 摘要 |
Hao Wang Team | 2604.21232 | HJFY |
|
| 2026-04-23 | How VLAs (Really) Work In Open-World Environments 开放环境下视觉-语言-动作模型的实际运作方式 摘要 |
Sajjad Pakdamansavoji Team | 2604.21192 | HJFY |
|
| 2026-04-20 | Neural Garbage Collection: Learning to Forget while Learning to Reason 神经垃圾回收:在学会推理的同时学会遗忘 摘要 |
Noah D. Goodman Team | 2604.18002 | HJFY |
|
| 2026-04-20 | Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models 揭示视觉-语言-动作模型中具身推理的幻象 摘要 |
Zongqing Lu Team | 2604.18000 | HJFY |
|
| 2026-04-20 | E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes E3VS-Bench:面向3D高斯泼溅场景中视角依赖主动感知的基准测试 摘要 |
Yutaka Matsuo Team | 2604.17969 | HJFY |
|
| 2026-04-20 | OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models OneDrive:基于视觉-语言-动作模型的多范式统一驾驶框架 摘要 |
Zhipeng Zhang Team | 2604.17915 | HJFY |
|
| 2026-04-20 | Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study 显式物理可行性是否有助于视觉语言动作模型学习?一项实证研究 摘要 |
Hashem Haghbayan Team | 2604.17896 | HJFY |
|
| 2026-04-20 | StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement StableIDM:通过时空精炼稳定逆动力学模型应对机械臂截断 摘要 |
Huaibo Huang Team | 2604.17887 | HJFY |
|
| 2026-04-20 | ST-$π$: Structured SpatioTemporal VLA for Robotic Manipulation ST-π:面向机器人操作的结构化时空视觉语言动作模型 摘要 |
Luxin Yan Team | 2604.17880 | HJFY |
|
| 2026-04-20 | OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation OFlow:注入对象感知时序流匹配以实现鲁棒机器人操作 摘要 |
Xiangyang Xue Team | 2604.17876 | HJFY |
|
| 2026-04-20 | DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation DART:用于双臂非抓取操作的学习增强型模型预测控制 摘要 |
Madhava Krishna Team | 2604.17833 | HJFY |
|
| 2026-04-20 | ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning ReFineVLA:通过教师引导微调实现多模态推理感知的通用机器人策略 摘要 |
Minh Nhat Vu Team | 2604.17800 | HJFY |
|
| 2026-04-15 | HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System HiVLA:一种以视觉定位为中心的分层具身操作系统 摘要 |
Ping Luo Team | 2604.14125 | HJFY |
|
| 2026-04-15 | MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images MApLe:诊断报告与大型医学图像的多实例对齐 摘要 |
Georg Langs Team | 2604.13970 | HJFY |
|
| 2026-04-15 | [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI [新兴理念] 人工三元智能:一种面向物理AI的仿生、传感优先架构 摘要 |
Hyung-Sin Kim Team | 2604.13959 | HJFY |
|
| 2026-04-15 | Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection Goal2Skill:基于自适应规划与反思的长时程操作 摘要 |
Zhongzhu Pu Team | 2604.13942 | HJFY |
|
| 2026-04-15 | EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development EmbodiedClaw:面向具身人工智能开发的对话式工作流执行系统 摘要 |
Yongchao Chen Team | 2604.13800 | HJFY |
|
| 2026-04-15 | Failure Identification in Imitation Learning Via Statistical and Semantic Filtering 基于统计与语义过滤的模仿学习故障识别 摘要 |
Jean-Baptiste Mouret Team | 2604.13788 | HJFY |
|
| 2026-04-15 | Jump-Start Reinforcement Learning with Vision-Language-Action Regularization 利用视觉-语言-动作正则化实现强化学习的快速启动 摘要 |
Loris Roveda Team | 2604.13733 | HJFY |
|
| 2026-04-15 | Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap 无人机视觉与语言导航:进展、挑战与研究路线图 摘要 |
Ji Pei Team | 2604.13654 | HJFY |
|
| 2026-04-15 | A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies 生成式机器人策略中仿真与真实协同训练的机理分析 摘要 |
Yuke Zhu Team | 2604.13645 | HJFY |
|
| 2026-04-15 | ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation ESCAPE:面向长时程移动操作的情景空间记忆与自适应执行策略 摘要 |
Li Jiang Team | 2604.13633 | HJFY |
|
| 2026-04-06 | InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement InfBaGel:基于动态感知与迭代优化的人-物-场景交互生成 摘要 |
Guanjie Zheng Team | 2604.04843 | HJFY |
|
| 2026-04-06 | E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes E-VLA:面向黑暗与模糊场景的事件增强视觉-语言-动作模型 摘要 |
Kaiwei Wang Team | 2604.04834 | HJFY |
|
| 2026-04-06 | AnyUser: Translating Sketched User Intent into Domestic Robots AnyUser:将草图用户意图转化为家用机器人指令 摘要 |
Shaowu Yang Team | 2604.04811 | HJFY |
|
| 2026-04-06 | ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration ROSClaw:一种面向异构多智能体协作的分层语义-物理框架 摘要 |
Jie Chen Team | 2604.04664 | HJFY |
|
| 2026-04-06 | Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation? Veo-Act:前沿视频模型能在多大程度上推动通用机器人操作? 摘要 |
Jianyu Chen Team | 2604.04502 | HJFY |
|
| 2026-04-05 | Adaptive Action Chunking at Inference-time for Vision-Language-Action Models 视觉-语言-动作模型在推理时自适应动作分块 摘要 |
Prahlad Vadakkepat Team | 2604.04161 | HJFY |
|
| 2026-04-05 | VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models VLA-遗忘:面向具身基础模型的视觉-语言-动作联合遗忘 摘要 |
Agoritsa Polyzou Team | 2604.03956 | HJFY |
|
| 2026-04-04 | From Prompt to Physical Action: Structured Backdoor Attacks on LLM-Mediated Robotic Control Systems 从提示到物理动作:针对LLM介导机器人控制系统的结构化后门攻击 摘要 |
Jin Wei-Kocsis Team | 2604.03890 | HJFY |
|
| 2026-04-04 | OpenRC: An Open-Source Robotic Colonoscopy Framework for Multimodal Data Acquisition and Autonomy Research OpenRC:一个用于多模态数据采集与自主性研究的开源机器人结肠镜框架 摘要 |
Farshid Alambeigi Team | 2604.03781 | HJFY |
|
| 2026-04-04 | When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks 多模态AI何时发挥作用?视觉语言模型与卷积神经网络在星地网络频谱管理中的诊断互补性 摘要 |
Yuanhang Li Team | 2604.03774 | HJFY |
|
| 2026-04-03 | The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling 压缩鸿沟:为何离散标记化限制视觉-语言-动作模型的扩展 摘要 |
Takuya Shiba Team | 2604.03191 | HJFY |
|
| 2026-04-03 | Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model 多视角视频扩散策略:一种三维时空感知的视频动作模型 摘要 |
Tieniu Tan Team | 2604.03181 | HJFY |
|
| 2026-04-03 | ARM: Advantage Reward Modeling for Long-Horizon Manipulation ARM:面向长时程操作的优势奖励建模 摘要 |
Hua Chen Team | 2604.03037 | HJFY |
|
| 2026-04-03 | Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA 开环规划与闭环验证:面向VLA的推测性验证方法 摘要 |
Xiu-Shen Wei Team | 2604.02965 | HJFY |
|
| 2026-04-03 | Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision 通过合成神经符号监督从视觉语言模型中学习结构化机器人策略 摘要 |
Pietro Falco Team | 2604.02812 | HJFY |
|
| 2026-04-03 | ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving ExploreVLA:面向端到端自动驾驶的密集世界建模与探索 摘要 |
Liu Ren Team | 2604.02714 | HJFY |
|
| 2026-04-02 | F2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation F2F-AP:面向实时动态操作的流到未来异步策略 摘要 |
Jiwen Lu Team | 2604.02408 | HJFY |
|
| 2026-04-02 | UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models UAV-Track VLA:基于视觉-语言-动作模型的无人机具身化空中跟踪 摘要 |
Yonglin Tian Team | 2604.02241 | HJFY |
|
| 2026-04-02 | UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving UniDriveVLA:面向自动驾驶的统一理解、感知与动作规划模型 摘要 |
Xinggang Wang Team | 2604.02190 | HJFY |
|
| 2026-04-02 | Cross-Modal Visuo-Tactile Object Perception 跨模态视觉-触觉物体感知 摘要 |
Mohsen Kaboli Team | 2604.02108 | HJFY |
|
| 2026-03-31 | Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models 机器人操作混合框架:集成强化学习与大语言模型 摘要 |
Mohd Suhaib Team | 2603.30022 | HJFY |
|
| 2026-03-31 | Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks 构建安全的AI智能体:针对间接提示注入攻击的系统级防御视角 摘要 |
G. Edward Suh Team | 2603.30016 | HJFY |
|
| 2026-03-31 | Passive iFIR filters for data-driven velocity control in robotics 机器人数据驱动速度控制中的被动iFIR滤波器 摘要 |
Fulvio Forni Team | 2603.29882 | HJFY |
|
| 2026-03-31 | DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA DIAL:通过潜在世界建模实现意图与动作解耦的端到端视觉语言动作模型 摘要 |
Xihui Liu Team | 2603.29844 | HJFY |
|
| 2026-03-31 | SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes SceneTeract:三维场景中的智能体功能可供性与视觉语言模型接地验证 摘要 |
Maks Ovsjanikov Team | 2603.29798 | HJFY |
|
| 2026-03-31 | From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety 从骨架到语义:面向公共安全的混合边缘动作检测系统设计与部署 摘要 |
Jan Schagen Team | 2603.29777 | HJFY |
|
| 2026-03-31 | SafeDMPs: Integrating Formal Safety with DMPs for Adaptive HRI SafeDMPs:将形式化安全性与动态运动基元结合以实现自适应人机交互 摘要 |
Ravi Prakash Team | 2603.29708 | HJFY |
|
| 2026-03-31 | RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment RAAP:基于检索增强的物性预测与跨图像动作对齐 摘要 |
Xiu-Shen Wei Team | 2603.29419 | HJFY |
|
| 2026-03-31 | CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics CLaD:通过跨模态潜在动态实现基于接地前瞻的规划 摘要 |
Sung-Eui Yoon Team | 2603.29409 | HJFY |
|
| 2026-03-31 | PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models PRISM:面向具身视觉语言模型的多视角多能力零售视频数据集 摘要 |
Sashi Reddi Team | 2603.29281 | HJFY |
|
| 2026-03-26 | Vega: Learning to Drive with Natural Language Instructions Vega:通过自然语言指令学习驾驶 摘要 |
Jiwen Lu Team | 2603.25741 | HJFY |
|
| 2026-03-26 | Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving 驶向我的方式:视觉-语言-动作模型的偏好对齐实现个性化驾驶 摘要 |
Jiachen Li Team | 2603.25740 | HJFY |
|
| 2026-03-26 | SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation SoftMimicGen:一种用于可变形物体操作中可扩展机器人学习的数据生成系统 摘要 |
Ajay Mandlekar Team | 2603.25725 | HJFY |
|
| 2026-03-26 | Self-Improvement of Large Language Models: A Technical Overview and Future Outlook 大型语言模型的自我改进:技术概览与未来展望 摘要 |
Jiawei Zhou Team | 2603.25681 | HJFY |
|
| 2026-03-26 | Towards Embodied AI with MuscleMimic: Unlocking full-body musculoskeletal motor learning at scale 迈向具身人工智能:通过MuscleMimic实现大规模全身肌肉骨骼运动学习 摘要 |
Alexander Mathis Team | 2603.25544 | HJFY |
|
| 2026-03-26 | PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos PAWS:基于大规模第一人称视角视频的野外关节感知 摘要 |
Arno Solin Team | 2603.25539 | HJFY |
|
| 2026-03-26 | LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation LILAC:面向开环轨迹生成的语言条件化物体中心光流方法 摘要 |
Komei Sugiura Team | 2603.25481 | HJFY |
|
| 2026-03-26 | VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents VideoWeaver:面向具身智能体的多模态多视角视频到视频迁移框架 摘要 |
Ziyuan Liu Team | 2603.25420 | HJFY |
|
| 2026-03-26 | MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation MMaDA-VLA:统一多模态指令与生成的大型扩散视觉-语言-动作模型 摘要 |
Donglin Wang Team | 2603.25406 | HJFY |
|
| 2026-03-26 | LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior LaMP:利用三维场景流作为潜在运动先验学习视觉-语言-动作策略 摘要 |
Lixin Yang Team | 2603.25399 | HJFY |
|
| 2026-03-25 | TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models TAG:面向视觉-语言-动作模型中稳定以对象为中心推理的目标无关引导 摘要 |
Guangrun Wang Team | 2603.24584 | HJFY |
|
| 2026-03-25 | Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation 变色龙:面向长时程机器人操作的场景记忆系统 摘要 |
Jianfei Yang Team | 2603.24576 | HJFY |
|
| 2026-03-25 | 3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models 面向VLA的3D-Mix:将基于VGGT的三维信息集成到视觉-语言-动作模型中的即插即用模块 摘要 |
Kai Chen Team | 2603.24393 | HJFY |
|
| 2026-03-25 | A Sensorless, Inherently Compliant Anthropomorphic Musculoskeletal Hand Driven by Electrohydraulic Actuators 一种由电液驱动器驱动的无传感器、固有顺应性仿生肌肉骨骼手 摘要 |
Robert K. Katzschmann Team | 2603.24357 | HJFY |
|
| 2026-03-25 | GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents GameplayQA:面向决策密集型POV同步多视频理解的3D虚拟智能体基准测试框架 摘要 |
Volkan Ustun Team | 2603.24329 | HJFY |
|
| 2026-03-25 | Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities 迈向通用型神经运动规划器:机器人操作臂的挑战与机遇 摘要 |
Minghui Zheng Team | 2603.24318 | HJFY |
|
| 2026-03-25 | CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare CarePilot:面向医疗领域长周期计算机任务自动化的多智能体框架 摘要 |
Salman Khan Team | 2603.24157 | HJFY |
|
| 2026-03-25 | Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning 基于知识引导的多任务强化学习操作 摘要 |
Aleksandr Panov Team | 2603.24083 | HJFY |
|
| 2026-03-25 | SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation SOMA:通过上下文适应增强视觉-语言-动作模型鲁棒性的战略编排与记忆增强系统 摘要 |
Jinyu Gu Team | 2603.24060 | HJFY |
|
| 2026-03-25 | ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents 精英:具备经验学习与意图感知迁移能力的自我提升型具身智能体框架 摘要 |
Yongtao Wang Team | 2603.24018 | HJFY |
|
| 2026-03-19 | Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models 并非所有特征生而平等:视觉-语言-动作模型机制研究 摘要 |
Peng Wang Team | 2603.19233 | HJFY |
|
| 2026-03-19 | MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction MonoArt:用于单目铰接三维重建的渐进式结构推理 摘要 |
Ziwei Liu Team | 2603.19231 | HJFY |
|
| 2026-03-19 | DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding DriveTok:面向统一多视角重建与理解的3D驾驶场景标记化方法 摘要 |
Jiwen Lu Team | 2603.19219 | HJFY |
|
| 2026-03-19 | OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation OmniVTA:面向接触密集型机器人操作的视觉-触觉世界建模 摘要 |
Wenchao Ding Team | 2603.19201 | HJFY |
|
| 2026-03-19 | FASTER: Rethinking Real-Time Flow VLAs FASTER:重新思考实时流式视觉语言动作模型 摘要 |
Hengshuang Zhao Team | 2603.19199 | HJFY |
|
| 2026-03-19 | Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models 稀疏自编码器揭示VLA模型中的可解释与可操控特征 摘要 |
Mac Schwager Team | 2603.19183 | HJFY |
|
| 2026-03-19 | Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation 语义与度量:面向视觉语言导航的多智能体概率性接地方法 摘要 |
Nakul Gopalan Team | 2603.19166 | HJFY |
|
| 2026-03-19 | From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models 从推理效率到具身效率:重新审视视觉-语言-动作模型的效率指标 摘要 |
Chaojian Li Team | 2603.19131 | HJFY |
|
| 2026-03-19 | MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction MERGE:面向人机交互中多参与者事件推理与情境感知的引导式视觉语言模型 摘要 |
Michael Gienger Team | 2603.18988 | HJFY |
|
| 2026-03-19 | GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting GHOST:基于高斯泼溅的快速类别无关手-物交互重建系统,从RGB视频中实现 摘要 |
Didier Stricker Team | 2603.18912 | HJFY |
|
| 2026-03-18 | Unified Spatio-Temporal Token Scoring for Efficient Video VLMs 统一时空令牌评分:面向高效视频视觉语言模型 摘要 |
Sangho Lee Team | 2603.18004 | HJFY |
|
| 2026-03-18 | ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models ProbeFlow:面向视觉-语言-动作模型的无训练自适应流匹配方法 摘要 |
Qiongfeng Shi Team | 2603.17850 | HJFY |
|
| 2026-03-18 | Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control 生成式控制即优化:面向自适应与鲁棒机器人控制的无时间条件流匹配 摘要 |
Hang Zhao Team | 2603.17834 | HJFY |
|
| 2026-03-18 | VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning VolumeDP:面向操作策略学习的体素化表征建模 摘要 |
Tao Jiang Team | 2603.17720 | HJFY |
|
| 2026-03-18 | HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness HeiSD:面向具身视觉-语言-动作模型的混合推测解码框架及其运动学感知 摘要 |
Xiang Chen Team | 2603.17573 | HJFY |
|
| 2026-03-18 | KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition KineVLA:通过双层动作分解实现运动学感知的视觉-语言-动作模型 摘要 |
Tongliang Liu Team | 2603.17524 | HJFY |
|
| 2026-03-17 | TeleDex: Accessible Dexterous Teleoperation TeleDex:便捷灵巧的远程操作系统 摘要 |
Yuchen Cui Team | 2603.17065 | HJFY |
|
| 2026-03-17 | ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K ManiTwin:将数据生成就绪的数字物体数据集扩展至10万规模 摘要 |
Ping Luo Team | 2603.16866 | HJFY |
|
| 2026-03-17 | MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation MolmoB0T:大规模仿真实现零样本操作 摘要 |
Ranjay Krishna Team | 2603.16861 | HJFY |
|
| 2026-03-17 | DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models DreamPlan:通过视频世界模型实现视觉语言规划器的高效强化微调 摘要 |
Yue Wang Team | 2603.16860 | HJFY |
|
| 2026-03-13 | Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos 面向单目视频的时空世界场景图生成 摘要 |
Vibhav Gogate Team | 2603.13185 | HJFY |
|
| 2026-03-13 | DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation DecoVLN:面向视觉与语言导航的观测、推理与纠错解耦框架 摘要 |
Shengjun Huang Team | 2603.13133 | HJFY |
|
| 2026-03-13 | Language-Grounded Decoupled Action Representation for Robotic Manipulation 面向机器人操作的语言锚定解耦动作表示 摘要 |
Heng Tao Shen Team | 2603.12967 | HJFY |
|
| 2026-03-13 | ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries ReMem-VLA:通过双层级循环查询赋能具有记忆能力的视觉-语言-动作模型 摘要 |
Alois Knoll Team | 2603.12942 | HJFY |
|
| 2026-03-13 | Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments 受限环境下混合可变形-刚性物体的协调操控 摘要 |
Federico Renda Team | 2603.12940 | HJFY |
|
| 2026-03-13 | RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics RoboStream:将时空推理与记忆融入机器人视觉语言模型 摘要 |
Zhi Wang Team | 2603.12939 | HJFY |
|
| 2026-03-13 | MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins MotionAnymesh:面向仿真就绪数字孪生的物理基础关节化方法 摘要 |
RuoNan Liu Team | 2603.12936 | HJFY |
|
| 2026-03-13 | Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models 基于有限差分流优化的文本到图像模型强化学习后训练 摘要 |
Samuli Laine Team | 2603.12893 | HJFY |
|
| 2026-03-13 | Adaptive Vision-Language Model Routing for Computer Use Agents 自适应视觉语言模型路由技术用于计算机使用代理 摘要 |
Huamin Chen Team | 2603.12823 | HJFY |
|
| 2026-03-13 | Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning 基础手术动作的泛化识别赋能技能评估与基于视觉语言模型的手术规划 摘要 |
Qi Dou Team | 2603.12787 | HJFY |
|
| 2026-03-10 | NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models NS-VLA:迈向神经符号视觉-语言-动作模型 摘要 |
Haoran Luo Team | 2603.09542 | HJFY |
|
| 2026-03-10 | Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks 超越短视界:面向非马尔可夫仿真基准中鲁棒长视界操作的VQ记忆方法 摘要 |
Bai Chenjia Team | 2603.09513 | HJFY |
|
| 2026-03-10 | StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving StyleVLA:面向自动驾驶的驾驶风格感知视觉语言动作模型 摘要 |
Johannes Betz Team | 2603.09482 | HJFY |
|
| 2026-03-10 | EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation EvoDriveVLA:通过协同感知-规划蒸馏进化自动驾驶视觉-语言-动作模型 摘要 |
Shanghang Zhang Team | 2603.09465 | HJFY |
|
| 2026-03-10 | From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation 从流匹配到单步生成:基于隐式最大似然估计分布蒸馏的实时多模态轨迹策略 摘要 |
Jianwei Zhang Team | 2603.09415 | HJFY |
|
| 2026-03-10 | CORAL: Scalable Multi-Task Robot Learning via LoRA Experts CORAL:基于LoRA专家的可扩展多任务机器人学习框架 摘要 |
Zhenguo Li Team | 2603.09298 | HJFY |
|
| 2026-03-10 | See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation 观察、规划、回溯:面向鲁棒机器人操作的进度感知视觉-语言-动作模型 摘要 |
Xiaojun Chang Team | 2603.09292 | HJFY |
|
| 2026-03-10 | Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos 基于网络视频的视觉语言导航隐式几何表征 摘要 |
Ivan Laptev Team | 2603.09259 | HJFY |
|
| 2026-03-10 | SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation SPAN-Nav:面向通用视觉语言导航的广义空间感知 摘要 |
He Wang Team | 2603.09163 | HJFY |
|
| 2026-03-10 | DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation DexHiL:一种用于灵巧操作中视觉-语言-动作模型后训练的人机协同框架 摘要 |
Wenzhao Lian Team | 2603.09121 | HJFY |
|
| 2026-03-06 | Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation 面向双臂机器人操作的时间任务结构与动作时序统一学习 摘要 |
Tamim Asfour Team | 2603.06538 | HJFY |
|
| 2026-03-06 | History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation 面向高效视觉语言导航的历史条件时空视觉令牌剪枝方法 摘要 |
Christopher Rasmussen Team | 2603.06480 | HJFY |
|
| 2026-03-06 | Data Analogies Enable Efficient Cross-Embodiment Transfer 数据类比实现高效跨具身迁移 摘要 |
Dorsa Sadigh Team | 2603.06450 | HJFY |
|
| 2026-03-06 | SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation SuperSuit:一种用于可扩展移动操作的同构双模态接口 摘要 |
Lu Fang Team | 2603.06280 | HJFY |
|
| 2026-03-06 | Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling 少样本神经可微模拟器:真实到仿真的刚体接触建模 摘要 |
Fan Shi Team | 2603.06218 | HJFY |
|
| 2026-03-06 | Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration 通过免训练注意力重校准恢复视觉语言动作模型的语言基础 摘要 |
Jingjing Chen Team | 2603.06001 | HJFY |
|
| 2026-03-06 | HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild HarvestFlex:通过视觉-语言-动作策略自适应在野外环境中实现草莓采摘 摘要 |
Ya Xiong Team | 2603.05982 | HJFY |
|
| 2026-03-06 | Learning Next Action Predictors from Human-Computer Interaction 从人机交互中学习下一个动作预测器 摘要 |
Diyi Yang Team | 2603.05923 | HJFY |
|
| 2026-03-06 | AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models AnyCamVLA:面向视角鲁棒视觉-语言-动作模型的零样本相机适配方法 摘要 |
Young Min Kim Team | 2603.05868 | HJFY |
|
| 2026-03-06 | DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization DexEMG:通过EMG2Pose泛化实现灵巧遥操作系统 摘要 |
Kaifeng Zhang Team | 2603.05861 | HJFY |
|
| 2026-03-05 | Observing and Controlling Features in Vision-Language-Action Models 观察与控制视觉-语言-动作模型中的特征 摘要 |
Marco Pavone Team | 2603.05487 | HJFY |
|
| 2026-03-05 | RealWonder: Real-Time Physical Action-Conditioned Video Generation RealWonder:实时物理动作条件视频生成系统 摘要 |
Jiajun Wu Team | 2603.05449 | HJFY |
|
| 2026-03-05 | PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking PhysiFlow:基于多脑潜在流匹配与鲁棒跟踪的物理感知人形机器人全身视觉-语言-动作框架 摘要 |
Hesheng Wang Team | 2603.05410 | HJFY |
|
| 2026-03-05 | OpenFrontier: General Navigation with Visual-Language Grounded Frontiers OpenFrontier:基于视觉-语言锚定前沿的通用导航 摘要 |
Hermann Blum Team | 2603.05377 | HJFY |
|
| 2026-03-05 | Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups 黎曼流形与李群上的曲线诱导动力系统 摘要 |
Sylvain Calinon Team | 2603.05268 | HJFY |
|
| 2026-03-05 | Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation 闭环批评者:一种用于鲁棒长程操作的三系统视觉语言动作框架 摘要 |
Shanlin Zhong Team | 2603.05185 | HJFY |
|
| 2026-03-05 | Lifelong Language-Conditioned Robotic Manipulation Learning 终身语言条件化机器人操作学习 摘要 |
Zhi Han Team | 2603.05160 | HJFY |
|
| 2026-03-05 | Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models 行动、思考或放弃:面向视觉-语言-动作模型的复杂度感知自适应推理框架 摘要 |
Matteo Matteucci Team | 2603.05147 | HJFY |
|
| 2026-03-05 | SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation SeedPolicy:通过自演化扩散策略实现机器人操作的水平扩展 摘要 |
Shuaicheng Liu Team | 2603.05117 | HJFY |
|
| 2026-03-05 | SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty SPIRIT:基于深度学习不确定性的感知共享自主性,实现鲁棒机器人操作 摘要 |
Konstantin Kondak Team | 2603.05111 | HJFY |
|
| 2026-02-26 | EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents EmbodMocap:面向具身智能体的野外四维人-场景重建 摘要 |
Taku Komura Team | 2602.23205 | HJFY |
|
| 2026-02-26 | Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability 基于残差库普曼谱分析预测与防止Transformer训练不稳定性 摘要 |
Yutaka Matsuo Team | 2602.22988 | HJFY |
|
| 2026-02-26 | Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy 经皮扩张气管切开术的自动化机器人针穿刺系统 摘要 |
Andrew Weightman Team | 2602.22952 | HJFY |
|
| 2026-02-26 | DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation DySL-VLA:通过动态-静态层跳跃实现机器人操作中高效视觉-语言-动作模型推理 摘要 |
Meng Li Team | 2602.22896 | HJFY |
|
| 2026-02-26 | GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion GraspLDP:基于潜在扩散的通用化抓取策略研究 摘要 |
Di Huang Team | 2602.22862 | HJFY |
|
| 2026-02-26 | ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals ArtPro:基于自适应运动提议集成的自监督关节物体重建 摘要 |
Changhe Tu Team | 2602.22666 | HJFY |
|
| 2026-02-26 | Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline 重新审视视觉-语言-动作模型的实用性:一个综合性基准与改进基线 摘要 |
Haoang Li Team | 2602.22663 | HJFY |
|
| 2026-02-26 | Metamorphic Testing of Vision-Language Action-Enabled Robots 视觉-语言-动作赋能机器人的蜕变测试 摘要 |
Aitor Arrieta Team | 2602.22579 | HJFY |
|
| 2026-02-26 | SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation SignVLA:一种无需注释的视觉-语言-动作框架,用于实时手语引导的机器人操作 摘要 |
Zezhi Tang Team | 2602.22514 | HJFY |
|
| 2026-02-25 | When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering 何时执行、询问或学习:不确定性感知的策略引导 摘要 |
Andrea Bajcsy Team | 2602.22474 | HJFY |
|
| 2026-02-24 | NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning NoRD:一种无需推理、数据高效驱动的视觉-语言-动作模型 摘要 |
Wei Zhan Team | 2602.21172 | HJFY |
|
| 2026-02-24 | ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking 行动推理:基于大语言模型的机器人三维空间动作推理与砖块堆叠应用 摘要 |
Brian Sheil Team | 2602.21161 | HJFY |
|
| 2026-02-24 | HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning HALO:面向具身多模态思维链推理的统一视觉-语言-动作模型 摘要 |
Song Guo Team | 2602.21157 | HJFY |
|
| 2026-02-24 | From Perception to Action: An Interactive Benchmark for Vision Reasoning 从感知到行动:视觉推理的交互式基准测试 摘要 |
Roy Ka-Wei Lee Team | 2602.21015 | HJFY |
|
| 2026-02-24 | Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks 自我笔记:用于依赖记忆操作任务的便签增强型视觉语言动作模型 摘要 |
Roland Memisevic Team | 2602.21013 | HJFY |
|
| 2026-02-24 | Toward an Agentic Infused Software Ecosystem 迈向赋能代理的软件生态系统 摘要 |
Mark Marron Team | 2602.20979 | HJFY |
|
| 2026-02-24 | IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation IG-RFT:面向长时程机器人操作的交互引导强化学习框架,用于视觉-语言-动作模型 摘要 |
Huixu Dong Team | 2602.20715 | HJFY |
|
| 2026-02-24 | How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective 基础技能如何影响基于视觉语言模型的具身智能体:一个原生视角 摘要 |
Tong Xu Team | 2602.20687 | HJFY |
|
| 2026-02-24 | Recursive Belief Vision Language Model 递归信念视觉语言模型 摘要 |
Nirav Patel Team | 2602.20659 | HJFY |
|
| 2026-02-24 | Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion 基于掩码视觉-语言-动作扩散的高效可解释端到端自动驾驶 摘要 |
Ziran Wang Team | 2602.20577 | HJFY |
|
| 2026-02-19 | When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs 当视觉凌驾于语言之上:评估与缓解视觉语言动作模型中的反事实失败 摘要 |
Mingyu Ding Team | 2602.17659 | HJFY |
|
| 2026-02-19 | What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else? 什么在破坏具身人工智能安全:大语言模型漏洞、信息物理系统缺陷,还是其他因素? 摘要 |
Yue Zhang Team | 2602.17345 | HJFY |
|
| 2026-02-19 | FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment FRAPPE:通过多未来表示对齐将世界建模融入通用策略 摘要 |
Donglin Wang Team | 2602.17259 | HJFY |
|
| 2026-02-19 | Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web 网络动词:面向智能网络可靠任务组合的类型化抽象 摘要 |
Suman Nath Team | 2602.17245 | HJFY |
|
| 2026-02-19 | Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success 评估物体姿态估计与重建对机器人抓取成功率影响的基准研究 摘要 |
Torsten Sattler Team | 2602.17101 | HJFY |
|
| 2026-02-18 | MALLVI: a multi agent framework for integrated generalized robotics manipulation MALLVI:一种面向集成通用机器人操作的多智能体框架 摘要 |
Babak Khalaj Team | 2602.16898 | HJFY |
|
| 2026-02-18 | EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data EgoScale:利用多样化的自我中心人类数据扩展灵巧操作能力 摘要 |
Linxi Fan Team | 2602.16710 | HJFY |
|
| 2026-02-19 | RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation RoboGene:通过多样性驱动的智能体框架提升视觉语言动作预训练,实现真实世界任务生成 摘要 |
Jian Tang Team | 2602.16444 | HJFY |
|
| 2026-02-17 | Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation 学习检索可导航候选对象以实现高效的视觉与语言导航 摘要 |
Lina Yao Team | 2602.15724 | HJFY |
|
| 2026-02-17 | The Next Paradigm Is User-Centric Agent, Not Platform-Centric Service 下一代范式是用户中心智能体,而非平台中心服务 摘要 |
Enhong Chen Team | 2602.15682 | HJFY |
|
| 2026-02-12 | Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment 扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效 摘要 |
Marco Pavone Team | 2602.12281 | HJFY |
|
| 2026-02-12 | Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work 面向共址蓝领工作团队协作的具身人工智能体 摘要 |
Thomas Olsson Team | 2602.12136 | HJFY |
|
| 2026-02-12 | GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain-0.5M*:一种基于世界模型强化学习训练的视觉-语言-动作模型 摘要 |
Zheng Zhu Team | 2602.12099 | HJFY |
|
| 2026-02-12 | VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model VLAW:视觉-语言-动作策略与世界模型的迭代协同改进 摘要 |
Chelsea Finn Team | 2602.12063 | HJFY |
|
| 2026-02-12 | HoloBrain-0 Technical Report HoloBrain-0技术报告 摘要 |
Zhizhong Su Team | 2602.12062 | HJFY |
|
| 2026-02-12 | When would Vision-Proprioception Policies Fail in Robotic Manipulation? 视觉-本体感知策略在机器人操作中何时会失效? 摘要 |
Di Hu Team | 2602.12032 | HJFY |
|
| 2026-02-12 | Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control Robot-DIFT:提取扩散特征以实现几何一致的视觉运动控制 摘要 |
Georgia Chalvatzaki Team | 2602.11934 | HJFY |
|
| 2026-02-12 | JEPA-VLA: Video Predictive Embedding is Needed for VLA Models JEPA-VLA:视觉语言动作模型需要视频预测性嵌入 摘要 |
Mingsheng Long Team | 2602.11832 | HJFY |
|
| 2026-02-12 | Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes Clutt3R-Seg:面向杂乱场景中语言驱动抓取的稀疏视角三维实例分割 摘要 |
Ayoung Kim Team | 2602.11660 | HJFY |
|
| 2026-02-12 | ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning ViTaS:面向视觉运动学习的视觉触觉软融合对比学习 摘要 |
Huazhe Xu Team | 2602.11643 | HJFY |
|
| 2026-02-10 | MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation MVISTA-4D:具有测试时动作推理能力的视图一致4D世界模型,用于机器人操作 摘要 |
Xiangyu Yue Team | 2602.09878 | HJFY |
|
| 2026-02-10 | Code2World: A GUI World Model via Renderable Code Generation Code2World:通过可渲染代码生成的GUI世界模型 摘要 |
Kevin Qinghong Lin Team | 2602.09856 | HJFY |
|
| 2026-02-10 | BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation BagelVLA:通过交错视觉-语言-动作生成增强长时程操作能力 摘要 |
Jianyu Chen Team | 2602.09849 | HJFY |
|
| 2026-02-10 | NavDreamer: Video Models as Zero-Shot 3D Navigators NavDreamer:视频模型作为零样本三维导航器 摘要 |
Fei Gao Team | 2602.09765 | HJFY |
|
| 2026-02-10 | Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization 重新审视视觉-语言-动作模型的规模化:对齐、混合与正则化 摘要 |
Qin Jin Team | 2602.09722 | HJFY |
|
| 2026-02-10 | AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild AutoFly:面向野外无人机自主导航的视觉-语言-动作模型 摘要 |
Hui Xiong Team | 2602.09657 | HJFY |
|
| 2026-02-10 | VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地 摘要 |
Hui Xiong Team | 2602.09638 | HJFY |
|
| 2026-02-10 | Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures Hand2World:基于自由空间手势的自回归第一人称交互生成 摘要 |
Xingang Pan Team | 2602.09600 | HJFY |
|
| 2026-02-10 | Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation 面向可变形物体操作的偏好对齐视觉运动扩散策略 摘要 |
Danica Kragic Team | 2602.09583 | HJFY |
|
| 2026-02-10 | AUHead: Realistic Emotional Talking Head Generation via Action Units Control AUHead:基于动作单元控制的逼真情感说话头部生成 摘要 |
Tat-Seng Chua Team | 2602.09534 | HJFY |
|
| 2026-02-04 | Capturing Visual Environment Structure Correlates with Control Performance 捕捉视觉环境结构与控制性能的相关性 摘要 |
Yu-Xiong Wang Team | 2602.04880 | HJFY |
|
| 2026-02-04 | CoWTracker: Tracking by Warping instead of Correlation CoWTracker:通过变形而非相关性进行跟踪 摘要 |
Andrea Vedaldi Team | 2602.04877 | HJFY |
|
| 2026-02-04 | Relational Scene Graphs for Object Grounding of Natural Language Commands 面向自然语言指令中物体定位的关系场景图 摘要 |
Ville Kyrki Team | 2602.04635 | HJFY |
|
| 2026-02-04 | Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data 行动、感知、再行动:从大规模第一人称人类数据中学习非马尔可夫主动感知策略 摘要 |
Wenzhao Lian Team | 2602.04600 | HJFY |
|
| 2026-02-04 | A Unified Complementarity-based Approach for Rigid-Body Manipulation and Motion Prediction 基于互补性的统一方法在刚体操作与运动预测中的应用 摘要 |
Riddhiman Laha Team | 2602.04522 | HJFY |
|
| 2026-02-04 | EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models EgoActor:通过视觉语言模型将任务规划落地为具身机器人的空间感知自我中心动作 摘要 |
Börje F. Karlsson Team | 2602.04515 | HJFY |
|
| 2026-02-04 | Self-evolving Embodied AI 自演化的具身人工智能 摘要 |
Wenwu Zhu Team | 2602.04411 | HJFY |
|
| 2026-02-04 | GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning GeneralVLA:具备知识引导轨迹规划的通用视觉-语言-动作模型 摘要 |
Hao Tang Team | 2602.04315 | HJFY |
|
| 2026-02-04 | Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation 视角至关重要:利用掩码自编码器动态优化视觉操控的视角 摘要 |
Wenzhao Lian Team | 2602.04243 | HJFY |
|
| 2026-02-04 | GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning GeoLanG:基于统一RGB-D多模态学习的几何感知语言引导抓取 摘要 |
Hongliang Ren Team | 2602.04231 | HJFY |
|
| 2026-02-02 | TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments TIC-VLA:一种用于动态环境中机器人导航的思控一体化视觉-语言-动作模型 摘要 |
Jiaqi Ma Team | 2602.02459 | HJFY |
|
| 2026-02-02 | World-Gymnast: Training Robots with Reinforcement Learning in a World Model 世界体操家:在世界模型中通过强化学习训练机器人 摘要 |
Sherry Yang Team | 2602.02454 | HJFY |
|
| 2026-02-02 | SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation SoMA:面向机器人软体操作的真实到仿真神经模拟器 摘要 |
Jiangmiao Pang Team | 2602.02402 | HJFY |
|
| 2026-02-02 | MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models MAIN-VLA:为视觉-语言-动作模型建模意图与环境的抽象 摘要 |
Lemiao Qiu Team | 2602.02212 | HJFY |
|
| 2026-02-02 | FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation FD-VLA:用于接触丰富操作的力蒸馏视觉-语言-动作模型 摘要 |
Haiyue Zhu Team | 2602.02142 | HJFY |
|
| 2026-02-02 | See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力 摘要 |
Takeo Igarashi Team | 2602.02063 | HJFY |
|
| 2026-02-02 | Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models 面向视觉语言动作模型推理时安全性的概念词典学习方法 摘要 |
Di Wang Team | 2602.01834 | HJFY |
|
| 2026-02-02 | From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models 从精确认知到精准执行:面向视觉语言动作模型的通用自校正与终止框架 摘要 |
Jianzong Wang Team | 2602.01811 | HJFY |
|
| 2026-02-02 | AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act AgenticLab:一个能够观察、思考与行动的真实世界机器人智能体平台 摘要 |
Yu She Team | 2602.01662 | HJFY |
|
| 2026-02-02 | From Perception to Action: Spatial AI Agents and World Models 从感知到行动:空间人工智能代理与世界模型 摘要 |
Esteban Rojas Team | 2602.01644 | HJFY |
|
| 2026-01-30 | Temporally Coherent Imitation Learning via Latent Action Flow Matching for Robotic Manipulation | Wu Songwei et.al. | 2601.23087 | null |
|
| 2026-01-30 | EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing | Xijie Yang et.al. | 2601.23065 | null |
|
| 2026-01-30 | Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation | Di Zhang et.al. | 2601.22988 | null |
|
| 2026-01-30 | Alignment among Language, Vision and Action Representations | Nicola Milano et.al. | 2601.22948 | null |
|
| 2026-01-30 | When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection | Shashank Mishra et.al. | 2601.22868 | null |
|
| 2026-01-30 | Vision-Language Models Unlock Task-Centric Latent Actions | Alexander Nikulin et.al. | 2601.22714 | null |
|
| 2026-01-30 | Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference | Emilien Biré et.al. | 2601.22701 | null |
|
| 2026-01-30 | CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control | Jiaqi Shi et.al. | 2601.22467 | null |
|
| 2026-01-29 | PoSafeNet: Safe Learning with Poset-Structured Neural Nets | Kiwan Wong et.al. | 2601.22356 | null |
|
| 2026-01-29 | DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation | Haozhe Xie et.al. | 2601.22153 | null |
|
| 2026-01-29 | PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction | Changjian Jiang et.al. | 2601.22046 | null |
|
| 2026-01-29 | PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy | Jinhao Zhang et.al. | 2601.22018 | null |
|
| 2026-01-29 | Causal World Modeling for Robot Control | Lin Li et.al. | 2601.21998 | null |
|
| 2026-01-29 | MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts | Lorenzo Mazza et.al. | 2601.21971 | null |
|
| 2026-01-29 | Information Filtering via Variational Regularization for Robot Manipulation | Jinhao Zhang et.al. | 2601.21926 | null |
|
| 2026-01-29 | Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation | Jiankun Peng et.al. | 2601.21751 | null |
|
| 2026-01-29 | CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation | Xuanran Zhai et.al. | 2601.21712 | null |
|
| 2026-01-29 | AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation | Jianli Sun et.al. | 2601.21602 | null |
|
| 2026-01-29 | EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots | Zixing Lei et.al. | 2601.21570 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-05-14 | Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN 探索VLM-LLM导航中的瓶颈:3D场景理解能力如何影响零样本VLN 摘要 |
Ling Pei Team | 2605.14801 | HJFY |
|
| 2026-05-13 | What Limits Vision-and-Language Navigation ? 视觉与语言导航的瓶颈何在? 摘要 |
Renjing Xu Team | 2605.13328 | HJFY |
|
| 2026-05-13 | HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation HCSG:面向视觉语言导航的以人为中心的语义-几何推理 摘要 |
Haoang Li Team | 2605.13321 | HJFY |
|
| 2026-05-11 | SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation 摘要 |
Amitava Das Team | 2605.10376 | HJFY |
|
| 2026-05-11 | Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation 沙盒中规划,开放世界中导航:面向具身导航的学习物理基础抽象经验 摘要 |
Tianrui Li Team | 2605.10118 | HJFY |
|
| 2026-05-13 | SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning SimWorld Studio:面向具身智能体学习的自主环境生成与演化编码智能体 摘要 |
Lianhui Qin Team | 2605.09423 | HJFY |
|
| 2026-05-09 | LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation LCGNav:面向视觉语言导航中通用拓扑规划的局部候选感知几何增强方法 摘要 |
Ying Xu Team | 2605.09053 | HJFY |
|
| 2026-05-08 | PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation PathPainter:将图像生成模型的泛化能力迁移至具身导航 摘要 |
Fei Gao Team | 2605.07496 | HJFY |
|
| 2026-05-07 | Cross-Modal Navigation with Multi-Agent Reinforcement Learning 基于多智能体强化学习的跨模态导航 摘要 |
Christopher Amato Team | 2605.06595 | HJFY |
|
| 2026-05-08 | NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps NavOne:面向俯视地图的视觉-语言导航的一步全局规划方法 摘要 |
Xuemiao Xu Team | 2605.06317 | HJFY |
|
| 2026-05-04 | Change-Robust Online Spatial-Semantic Topological Mapping 抗变化在线空间语义拓扑映射 摘要 |
Harold Soh Team | 2605.02227 | HJFY |
|
| 2026-05-03 | Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning 多尺度高斯语言地图用于零样本具身导航与推理 摘要 |
Shuqiang Jiang Team | 2605.01736 | HJFY |
|
| 2026-05-03 | TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation TrajRAG:为零样本目标导航检索几何-语义经验 摘要 |
Shuqiang Jiang Team | 2605.01700 | HJFY |
|
| 2026-04-30 | SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation SpaAct:面向视觉-语言导航的空间激活式迁移学习与课程自适应方法 摘要 |
Nanning Zheng Team | 2604.27620 | HJFY |
|
| 2026-04-30 | World2Minecraft: Occupancy-Driven Simulated Scenes Construction World2Minecraft:基于占用驱动的仿真场景构建 摘要 |
Xin Tan Team | 2604.27578 | HJFY |
|
| 2026-04-29 | Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation 三步导航:用于零样本视觉语言导航的分层全局-局部规划器 摘要 |
Laurent Itti Team | 2604.26946 | HJFY |
|
| 2026-04-28 | Where Did It Go Wrong? Capability-Oriented Failure Attribution for Vision-and-Language Navigation Agents 问题出在哪里?面向视觉与语言导航智能体的能力导向失败归因 摘要 |
Fanjiang Xu Team | 2604.25161 | HJFY |
|
| 2026-04-27 | FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching FreqCache:自适应频率引导的令牌缓存加速具身VLN模型 摘要 |
Xiang Chen Team | 2604.24391 | HJFY |
|
| 2026-04-23 | A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration 一种具有层次化认知与上下文感知探索的可部署具身视觉语言导航系统 摘要 |
Lihua Xie Team | 2604.21363 | HJFY |
|
| 2026-04-22 | Self-Predictive Representation for Autonomous UAV Object-Goal Navigation 面向自主无人机目标导向导航的自我预测表征 摘要 |
Bruno J. T. Fernandes Team | 2604.21130 | HJFY |
|
| 2026-04-21 | LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation LiveVLN:打破视觉语言导航中的走走停停循环 摘要 |
Feng Zheng Team | 2604.19536 | HJFY |
|
| 2026-04-21 | The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation 视觉-语言导航中自我改进代理的平衡本质 摘要 |
Jingwen Fu Team | 2604.19064 | HJFY |
|
| 2026-04-21 | Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents 像人类一样探索:面向具身智能体的在线语义图记忆构建自主探索方法 摘要 |
Mu Xu Team | 2604.19034 | HJFY |
|
| 2026-04-20 | Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation 指令即状态:环境引导与状态条件化的具身导航语义理解 摘要 |
Jingwen Fu Team | 2604.18223 | HJFY |
|
| 2026-04-19 | Dual-Anchoring: Addressing State Drift in Vision-Language Navigation 双重锚定:解决视觉语言导航中的状态漂移问题 摘要 |
Jianyi Liu Team | 2604.17473 | HJFY |
|
| 2026-04-19 | LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation LookasideVLN:方向感知的空中视觉语言导航 摘要 |
Guanbin Li Team | 2604.17190 | HJFY |
|
| 2026-04-18 | Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents Mini-BEHAVIOR-Gran:揭示指令粒度对语言引导具身智能体的U型效应 摘要 |
Hamid Rezatofighi Team | 2604.17019 | HJFY |
|
| 2026-04-18 | Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification Rule-VLN:通过语义推理与几何校正桥接感知与合规性 摘要 |
Xiaowen Chu Team | 2604.16993 | HJFY |
|
| 2026-04-17 | FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation FineCog-Nav:集成细粒度认知模块实现零样本多模态无人机导航 摘要 |
Jing Huo Team | 2604.16298 | HJFY |
|
| 2026-04-15 | Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap 无人机视觉与语言导航:进展、挑战与研究路线图 摘要 |
Ji Pei Team | 2604.13654 | HJFY |
|
| 2026-04-14 | OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation OVAL:面向终身目标导航的开放词汇增强记忆模型 摘要 |
Xueqian Wang Team | 2604.12872 | HJFY |
|
| 2026-04-14 | DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation DeCoNav:对话增强的长时程协作视觉语言导航 摘要 |
Xuelong Li Team | 2604.12486 | HJFY |
|
| 2026-04-13 | Fast-SegSim: Real-Time Open-Vocabulary Segmentation for Robotics in Simulation Fast-SegSim:面向机器人仿真的实时开放词汇分割 摘要 |
Yue Wang Team | 2604.10951 | HJFY |
|
| 2026-04-19 | VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions VLN-NF:具备可行性感知的视觉语言导航与虚假前提指令处理 摘要 |
Winston H. Hsu Team | 2604.10533 | HJFY |
|
| 2026-04-10 | HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation HTNav:一种面向城市空中视觉与语言导航的层级式混合导航框架 摘要 |
Jie Qin Team | 2604.08883 | HJFY |
|
| 2026-04-09 | HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation HiRO-Nav:混合推理实现高效具身导航 摘要 |
Chunyan Miao Team | 2604.08232 | HJFY |
|
| 2026-04-09 | How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace 大型多模态模型距离人类水平空间行动能力还有多远?面向城市空域目标导向具身导航的基准测试 摘要 |
Xinlei Chen Team | 2604.07973 | HJFY |
|
| 2026-04-09 | WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models WorldMAP:利用生成式世界模型自举视觉语言导航轨迹预测 摘要 |
Zhibo Chen Team | 2604.07957 | HJFY |
|
| 2026-04-09 | Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models 面向空中机器人的视觉语言导航:迈向大语言模型时代 摘要 |
Wen Yao Team | 2604.07705 | HJFY |
|
| 2026-04-06 | ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration ROSClaw:一种面向异构多智能体协作的分层语义-物理框架 摘要 |
Jie Chen Team | 2604.04664 | HJFY |
|
| 2026-04-05 | Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation 假设图优化:基于假设驱动探索与级联错误纠正的具身导航 摘要 |
Qing Li Team | 2604.04108 | HJFY |
|
| 2026-04-03 | FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation FSUNav:一种用于快速、安全且通用的零样本目标导向导航的大脑-小脑架构 摘要 |
Wei Zhang Team | 2604.03139 | HJFY |
|
| 2026-04-02 | Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning 停止徘徊:通过元认知推理实现高效视觉语言导航 摘要 |
Guozi Liu Team | 2604.02318 | HJFY |
|
| 2026-03-31 | Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation 超越策略的交互基准测试:一个可复现的协作实例物体导航基准 摘要 |
Loris Bazzani Team | 2604.00265 | HJFY |
|
| 2026-03-31 | LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning LatentPilot:通过潜在视觉推理前瞻梦境,实现场景感知的视觉与语言导航 摘要 |
Xiaojun Chang Team | 2603.29165 | HJFY |
|
| 2026-03-30 | CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence CARLA-Air:在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施 摘要 |
Hong Zhang Team | 2603.28032 | HJFY |
|
| 2026-03-29 | Structured Observation Language for Efficient and Generalizable Vision-Language Navigation 结构化观察语言:实现高效且可泛化的视觉语言导航 摘要 |
Jun Ma Team | 2603.27577 | HJFY |
|
| 2026-03-27 | Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation 超越文本知识:利用多模态知识库增强视觉与语言导航 摘要 |
Liejun Wang Team | 2603.26859 | HJFY |
|
| 2026-03-27 | SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation SpatialAnt:通过主动场景重建与视觉预测实现自主零样本机器人导航 摘要 |
Qi Wu Team | 2603.26837 | HJFY |
|
| 2026-03-23 | IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments IGV-RRT:面向动态环境中主动目标搜索的先验与实时观测融合方法 摘要 |
Chaoqun Wang Team | 2603.21887 | HJFY |
|
| 2026-03-22 | DyGeoVLN: Infusing Dynamic Geometry Foundation Model into Vision-Language Navigation DyGeoVLN:将动态几何基础模型融入视觉语言导航 摘要 |
Sung-Eui Yoon Team | 2603.21269 | HJFY |
|
| 2026-03-22 | SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments SpatialFly:面向城市环境中无人机视觉语言导航的几何引导表示对齐方法 摘要 |
Xiangyang Ji Team | 2603.21046 | HJFY |
|
| 2026-03-21 | Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation 同伴观察是否有效?视觉语言导航中的视觉共享协作研究 摘要 |
Qi Wu Team | 2603.20804 | HJFY |
|
| 2026-03-20 | Memory Over Maps: 3D Object Localization Without Reconstruction 记忆优于地图:无需重建的三维物体定位 摘要 |
Marc Pollefeys Team | 2603.20530 | HJFY |
|
| 2026-03-20 | HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks HUGE-Bench:面向高级无人机视觉-语言-动作任务的基准测试平台 摘要 |
Mingming Gong Team | 2603.19822 | HJFY |
|
| 2026-03-20 | CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation CeRLP:一种面向视觉导航的跨形态机器人局部规划框架 摘要 |
Wei Zhang Team | 2603.19602 | HJFY |
|
| 2026-03-19 | NavTrust: Benchmarking Trustworthiness for Embodied Navigation NavTrust:面向具身导航的信任度基准测试 摘要 |
Jiachen Li Team | 2603.19229 | HJFY |
|
| 2026-03-19 | Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation 语义与度量:面向视觉语言导航的多智能体概率性接地方法 摘要 |
Nakul Gopalan Team | 2603.19166 | HJFY |
|
| 2026-03-19 | REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation REST:用于零样本目标导航的滚动时域探索斯坦纳树 摘要 |
Hui Kong Team | 2603.18624 | HJFY |
|
| 2026-03-19 | SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation SR-Nav:空间关系对零样本目标导向导航至关重要 摘要 |
Yinlong Yan Team | 2603.18443 | HJFY |
|
| 2026-03-18 | GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System GoalVLM:面向多智能体系统的视觉语言模型驱动目标物体导航 摘要 |
Dzmitry Tsetserokou Team | 2603.18210 | HJFY |
|
| 2026-03-18 | AgentVLN: Towards Agentic Vision-and-Language Navigation AgentVLN:迈向智能体化的视觉与语言导航 摘要 |
Shengjun Huang Team | 2603.17670 | HJFY |
|
| 2026-03-18 | P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation P$^{3}$Nav:面向视觉与语言导航的端到端感知、预测与规划框架 摘要 |
Haoang Li Team | 2603.17459 | HJFY |
|
| 2026-03-18 | FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation FloorPlan-VLN:一种基于平面图引导的视觉语言导航新范式 摘要 |
Liang Wang Team | 2603.17437 | HJFY |
|
| 2026-03-18 | OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms OmniVLN:面向空地和地面平台视觉语言导航的全向三维感知与令牌高效大语言模型推理 摘要 |
Lihua Xie Team | 2603.17351 | HJFY |
|
| 2026-03-16 | EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments EmergeNav:面向连续环境中零样本视觉语言导航的结构化具身推理框架 摘要 |
Xiaoguang Ma Team | 2603.16947 | HJFY |
|
| 2026-03-17 | SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments SignNav:利用标识牌在大规模室内环境中实现语义视觉导航 摘要 |
Hui Kong Team | 2603.16166 | HJFY |
|
| 2026-03-16 | FlatLands: Generative Floormap Completion From a Single Egocentric View FlatLands:基于单视角第一人称视图的生成式楼层平面图补全 摘要 |
Rahul Shome Team | 2603.16016 | HJFY |
|
| 2026-03-16 | Nonequilibrium energetics of sensing and actuation by a smart active particle 智能活性粒子感知与驱动的非平衡能量学 摘要 |
Lorenzo Piro Team | 2603.15602 | HJFY |
|
| 2026-03-16 | Trajectory-Diversity-Driven Robust Vision-and-Language Navigation 轨迹多样性驱动的鲁棒视觉语言导航 摘要 |
Yihong Gong Team | 2603.15370 | HJFY |
|
| 2026-03-16 | HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System HiMemVLN:通过分层记忆系统提升开源零样本视觉语言导航的可靠性 摘要 |
Ce Hao Team | 2603.14807 | HJFY |
|
| 2026-03-13 | DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation DecoVLN:面向视觉与语言导航的观测、推理与纠错解耦框架 摘要 |
Shengjun Huang Team | 2603.13133 | HJFY |
|
| 2026-03-13 | GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation 目标蜂群:面向开放词汇目标导航的多无人机语义协同框架 摘要 |
Dzmitry Tsetserokou Team | 2603.12908 | HJFY |
|
| 2026-03-13 | HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation HaltNav:基于轻量级拓扑先验的响应式视觉停顿,实现鲁棒的视觉语言导航 摘要 |
Sören Schwertfeger Team | 2603.12696 | HJFY |
|
| 2026-03-11 | OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency OnFly:面向安全与效率的机载零样本空中视觉语言导航 摘要 |
Boyu Zhou Team | 2603.10682 | HJFY |
|
| 2026-03-10 | Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments 逐步奖励:连续环境中视觉语言导航的步骤感知对比对齐 摘要 |
Yi Yang Team | 2603.09740 | HJFY |
|
| 2026-03-10 | Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos 基于网络视频的视觉语言导航隐式几何表征 摘要 |
Ivan Laptev Team | 2603.09259 | HJFY |
|
| 2026-03-10 | SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation SPAN-Nav:面向通用视觉语言导航的广义空间感知 摘要 |
He Wang Team | 2603.09163 | HJFY |
|
| 2026-03-10 | PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings PM-Nav:基于先验地图引导的功能性建筑内具身导航 摘要 |
Xiaoguang Ma Team | 2603.09113 | HJFY |
|
| 2026-03-09 | From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation 从反应式到基于地图的人工智能:利用微调本地大语言模型实现目标导向导航中的语义区域推断 摘要 |
Kanji Tanaka Team | 2603.08086 | HJFY |
|
| 2026-03-09 | ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation ViSA增强的空中视觉语言导航:一种视觉空间推理增强的空中视觉语言导航框架 摘要 |
Chenghao Lin Team | 2603.08007 | HJFY |
|
| 2026-03-09 | CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval CMMR-VLN:基于持续多模态记忆检索的视觉与语言导航 摘要 |
Xiaoguang Ma Team | 2603.07997 | HJFY |
|
| 2026-03-08 | MWM: Mobile World Models for Action-Conditioned Consistent Prediction MWM:面向动作条件一致预测的移动世界模型 摘要 |
Hao Tang Team | 2603.07799 | HJFY |
|
| 2026-03-07 | FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation 自由飞行思维:将思维链推理与连续无人机导航对齐 摘要 |
Tao Li Team | 2603.07181 | HJFY |
|
| 2026-03-10 | VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness VLN-Cache:基于视觉/语义动态感知的视觉语言导航模型令牌缓存技术 摘要 |
Xiang Chen Team | 2603.07080 | HJFY |
|
| 2026-03-06 | History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation 面向高效视觉语言导航的历史条件时空视觉令牌剪枝方法 摘要 |
Christopher Rasmussen Team | 2603.06480 | HJFY |
|
| 2026-03-06 | Lifelong Embodied Navigation Learning 终身具身导航学习 摘要 |
Zhi Han Team | 2603.06073 | HJFY |
|
| 2026-03-05 | OpenFrontier: General Navigation with Visual-Language Grounded Frontiers OpenFrontier:基于视觉-语言锚定前沿的通用导航 摘要 |
Hermann Blum Team | 2603.05377 | HJFY |
|
| 2026-03-04 | Efficient Autonomous Navigation of a Quadruped Robot in Underground Mines on Edge Hardware 四足机器人在地下矿井边缘硬件上的高效自主导航 摘要 |
Kwame Awuah-Offei Team | 2603.04470 | HJFY |
|
| 2026-03-04 | RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation RAGNav:面向多目标视觉语言导航的检索增强型拓扑推理框架 摘要 |
Qiangian Bai Team | 2603.03745 | HJFY |
|
| 2026-03-04 | PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation PROSPECT:通过语义-空间融合与潜在预测表征实现统一的流式视觉语言导航 摘要 |
Feng Gao Team | 2603.03739 | HJFY |
|
| 2026-03-03 | MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN MA-CoNav:一种面向长程具身视觉语言导航的主从式多智能体框架,具备层次化协作与双级反思机制 摘要 |
Qianqian Bai Team | 2603.03024 | HJFY |
|
| 2026-03-03 | TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation TagaVLM:面向视觉语言导航的拓扑感知全局动作推理 摘要 |
Baocai Yin Team | 2603.02972 | HJFY |
|
| 2026-03-03 | Agentic Self-Evolutionary Replanning for Embodied Navigation 具身导航中的自主自进化重规划 摘要 |
Chengzhong Xu Team | 2603.02772 | HJFY |
|
| 2026-03-02 | CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies CHOP:利用反事实人类偏好标签提升视觉运动导航策略的避障能力 摘要 |
Dinesh Manocha Team | 2603.02004 | HJFY |
|
| 2026-02-27 | Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos 利用真实世界室内导览视频的多模态事件知识增强视觉语言导航 摘要 |
Haoang Li Team | 2602.23937 | HJFY |
|
| 2026-02-20 | CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation CapNav:基于能力条件室内导航的视觉语言模型基准测试 摘要 |
Jon Froehlich Team | 2602.18424 | HJFY |
|
| 2026-02-17 | Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation 学习检索可导航候选对象以实现高效的视觉与语言导航 摘要 |
Lina Yao Team | 2602.15724 | HJFY |
|
| 2026-02-17 | One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation 一智体引领全局:通过显式世界表征赋能多模态大语言模型实现视觉与语言导航 摘要 |
Qi Wu Team | 2602.15400 | HJFY |
|
| 2026-02-16 | pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI pFedNavi:面向具身AI的结构感知个性化联邦视觉语言导航 摘要 |
Haibing Guan Team | 2602.14401 | HJFY |
|
| 2026-02-12 | ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation ABot-N0:面向通用具身导航的视觉-语言-动作基础模型技术报告 摘要 |
Mu Xu Team | 2602.11598 | HJFY |
|
| 2026-02-10 | Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning Hydra-Nav:基于自适应双过程推理的目标导航 摘要 |
Yiming Gan Team | 2602.09972 | HJFY |
|
| 2026-02-10 | AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild AutoFly:面向野外无人机自主导航的视觉-语言-动作模型 摘要 |
Hui Xiong Team | 2602.09657 | HJFY |
|
| 2026-02-09 | When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning 何时想象与想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理 摘要 |
Mohit Bansal Team | 2602.08236 | HJFY |
|
| 2026-02-10 | LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation LCLA:面向视觉语言导航的语言条件化潜在对齐框架 摘要 |
Soumik Sarkar Team | 2602.07629 | HJFY |
|
| 2026-02-06 | Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters 弥合室内外鸿沟:面向最后几米的视觉中心化指令引导具身导航 摘要 |
Mu Xu Team | 2602.06427 | HJFY |
|
| 2026-02-06 | Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation 防微杜渐:基于回溯修正的鲁棒视觉语言导航 摘要 |
Weiying Xie Team | 2602.06356 | HJFY |
|
| 2026-02-05 | Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation 稀疏视频生成推动现实世界超视距视觉语言导航 摘要 |
Hongyang Li Team | 2602.05827 | HJFY |
|
| 2026-02-05 | Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation 他者中心感知器:通过框架实例化从他者视觉先验中解耦他者中心推理 摘要 |
Weiming Zhang Team | 2602.05789 | HJFY |
|
| 2026-02-05 | MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation MerNav:一种高度可泛化的记忆-执行-回顾框架,用于零样本目标导航 摘要 |
Mu Xu Team | 2602.05467 | HJFY |
|
| 2026-02-02 | LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation LangMap:面向开放词汇目标导航的分层基准 摘要 |
Anton van den Hengel Team | 2602.02220 | HJFY |
|
| 2026-01-31 | APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation APEX:一种用于异步空中目标导航的解耦记忆型探索器 摘要 |
Shuo Yang Team | 2602.00551 | HJFY |
|
| 2026-02-03 | MapDream: Task-Driven Map Learning for Vision-Language Navigation MapDream:面向视觉语言导航的任务驱动地图学习 摘要 |
Zhaoxin Fan Team | 2602.00222 | HJFY |
|
| 2026-01-29 | Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation 动态拓扑感知:打破视觉语言导航中的粒度僵化 摘要 |
Xiaoming Wang Team | 2601.21751 | HJFY |
|
| 2026-01-26 | DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation DV-VLN:基于大语言模型的视觉与语言导航双重验证可靠框架 摘要 |
Shoujun Zhou Team | 2601.18492 | HJFY |
|
| 2026-01-26 | \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation NaVIDA:基于逆动力学增强的视觉语言导航 摘要 |
Feng Zheng Team | 2601.18188 | HJFY |
|
| 2026-01-22 | AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning AION:基于双策略强化学习的空中室内目标导航系统 摘要 |
Lin Zhao Team | 2601.15614 | HJFY |
|
| 2026-01-23 | FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架 摘要 |
Yonggang Qi Team | 2601.13976 | HJFY |
|
| 2026-01-19 | Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration Spatial-VLN:具备显式空间感知与探索能力的零样本视觉语言导航 摘要 |
Feitian Zhang Team | 2601.12766 | HJFY |
|
| 2026-01-14 | Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning 迈向开放环境与指令:基于快慢交互推理的通用视觉语言导航 摘要 |
Yahong Han Team | 2601.09111 | HJFY |
|
| 2026-01-11 | Residual Cross-Modal Fusion Networks for Audio-Visual Navigation | Yi Wang et.al. | 2601.08868 | null |
|
| 2026-01-13 | VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory | Shaoan Wang et.al. | 2601.08665 | null |
|
| 2026-01-12 | GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap | Farzad Shami et.al. | 2601.07375 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-05-14 | Does Synthetic Layered Design Data Benefit Layered Design Decomposition? 合成分层设计数据是否有助于分层设计分解? 摘要 |
Qifeng Chen Team | 2605.15167 | HJFY |
|
| 2026-05-14 | On the Cultural Anachronism and Temporal Reasoning in Vision Language Models 论视觉语言模型中的文化时代错位与时间推理 摘要 |
Zhiqiang Shen Team | 2605.15071 | HJFY |
|
| 2026-05-14 | LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection LATERN:测试时上下文感知的可解释视频异常检测 摘要 |
Muchao Ye Team | 2605.15054 | HJFY |
|
| 2026-05-14 | MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs MHSA:一种通过引导注意力减轻大型视觉语言模型幻觉的轻量级框架 摘要 |
Yu Wang Team | 2605.14966 | HJFY |
|
| 2026-05-14 | Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models 章鱼:面向多模态大语言模型持续学习的无历史梯度正交化方法 摘要 |
Chao Ma Team | 2605.14938 | HJFY |
|
| 2026-05-14 | Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA 程序链:面向程序性问答的分层视觉-语言推理 摘要 |
Derek F. Wong Team | 2605.14928 | HJFY |
|
| 2026-05-14 | SteerSeg: Attention Steering for Reasoning Video Segmentation SteerSeg:面向推理视频分割的注意力引导 摘要 |
Lars Petersson Team | 2605.14908 | HJFY |
|
| 2026-05-14 | MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models MemLens: 评估大型视觉-语言模型的多模态长期记忆能力 摘要 |
Simon See Team | 2605.14906 | HJFY |
|
| 2026-05-14 | Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers 你的CLIP含有164个噪声维度:对比预训练视觉-语言Transformer的嵌入协方差特征谱探索 摘要 |
Przemysław Biecek Team | 2605.14893 | HJFY |
|
| 2026-05-14 | Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study 探索视觉-语言模型用于在线签名验证:零样本能力研究 摘要 |
Javier Ortega-Garcia Team | 2605.14845 | HJFY |
|
| 2026-05-08 | Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment Proxy3D:通过语义聚类与对齐实现高效视语言模型的3D表征 摘要 |
Wenzhao Zheng Team | 2605.08064 | HJFY |
|
| 2026-05-08 | Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models 面向视觉语言模型的无目标幻觉强化反学习 摘要 |
Jinsong Su Team | 2605.08031 | HJFY |
|
| 2026-05-08 | SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere SphereVAD:基于单位超球面上测地线推理的无训练视频异常检测 摘要 |
Xiaochun Cao Team | 2605.08003 | HJFY |
|
| 2026-05-08 | MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence MedVIGIL:在视觉证据缺失情境下评估可信医学视觉语言模型 摘要 |
Xiang Li Team | 2605.07919 | HJFY |
|
| 2026-05-08 | Anisotropic Modality Align 各向异性模态对齐 摘要 |
Hui Xiong Team | 2605.07825 | HJFY |
|
| 2026-05-08 | GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning GazeVLM:通过内部注意力控制实现主动视觉进行多模态推理 摘要 |
Mattia Rigotti Team | 2605.07817 | HJFY |
|
| 2026-05-08 | Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models 在运行设计域内运行:基于视觉语言模型的零样本感知 摘要 |
Plachetka Christopher Team | 2605.07649 | HJFY |
|
| 2026-05-08 | LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation LithoBench:面向遥感岩性解译的大规模多模态模型基准测试 摘要 |
Wei Han Team | 2605.07640 | HJFY |
|
| 2026-05-08 | PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models PolarVLM:弥合视觉语言模型中的语义-物理鸿沟 摘要 |
Zhanyu Ma Team | 2605.07574 | HJFY |
|
| 2026-05-08 | Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs 超越GSD即文本:遥感视觉语言模型的连续尺度条件化 摘要 |
Yawei Li Team | 2605.07562 | HJFY |
|
| 2026-05-05 | StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning StateVLM:用于机器人可操作属性推理的状态感知视觉-语言模型 摘要 |
Stefan Wermter Team | 2605.03927 | HJFY |
|
| 2026-05-05 | Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing 基于视觉语言嵌入与超维计算的机器人检测任务感知扫描参数配置 摘要 |
Farhad Imani Team | 2605.03909 | HJFY |
|
| 2026-05-05 | CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing CC-OCR V2: 面向真实世界文档处理的大规模多模态模型读写能力基准评测 摘要 |
Dayiheng Liu Team | 2605.03903 | HJFY |
|
| 2026-05-05 | Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework Deco:通过双具身框架将个人实体物品扩展为普适人工智能伴侣 摘要 |
Xuhai Xu Team | 2605.03882 | HJFY |
|
| 2026-05-05 | Quantifying the human visual exposome with vision language models 利用视觉语言模型量化人类视觉暴露组 摘要 |
Magdalena Katharina Wekenborg Team | 2605.03863 | HJFY |
|
| 2026-05-05 | Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation 基于链式问题引导的检索增强生成增强多模态大语言模型视觉问答 摘要 |
Chia-Wen Lin Team | 2605.03790 | HJFY |
|
| 2026-05-05 | Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks 遗忘之前,先学会记忆:重新审视LVLM遗忘基准中的基础学习失败 摘要 |
YoungBin Kim Team | 2605.03759 | HJFY |
|
| 2026-05-05 | Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe 摘要 |
Hehe Fan Team | 2605.03677 | HJFY |
|
| 2026-05-05 | The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection 检测器自我教导:面向开放词汇目标检测的轻量级自监督适应性方法 摘要 |
Changjae Oh Team | 2605.03642 | HJFY |
|
| 2026-05-05 | Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models 抹除角色,遗忘设定:大型视觉语言模型中多模态版权遗忘的基准测试 摘要 |
YoungBin Kim Team | 2605.03547 | HJFY |
|
| 2026-04-30 | AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images AEGIS:面向AI生成学术图片取证分析的全方位基准 摘要 |
Haihong E Team | 2604.28177 | HJFY |
|
| 2026-04-30 | PhyCo: Learning Controllable Physical Priors for Generative Motion PhyCo:学习可控物理先验以生成运动 摘要 |
Manmohan Chandraker Team | 2604.28169 | HJFY |
|
| 2026-04-30 | PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning PRISM:多模态强化学习中基于黑盒在线策略蒸馏的预对齐方法 摘要 |
Chengwei Qin Team | 2604.28123 | HJFY |
|
| 2026-04-30 | FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction FreeOcc:无需训练的具体化开放词汇占据预测 摘要 |
Changhao Chen Team | 2604.28115 | HJFY |
|
| 2026-04-30 | SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images SpecVQA:面向科学图像的光谱理解与视觉问答基准 摘要 |
Xi Fang Team | 2604.28039 | HJFY |
|
| 2026-04-30 | Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation Echo-α:面向超声影像解读的大型智能多模态推理模型 摘要 |
Dacheng Tao Team | 2604.28011 | HJFY |
|
| 2026-04-30 | TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions TransVLM:用于检测任意镜头转换的视觉-语言框架与基准 摘要 |
Mingming Gong Team | 2604.27975 | HJFY |
|
| 2026-04-30 | FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting FineState-Bench: 面向细粒度GUI状态设定的状态条件定位基准 摘要 |
Xiuying Chen Team | 2604.27974 | HJFY |
|
| 2026-04-30 | From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation 从幻象到根基:迈向可靠的多模态电路到Verilog代码生成 摘要 |
Xin Xi Team | 2604.27969 | HJFY |
|
| 2026-04-30 | The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models 视觉启动对视觉语言模型中合作行为的影响 摘要 |
Kenneth J. K. Ong Team | 2604.27953 | HJFY |
|
| 2026-04-23 | When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs 当提示压倒视觉:LVLM中提示诱导的幻觉 摘要 |
Matthieu Cord Team | 2604.21911 | HJFY |
|
| 2026-04-23 | From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media 从编码本到视觉语言模型:社交媒体上气候变化视觉话语的自动化评估分析 摘要 |
Margret Keuper Team | 2604.21786 | HJFY |
|
| 2026-04-23 | Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection Ramen:通过主动样本选择实现视觉-语言模型的鲁棒测试时自适应 摘要 |
Jingrui He Team | 2604.21728 | HJFY |
|
| 2026-04-23 | Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models 眼见不为实:揭示评估型视觉语言模型中的盲点 摘要 |
Mitesh M. Khapra Team | 2604.21523 | HJFY |
|
| 2026-04-23 | Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision 多模态大语言模型理解指向吗?面向第一人称视角的指代推理基准构建与能力增强 摘要 |
Jie Zhou Team | 2604.21461 | HJFY |
|
| 2026-04-23 | VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought VG-CoT:基于可信视觉推理的接地链式思维方法 摘要 |
YoungBin Kim Team | 2604.21396 | HJFY |
|
| 2026-04-23 | A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration 一种具有层次化认知与上下文感知探索的可部署具身视觉语言导航系统 摘要 |
Lihua Xie Team | 2604.21363 | HJFY |
|
| 2026-04-23 | Prototype-Based Test-Time Adaptation of Vision-Language Models 基于原型的视觉语言模型测试时自适应方法 摘要 |
Rongrong Ji Team | 2604.21360 | HJFY |
|
| 2026-04-23 | Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning 符号化根基揭示抽象视觉推理中的表征瓶颈 摘要 |
Tanel Tammet Team | 2604.21346 | HJFY |
|
| 2026-04-23 | Latent Denoising Improves Visual Alignment in Large Multimodal Models 潜在去噪提升大型多模态模型的视觉对齐 摘要 |
Viktor Prasanna Team | 2604.21343 | HJFY |
|
| 2026-04-20 | Mitigating Multimodal Hallucination via Phase-wise Self-reward 通过分阶段自奖励缓解多模态幻觉 摘要 |
Min Zhang Team | 2604.17982 | HJFY |
|
| 2026-04-20 | From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models 从注意力头到神经元:多任务视觉语言模型中的因果归因与调控 摘要 |
Ming Jiang Team | 2604.17941 | HJFY |
|
| 2026-04-20 | OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models OneDrive:基于视觉-语言-动作模型的多范式统一驾驶框架 摘要 |
Zhipeng Zhang Team | 2604.17915 | HJFY |
|
| 2026-04-20 | AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning AeroRAG:面向细粒度航空视觉推理的结构化多模态检索增强大语言模型 摘要 |
Xuecheng Wu Team | 2604.17889 | HJFY |
|
| 2026-04-20 | SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces SpaceDex:分层工作空间中的泛化灵巧抓取 摘要 |
Ning Tan Team | 2604.17888 | HJFY |
|
| 2026-04-20 | Weakly-Supervised Referring Video Object Segmentation through Text Supervision 摘要 |
Hanli Wang Team | 2604.17797 | HJFY |
|
| 2026-04-20 | When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias 当视觉语言模型未见而判:揭示信息量偏见 摘要 |
Dan Roth Team | 2604.17768 | HJFY |
|
| 2026-04-19 | BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs BioVLM:通过路由提示而非参数实现生物医学视觉语言模型的跨模态泛化 摘要 |
Biplab Banerjee Team | 2604.17629 | HJFY |
|
| 2026-04-19 | PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation PBSBench:用于血液病理学全玻片图像解读的多层次视觉-语言框架与基准 摘要 |
Ping Zhang Team | 2604.17570 | HJFY |
|
| 2026-04-19 | RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding RS-HyRe-R1:一种克服遥感图像理解中感知惯性的混合奖励机制 摘要 |
Haifeng Li Team | 2604.17504 | HJFY |
|
| 2026-04-15 | One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding 每帧一令牌:迈向长视频理解的极致压缩 摘要 |
Yu-Xiong Wang Team | 2604.14149 | HJFY |
|
| 2026-04-15 | ROSE: Retrieval-Oriented Segmentation Enhancement ROSE:面向检索的分割增强框架 摘要 |
Yu-Gang Jiang Team | 2604.14147 | HJFY |
|
| 2026-04-15 | HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System HiVLA:一种以视觉定位为中心的分层具身操作系统 摘要 |
Ping Luo Team | 2604.14125 | HJFY |
|
| 2026-04-15 | Training-Free Semantic Multi-Object Tracking with Vision-Language Models 无需训练的语义多目标跟踪:基于视觉-语言模型的方法 摘要 |
Lorenzo Vaquero Team | 2604.14074 | HJFY |
|
| 2026-04-15 | Towards Unconstrained Human-Object Interaction 迈向无约束的人-物交互检测 摘要 |
Elisa Ricci Team | 2604.14069 | HJFY |
|
| 2026-04-15 | Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models 解码变化:利用多模态大语言模型统一遥感变化检测与理解 摘要 |
Zide Fan Team | 2604.14044 | HJFY |
|
| 2026-04-15 | Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios 寻解:评估多模态大语言模型在日常场景中基于视觉线索的推理能力 摘要 |
Xu Jia Team | 2604.14041 | HJFY |
|
| 2026-04-15 | POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch POINTS-Seeker:迈向从零开始训练多模态智能搜索模型 摘要 |
Weidi Xie Team | 2604.14029 | HJFY |
|
| 2026-04-15 | MAny: Merge Anything for Multimodal Continual Instruction Tuning MAny:面向多模态持续指令调优的任意合并框架 摘要 |
Kele Xu Team | 2604.14016 | HJFY |
|
| 2026-04-15 | Reward Design for Physical Reasoning in Vision-Language Models 视觉语言模型中物理推理的奖励设计 摘要 |
Sameera Horawalavithana Team | 2604.13993 | HJFY |
|
| 2026-04-06 | Rethinking Model Efficiency: Multi-Agent Inference with Large Models 重新思考模型效率:大模型的多智能体推理 摘要 |
Qi Qian Team | 2604.04929 | HJFY |
|
| 2026-04-06 | Vero: An Open RL Recipe for General Visual Reasoning Vero:面向通用视觉推理的开源强化学习方案 摘要 |
Zhuang Liu Team | 2604.04917 | HJFY |
|
| 2026-04-06 | ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality ClickAIXR:基于设备的多模态视觉-语言交互在扩展现实中与现实世界物体的应用 摘要 |
Ivan Viola Team | 2604.04905 | HJFY |
|
| 2026-04-06 | Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations 超越全局分数:细粒度标记定位作为LVLM幻觉的鲁棒检测器 摘要 |
Vu Minh Hieu Phan Team | 2604.04863 | HJFY |
|
| 2026-04-06 | The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models 适应中的盲点:量化与缓解微调驾驶模型中的灾难性遗忘 摘要 |
Zhipeng Zhang Team | 2604.04857 | HJFY |
|
| 2026-04-06 | Less Detail, Better Answers: Degradation-Driven Prompting for VQA 细节越少,答案越好:面向视觉问答的降质驱动提示方法 摘要 |
Bohan Zhuang Team | 2604.04838 | HJFY |
|
| 2026-04-06 | Discovering Failure Modes in Vision-Language Models using RL 利用强化学习探索视觉语言模型的失效模式 摘要 |
Aishwarya Agrawal Team | 2604.04733 | HJFY |
|
| 2026-04-06 | ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration ROSClaw:一种面向异构多智能体协作的分层语义-物理框架 摘要 |
Jie Chen Team | 2604.04664 | HJFY |
|
| 2026-04-06 | Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection Synthesis4AD:合成异常即三维异常检测所需全部 摘要 |
Weiming Shen Team | 2604.04658 | HJFY |
|
| 2026-04-06 | InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation InCTRLv2:用于少样本异常检测与分割的通用残差模型 摘要 |
Guansong Pang Team | 2604.04632 | HJFY |
|
| 2026-04-03 | CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning CoME-VL:扩展互补多编码器视觉-语言学习 摘要 |
Salman Khan Team | 2604.03231 | HJFY |
|
| 2026-04-03 | The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling 压缩鸿沟:为何离散标记化限制视觉-语言-动作模型的扩展 摘要 |
Takuya Shiba Team | 2604.03191 | HJFY |
|
| 2026-04-03 | Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models 理解幻觉在多模态推理模型强化后训练中的作用 摘要 |
Tianlong Chen Team | 2604.03179 | HJFY |
|
| 2026-04-03 | EffiMiniVLM: A Compact Dual-Encoder Regression Framework EffiMiniVLM:一种紧凑型双编码器回归框架 摘要 |
Yan Chai Hum Team | 2604.03172 | HJFY |
|
| 2026-04-03 | Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models Chart-RL:基于策略优化强化学习的视觉语言模型图表问答视觉推理增强方法 摘要 |
Shekhar Jain Team | 2604.03157 | HJFY |
|
| 2026-04-03 | FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation FSUNav:一种用于快速、安全且通用的零样本目标导向导航的大脑-小脑架构 摘要 |
Wei Zhang Team | 2604.03139 | HJFY |
|
| 2026-04-03 | Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models 揭示物理世界语义漏洞:面向红外视觉语言模型的通用对抗性补丁 摘要 |
Wen Yao Team | 2604.03117 | HJFY |
|
| 2026-04-03 | MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs MI-Pruner:基于跨模态互信息引导的高效多模态大语言模型令牌剪枝器 摘要 |
Matthew B. Blaschko Team | 2604.03072 | HJFY |
|
| 2026-04-03 | QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection QVAD:一种以问题为中心的高效免训练视频异常检测代理框架 摘要 |
Yasin Yilmaz Team | 2604.03040 | HJFY |
|
| 2026-04-03 | Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence? Agentic-MME:智能体能力究竟为多模态智能带来什么? 摘要 |
Yi-Fan Zhang Team | 2604.03016 | HJFY |
|
| 2026-03-31 | Scaling Video Pretraining for Surgical Foundation Models 扩展视频预训练以构建外科基础模型 摘要 |
Zuozhu Liu Team | 2603.29966 | HJFY |
|
| 2026-03-31 | EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos EC-Bench:超长视频的枚举与计数基准测试 摘要 |
Yutaka Matsuo Team | 2603.29943 | HJFY |
|
| 2026-03-31 | ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation ATP-Bench:迈向多模态大语言模型交错生成的智能体工具规划 摘要 |
Guanjun Jiang Team | 2603.29902 | HJFY |
|
| 2026-03-31 | DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA DIAL:通过潜在世界建模实现意图与动作解耦的端到端视觉语言动作模型 摘要 |
Xihui Liu Team | 2603.29844 | HJFY |
|
| 2026-03-31 | SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes SceneTeract:三维场景中的智能体功能可供性与视觉语言模型接地验证 摘要 |
Maks Ovsjanikov Team | 2603.29798 | HJFY |
|
| 2026-03-31 | From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety 从骨架到语义:面向公共安全的混合边缘动作检测系统设计与部署 摘要 |
Jan Schagen Team | 2603.29777 | HJFY |
|
| 2026-03-31 | TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios TSHA:面向可信安全风险评估场景的视觉语言模型基准 摘要 |
Xin Tan Team | 2603.29759 | HJFY |
|
| 2026-03-31 | A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models 大型视觉语言模型的信息分解综合分析 摘要 |
Hideki Nakayama Team | 2603.29676 | HJFY |
|
| 2026-03-31 | Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras 存储更少,发现更多:新颖性过滤如何提升边缘摄像头的跨模态检索效果 摘要 |
Sherif Abdelwahab Team | 2603.29631 | HJFY |
|
| 2026-03-31 | Calibrated Confidence Expression for Radiology Report Generation 放射学报告生成中的校准置信度表达 摘要 |
Matthias Keicher Team | 2603.29492 | HJFY |
|
| 2026-03-26 | SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding SlotVTG:面向泛化视频时序定位的对象中心适配器 摘要 |
Jinwoo Choi Team | 2603.25733 | HJFY |
|
| 2026-03-26 | Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos Colon-Bench:一种用于全流程结肠镜视频中可扩展密集病灶标注的智能体工作流 摘要 |
Xin Gao Team | 2603.25645 | HJFY |
|
| 2026-03-26 | LanteRn: Latent Visual Structured Reasoning LanteRn:潜在视觉结构化推理 摘要 |
André Martins Team | 2603.25629 | HJFY |
|
| 2026-03-26 | Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification 多模态大语言模型中的人口统计学公平性:人脸验证中的性别与种族偏见基准研究 摘要 |
Sébastien Marcel Team | 2603.25613 | HJFY |
|
| 2026-03-26 | GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing GeoHeight-Bench:迈向遥感中的高度感知多模态推理 摘要 |
Wufan Zhao Team | 2603.25565 | HJFY |
|
| 2026-03-26 | Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence 人类与视觉语言模型:叙事连贯性的统一度量 摘要 |
Sharid Loáiciga Team | 2603.25537 | HJFY |
|
| 2026-03-26 | GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids GridVAD:基于分层帧网格空间推理的开放集视频异常检测 摘要 |
Sondos Mohamed Team | 2603.25467 | HJFY |
|
| 2026-03-26 | HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models HiSpatial:驾驭视觉语言模型中的层次化三维空间理解 摘要 |
Jiaolong Yang Team | 2603.25411 | HJFY |
|
| 2026-03-26 | Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models 形态与实质:针对本地视觉语言模型的双层侧信道攻击 摘要 |
Mordechai Guri Team | 2603.25403 | HJFY |
|
| 2026-03-26 | DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers DAGverse:从科学论文构建基于文档的语义有向无环图 摘要 |
Huan Liu Team | 2603.25293 | HJFY |
|
| 2026-03-25 | Vision-Language Models vs Human: Perceptual Image Quality Assessment 视觉语言模型与人类:感知图像质量评估对比 摘要 |
Brian Deegan Team | 2603.24578 | HJFY |
|
| 2026-03-25 | VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models VFIG:利用视觉语言模型将复杂图形矢量化至SVG格式 摘要 |
Ranjay Krishna Team | 2603.24575 | HJFY |
|
| 2026-03-25 | LensWalk: Agentic Video Understanding by Planning How You See in Videos LensWalk:通过规划视频观看方式实现智能体驱动的视频理解 摘要 |
Shiguang Shan Team | 2603.24558 | HJFY |
|
| 2026-03-25 | UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience UI-Voyager:一种通过失败经验实现自我进化的图形用户界面代理学习框架 摘要 |
Jie Jiang Team | 2603.24533 | HJFY |
|
| 2026-03-25 | Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification 跨模态原型对齐与混合:面向免训练小样本分类 摘要 |
Joost van de Weijer Team | 2603.24528 | HJFY |
|
| 2026-03-25 | Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models 纯视频心智理论:增强多模态大语言模型的心智理论能力 摘要 |
Jiansheng Chen Team | 2603.24484 | HJFY |
|
| 2026-03-25 | Unleashing Vision-Language Semantics for Deepfake Video Detection 释放视觉-语言语义在深度伪造视频检测中的潜力 摘要 |
Guansong Pang Team | 2603.24454 | HJFY |
|
| 2026-03-25 | 3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models 面向VLA的3D-Mix:将基于VGGT的三维信息集成到视觉-语言-动作模型中的即插即用模块 摘要 |
Kai Chen Team | 2603.24393 | HJFY |
|
| 2026-03-25 | ViHOI: Human-Object Interaction Synthesis with Visual Priors ViHOI:基于视觉先验的人-物交互合成 摘要 |
Changxing Ding Team | 2603.24383 | HJFY |
|
| 2026-03-25 | GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization GeoRouter:面向全球图像地理定位的动态范式路由 摘要 |
Xiangyu Zhao Team | 2603.24376 | HJFY |
|
| 2026-03-19 | Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding 生成模型通晓空间:释放隐式三维先验以促进场景理解 摘要 |
Xiang Bai Team | 2603.19235 | HJFY |
|
| 2026-03-19 | Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders 视觉语言模型是否需要视觉Transformer?评估状态空间模型作为视觉编码器的表现 摘要 |
Paola Cascante-Bonilla Team | 2603.19209 | HJFY |
|
| 2026-03-19 | Tinted Frames: Question Framing Blinds Vision-Language Models 着色框架:问题框架使视觉语言模型失明 摘要 |
Ritwik Gupta Team | 2603.19203 | HJFY |
|
| 2026-03-19 | Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation 语义与度量:面向视觉语言导航的多智能体概率性接地方法 摘要 |
Nakul Gopalan Team | 2603.19166 | HJFY |
|
| 2026-03-19 | GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning GSMem:将3D高斯泼溅作为持久空间记忆,用于零样本具身探索与推理 摘要 |
Yu Yin Team | 2603.19137 | HJFY |
|
| 2026-03-19 | TAU-R1: Visual Language Model for Traffic Anomaly Understanding TAU-R1:面向交通异常理解的可视语言模型 摘要 |
Nic Zhang Team | 2603.19098 | HJFY |
|
| 2026-03-19 | SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues SAVeS:通过语义线索引导视觉语言模型的安全判断 摘要 |
Bernard Ghanem Team | 2603.19092 | HJFY |
|
| 2026-03-19 | SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation SwiftTailor:基于几何图像表示的高效三维服装生成 摘要 |
Phong Nguyen Team | 2603.19053 | HJFY |
|
| 2026-03-19 | TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation TerraScope:面向地球观测的像素级视觉推理 摘要 |
Paolo Rota Team | 2603.19039 | HJFY |
|
| 2026-03-19 | SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models SEM:面向视觉语言模型事后去偏的稀疏嵌入调制方法 摘要 |
Massimiliano Mancini Team | 2603.19028 | HJFY |
|
| 2026-03-18 | Unified Spatio-Temporal Token Scoring for Efficient Video VLMs 统一时空令牌评分:面向高效视频视觉语言模型 摘要 |
Sangho Lee Team | 2603.18004 | HJFY |
|
| 2026-03-18 | Universal Skeleton Understanding via Differentiable Rendering and MLLMs 基于可微分渲染与多模态大语言模型的通用骨架理解 摘要 |
Mengyuan Liu Team | 2603.18003 | HJFY |
|
| 2026-03-18 | Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models Loc3R-VLM:基于语言的定位与视觉语言模型的三维推理 摘要 |
Marc Pollefeys Team | 2603.18002 | HJFY |
|
| 2026-03-18 | Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding 感知空间:面向高效精准三维场景理解的自我运动感知视频表征 摘要 |
Kang G. Shin Team | 2603.17980 | HJFY |
|
| 2026-03-18 | ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models ProbeFlow:面向视觉-语言-动作模型的无训练自适应流匹配方法 摘要 |
Qiongfeng Shi Team | 2603.17850 | HJFY |
|
| 2026-03-18 | Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients 基于量化感知积分梯度的大规模视觉语言模型细粒度后训练量化 摘要 |
Xu-Yao Zhang Team | 2603.17809 | HJFY |
|
| 2026-03-18 | Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs 基于大型视觉语言模型的跨域图像深度伪造检测证据包方法 摘要 |
Zhaohong Jia Team | 2603.17761 | HJFY |
|
| 2026-03-18 | Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation 概念到像素:无需提示的通用医学图像分割 摘要 |
Shaohua Kevin Zhou Team | 2603.17746 | HJFY |
|
| 2026-03-18 | SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition SARE:面向免训练细粒度视觉识别的样本自适应推理框架 摘要 |
Xuhong Zhang Team | 2603.17729 | HJFY |
|
| 2026-03-18 | From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving 从虚拟环境到现实世界测试:自动驾驶的新兴趋势 摘要 |
A. Behera Team | 2603.17714 | HJFY |
|
| 2026-03-13 | Visual-ERM: Reward Modeling for Visual Equivalence Visual-ERM:视觉等价性奖励建模 摘要 |
Yuhang Zang Team | 2603.13224 | HJFY |
|
| 2026-03-13 | Navig-AI-tion: Navigation by Contextual AI and Spatial Audio 导航AI化:基于情境人工智能与空间音频的导航系统 摘要 |
Mar Gonzalez-Franco Team | 2603.13200 | HJFY |
|
| 2026-03-13 | Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos 面向单目视频的时空世界场景图生成 摘要 |
Vibhav Gogate Team | 2603.13185 | HJFY |
|
| 2026-03-13 | Geometry-Guided Camera Motion Understanding in VideoLLMs 视频大语言模型中的几何引导相机运动理解 摘要 |
Guan-Ming Su Team | 2603.13119 | HJFY |
|
| 2026-03-13 | Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences 评估视觉语言模型在机器人运动中的空间推理能力:迈向融合运动偏好的机器人规划 摘要 |
Martim Brandão Team | 2603.13100 | HJFY |
|
| 2026-03-13 | Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence 视频推理评估:探究多模态大语言模型如何提取、整合与重构时空证据 摘要 |
Hwanjun Song Team | 2603.13091 | HJFY |
|
| 2026-03-13 | Topo-R1: Detecting Topological Anomalies via Vision-Language Models Topo-R1:基于视觉语言模型的拓扑异常检测 摘要 |
Chao Chen Team | 2603.13054 | HJFY |
|
| 2026-03-13 | ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models ESPIRE:面向视觉语言模型具身空间推理的诊断性基准 摘要 |
Zilong Zheng Team | 2603.13033 | HJFY |
|
| 2026-03-13 | A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks 一种跨模态与任务具有效用保证的视觉语言模型去偏闭式解 摘要 |
Oya Celiktutan Team | 2603.12998 | HJFY |
|
| 2026-03-13 | Test-Time Attention Purification for Backdoored Large Vision Language Models 针对后门大型视觉语言模型的测试时注意力净化 摘要 |
Miao Xu Team | 2603.12989 | HJFY |
|
| 2026-03-10 | X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models X-GS:一个可扩展的开放框架,统一3DGS架构与下游多模态模型 摘要 |
Irwin King Team | 2603.09632 | HJFY |
|
| 2026-03-10 | Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models Speech-Omni-Lite:面向视觉语言模型的便携式语音接口 摘要 |
Xiao Chen Team | 2603.09627 | HJFY |
|
| 2026-03-10 | More than the Sum: Panorama-Language Models for Adverse Omni-Scenes 超越简单叠加:面向全景恶劣场景的全景语言模型 摘要 |
Rainer Stiefelhagen Team | 2603.09573 | HJFY |
|
| 2026-03-10 | GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision GeoSolver:通过细粒度过程监督扩展遥感领域的测试时推理能力 摘要 |
Bo Yang Team | 2603.09551 | HJFY |
|
| 2026-03-10 | Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization 通过分组相对策略优化实现统一的多模态交错生成 摘要 |
Li Zhang Team | 2603.09538 | HJFY |
|
| 2026-03-10 | Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning 探究驾驶视觉语言模型的可靠性:从响应不一致到基于时序的推理 摘要 |
Alain Pagani Team | 2603.09512 | HJFY |
|
| 2026-03-10 | Evolving Prompt Adaptation for Vision-Language Models 面向视觉语言模型的演化提示适应方法 摘要 |
Yang Li Team | 2603.09493 | HJFY |
|
| 2026-03-10 | StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving StyleVLA:面向自动驾驶的驾驶风格感知视觉语言动作模型 摘要 |
Johannes Betz Team | 2603.09482 | HJFY |
|
| 2026-03-10 | Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity 剪除冗余,保留精髓:基于协同重要性-多样性原则的视觉语言模型视觉令牌压缩 摘要 |
Wenjie Pei Team | 2603.09480 | HJFY |
|
| 2026-03-10 | MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning MORE-R1:通过强化学习引导大型视觉语言模型进行多模态对象-实体关系提取的逐步推理 摘要 |
Tong Mo Team | 2603.09478 | HJFY |
|
| 2026-03-06 | Multimodal Large Language Models as Image Classifiers 多模态大语言模型作为图像分类器 摘要 |
Jiri Matas Team | 2603.06578 | HJFY |
|
| 2026-03-06 | Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion Omni-Diffusion:基于掩码离散扩散的统一多模态理解与生成 摘要 |
Chaoyou Fu Team | 2603.06577 | HJFY |
|
| 2026-03-06 | SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning SUREON:一个用于外科推理的基准与视觉语言模型 摘要 |
Omid Mohareri Team | 2603.06570 | HJFY |
|
| 2026-03-06 | Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders Penguin-VL:探索基于LLM视觉编码器的视觉语言模型效率极限 摘要 |
Leoweiliang Team | 2603.06569 | HJFY |
|
| 2026-03-06 | Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement 基础模型懂几何吗?探究冻结特征在连续物理测量中的应用 摘要 |
Yakov Pyotr Shkolnikov Team | 2603.06459 | HJFY |
|
| 2026-03-06 | OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis OralGPT-Plus:通过强化学习掌握视觉工具用于全景X射线分析 摘要 |
Hao Tang Team | 2603.06366 | HJFY |
|
| 2026-03-06 | K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging K-MaT:基于知识锚定的流形迁移用于医学影像中的跨模态提示学习 摘要 |
Shadi Albarqouni Team | 2603.06340 | HJFY |
|
| 2026-03-06 | WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection WMoE-CLIP:基于小波增强专家混合提示学习的零样本异常检测 摘要 |
Chao Huang Team | 2603.06313 | HJFY |
|
| 2026-03-06 | DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models DEX-AR:一种面向自回归视觉语言模型的动态可解释性方法 摘要 |
Hilde Kuehne Team | 2603.06302 | HJFY |
|
| 2026-03-06 | HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models HiPP-Prune:面向视觉语言模型的分层偏好条件结构化剪枝 摘要 |
Raul Santos-Rodriguez Team | 2603.06270 | HJFY |
|
| 2026-03-05 | HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token HALP:无需生成任何词元即可检测视觉语言模型中的幻觉现象 摘要 |
Jiawei Zhou Team | 2603.05465 | HJFY |
|
| 2026-03-05 | ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking ORMOT:面向全向指涉多目标跟踪的数据集与框架 摘要 |
Wenbing Tao Team | 2603.05384 | HJFY |
|
| 2026-03-05 | Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum Wiki-R1:通过数据与采样课程激励基于知识的视觉问答中的多模态推理 摘要 |
Xuming He Team | 2603.05256 | HJFY |
|
| 2026-03-05 | Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation 闭环批评者:一种用于鲁棒长程操作的三系统视觉语言动作框架 摘要 |
Shanlin Zhong Team | 2603.05185 | HJFY |
|
| 2026-03-05 | Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule Logi-PAR:基于可微分规则的逻辑增强型患者活动识别 摘要 |
Kawsar Farooq Team | 2603.05184 | HJFY |
|
| 2026-03-05 | Mario: Multimodal Graph Reasoning with Large Language Models Mario:基于大语言模型的多模态图推理 摘要 |
Qiaoyu Tan Team | 2603.05181 | HJFY |
|
| 2026-03-05 | UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark UniM:一个统一的任意到任意交错多模态基准 摘要 |
Wynne Hsu Team | 2603.05075 | HJFY |
|
| 2026-03-05 | Direct Contact-Tolerant Motion Planning With Vision Language Models 基于视觉语言模型的直接接触容忍运动规划 摘要 |
Chengzhong Xu Team | 2603.05017 | HJFY |
|
| 2026-03-05 | VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters VisionPangu:一款拥有17亿参数的紧凑且细粒度多模态助手 摘要 |
Wenpo Song Team | 2603.04957 | HJFY |
|
| 2026-03-05 | AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM AdaIAT:自适应增强对生成文本的关注以缓解大型视觉语言模型中的幻觉问题 摘要 |
Xiangui Kang Team | 2603.04908 | HJFY |
|
| 2026-02-26 | MediX-R1: Open Ended Medical Reinforcement Learning MediX-R1:开放式医学强化学习框架 摘要 |
Hisham Cholakkal Team | 2602.23363 | HJFY |
|
| 2026-02-26 | Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning 规模无法克服语用学:报告偏差对视觉-语言推理的影响 摘要 |
Ranjay Krishna Team | 2602.23351 | HJFY |
|
| 2026-02-26 | Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? 检索与分割:少量示例足以弥合开放词汇分割中的监督鸿沟吗? 摘要 |
Giorgos Tolias Team | 2602.23339 | HJFY |
|
| 2026-02-26 | CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays CXReasonAgent:基于证据的胸部X光诊断推理智能体 摘要 |
Edward Choi Team | 2602.23276 | HJFY |
|
| 2026-02-26 | Large Multimodal Models as General In-Context Classifiers 大型多模态模型作为通用上下文内分类器 摘要 |
Elisa Ricci Team | 2602.23229 | HJFY |
|
| 2026-02-26 | MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction MovieTeller:基于工具增强的电影剧情摘要与ID一致渐进式抽象 摘要 |
Gaoang Wang Team | 2602.23228 | HJFY |
|
| 2026-02-26 | Efficient Encoder-Free Fourier-based 3D Large Multimodal Model 高效无编码器的基于傅里叶变换的3D大型多模态模型 摘要 |
Fabio Poiesi Team | 2602.23153 | HJFY |
|
| 2026-02-26 | Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy 以言构形:弱监督视觉-语言建模用于人脑显微成像 摘要 |
Christian Schiffer Team | 2602.23088 | HJFY |
|
| 2026-02-26 | SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling SubspaceAD:基于子空间建模的无训练少样本异常检测方法 摘要 |
Egor Bondarev Team | 2602.23013 | HJFY |
|
| 2026-02-26 | FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning FactGuard:基于强化学习的智能体视频虚假信息检测 摘要 |
Zhaoqi Wang Team | 2602.22963 | HJFY |
|
| 2026-02-24 | Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning Spa3R:面向三维视觉推理的预测性空间场建模 摘要 |
Xinggang Wang Team | 2602.21186 | HJFY |
|
| 2026-02-24 | Seeing Through Words: Controlling Visual Retrieval Quality with Language Models 透过文字看见:利用语言模型控制视觉检索质量 摘要 |
Yun Fu Team | 2602.21175 | HJFY |
|
| 2026-02-24 | LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis LUMEN:用于预后与诊断的纵向多模态放射学模型 摘要 |
Marius George Linguraru Team | 2602.21142 | HJFY |
|
| 2026-02-24 | VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation VAUQ:面向LVLM自评估的视觉感知不确定性量化 摘要 |
Sharon Li Team | 2602.21054 | HJFY |
|
| 2026-02-24 | OCR-Agent: Agentic OCR with Capability and Memory Reflection OCR-Agent:具备能力与记忆反思的智能OCR代理 摘要 |
Ying Cai Team | 2602.21053 | HJFY |
|
| 2026-02-24 | Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning 不止于所见:无需微调,让CLIP理解否定的视觉描述 摘要 |
Zejiang He Team | 2602.21035 | HJFY |
|
| 2026-02-24 | From Perception to Action: An Interactive Benchmark for Vision Reasoning 从感知到行动:视觉推理的交互式基准测试 摘要 |
Roy Ka-Wei Lee Team | 2602.21015 | HJFY |
|
| 2026-02-24 | CrystaL: Spontaneous Emergence of Visual Latents in MLLMs CrystaL:多模态大语言模型中视觉潜在特征的自发涌现 摘要 |
Xiang Li Team | 2602.20980 | HJFY |
|
| 2026-02-24 | Are Multimodal Large Language Models Good Annotators for Image Tagging? 多模态大语言模型是图像标注的优秀注释者吗? 摘要 |
Masashi Sugiyama Team | 2602.20972 | HJFY |
|
| 2026-02-24 | LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding LongVideo-R1:面向低成本长视频理解的智能导航方法 摘要 |
Qixiang Ye Team | 2602.20913 | HJFY |
|
| 2026-02-19 | Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting 通过细粒度细节定位推动黑盒大视觉语言模型攻击前沿 摘要 |
Zhiqiang Shen Team | 2602.17645 | HJFY |
|
| 2026-02-19 | Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning 抗灾难性遗忘的单次增量联邦学习 摘要 |
Monowar Bhuyan Team | 2602.17625 | HJFY |
|
| 2026-02-19 | AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games AI游戏商店:通过人类游戏实现机器通用智能的可扩展、开放式评估 摘要 |
Joshua B. Tenenbaum Team | 2602.17594 | HJFY |
|
| 2026-02-19 | RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward RetouchIQ:基于指令的图像修饰多模态大语言模型智能体与通用奖励机制 摘要 |
Handong Zhao Team | 2602.17558 | HJFY |
|
| 2026-02-19 | GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking GraphThinker:通过事件图思维强化视频推理 摘要 |
Shaogang Gong Team | 2602.17555 | HJFY |
|
| 2026-02-19 | LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs LATA:面向医学视觉语言模型置信度校准的拉普拉斯辅助转导自适应方法 摘要 |
Zongyuan Ge Team | 2602.17535 | HJFY |
|
| 2026-02-19 | QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery QuPAINT:面向量子材料发现的物理感知指令调优方法 摘要 |
Khoa Luu Team | 2602.17478 | HJFY |
|
| 2026-02-19 | EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models EAGLE:面向多模态大语言模型免调优工业异常检测的专家增强注意力引导方法 摘要 |
Seon Han Choi Team | 2602.17419 | HJFY |
|
| 2026-02-19 | EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models EntropyPrune:基于矩阵熵引导的多模态大语言模型视觉令牌剪枝 摘要 |
Lianghua He Team | 2602.17196 | HJFY |
|
| 2026-02-19 | Selective Training for Large Vision Language Models via Visual Information Gain 基于视觉信息增益的大型视觉语言模型选择性训练 摘要 |
Sangheum Hwang Team | 2602.17186 | HJFY |
|
| 2026-02-12 | Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment 扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效 摘要 |
Marco Pavone Team | 2602.12281 | HJFY |
|
| 2026-02-12 | ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images ExStrucTiny:面向文档图像中模式可变结构化信息提取的基准数据集 摘要 |
Manuela Veloso Team | 2602.12203 | HJFY |
|
| 2026-02-12 | Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education 视觉推理基准:评估多模态大语言模型在基础教育课堂真实视觉问题上的表现 摘要 |
Oliver G. B. Garrod Team | 2602.12196 | HJFY |
|
| 2026-02-12 | 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting 3DGSNav:通过主动3D高斯泼溅增强视觉语言模型在物体导航中的推理能力 摘要 |
Xinyi Yu Team | 2602.12159 | HJFY |
|
| 2026-02-12 | DeepSight: An All-in-One LM Safety Toolkit DeepSight:一体化大型模型安全工具箱 摘要 |
Xia Hu Team | 2602.12092 | HJFY |
|
| 2026-02-12 | Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning 可供性图化任务世界:面向可扩展具身学习的自演化任务生成 摘要 |
Changshui Zhang Team | 2602.12065 | HJFY |
|
| 2026-02-12 | Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation 本地视觉语言模型能否超越视觉Transformer提升活动识别能力?——以新生儿复苏为例的研究 摘要 |
Øyvind Meinich-Bache Team | 2602.12002 | HJFY |
|
| 2026-02-12 | Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation 空间思维链:连接理解与生成模型以实现空间推理生成 摘要 |
Long Chen Team | 2602.11980 | HJFY |
|
| 2026-02-12 | Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion 评估视觉语言模型在法语PDF转Markdown任务中的性能基准 摘要 |
Nicolas Mery Team | 2602.11960 | HJFY |
|
| 2026-02-12 | Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization 双LLM是否优于单一模型?一种用于医药内容优化的师生双头LLM架构 摘要 |
Anubhav Girdhar Team | 2602.11957 | HJFY |
|
| 2026-02-10 | Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection Reason-IAD:面向可解释工业异常检测的知识引导动态潜在推理框架 摘要 |
Xiaochun Cao Team | 2602.09850 | HJFY |
|
| 2026-02-10 | Kelix Technique Report Kelix技术报告 摘要 |
Ziqi Wang Team | 2602.09843 | HJFY |
|
| 2026-02-10 | SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding SAKED:通过稳定性感知的知识增强解码缓解大型视觉语言模型中的幻觉问题 摘要 |
Xudong Jiang Team | 2602.09825 | HJFY |
|
| 2026-02-10 | GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation GenSeg-R1:基于强化学习的视觉语言细粒度指代分割 摘要 |
Uma Mahesh Team | 2602.09701 | HJFY |
|
| 2026-02-10 | VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地 摘要 |
Hui Xiong Team | 2602.09638 | HJFY |
|
| 2026-02-10 | AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models AGMark:面向大型视觉语言模型的注意力引导动态水印技术 摘要 |
Linlin Wang Team | 2602.09611 | HJFY |
|
| 2026-02-10 | Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing Tele-Omni:面向视频生成与编辑的统一多模态框架 摘要 |
Xuelong Li Team | 2602.09609 | HJFY |
|
| 2026-02-10 | Delving into Spectral Clustering with Vision-Language Representations 探索基于视觉-语言表征的光谱聚类方法 摘要 |
Zhen Fang Team | 2602.09586 | HJFY |
|
| 2026-02-10 | Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination 手术刀:通过混合高斯桥精细对齐注意力激活流形以缓解多模态幻觉 摘要 |
Koichi Shirahata Team | 2602.09541 | HJFY |
|
| 2026-02-10 | DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment DR.Experts:面向盲图像质量评估的失真感知专家差分细化方法 摘要 |
Runze Hu Team | 2602.09531 | HJFY |
|
| 2026-02-04 | When LLaVA Meets Objects: Token Composition for Vision-Language-Models 当LLaVA遇见物体:视觉语言模型的令牌组合 摘要 |
Hilde Kuehne Team | 2602.04864 | HJFY |
|
| 2026-02-04 | El Agente Estructural: An Artificially Intelligent Molecular Editor 结构智能体:一种人工智能分子编辑器 摘要 |
Varinia Bernales Team | 2602.04849 | HJFY |
|
| 2026-02-04 | VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? VISTA-Bench:视觉语言模型真的能像理解纯文本一样理解图像中的文本吗? 摘要 |
Huchuan Lu Team | 2602.04802 | HJFY |
|
| 2026-02-04 | Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases 多模态大语言模型中的对齐漂移:对八个模型版本有害性的两阶段纵向评估 摘要 |
Emily Dix Team | 2602.04739 | HJFY |
|
| 2026-02-04 | SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation SAR-RAG:通过语义搜索、检索与多模态大语言模型生成的自动目标识别视觉问答 摘要 |
Andreas Spanias Team | 2602.04712 | HJFY |
|
| 2026-02-04 | Annotation Free Spacecraft Detection and Segmentation using Vision Language Models 基于视觉语言模型的无标注航天器检测与分割 摘要 |
Djamila Aouada Team | 2602.04699 | HJFY |
|
| 2026-02-04 | AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation AGILE:基于智能体生成从视频重建手-物交互 摘要 |
Chunhua Shen Team | 2602.04672 | HJFY |
|
| 2026-02-04 | PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective PIO-FVLM:从推理目标视角重新审视用于VLM加速的无训练视觉令牌缩减 摘要 |
Chunhua Shen Team | 2602.04657 | HJFY |
|
| 2026-02-04 | Relational Scene Graphs for Object Grounding of Natural Language Commands 面向自然语言指令中物体定位的关系场景图 摘要 |
Ville Kyrki Team | 2602.04635 | HJFY |
|
| 2026-02-04 | LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation LEAD:面向忠实放射学报告生成的层级专家对齐解码 摘要 |
Yan Song Team | 2602.04617 | HJFY |
|
| 2026-02-02 | Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts Avenir-Web:基于混合定位专家的人类经验模仿式多模态网络代理 摘要 |
Mengdi Wang Team | 2602.02468 | HJFY |
|
| 2026-02-02 | MentisOculi: Revealing the Limits of Reasoning with Mental Imagery MentisOculi:揭示心智意象推理的局限性 摘要 |
Wieland Brendel Team | 2602.02465 | HJFY |
|
| 2026-02-02 | Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning 面向任务推理的关系感知分层三维场景图 摘要 |
Kostas Alexis Team | 2602.02456 | HJFY |
|
| 2026-02-02 | World-Gymnast: Training Robots with Reinforcement Learning in a World Model 世界体操家:在世界模型中通过强化学习训练机器人 摘要 |
Sherry Yang Team | 2602.02454 | HJFY |
|
| 2026-02-02 | ReasonEdit: Editing Vision-Language Models using Human Reasoning ReasonEdit:基于人类推理的视觉语言模型编辑 摘要 |
Thomas Hartvigsen Team | 2602.02408 | HJFY |
|
| 2026-02-02 | LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization LongVPO:从锚定线索到自我推理的长视频偏好优化 摘要 |
Limin Wang Team | 2602.02341 | HJFY |
|
| 2026-02-02 | Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Vision-DeepResearch基准:重新思考多模态大语言模型的视觉与文本搜索能力 摘要 |
Shaosheng Cao Team | 2602.02185 | HJFY |
|
| 2026-02-02 | See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力 摘要 |
Takeo Igarashi Team | 2602.02063 | HJFY |
|
| 2026-02-02 | Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models Auto-Comp:面向对比式视觉语言模型可扩展组合性探测的自动化流程 摘要 |
Toshihiko Yamasaki Team | 2602.02043 | HJFY |
|
| 2026-02-02 | One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation 一图多配:在大规模广告图像生成中协调多样化的群体点击偏好 摘要 |
Jian Liang Team | 2602.02033 | HJFY |
|
| 2026-01-30 | User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments | Junfeng Lin et.al. | 2601.23281 | null |
|
| 2026-01-30 | Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models | Yi Zhang et.al. | 2601.23253 | null |
|
| 2026-01-30 | Structured Over Scale: Learning Spatial Reasoning from Educational Video | Bishoy Galoaa et.al. | 2601.23251 | null |
|
| 2026-01-30 | Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning | Xiangyu Zeng et.al. | 2601.23224 | null |
|
| 2026-01-30 | Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training | Anglin Liu et.al. | 2601.23220 | null |
|
| 2026-01-30 | Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization | Hui Lu et.al. | 2601.23179 | null |
|
| 2026-01-30 | Hearing is Believing? Evaluating and Analyzing Audio Language Model Sycophancy with SYAUDIO | Junchi Yao et.al. | 2601.23149 | null |
|
| 2026-01-30 | One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs | Youxu Shi et.al. | 2601.23041 | null |
|
| 2026-01-30 | Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models | Anmin Wang et.al. | 2601.22959 | null |
|
| 2026-01-30 | Alignment among Language, Vision and Action Representations | Nicola Milano et.al. | 2601.22948 | null |
|
| 2026-01-29 | UEval: A Benchmark for Unified Multimodal Generation | Bo Li et.al. | 2601.22155 | null |
|
| 2026-01-29 | Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions | Xiaoxiao Sun et.al. | 2601.22150 | null |
|
| 2026-01-29 | SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence | Saoud Aldowaish et.al. | 2601.22114 | null |
|
| 2026-01-29 | VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning | Yibo Wang et.al. | 2601.22069 | null |
|
| 2026-01-29 | Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models | Wenxuan Huang et.al. | 2601.22060 | null |
|
| 2026-01-29 | MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources | Baorui Ma et.al. | 2601.22054 | null |
|
| 2026-01-29 | Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning | Chengyi Cai et.al. | 2601.22020 | null |
|
| 2026-01-29 | Causal World Modeling for Robot Control | Lin Li et.al. | 2601.21998 | null |
|
| 2026-01-29 | Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models | Konstantinos P. Panousis et.al. | 2601.21944 | null |
|
| 2026-01-29 | VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models | Yunhao Li et.al. | 2601.21915 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。