🤖 VLM-Arxiv-Daily

每日自动追踪 Vision-Language-Action (VLA)、Vision-Language Navigation (VLN) 和 Vision-Language Models (VLM) 的最新 arXiv 论文。

Updated on 2026.05.18

📌 VLA

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY 评估
2026-05-14 VGGT-$Ω$
VGGT-Ω
摘要
Christian Rupprecht Team 2605.15195 HJFY
2026-05-14 Hand-in-the-Loop: Improving Dexterous VLA via Seamless Interventional Correction
手中介入:通过无缝干预校正提升灵巧VLA模型
摘要
Ruoshi Wen Team 2605.15157 HJFY
2026-05-14 Evo-Depth: A Lightweight Depth-Enhanced Vision-Language-Action Model
Evo-Depth: 一种轻量级深度增强的视觉-语言-动作模型
摘要
Bo Zhao Team 2605.14950 HJFY
2026-05-14 Slot-MPC: Goal-Conditioned Model Predictive Control with Object-Centric Representations
Slot-MPC:基于对象中心表征的目标条件模型预测控制
摘要
Sven Behnke Team 2605.14937 HJFY
2026-05-14 Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
程序链:面向程序性问答的分层视觉-语言推理
摘要
Derek F. Wong Team 2605.14928 HJFY
2026-05-14 IntentVLA: Short-Horizon Intent Modeling for Aliased Robot Manipulation
IntentVLA:面向别名化机器人操作的短时意图建模
摘要
Kai Chen Team 2605.14712 HJFY
2026-05-14 Digital Twin Synchronization Over Mobile Embodied AI Network With Agentic Intelligence
具备智能体智能的移动具身AI网络中的数字孪生同步
摘要
Kaibin Huang Team 2605.14625 HJFY
2026-05-14 DSSP: Diffusion State Space Policy with Full-History Encoding
DSSP:具备全历史编码的扩散状态空间策略
摘要
Yutong Ban Team 2605.14598 HJFY
2026-05-14 TeachAnything: A Multimodal Crowdsourcing Platform for Training Embodied AI Agents in Symmetrical Reality
TeachAnything:面向对称现实中具身AI智能体训练的多模态众包平台
摘要
Zhenliang Zhang Team 2605.14556 HJFY
2026-05-14 DiffPhD: A Unified Differentiable Solver for Projective Heterogeneous Materials in Elastodynamics with Contact-Rich GPU-Acceleration
DiffPhD:面向弹性动力学中投影异构材料且支持丰富接触的GPU加速统一可微求解器
摘要
Bing-Yu Chen Team 2605.14526 HJFY
2026-05-08 One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
每帧单Token:重新审视VLA策略中世界模型的视觉带宽
摘要
Bin Liu Team 2605.07931 HJFY
2026-05-08 EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting
EggHand:面向自我中心手部姿态预测的多模态基础模型
摘要
Daehee Park Team 2605.07642 HJFY
2026-05-08 ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations
ForgeVLA:无需语言标注的联邦式视觉-语言-动作学习
摘要
Jiancheng Lyu Team 2605.07474 HJFY
2026-05-08 EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner:一种面向图像编辑优化的人机对齐智能体框架
摘要
Guangtao Zhai Team 2605.07457 HJFY
2026-05-08 Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
摘要
Mike Zheng Shou Team 2605.07381 HJFY
2026-05-08 Task-Oriented Communication for Human Action Understanding via Edge-Cloud Co-Inference
面向人类动作理解的边缘-云端协同推理任务导向通信
摘要
Jiawei Shao Team 2605.07354 HJFY
2026-05-08 CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations
CSR:基于海量缓存状态表示的无限时域实时策略
摘要
Go Suzui Team 2605.07325 HJFY
2026-05-08 AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA:面向视觉-语言-动作模型增强反馈反应的适应性触觉注入机制
摘要
Hao Dong Team 2605.07308 HJFY
2026-05-08 BioProVLA-Agent: An Affordable, Protocol-Driven, Vision-Enhanced VLA-Enabled Embodied Multi-Agent System with Closed-Loop-Capable Reasoning for Biological Laboratory Manipulation
BioProVLA-Agent:一种经济实惠、协议驱动、视觉增强的VLA赋能具身多智能体系统,具备闭环推理能力用于生物实验室操作
摘要
Zhe Liu Team 2605.07306 HJFY
2026-05-08 Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
Sword: 通过动态潜在引导实现风格鲁棒的世界模型,用于VLA策略后训练的模拟器
摘要
Sheng Wen Team 2605.07288 HJFY
2026-05-05 Bridging the Embodiment Gap: Disentangled Cross-Embodiment Video Editing
弥合具身鸿沟:解耦跨具身视频编辑
摘要
Joni Pajarinen Team 2605.03637 HJFY
2026-05-05 MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models
MHPR:面向大型视觉-语言模型的多维人类感知与推理基准
摘要
Shengzhao Wen Team 2605.03485 HJFY
2026-05-05 Neural Control: Adjoint Learning Through Equilibrium Constraints
神经控制:通过平衡约束的伴随学习
摘要
M. Khalid Jawed Team 2605.03288 HJFY
2026-05-05 RLDX-1 Technical Report
RLDX-1技术报告
摘要
Jinwoo Shin Team 2605.03269 HJFY
2026-05-04 MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2:面向现实部署的动作推理模型
摘要
Ranjay Krishna Team 2605.02881 HJFY
2026-05-05 VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
VideoNet:面向领域特定动作识别的大规模数据集
摘要
Ranjay Krishna Team 2605.02834 HJFY
2026-05-04 Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation
从仿真中看见真实:面向视觉-语言-动作数据增强的高效视频迁移方法
摘要
Chang Xu Team 2605.02757 HJFY
2026-05-04 Latent Bridge: Feature Delta Prediction for Efficient Dual-System Vision-Language-Action Model Inference
潜在桥接:面向高效双系统视觉-语言-动作模型推理的特征增量预测
摘要
Hai Li Team 2605.02739 HJFY
2026-05-04 Learning Equivariant Neural-Augmented Object Dynamics From Few Interactions
从少量交互中学习等变神经增强物体动力学
摘要
Laura Herlant Team 2605.02699 HJFY
2026-05-04 AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs
AnchorD:基于因子图的单目深度度量锚定方法
摘要
Abhinav Valada Team 2605.02667 HJFY
2026-04-30 LaST-R1: Reinforcing Action via Adaptive Physical Latent Reasoning for VLA Models
LaST-R1:通过自适应物理潜在推理增强VLA模型的动作能力
摘要
Pheng-Ann Heng Team 2604.28192 HJFY
2026-04-30 RopeDreamer: A Kinematic Recurrent State Space Model for Dynamics of Flexible Deformable Linear Objects
RopeDreamer:面向柔性可变形线性物体动力学的运动学递归状态空间模型
摘要
Paula Dornhofer Paro Costa Team 2604.28161 HJFY
2026-04-30 A Pattern Language for Resilient Visual Agents
韧性视觉智能体的模式语言
摘要
Alois Knoll Team 2604.28001 HJFY
2026-04-30 MotuBrain: An Advanced World Action Model for Robot Control
MotuBrain:一种面向机器人控制的高级世界动作模型
摘要
Jun Zhu Team 2604.27792 HJFY
2026-04-30 Robot Learning from Human Videos: A Survey
从人类视频中学习机器人技能:综述
摘要
Hesheng Wang Team 2604.27621 HJFY
2026-04-30 SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct:面向视觉-语言导航的空间激活式迁移学习与课程自适应方法
摘要
Nanning Zheng Team 2604.27620 HJFY
2026-04-30 World2Minecraft: Occupancy-Driven Simulated Scenes Construction
World2Minecraft:基于占用驱动的仿真场景构建
摘要
Xin Tan Team 2604.27578 HJFY
2026-04-30 SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation
空间语法:一种面向基于大语言模型的3D室内场景生成的领域特定语言
摘要
Xiaowen Chu Team 2604.27555 HJFY
2026-04-30 PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
PRTS:基于对比表示的原初推理与任务系统
摘要
Xuelong Li Team 2604.27472 HJFY
2026-04-30 Judge, Then Drive: A Critic-Centric Vision Language Action Framework for Autonomous Driving
先评判,再驾驶:一种以批评者为中心的视觉语言动作自动驾驶框架
摘要
Hao Yang Team 2604.27366 HJFY
2026-04-23 Long-Horizon Manipulation via Trace-Conditioned VLA Planning
基于轨迹条件视觉-语言-动作规划的长程操作
摘要
Sifei Liu Team 2604.21924 HJFY
2026-04-23 VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis
VistaBot:基于时空感知视图合成的鲁棒机器人操作
摘要
Wenchao Ding Team 2604.21914 HJFY
2026-04-23 From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
从编码本到视觉语言模型:社交媒体上气候变化视觉话语的自动化评估分析
摘要
Margret Keuper Team 2604.21786 HJFY
2026-04-23 Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
Hi-WM:世界模型驱动的人机协同机器人后训练框架
摘要
Yichen Zhu Team 2604.21741 HJFY
2026-04-23 A Replicable Robotics Awareness Method Using LLM-Enabled Robotics Interaction: Evidence from a Corporate Challenge
一种基于LLM赋能的机器人交互的可复现机器人认知方法:来自企业挑战赛的证据
摘要
P. Olivera Brizzio Team 2604.21377 HJFY
2026-04-23 Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
符号化根基揭示抽象视觉推理中的表征瓶颈
摘要
Tanel Tammet Team 2604.21346 HJFY
2026-04-23 Reasoning About Traversability: Language-Guided Off-Road 3D Trajectory Planning
关于可通行性的推理:语言引导的越野3D轨迹规划
摘要
Soonmin Hwang Team 2604.21249 HJFY
2026-04-23 CorridorVLA: Explicit Spatial Constraints for Generative Action Heads via Sparse Anchors
CorridorVLA:通过稀疏锚点为生成式动作头提供显式空间约束
摘要
Jianqiang Li Team 2604.21241 HJFY
2026-04-23 ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures
ReCAPA: 层级预测性纠正以缓解级联故障
摘要
Hao Wang Team 2604.21232 HJFY
2026-04-23 How VLAs (Really) Work In Open-World Environments
开放环境下视觉-语言-动作模型的实际运作方式
摘要
Sajjad Pakdamansavoji Team 2604.21192 HJFY
2026-04-20 Neural Garbage Collection: Learning to Forget while Learning to Reason
神经垃圾回收:在学会推理的同时学会遗忘
摘要
Noah D. Goodman Team 2604.18002 HJFY
2026-04-20 Unmasking the Illusion of Embodied Reasoning in Vision-Language-Action Models
揭示视觉-语言-动作模型中具身推理的幻象
摘要
Zongqing Lu Team 2604.18000 HJFY
2026-04-20 E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes
E3VS-Bench:面向3D高斯泼溅场景中视角依赖主动感知的基准测试
摘要
Yutaka Matsuo Team 2604.17969 HJFY
2026-04-20 OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive:基于视觉-语言-动作模型的多范式统一驾驶框架
摘要
Zhipeng Zhang Team 2604.17915 HJFY
2026-04-20 Can Explicit Physical Feasibility Benefit VLA Learning? An Empirical Study
显式物理可行性是否有助于视觉语言动作模型学习?一项实证研究
摘要
Hashem Haghbayan Team 2604.17896 HJFY
2026-04-20 StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM:通过时空精炼稳定逆动力学模型应对机械臂截断
摘要
Huaibo Huang Team 2604.17887 HJFY
2026-04-20 ST-$π$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π:面向机器人操作的结构化时空视觉语言动作模型
摘要
Luxin Yan Team 2604.17880 HJFY
2026-04-20 OFlow: Injecting Object-Aware Temporal Flow Matching for Robust Robotic Manipulation
OFlow:注入对象感知时序流匹配以实现鲁棒机器人操作
摘要
Xiangyang Xue Team 2604.17876 HJFY
2026-04-20 DART: Learning-Enhanced Model Predictive Control for Dual-Arm Non-Prehensile Manipulation
DART:用于双臂非抓取操作的学习增强型模型预测控制
摘要
Madhava Krishna Team 2604.17833 HJFY
2026-04-20 ReFineVLA: Multimodal Reasoning-Aware Generalist Robotic Policies via Teacher-Guided Fine-Tuning
ReFineVLA:通过教师引导微调实现多模态推理感知的通用机器人策略
摘要
Minh Nhat Vu Team 2604.17800 HJFY
2026-04-15 HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA:一种以视觉定位为中心的分层具身操作系统
摘要
Ping Luo Team 2604.14125 HJFY
2026-04-15 MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images
MApLe:诊断报告与大型医学图像的多实例对齐
摘要
Georg Langs Team 2604.13970 HJFY
2026-04-15 [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
[新兴理念] 人工三元智能:一种面向物理AI的仿生、传感优先架构
摘要
Hyung-Sin Kim Team 2604.13959 HJFY
2026-04-15 Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
Goal2Skill:基于自适应规划与反思的长时程操作
摘要
Zhongzhu Pu Team 2604.13942 HJFY
2026-04-15 EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw:面向具身人工智能开发的对话式工作流执行系统
摘要
Yongchao Chen Team 2604.13800 HJFY
2026-04-15 Failure Identification in Imitation Learning Via Statistical and Semantic Filtering
基于统计与语义过滤的模仿学习故障识别
摘要
Jean-Baptiste Mouret Team 2604.13788 HJFY
2026-04-15 Jump-Start Reinforcement Learning with Vision-Language-Action Regularization
利用视觉-语言-动作正则化实现强化学习的快速启动
摘要
Loris Roveda Team 2604.13733 HJFY
2026-04-15 Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
无人机视觉与语言导航:进展、挑战与研究路线图
摘要
Ji Pei Team 2604.13654 HJFY
2026-04-15 A Mechanistic Analysis of Sim-and-Real Co-Training in Generative Robot Policies
生成式机器人策略中仿真与真实协同训练的机理分析
摘要
Yuke Zhu Team 2604.13645 HJFY
2026-04-15 ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
ESCAPE:面向长时程移动操作的情景空间记忆与自适应执行策略
摘要
Li Jiang Team 2604.13633 HJFY
2026-04-06 InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement
InfBaGel:基于动态感知与迭代优化的人-物-场景交互生成
摘要
Guanjie Zheng Team 2604.04843 HJFY
2026-04-06 E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA:面向黑暗与模糊场景的事件增强视觉-语言-动作模型
摘要
Kaiwei Wang Team 2604.04834 HJFY
2026-04-06 AnyUser: Translating Sketched User Intent into Domestic Robots
AnyUser:将草图用户意图转化为家用机器人指令
摘要
Shaowu Yang Team 2604.04811 HJFY
2026-04-06 ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw:一种面向异构多智能体协作的分层语义-物理框架
摘要
Jie Chen Team 2604.04664 HJFY
2026-04-06 Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-Act:前沿视频模型能在多大程度上推动通用机器人操作?
摘要
Jianyu Chen Team 2604.04502 HJFY
2026-04-05 Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
视觉-语言-动作模型在推理时自适应动作分块
摘要
Prahlad Vadakkepat Team 2604.04161 HJFY
2026-04-05 VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
VLA-遗忘:面向具身基础模型的视觉-语言-动作联合遗忘
摘要
Agoritsa Polyzou Team 2604.03956 HJFY
2026-04-04 From Prompt to Physical Action: Structured Backdoor Attacks on LLM-Mediated Robotic Control Systems
从提示到物理动作:针对LLM介导机器人控制系统的结构化后门攻击
摘要
Jin Wei-Kocsis Team 2604.03890 HJFY
2026-04-04 OpenRC: An Open-Source Robotic Colonoscopy Framework for Multimodal Data Acquisition and Autonomy Research
OpenRC:一个用于多模态数据采集与自主性研究的开源机器人结肠镜框架
摘要
Farshid Alambeigi Team 2604.03781 HJFY
2026-04-04 When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks
多模态AI何时发挥作用?视觉语言模型与卷积神经网络在星地网络频谱管理中的诊断互补性
摘要
Yuanhang Li Team 2604.03774 HJFY
2026-04-03 The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
压缩鸿沟:为何离散标记化限制视觉-语言-动作模型的扩展
摘要
Takuya Shiba Team 2604.03191 HJFY
2026-04-03 Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
多视角视频扩散策略:一种三维时空感知的视频动作模型
摘要
Tieniu Tan Team 2604.03181 HJFY
2026-04-03 ARM: Advantage Reward Modeling for Long-Horizon Manipulation
ARM:面向长时程操作的优势奖励建模
摘要
Hua Chen Team 2604.03037 HJFY
2026-04-03 Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
开环规划与闭环验证:面向VLA的推测性验证方法
摘要
Xiu-Shen Wei Team 2604.02965 HJFY
2026-04-03 Learning Structured Robot Policies from Vision-Language Models via Synthetic Neuro-Symbolic Supervision
通过合成神经符号监督从视觉语言模型中学习结构化机器人策略
摘要
Pietro Falco Team 2604.02812 HJFY
2026-04-03 ExploreVLA: Dense World Modeling and Exploration for End-to-End Autonomous Driving
ExploreVLA:面向端到端自动驾驶的密集世界建模与探索
摘要
Liu Ren Team 2604.02714 HJFY
2026-04-02 F2F-AP: Flow-to-Future Asynchronous Policy for Real-time Dynamic Manipulation
F2F-AP:面向实时动态操作的流到未来异步策略
摘要
Jiwen Lu Team 2604.02408 HJFY
2026-04-02 UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
UAV-Track VLA:基于视觉-语言-动作模型的无人机具身化空中跟踪
摘要
Yonglin Tian Team 2604.02241 HJFY
2026-04-02 UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
UniDriveVLA:面向自动驾驶的统一理解、感知与动作规划模型
摘要
Xinggang Wang Team 2604.02190 HJFY
2026-04-02 Cross-Modal Visuo-Tactile Object Perception
跨模态视觉-触觉物体感知
摘要
Mohsen Kaboli Team 2604.02108 HJFY
2026-03-31 Hybrid Framework for Robotic Manipulation: Integrating Reinforcement Learning and Large Language Models
机器人操作混合框架:集成强化学习与大语言模型
摘要
Mohd Suhaib Team 2603.30022 HJFY
2026-03-31 Architecting Secure AI Agents: Perspectives on System-Level Defenses Against Indirect Prompt Injection Attacks
构建安全的AI智能体:针对间接提示注入攻击的系统级防御视角
摘要
G. Edward Suh Team 2603.30016 HJFY
2026-03-31 Passive iFIR filters for data-driven velocity control in robotics
机器人数据驱动速度控制中的被动iFIR滤波器
摘要
Fulvio Forni Team 2603.29882 HJFY
2026-03-31 DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL:通过潜在世界建模实现意图与动作解耦的端到端视觉语言动作模型
摘要
Xihui Liu Team 2603.29844 HJFY
2026-03-31 SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes
SceneTeract:三维场景中的智能体功能可供性与视觉语言模型接地验证
摘要
Maks Ovsjanikov Team 2603.29798 HJFY
2026-03-31 From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety
从骨架到语义:面向公共安全的混合边缘动作检测系统设计与部署
摘要
Jan Schagen Team 2603.29777 HJFY
2026-03-31 SafeDMPs: Integrating Formal Safety with DMPs for Adaptive HRI
SafeDMPs:将形式化安全性与动态运动基元结合以实现自适应人机交互
摘要
Ravi Prakash Team 2603.29708 HJFY
2026-03-31 RAAP: Retrieval-Augmented Affordance Prediction with Cross-Image Action Alignment
RAAP:基于检索增强的物性预测与跨图像动作对齐
摘要
Xiu-Shen Wei Team 2603.29419 HJFY
2026-03-31 CLaD: Planning with Grounded Foresight via Cross-Modal Latent Dynamics
CLaD:通过跨模态潜在动态实现基于接地前瞻的规划
摘要
Sung-Eui Yoon Team 2603.29409 HJFY
2026-03-31 PRISM: A Multi-View Multi-Capability Retail Video Dataset for Embodied Vision-Language Models
PRISM:面向具身视觉语言模型的多视角多能力零售视频数据集
摘要
Sashi Reddi Team 2603.29281 HJFY
2026-03-26 Vega: Learning to Drive with Natural Language Instructions
Vega:通过自然语言指令学习驾驶
摘要
Jiwen Lu Team 2603.25741 HJFY
2026-03-26 Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving
驶向我的方式:视觉-语言-动作模型的偏好对齐实现个性化驾驶
摘要
Jiachen Li Team 2603.25740 HJFY
2026-03-26 SoftMimicGen: A Data Generation System for Scalable Robot Learning in Deformable Object Manipulation
SoftMimicGen:一种用于可变形物体操作中可扩展机器人学习的数据生成系统
摘要
Ajay Mandlekar Team 2603.25725 HJFY
2026-03-26 Self-Improvement of Large Language Models: A Technical Overview and Future Outlook
大型语言模型的自我改进:技术概览与未来展望
摘要
Jiawei Zhou Team 2603.25681 HJFY
2026-03-26 Towards Embodied AI with MuscleMimic: Unlocking full-body musculoskeletal motor learning at scale
迈向具身人工智能:通过MuscleMimic实现大规模全身肌肉骨骼运动学习
摘要
Alexander Mathis Team 2603.25544 HJFY
2026-03-26 PAWS: Perception of Articulation in the Wild at Scale from Egocentric Videos
PAWS:基于大规模第一人称视角视频的野外关节感知
摘要
Arno Solin Team 2603.25539 HJFY
2026-03-26 LILAC: Language-Conditioned Object-Centric Optical Flow for Open-Loop Trajectory Generation
LILAC:面向开环轨迹生成的语言条件化物体中心光流方法
摘要
Komei Sugiura Team 2603.25481 HJFY
2026-03-26 VideoWeaver: Multimodal Multi-View Video-to-Video Transfer for Embodied Agents
VideoWeaver:面向具身智能体的多模态多视角视频到视频迁移框架
摘要
Ziyuan Liu Team 2603.25420 HJFY
2026-03-26 MMaDA-VLA: Large Diffusion Vision-Language-Action Model with Unified Multi-Modal Instruction and Generation
MMaDA-VLA:统一多模态指令与生成的大型扩散视觉-语言-动作模型
摘要
Donglin Wang Team 2603.25406 HJFY
2026-03-26 LaMP: Learning Vision-Language-Action Policies with 3D Scene Flow as Latent Motion Prior
LaMP:利用三维场景流作为潜在运动先验学习视觉-语言-动作策略
摘要
Lixin Yang Team 2603.25399 HJFY
2026-03-25 TAG: Target-Agnostic Guidance for Stable Object-Centric Inference in Vision-Language-Action Models
TAG:面向视觉-语言-动作模型中稳定以对象为中心推理的目标无关引导
摘要
Guangrun Wang Team 2603.24584 HJFY
2026-03-25 Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
变色龙:面向长时程机器人操作的场景记忆系统
摘要
Jianfei Yang Team 2603.24576 HJFY
2026-03-25 3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models
面向VLA的3D-Mix:将基于VGGT的三维信息集成到视觉-语言-动作模型中的即插即用模块
摘要
Kai Chen Team 2603.24393 HJFY
2026-03-25 A Sensorless, Inherently Compliant Anthropomorphic Musculoskeletal Hand Driven by Electrohydraulic Actuators
一种由电液驱动器驱动的无传感器、固有顺应性仿生肌肉骨骼手
摘要
Robert K. Katzschmann Team 2603.24357 HJFY
2026-03-25 GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
GameplayQA:面向决策密集型POV同步多视频理解的3D虚拟智能体基准测试框架
摘要
Volkan Ustun Team 2603.24329 HJFY
2026-03-25 Toward Generalist Neural Motion Planners for Robotic Manipulators: Challenges and Opportunities
迈向通用型神经运动规划器:机器人操作臂的挑战与机遇
摘要
Minghui Zheng Team 2603.24318 HJFY
2026-03-25 CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
CarePilot:面向医疗领域长周期计算机任务自动化的多智能体框架
摘要
Salman Khan Team 2603.24157 HJFY
2026-03-25 Knowledge-Guided Manipulation Using Multi-Task Reinforcement Learning
基于知识引导的多任务强化学习操作
摘要
Aleksandr Panov Team 2603.24083 HJFY
2026-03-25 SOMA: Strategic Orchestration and Memory-Augmented System for Vision-Language-Action Model Robustness via In-Context Adaptation
SOMA:通过上下文适应增强视觉-语言-动作模型鲁棒性的战略编排与记忆增强系统
摘要
Jinyu Gu Team 2603.24060 HJFY
2026-03-25 ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents
精英:具备经验学习与意图感知迁移能力的自我提升型具身智能体框架
摘要
Yongtao Wang Team 2603.24018 HJFY
2026-03-19 Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
并非所有特征生而平等:视觉-语言-动作模型机制研究
摘要
Peng Wang Team 2603.19233 HJFY
2026-03-19 MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction
MonoArt:用于单目铰接三维重建的渐进式结构推理
摘要
Ziwei Liu Team 2603.19231 HJFY
2026-03-19 DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding
DriveTok:面向统一多视角重建与理解的3D驾驶场景标记化方法
摘要
Jiwen Lu Team 2603.19219 HJFY
2026-03-19 OmniVTA: Visuo-Tactile World Modeling for Contact-Rich Robotic Manipulation
OmniVTA:面向接触密集型机器人操作的视觉-触觉世界建模
摘要
Wenchao Ding Team 2603.19201 HJFY
2026-03-19 FASTER: Rethinking Real-Time Flow VLAs
FASTER:重新思考实时流式视觉语言动作模型
摘要
Hengshuang Zhao Team 2603.19199 HJFY
2026-03-19 Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
稀疏自编码器揭示VLA模型中的可解释与可操控特征
摘要
Mac Schwager Team 2603.19183 HJFY
2026-03-19 Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
语义与度量:面向视觉语言导航的多智能体概率性接地方法
摘要
Nakul Gopalan Team 2603.19166 HJFY
2026-03-19 From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models
从推理效率到具身效率:重新审视视觉-语言-动作模型的效率指标
摘要
Chaojian Li Team 2603.19131 HJFY
2026-03-19 MERGE: Guided Vision-Language Models for Multi-Actor Event Reasoning and Grounding in Human-Robot Interaction
MERGE:面向人机交互中多参与者事件推理与情境感知的引导式视觉语言模型
摘要
Michael Gienger Team 2603.18988 HJFY
2026-03-19 GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting
GHOST:基于高斯泼溅的快速类别无关手-物交互重建系统,从RGB视频中实现
摘要
Didier Stricker Team 2603.18912 HJFY
2026-03-18 Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
统一时空令牌评分:面向高效视频视觉语言模型
摘要
Sangho Lee Team 2603.18004 HJFY
2026-03-18 ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models
ProbeFlow:面向视觉-语言-动作模型的无训练自适应流匹配方法
摘要
Qiongfeng Shi Team 2603.17850 HJFY
2026-03-18 Generative Control as Optimization: Time Unconditional Flow Matching for Adaptive and Robust Robotic Control
生成式控制即优化:面向自适应与鲁棒机器人控制的无时间条件流匹配
摘要
Hang Zhao Team 2603.17834 HJFY
2026-03-18 VolumeDP: Modeling Volumetric Representation for Manipulation Policy Learning
VolumeDP:面向操作策略学习的体素化表征建模
摘要
Tao Jiang Team 2603.17720 HJFY
2026-03-18 HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
HeiSD:面向具身视觉-语言-动作模型的混合推测解码框架及其运动学感知
摘要
Xiang Chen Team 2603.17573 HJFY
2026-03-18 KineVLA: Towards Kinematics-Aware Vision-Language-Action Models with Bi-Level Action Decomposition
KineVLA:通过双层动作分解实现运动学感知的视觉-语言-动作模型
摘要
Tongliang Liu Team 2603.17524 HJFY
2026-03-17 TeleDex: Accessible Dexterous Teleoperation
TeleDex:便捷灵巧的远程操作系统
摘要
Yuchen Cui Team 2603.17065 HJFY
2026-03-17 ManiTwin: Scaling Data-Generation-Ready Digital Object Dataset to 100K
ManiTwin:将数据生成就绪的数字物体数据集扩展至10万规模
摘要
Ping Luo Team 2603.16866 HJFY
2026-03-17 MolmoB0T: Large-Scale Simulation Enables Zero-Shot Manipulation
MolmoB0T:大规模仿真实现零样本操作
摘要
Ranjay Krishna Team 2603.16861 HJFY
2026-03-17 DreamPlan: Efficient Reinforcement Fine-Tuning of Vision-Language Planners via Video World Models
DreamPlan:通过视频世界模型实现视觉语言规划器的高效强化微调
摘要
Yue Wang Team 2603.16860 HJFY
2026-03-13 Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos
面向单目视频的时空世界场景图生成
摘要
Vibhav Gogate Team 2603.13185 HJFY
2026-03-13 DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
DecoVLN:面向视觉与语言导航的观测、推理与纠错解耦框架
摘要
Shengjun Huang Team 2603.13133 HJFY
2026-03-13 Language-Grounded Decoupled Action Representation for Robotic Manipulation
面向机器人操作的语言锚定解耦动作表示
摘要
Heng Tao Shen Team 2603.12967 HJFY
2026-03-13 ReMem-VLA: Empowering Vision-Language-Action Model with Memory via Dual-Level Recurrent Queries
ReMem-VLA:通过双层级循环查询赋能具有记忆能力的视觉-语言-动作模型
摘要
Alois Knoll Team 2603.12942 HJFY
2026-03-13 Coordinated Manipulation of Hybrid Deformable-Rigid Objects in Constrained Environments
受限环境下混合可变形-刚性物体的协调操控
摘要
Federico Renda Team 2603.12940 HJFY
2026-03-13 RoboStream: Weaving Spatio-Temporal Reasoning with Memory in Vision-Language Models for Robotics
RoboStream:将时空推理与记忆融入机器人视觉语言模型
摘要
Zhi Wang Team 2603.12939 HJFY
2026-03-13 MotionAnymesh: Physics-Grounded Articulation for Simulation-Ready Digital Twins
MotionAnymesh:面向仿真就绪数字孪生的物理基础关节化方法
摘要
RuoNan Liu Team 2603.12936 HJFY
2026-03-13 Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models
基于有限差分流优化的文本到图像模型强化学习后训练
摘要
Samuli Laine Team 2603.12893 HJFY
2026-03-13 Adaptive Vision-Language Model Routing for Computer Use Agents
自适应视觉语言模型路由技术用于计算机使用代理
摘要
Huamin Chen Team 2603.12823 HJFY
2026-03-13 Generalized Recognition of Basic Surgical Actions Enables Skill Assessment and Vision-Language-Model-based Surgical Planning
基础手术动作的泛化识别赋能技能评估与基于视觉语言模型的手术规划
摘要
Qi Dou Team 2603.12787 HJFY
2026-03-10 NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models
NS-VLA:迈向神经符号视觉-语言-动作模型
摘要
Haoran Luo Team 2603.09542 HJFY
2026-03-10 Beyond Short-Horizon: VQ-Memory for Robust Long-Horizon Manipulation in Non-Markovian Simulation Benchmarks
超越短视界:面向非马尔可夫仿真基准中鲁棒长视界操作的VQ记忆方法
摘要
Bai Chenjia Team 2603.09513 HJFY
2026-03-10 StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving
StyleVLA:面向自动驾驶的驾驶风格感知视觉语言动作模型
摘要
Johannes Betz Team 2603.09482 HJFY
2026-03-10 EvoDriveVLA: Evolving Autonomous Driving Vision-Language-Action Model via Collaborative Perception-Planning Distillation
EvoDriveVLA:通过协同感知-规划蒸馏进化自动驾驶视觉-语言-动作模型
摘要
Shanghang Zhang Team 2603.09465 HJFY
2026-03-10 From Flow to One Step: Real-Time Multi-Modal Trajectory Policies via Implicit Maximum Likelihood Estimation-based Distribution Distillation
从流匹配到单步生成:基于隐式最大似然估计分布蒸馏的实时多模态轨迹策略
摘要
Jianwei Zhang Team 2603.09415 HJFY
2026-03-10 CORAL: Scalable Multi-Task Robot Learning via LoRA Experts
CORAL:基于LoRA专家的可扩展多任务机器人学习框架
摘要
Zhenguo Li Team 2603.09298 HJFY
2026-03-10 See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation
观察、规划、回溯:面向鲁棒机器人操作的进度感知视觉-语言-动作模型
摘要
Xiaojun Chang Team 2603.09292 HJFY
2026-03-10 Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
基于网络视频的视觉语言导航隐式几何表征
摘要
Ivan Laptev Team 2603.09259 HJFY
2026-03-10 SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation
SPAN-Nav:面向通用视觉语言导航的广义空间感知
摘要
He Wang Team 2603.09163 HJFY
2026-03-10 DexHiL: A Human-in-the-Loop Framework for Vision-Language-Action Model Post-Training in Dexterous Manipulation
DexHiL:一种用于灵巧操作中视觉-语言-动作模型后训练的人机协同框架
摘要
Wenzhao Lian Team 2603.09121 HJFY
2026-03-06 Unified Learning of Temporal Task Structure and Action Timing for Bimanual Robot Manipulation
面向双臂机器人操作的时间任务结构与动作时序统一学习
摘要
Tamim Asfour Team 2603.06538 HJFY
2026-03-06 History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation
面向高效视觉语言导航的历史条件时空视觉令牌剪枝方法
摘要
Christopher Rasmussen Team 2603.06480 HJFY
2026-03-06 Data Analogies Enable Efficient Cross-Embodiment Transfer
数据类比实现高效跨具身迁移
摘要
Dorsa Sadigh Team 2603.06450 HJFY
2026-03-06 SuperSuit: An Isomorphic Bimodal Interface for Scalable Mobile Manipulation
SuperSuit:一种用于可扩展移动操作的同构双模态接口
摘要
Lu Fang Team 2603.06280 HJFY
2026-03-06 Few-Shot Neural Differentiable Simulator: Real-to-Sim Rigid-Contact Modeling
少样本神经可微模拟器:真实到仿真的刚体接触建模
摘要
Fan Shi Team 2603.06218 HJFY
2026-03-06 Restoring Linguistic Grounding in VLA Models via Train-Free Attention Recalibration
通过免训练注意力重校准恢复视觉语言动作模型的语言基础
摘要
Jingjing Chen Team 2603.06001 HJFY
2026-03-06 HarvestFlex: Strawberry Harvesting via Vision-Language-Action Policy Adaptation in the Wild
HarvestFlex:通过视觉-语言-动作策略自适应在野外环境中实现草莓采摘
摘要
Ya Xiong Team 2603.05982 HJFY
2026-03-06 Learning Next Action Predictors from Human-Computer Interaction
从人机交互中学习下一个动作预测器
摘要
Diyi Yang Team 2603.05923 HJFY
2026-03-06 AnyCamVLA: Zero-Shot Camera Adaptation for Viewpoint Robust Vision-Language-Action Models
AnyCamVLA:面向视角鲁棒视觉-语言-动作模型的零样本相机适配方法
摘要
Young Min Kim Team 2603.05868 HJFY
2026-03-06 DexEMG: Towards Dexterous Teleoperation System via EMG2Pose Generalization
DexEMG:通过EMG2Pose泛化实现灵巧遥操作系统
摘要
Kaifeng Zhang Team 2603.05861 HJFY
2026-03-05 Observing and Controlling Features in Vision-Language-Action Models
观察与控制视觉-语言-动作模型中的特征
摘要
Marco Pavone Team 2603.05487 HJFY
2026-03-05 RealWonder: Real-Time Physical Action-Conditioned Video Generation
RealWonder:实时物理动作条件视频生成系统
摘要
Jiajun Wu Team 2603.05449 HJFY
2026-03-05 PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking
PhysiFlow:基于多脑潜在流匹配与鲁棒跟踪的物理感知人形机器人全身视觉-语言-动作框架
摘要
Hesheng Wang Team 2603.05410 HJFY
2026-03-05 OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
OpenFrontier:基于视觉-语言锚定前沿的通用导航
摘要
Hermann Blum Team 2603.05377 HJFY
2026-03-05 Curve-Induced Dynamical Systems on Riemannian Manifolds and Lie Groups
黎曼流形与李群上的曲线诱导动力系统
摘要
Sylvain Calinon Team 2603.05268 HJFY
2026-03-05 Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation
闭环批评者:一种用于鲁棒长程操作的三系统视觉语言动作框架
摘要
Shanlin Zhong Team 2603.05185 HJFY
2026-03-05 Lifelong Language-Conditioned Robotic Manipulation Learning
终身语言条件化机器人操作学习
摘要
Zhi Han Team 2603.05160 HJFY
2026-03-05 Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
行动、思考或放弃:面向视觉-语言-动作模型的复杂度感知自适应推理框架
摘要
Matteo Matteucci Team 2603.05147 HJFY
2026-03-05 SeedPolicy: Horizon Scaling via Self-Evolving Diffusion Policy for Robot Manipulation
SeedPolicy:通过自演化扩散策略实现机器人操作的水平扩展
摘要
Shuaicheng Liu Team 2603.05117 HJFY
2026-03-05 SPIRIT: Perceptive Shared Autonomy for Robust Robotic Manipulation under Deep Learning Uncertainty
SPIRIT:基于深度学习不确定性的感知共享自主性,实现鲁棒机器人操作
摘要
Konstantin Kondak Team 2603.05111 HJFY
2026-02-26 EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
EmbodMocap:面向具身智能体的野外四维人-场景重建
摘要
Taku Komura Team 2602.23205 HJFY
2026-02-26 Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability
基于残差库普曼谱分析预测与防止Transformer训练不稳定性
摘要
Yutaka Matsuo Team 2602.22988 HJFY
2026-02-26 Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy
经皮扩张气管切开术的自动化机器人针穿刺系统
摘要
Andrew Weightman Team 2602.22952 HJFY
2026-02-26 DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation
DySL-VLA:通过动态-静态层跳跃实现机器人操作中高效视觉-语言-动作模型推理
摘要
Meng Li Team 2602.22896 HJFY
2026-02-26 GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion
GraspLDP:基于潜在扩散的通用化抓取策略研究
摘要
Di Huang Team 2602.22862 HJFY
2026-02-26 ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals
ArtPro:基于自适应运动提议集成的自监督关节物体重建
摘要
Changhe Tu Team 2602.22666 HJFY
2026-02-26 Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline
重新审视视觉-语言-动作模型的实用性:一个综合性基准与改进基线
摘要
Haoang Li Team 2602.22663 HJFY
2026-02-26 Metamorphic Testing of Vision-Language Action-Enabled Robots
视觉-语言-动作赋能机器人的蜕变测试
摘要
Aitor Arrieta Team 2602.22579 HJFY
2026-02-26 SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation
SignVLA:一种无需注释的视觉-语言-动作框架,用于实时手语引导的机器人操作
摘要
Zezhi Tang Team 2602.22514 HJFY
2026-02-25 When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering
何时执行、询问或学习:不确定性感知的策略引导
摘要
Andrea Bajcsy Team 2602.22474 HJFY
2026-02-24 NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning
NoRD:一种无需推理、数据高效驱动的视觉-语言-动作模型
摘要
Wei Zhan Team 2602.21172 HJFY
2026-02-24 ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking
行动推理:基于大语言模型的机器人三维空间动作推理与砖块堆叠应用
摘要
Brian Sheil Team 2602.21161 HJFY
2026-02-24 HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning
HALO:面向具身多模态思维链推理的统一视觉-语言-动作模型
摘要
Song Guo Team 2602.21157 HJFY
2026-02-24 From Perception to Action: An Interactive Benchmark for Vision Reasoning
从感知到行动:视觉推理的交互式基准测试
摘要
Roy Ka-Wei Lee Team 2602.21015 HJFY
2026-02-24 Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks
自我笔记:用于依赖记忆操作任务的便签增强型视觉语言动作模型
摘要
Roland Memisevic Team 2602.21013 HJFY
2026-02-24 Toward an Agentic Infused Software Ecosystem
迈向赋能代理的软件生态系统
摘要
Mark Marron Team 2602.20979 HJFY
2026-02-24 IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation
IG-RFT:面向长时程机器人操作的交互引导强化学习框架,用于视觉-语言-动作模型
摘要
Huixu Dong Team 2602.20715 HJFY
2026-02-24 How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective
基础技能如何影响基于视觉语言模型的具身智能体:一个原生视角
摘要
Tong Xu Team 2602.20687 HJFY
2026-02-24 Recursive Belief Vision Language Model
递归信念视觉语言模型
摘要
Nirav Patel Team 2602.20659 HJFY
2026-02-24 Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion
基于掩码视觉-语言-动作扩散的高效可解释端到端自动驾驶
摘要
Ziran Wang Team 2602.20577 HJFY
2026-02-19 When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs
当视觉凌驾于语言之上:评估与缓解视觉语言动作模型中的反事实失败
摘要
Mingyu Ding Team 2602.17659 HJFY
2026-02-19 What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else?
什么在破坏具身人工智能安全:大语言模型漏洞、信息物理系统缺陷,还是其他因素?
摘要
Yue Zhang Team 2602.17345 HJFY
2026-02-19 FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment
FRAPPE:通过多未来表示对齐将世界建模融入通用策略
摘要
Donglin Wang Team 2602.17259 HJFY
2026-02-19 Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web
网络动词:面向智能网络可靠任务组合的类型化抽象
摘要
Suman Nath Team 2602.17245 HJFY
2026-02-19 Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success
评估物体姿态估计与重建对机器人抓取成功率影响的基准研究
摘要
Torsten Sattler Team 2602.17101 HJFY
2026-02-18 MALLVI: a multi agent framework for integrated generalized robotics manipulation
MALLVI:一种面向集成通用机器人操作的多智能体框架
摘要
Babak Khalaj Team 2602.16898 HJFY
2026-02-18 EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data
EgoScale:利用多样化的自我中心人类数据扩展灵巧操作能力
摘要
Linxi Fan Team 2602.16710 HJFY
2026-02-19 RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation
RoboGene:通过多样性驱动的智能体框架提升视觉语言动作预训练,实现真实世界任务生成
摘要
Jian Tang Team 2602.16444 HJFY
2026-02-17 Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
学习检索可导航候选对象以实现高效的视觉与语言导航
摘要
Lina Yao Team 2602.15724 HJFY
2026-02-17 The Next Paradigm Is User-Centric Agent, Not Platform-Centric Service
下一代范式是用户中心智能体,而非平台中心服务
摘要
Enhong Chen Team 2602.15682 HJFY
2026-02-12 Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效
摘要
Marco Pavone Team 2602.12281 HJFY
2026-02-12 Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work
面向共址蓝领工作团队协作的具身人工智能体
摘要
Thomas Olsson Team 2602.12136 HJFY
2026-02-12 GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning
GigaBrain-0.5M*:一种基于世界模型强化学习训练的视觉-语言-动作模型
摘要
Zheng Zhu Team 2602.12099 HJFY
2026-02-12 VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model
VLAW:视觉-语言-动作策略与世界模型的迭代协同改进
摘要
Chelsea Finn Team 2602.12063 HJFY
2026-02-12 HoloBrain-0 Technical Report
HoloBrain-0技术报告
摘要
Zhizhong Su Team 2602.12062 HJFY
2026-02-12 When would Vision-Proprioception Policies Fail in Robotic Manipulation?
视觉-本体感知策略在机器人操作中何时会失效?
摘要
Di Hu Team 2602.12032 HJFY
2026-02-12 Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control
Robot-DIFT:提取扩散特征以实现几何一致的视觉运动控制
摘要
Georgia Chalvatzaki Team 2602.11934 HJFY
2026-02-12 JEPA-VLA: Video Predictive Embedding is Needed for VLA Models
JEPA-VLA:视觉语言动作模型需要视频预测性嵌入
摘要
Mingsheng Long Team 2602.11832 HJFY
2026-02-12 Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes
Clutt3R-Seg:面向杂乱场景中语言驱动抓取的稀疏视角三维实例分割
摘要
Ayoung Kim Team 2602.11660 HJFY
2026-02-12 ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning
ViTaS:面向视觉运动学习的视觉触觉软融合对比学习
摘要
Huazhe Xu Team 2602.11643 HJFY
2026-02-10 MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation
MVISTA-4D:具有测试时动作推理能力的视图一致4D世界模型,用于机器人操作
摘要
Xiangyu Yue Team 2602.09878 HJFY
2026-02-10 Code2World: A GUI World Model via Renderable Code Generation
Code2World:通过可渲染代码生成的GUI世界模型
摘要
Kevin Qinghong Lin Team 2602.09856 HJFY
2026-02-10 BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation
BagelVLA:通过交错视觉-语言-动作生成增强长时程操作能力
摘要
Jianyu Chen Team 2602.09849 HJFY
2026-02-10 NavDreamer: Video Models as Zero-Shot 3D Navigators
NavDreamer:视频模型作为零样本三维导航器
摘要
Fei Gao Team 2602.09765 HJFY
2026-02-10 Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization
重新审视视觉-语言-动作模型的规模化:对齐、混合与正则化
摘要
Qin Jin Team 2602.09722 HJFY
2026-02-10 AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
AutoFly:面向野外无人机自主导航的视觉-语言-动作模型
摘要
Hui Xiong Team 2602.09657 HJFY
2026-02-10 VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地
摘要
Hui Xiong Team 2602.09638 HJFY
2026-02-10 Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures
Hand2World:基于自由空间手势的自回归第一人称交互生成
摘要
Xingang Pan Team 2602.09600 HJFY
2026-02-10 Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation
面向可变形物体操作的偏好对齐视觉运动扩散策略
摘要
Danica Kragic Team 2602.09583 HJFY
2026-02-10 AUHead: Realistic Emotional Talking Head Generation via Action Units Control
AUHead:基于动作单元控制的逼真情感说话头部生成
摘要
Tat-Seng Chua Team 2602.09534 HJFY
2026-02-04 Capturing Visual Environment Structure Correlates with Control Performance
捕捉视觉环境结构与控制性能的相关性
摘要
Yu-Xiong Wang Team 2602.04880 HJFY
2026-02-04 CoWTracker: Tracking by Warping instead of Correlation
CoWTracker:通过变形而非相关性进行跟踪
摘要
Andrea Vedaldi Team 2602.04877 HJFY
2026-02-04 Relational Scene Graphs for Object Grounding of Natural Language Commands
面向自然语言指令中物体定位的关系场景图
摘要
Ville Kyrki Team 2602.04635 HJFY
2026-02-04 Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data
行动、感知、再行动:从大规模第一人称人类数据中学习非马尔可夫主动感知策略
摘要
Wenzhao Lian Team 2602.04600 HJFY
2026-02-04 A Unified Complementarity-based Approach for Rigid-Body Manipulation and Motion Prediction
基于互补性的统一方法在刚体操作与运动预测中的应用
摘要
Riddhiman Laha Team 2602.04522 HJFY
2026-02-04 EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
EgoActor:通过视觉语言模型将任务规划落地为具身机器人的空间感知自我中心动作
摘要
Börje F. Karlsson Team 2602.04515 HJFY
2026-02-04 Self-evolving Embodied AI
自演化的具身人工智能
摘要
Wenwu Zhu Team 2602.04411 HJFY
2026-02-04 GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning
GeneralVLA:具备知识引导轨迹规划的通用视觉-语言-动作模型
摘要
Hao Tang Team 2602.04315 HJFY
2026-02-04 Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation
视角至关重要:利用掩码自编码器动态优化视觉操控的视角
摘要
Wenzhao Lian Team 2602.04243 HJFY
2026-02-04 GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning
GeoLanG:基于统一RGB-D多模态学习的几何感知语言引导抓取
摘要
Hongliang Ren Team 2602.04231 HJFY
2026-02-02 TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments
TIC-VLA:一种用于动态环境中机器人导航的思控一体化视觉-语言-动作模型
摘要
Jiaqi Ma Team 2602.02459 HJFY
2026-02-02 World-Gymnast: Training Robots with Reinforcement Learning in a World Model
世界体操家:在世界模型中通过强化学习训练机器人
摘要
Sherry Yang Team 2602.02454 HJFY
2026-02-02 SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation
SoMA:面向机器人软体操作的真实到仿真神经模拟器
摘要
Jiangmiao Pang Team 2602.02402 HJFY
2026-02-02 MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models
MAIN-VLA:为视觉-语言-动作模型建模意图与环境的抽象
摘要
Lemiao Qiu Team 2602.02212 HJFY
2026-02-02 FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation
FD-VLA:用于接触丰富操作的力蒸馏视觉-语言-动作模型
摘要
Haiyue Zhu Team 2602.02142 HJFY
2026-02-02 See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力
摘要
Takeo Igarashi Team 2602.02063 HJFY
2026-02-02 Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models
面向视觉语言动作模型推理时安全性的概念词典学习方法
摘要
Di Wang Team 2602.01834 HJFY
2026-02-02 From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models
从精确认知到精准执行:面向视觉语言动作模型的通用自校正与终止框架
摘要
Jianzong Wang Team 2602.01811 HJFY
2026-02-02 AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act
AgenticLab:一个能够观察、思考与行动的真实世界机器人智能体平台
摘要
Yu She Team 2602.01662 HJFY
2026-02-02 From Perception to Action: Spatial AI Agents and World Models
从感知到行动:空间人工智能代理与世界模型
摘要
Esteban Rojas Team 2602.01644 HJFY
2026-01-30 Temporally Coherent Imitation Learning via Latent Action Flow Matching for Robotic Manipulation Wu Songwei et.al. 2601.23087 null
2026-01-30 EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing Xijie Yang et.al. 2601.23065 null
2026-01-30 Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation Di Zhang et.al. 2601.22988 null
2026-01-30 Alignment among Language, Vision and Action Representations Nicola Milano et.al. 2601.22948 null
2026-01-30 When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection Shashank Mishra et.al. 2601.22868 null
2026-01-30 Vision-Language Models Unlock Task-Centric Latent Actions Alexander Nikulin et.al. 2601.22714 null
2026-01-30 Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference Emilien Biré et.al. 2601.22701 null
2026-01-30 CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control Jiaqi Shi et.al. 2601.22467 null
2026-01-29 PoSafeNet: Safe Learning with Poset-Structured Neural Nets Kiwan Wong et.al. 2601.22356 null
2026-01-29 DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation Haozhe Xie et.al. 2601.22153 null
2026-01-29 PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction Changjian Jiang et.al. 2601.22046 null
2026-01-29 PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy Jinhao Zhang et.al. 2601.22018 null
2026-01-29 Causal World Modeling for Robot Control Lin Li et.al. 2601.21998 null
2026-01-29 MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts Lorenzo Mazza et.al. 2601.21971 null
2026-01-29 Information Filtering via Variational Regularization for Robot Manipulation Jinhao Zhang et.al. 2601.21926 null
2026-01-29 Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation Jiankun Peng et.al. 2601.21751 null
2026-01-29 CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation Xuanran Zhai et.al. 2601.21712 null
2026-01-29 AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation Jianli Sun et.al. 2601.21602 null
2026-01-29 EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots Zixing Lei et.al. 2601.21570 null

评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。

📌 VLN

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY 评估
2026-05-14 Exploring Bottlenecks in VLM-LLM Navigation: How 3D Scene Understanding Capability Impacts Zero-Shot VLN
探索VLM-LLM导航中的瓶颈:3D场景理解能力如何影响零样本VLN
摘要
Ling Pei Team 2605.14801 HJFY
2026-05-13 What Limits Vision-and-Language Navigation ?
视觉与语言导航的瓶颈何在?
摘要
Renjing Xu Team 2605.13328 HJFY
2026-05-13 HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation
HCSG:面向视觉语言导航的以人为中心的语义-几何推理
摘要
Haoang Li Team 2605.13321 HJFY
2026-05-11 SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation
摘要
Amitava Das Team 2605.10376 HJFY
2026-05-11 Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation
沙盒中规划,开放世界中导航:面向具身导航的学习物理基础抽象经验
摘要
Tianrui Li Team 2605.10118 HJFY
2026-05-13 SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio:面向具身智能体学习的自主环境生成与演化编码智能体
摘要
Lianhui Qin Team 2605.09423 HJFY
2026-05-09 LCGNav: Local Candidate-Aware Geometric Enhancement for General Topological Planning in Vision-Language Navigation
LCGNav:面向视觉语言导航中通用拓扑规划的局部候选感知几何增强方法
摘要
Ying Xu Team 2605.09053 HJFY
2026-05-08 PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
PathPainter:将图像生成模型的泛化能力迁移至具身导航
摘要
Fei Gao Team 2605.07496 HJFY
2026-05-07 Cross-Modal Navigation with Multi-Agent Reinforcement Learning
基于多智能体强化学习的跨模态导航
摘要
Christopher Amato Team 2605.06595 HJFY
2026-05-08 NavOne: One-Step Global Planning for Vision-Language Navigation on Top-Down Maps
NavOne:面向俯视地图的视觉-语言导航的一步全局规划方法
摘要
Xuemiao Xu Team 2605.06317 HJFY
2026-05-04 Change-Robust Online Spatial-Semantic Topological Mapping
抗变化在线空间语义拓扑映射
摘要
Harold Soh Team 2605.02227 HJFY
2026-05-03 Multi-Scale Gaussian-Language Map for Zero-shot Embodied Navigation and Reasoning
多尺度高斯语言地图用于零样本具身导航与推理
摘要
Shuqiang Jiang Team 2605.01736 HJFY
2026-05-03 TrajRAG: Retrieving Geometric-Semantic Experience for Zero-Shot Object Navigation
TrajRAG:为零样本目标导航检索几何-语义经验
摘要
Shuqiang Jiang Team 2605.01700 HJFY
2026-04-30 SpaAct: Spatially-Activated Transition Learning with Curriculum Adaptation for Vision-Language Navigation
SpaAct:面向视觉-语言导航的空间激活式迁移学习与课程自适应方法
摘要
Nanning Zheng Team 2604.27620 HJFY
2026-04-30 World2Minecraft: Occupancy-Driven Simulated Scenes Construction
World2Minecraft:基于占用驱动的仿真场景构建
摘要
Xin Tan Team 2604.27578 HJFY
2026-04-29 Three-Step Nav: A Hierarchical Global-Local Planner for Zero-Shot Vision-and-Language Navigation
三步导航:用于零样本视觉语言导航的分层全局-局部规划器
摘要
Laurent Itti Team 2604.26946 HJFY
2026-04-28 Where Did It Go Wrong? Capability-Oriented Failure Attribution for Vision-and-Language Navigation Agents
问题出在哪里?面向视觉与语言导航智能体的能力导向失败归因
摘要
Fanjiang Xu Team 2604.25161 HJFY
2026-04-27 FreqCache: Accelerating Embodied VLN Models with Adaptive Frequency-Guided Token Caching
FreqCache:自适应频率引导的令牌缓存加速具身VLN模型
摘要
Xiang Chen Team 2604.24391 HJFY
2026-04-23 A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
一种具有层次化认知与上下文感知探索的可部署具身视觉语言导航系统
摘要
Lihua Xie Team 2604.21363 HJFY
2026-04-22 Self-Predictive Representation for Autonomous UAV Object-Goal Navigation
面向自主无人机目标导向导航的自我预测表征
摘要
Bruno J. T. Fernandes Team 2604.21130 HJFY
2026-04-21 LiveVLN: Breaking the Stop-and-Go Loop in Vision-Language Navigation
LiveVLN:打破视觉语言导航中的走走停停循环
摘要
Feng Zheng Team 2604.19536 HJFY
2026-04-21 The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
视觉-语言导航中自我改进代理的平衡本质
摘要
Jingwen Fu Team 2604.19064 HJFY
2026-04-21 Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
像人类一样探索:面向具身智能体的在线语义图记忆构建自主探索方法
摘要
Mu Xu Team 2604.19034 HJFY
2026-04-20 Instruction-as-State: Environment-Guided and State-Conditioned Semantic Understanding for Embodied Navigation
指令即状态:环境引导与状态条件化的具身导航语义理解
摘要
Jingwen Fu Team 2604.18223 HJFY
2026-04-19 Dual-Anchoring: Addressing State Drift in Vision-Language Navigation
双重锚定:解决视觉语言导航中的状态漂移问题
摘要
Jianyi Liu Team 2604.17473 HJFY
2026-04-19 LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
LookasideVLN:方向感知的空中视觉语言导航
摘要
Guanbin Li Team 2604.17190 HJFY
2026-04-18 Mini-BEHAVIOR-Gran: Revealing U-Shaped Effects of Instruction Granularity on Language-Guided Embodied Agents
Mini-BEHAVIOR-Gran:揭示指令粒度对语言引导具身智能体的U型效应
摘要
Hamid Rezatofighi Team 2604.17019 HJFY
2026-04-18 Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification
Rule-VLN:通过语义推理与几何校正桥接感知与合规性
摘要
Xiaowen Chu Team 2604.16993 HJFY
2026-04-17 FineCog-Nav: Integrating Fine-grained Cognitive Modules for Zero-shot Multimodal UAV Navigation
FineCog-Nav:集成细粒度认知模块实现零样本多模态无人机导航
摘要
Jing Huo Team 2604.16298 HJFY
2026-04-15 Vision-and-Language Navigation for UAVs: Progress, Challenges, and a Research Roadmap
无人机视觉与语言导航:进展、挑战与研究路线图
摘要
Ji Pei Team 2604.13654 HJFY
2026-04-14 OVAL: Open-Vocabulary Augmented Memory Model for Lifelong Object Goal Navigation
OVAL:面向终身目标导航的开放词汇增强记忆模型
摘要
Xueqian Wang Team 2604.12872 HJFY
2026-04-14 DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation
DeCoNav:对话增强的长时程协作视觉语言导航
摘要
Xuelong Li Team 2604.12486 HJFY
2026-04-13 Fast-SegSim: Real-Time Open-Vocabulary Segmentation for Robotics in Simulation
Fast-SegSim:面向机器人仿真的实时开放词汇分割
摘要
Yue Wang Team 2604.10951 HJFY
2026-04-19 VLN-NF: Feasibility-Aware Vision-and-Language Navigation with False-Premise Instructions
VLN-NF:具备可行性感知的视觉语言导航与虚假前提指令处理
摘要
Winston H. Hsu Team 2604.10533 HJFY
2026-04-10 HTNav: A Hybrid Navigation Framework with Tiered Structure for Urban Aerial Vision-and-Language Navigation
HTNav:一种面向城市空中视觉与语言导航的层级式混合导航框架
摘要
Jie Qin Team 2604.08883 HJFY
2026-04-09 HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation
HiRO-Nav:混合推理实现高效具身导航
摘要
Chunyan Miao Team 2604.08232 HJFY
2026-04-09 How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace
大型多模态模型距离人类水平空间行动能力还有多远?面向城市空域目标导向具身导航的基准测试
摘要
Xinlei Chen Team 2604.07973 HJFY
2026-04-09 WorldMAP: Bootstrapping Vision-Language Navigation Trajectory Prediction with Generative World Models
WorldMAP:利用生成式世界模型自举视觉语言导航轨迹预测
摘要
Zhibo Chen Team 2604.07957 HJFY
2026-04-09 Vision-Language Navigation for Aerial Robots: Towards the Era of Large Language Models
面向空中机器人的视觉语言导航:迈向大语言模型时代
摘要
Wen Yao Team 2604.07705 HJFY
2026-04-06 ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw:一种面向异构多智能体协作的分层语义-物理框架
摘要
Jie Chen Team 2604.04664 HJFY
2026-04-05 Hypothesis Graph Refinement: Hypothesis-Driven Exploration with Cascade Error Correction for Embodied Navigation
假设图优化:基于假设驱动探索与级联错误纠正的具身导航
摘要
Qing Li Team 2604.04108 HJFY
2026-04-03 FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
FSUNav:一种用于快速、安全且通用的零样本目标导向导航的大脑-小脑架构
摘要
Wei Zhang Team 2604.03139 HJFY
2026-04-02 Stop Wandering: Efficient Vision-Language Navigation via Metacognitive Reasoning
停止徘徊:通过元认知推理实现高效视觉语言导航
摘要
Guozi Liu Team 2604.02318 HJFY
2026-03-31 Benchmarking Interaction, Beyond Policy: a Reproducible Benchmark for Collaborative Instance Object Navigation
超越策略的交互基准测试:一个可复现的协作实例物体导航基准
摘要
Loris Bazzani Team 2604.00265 HJFY
2026-03-31 LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
LatentPilot:通过潜在视觉推理前瞻梦境,实现场景感知的视觉与语言导航
摘要
Xiaojun Chang Team 2603.29165 HJFY
2026-03-30 CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence
CARLA-Air:在CARLA世界中飞行无人机——面向空地具身智能的统一基础设施
摘要
Hong Zhang Team 2603.28032 HJFY
2026-03-29 Structured Observation Language for Efficient and Generalizable Vision-Language Navigation
结构化观察语言:实现高效且可泛化的视觉语言导航
摘要
Jun Ma Team 2603.27577 HJFY
2026-03-27 Beyond Textual Knowledge-Leveraging Multimodal Knowledge Bases for Enhancing Vision-and-Language Navigation
超越文本知识:利用多模态知识库增强视觉与语言导航
摘要
Liejun Wang Team 2603.26859 HJFY
2026-03-27 SpatialAnt: Autonomous Zero-Shot Robot Navigation via Active Scene Reconstruction and Visual Anticipation
SpatialAnt:通过主动场景重建与视觉预测实现自主零样本机器人导航
摘要
Qi Wu Team 2603.26837 HJFY
2026-03-23 IGV-RRT: Prior-Real-Time Observation Fusion for Active Object Search in Changing Environments
IGV-RRT:面向动态环境中主动目标搜索的先验与实时观测融合方法
摘要
Chaoqun Wang Team 2603.21887 HJFY
2026-03-22 DyGeoVLN: Infusing Dynamic Geometry Foundation Model into Vision-Language Navigation
DyGeoVLN:将动态几何基础模型融入视觉语言导航
摘要
Sung-Eui Yoon Team 2603.21269 HJFY
2026-03-22 SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
SpatialFly:面向城市环境中无人机视觉语言导航的几何引导表示对齐方法
摘要
Xiangyang Ji Team 2603.21046 HJFY
2026-03-21 Does Peer Observation Help? Vision-Sharing Collaboration for Vision-Language Navigation
同伴观察是否有效?视觉语言导航中的视觉共享协作研究
摘要
Qi Wu Team 2603.20804 HJFY
2026-03-20 Memory Over Maps: 3D Object Localization Without Reconstruction
记忆优于地图:无需重建的三维物体定位
摘要
Marc Pollefeys Team 2603.20530 HJFY
2026-03-20 HUGE-Bench: A Benchmark for High-Level UAV Vision-Language-Action Tasks
HUGE-Bench:面向高级无人机视觉-语言-动作任务的基准测试平台
摘要
Mingming Gong Team 2603.19822 HJFY
2026-03-20 CeRLP: A Cross-embodiment Robot Local Planning Framework for Visual Navigation
CeRLP:一种面向视觉导航的跨形态机器人局部规划框架
摘要
Wei Zhang Team 2603.19602 HJFY
2026-03-19 NavTrust: Benchmarking Trustworthiness for Embodied Navigation
NavTrust:面向具身导航的信任度基准测试
摘要
Jiachen Li Team 2603.19229 HJFY
2026-03-19 Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
语义与度量:面向视觉语言导航的多智能体概率性接地方法
摘要
Nakul Gopalan Team 2603.19166 HJFY
2026-03-19 REST: Receding Horizon Explorative Steiner Tree for Zero-Shot Object-Goal Navigation
REST:用于零样本目标导航的滚动时域探索斯坦纳树
摘要
Hui Kong Team 2603.18624 HJFY
2026-03-19 SR-Nav: Spatial Relationships Matter for Zero-shot Object Goal Navigation
SR-Nav:空间关系对零样本目标导向导航至关重要
摘要
Yinlong Yan Team 2603.18443 HJFY
2026-03-18 GoalVLM: VLM-driven Object Goal Navigation for Multi-Agent System
GoalVLM:面向多智能体系统的视觉语言模型驱动目标物体导航
摘要
Dzmitry Tsetserokou Team 2603.18210 HJFY
2026-03-18 AgentVLN: Towards Agentic Vision-and-Language Navigation
AgentVLN:迈向智能体化的视觉与语言导航
摘要
Shengjun Huang Team 2603.17670 HJFY
2026-03-18 P$^{3}$Nav: End-to-End Perception, Prediction and Planning for Vision-and-Language Navigation
P$^{3}$Nav:面向视觉与语言导航的端到端感知、预测与规划框架
摘要
Haoang Li Team 2603.17459 HJFY
2026-03-18 FloorPlan-VLN: A New Paradigm for Floor Plan Guided Vision-Language Navigation
FloorPlan-VLN:一种基于平面图引导的视觉语言导航新范式
摘要
Liang Wang Team 2603.17437 HJFY
2026-03-18 OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms
OmniVLN:面向空地和地面平台视觉语言导航的全向三维感知与令牌高效大语言模型推理
摘要
Lihua Xie Team 2603.17351 HJFY
2026-03-16 EmergeNav: Structured Embodied Inference for Zero-Shot Vision-and-Language Navigation in Continuous Environments
EmergeNav:面向连续环境中零样本视觉语言导航的结构化具身推理框架
摘要
Xiaoguang Ma Team 2603.16947 HJFY
2026-03-17 SignNav: Leveraging Signage for Semantic Visual Navigation in Large-Scale Indoor Environments
SignNav:利用标识牌在大规模室内环境中实现语义视觉导航
摘要
Hui Kong Team 2603.16166 HJFY
2026-03-16 FlatLands: Generative Floormap Completion From a Single Egocentric View
FlatLands:基于单视角第一人称视图的生成式楼层平面图补全
摘要
Rahul Shome Team 2603.16016 HJFY
2026-03-16 Nonequilibrium energetics of sensing and actuation by a smart active particle
智能活性粒子感知与驱动的非平衡能量学
摘要
Lorenzo Piro Team 2603.15602 HJFY
2026-03-16 Trajectory-Diversity-Driven Robust Vision-and-Language Navigation
轨迹多样性驱动的鲁棒视觉语言导航
摘要
Yihong Gong Team 2603.15370 HJFY
2026-03-16 HiMemVLN: Enhancing Reliability of Open-Source Zero-Shot Vision-and-Language Navigation with Hierarchical Memory System
HiMemVLN:通过分层记忆系统提升开源零样本视觉语言导航的可靠性
摘要
Ce Hao Team 2603.14807 HJFY
2026-03-13 DecoVLN: Decoupling Observation, Reasoning, and Correction for Vision-and-Language Navigation
DecoVLN:面向视觉与语言导航的观测、推理与纠错解耦框架
摘要
Shengjun Huang Team 2603.13133 HJFY
2026-03-13 GoalSwarm: Multi-UAV Semantic Coordination for Open-Vocabulary Object Navigation
目标蜂群:面向开放词汇目标导航的多无人机语义协同框架
摘要
Dzmitry Tsetserokou Team 2603.12908 HJFY
2026-03-13 HaltNav: Reactive Visual Halting over Lightweight Topological Priors for Robust Vision-Language Navigation
HaltNav:基于轻量级拓扑先验的响应式视觉停顿,实现鲁棒的视觉语言导航
摘要
Sören Schwertfeger Team 2603.12696 HJFY
2026-03-11 OnFly: Onboard Zero-Shot Aerial Vision-Language Navigation toward Safety and Efficiency
OnFly:面向安全与效率的机载零样本空中视觉语言导航
摘要
Boyu Zhou Team 2603.10682 HJFY
2026-03-10 Let's Reward Step-by-Step: Step-Aware Contrastive Alignment for Vision-Language Navigation in Continuous Environments
逐步奖励:连续环境中视觉语言导航的步骤感知对比对齐
摘要
Yi Yang Team 2603.09740 HJFY
2026-03-10 Implicit Geometry Representations for Vision-and-Language Navigation from Web Videos
基于网络视频的视觉语言导航隐式几何表征
摘要
Ivan Laptev Team 2603.09259 HJFY
2026-03-10 SPAN-Nav: Generalized Spatial Awareness for Versatile Vision-Language Navigation
SPAN-Nav:面向通用视觉语言导航的广义空间感知
摘要
He Wang Team 2603.09163 HJFY
2026-03-10 PM-Nav: Priori-Map Guided Embodied Navigation in Functional Buildings
PM-Nav:基于先验地图引导的功能性建筑内具身导航
摘要
Xiaoguang Ma Team 2603.09113 HJFY
2026-03-09 From Reactive to Map-Based AI: Tuned Local LLMs for Semantic Zone Inference in Object-Goal Navigation
从反应式到基于地图的人工智能:利用微调本地大语言模型实现目标导向导航中的语义区域推断
摘要
Kanji Tanaka Team 2603.08086 HJFY
2026-03-09 ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
ViSA增强的空中视觉语言导航:一种视觉空间推理增强的空中视觉语言导航框架
摘要
Chenghao Lin Team 2603.08007 HJFY
2026-03-09 CMMR-VLN: Vision-and-Language Navigation via Continual Multimodal Memory Retrieval
CMMR-VLN:基于持续多模态记忆检索的视觉与语言导航
摘要
Xiaoguang Ma Team 2603.07997 HJFY
2026-03-08 MWM: Mobile World Models for Action-Conditioned Consistent Prediction
MWM:面向动作条件一致预测的移动世界模型
摘要
Hao Tang Team 2603.07799 HJFY
2026-03-07 FreeFly-Thinking : Aligning Chain-of-Thought Reasoning with Continuous UAV Navigation
自由飞行思维:将思维链推理与连续无人机导航对齐
摘要
Tao Li Team 2603.07181 HJFY
2026-03-10 VLN-Cache: Enabling Token Caching for VLN Models with Visual/Semantic Dynamics Awareness
VLN-Cache:基于视觉/语义动态感知的视觉语言导航模型令牌缓存技术
摘要
Xiang Chen Team 2603.07080 HJFY
2026-03-06 History-Conditioned Spatio-Temporal Visual Token Pruning for Efficient Vision-Language Navigation
面向高效视觉语言导航的历史条件时空视觉令牌剪枝方法
摘要
Christopher Rasmussen Team 2603.06480 HJFY
2026-03-06 Lifelong Embodied Navigation Learning
终身具身导航学习
摘要
Zhi Han Team 2603.06073 HJFY
2026-03-05 OpenFrontier: General Navigation with Visual-Language Grounded Frontiers
OpenFrontier:基于视觉-语言锚定前沿的通用导航
摘要
Hermann Blum Team 2603.05377 HJFY
2026-03-04 Efficient Autonomous Navigation of a Quadruped Robot in Underground Mines on Edge Hardware
四足机器人在地下矿井边缘硬件上的高效自主导航
摘要
Kwame Awuah-Offei Team 2603.04470 HJFY
2026-03-04 RAGNav: A Retrieval-Augmented Topological Reasoning Framework for Multi-Goal Visual-Language Navigation
RAGNav:面向多目标视觉语言导航的检索增强型拓扑推理框架
摘要
Qiangian Bai Team 2603.03745 HJFY
2026-03-04 PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation
PROSPECT:通过语义-空间融合与潜在预测表征实现统一的流式视觉语言导航
摘要
Feng Gao Team 2603.03739 HJFY
2026-03-03 MA-CoNav: A Master-Slave Multi-Agent Framework with Hierarchical Collaboration and Dual-Level Reflection for Long-Horizon Embodied VLN
MA-CoNav:一种面向长程具身视觉语言导航的主从式多智能体框架,具备层次化协作与双级反思机制
摘要
Qianqian Bai Team 2603.03024 HJFY
2026-03-03 TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation
TagaVLM:面向视觉语言导航的拓扑感知全局动作推理
摘要
Baocai Yin Team 2603.02972 HJFY
2026-03-03 Agentic Self-Evolutionary Replanning for Embodied Navigation
具身导航中的自主自进化重规划
摘要
Chengzhong Xu Team 2603.02772 HJFY
2026-03-02 CHOP: Counterfactual Human Preference Labels Improve Obstacle Avoidance in Visuomotor Navigation Policies
CHOP:利用反事实人类偏好标签提升视觉运动导航策略的避障能力
摘要
Dinesh Manocha Team 2603.02004 HJFY
2026-02-27 Enhancing Vision-Language Navigation with Multimodal Event Knowledge from Real-World Indoor Tour Videos
利用真实世界室内导览视频的多模态事件知识增强视觉语言导航
摘要
Haoang Li Team 2602.23937 HJFY
2026-02-20 CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation
CapNav:基于能力条件室内导航的视觉语言模型基准测试
摘要
Jon Froehlich Team 2602.18424 HJFY
2026-02-17 Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation
学习检索可导航候选对象以实现高效的视觉与语言导航
摘要
Lina Yao Team 2602.15724 HJFY
2026-02-17 One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation
一智体引领全局:通过显式世界表征赋能多模态大语言模型实现视觉与语言导航
摘要
Qi Wu Team 2602.15400 HJFY
2026-02-16 pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI
pFedNavi:面向具身AI的结构感知个性化联邦视觉语言导航
摘要
Haibing Guan Team 2602.14401 HJFY
2026-02-12 ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation
ABot-N0:面向通用具身导航的视觉-语言-动作基础模型技术报告
摘要
Mu Xu Team 2602.11598 HJFY
2026-02-10 Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning
Hydra-Nav:基于自适应双过程推理的目标导航
摘要
Yiming Gan Team 2602.09972 HJFY
2026-02-10 AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild
AutoFly:面向野外无人机自主导航的视觉-语言-动作模型
摘要
Hui Xiong Team 2602.09657 HJFY
2026-02-09 When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning
何时想象与想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理
摘要
Mohit Bansal Team 2602.08236 HJFY
2026-02-10 LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation
LCLA:面向视觉语言导航的语言条件化潜在对齐框架
摘要
Soumik Sarkar Team 2602.07629 HJFY
2026-02-06 Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters
弥合室内外鸿沟:面向最后几米的视觉中心化指令引导具身导航
摘要
Mu Xu Team 2602.06427 HJFY
2026-02-06 Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation
防微杜渐:基于回溯修正的鲁棒视觉语言导航
摘要
Weiying Xie Team 2602.06356 HJFY
2026-02-05 Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
稀疏视频生成推动现实世界超视距视觉语言导航
摘要
Hongyang Li Team 2602.05827 HJFY
2026-02-05 Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation
他者中心感知器:通过框架实例化从他者视觉先验中解耦他者中心推理
摘要
Weiming Zhang Team 2602.05789 HJFY
2026-02-05 MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation
MerNav:一种高度可泛化的记忆-执行-回顾框架,用于零样本目标导航
摘要
Mu Xu Team 2602.05467 HJFY
2026-02-02 LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation
LangMap:面向开放词汇目标导航的分层基准
摘要
Anton van den Hengel Team 2602.02220 HJFY
2026-01-31 APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation
APEX:一种用于异步空中目标导航的解耦记忆型探索器
摘要
Shuo Yang Team 2602.00551 HJFY
2026-02-03 MapDream: Task-Driven Map Learning for Vision-Language Navigation
MapDream:面向视觉语言导航的任务驱动地图学习
摘要
Zhaoxin Fan Team 2602.00222 HJFY
2026-01-29 Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation
动态拓扑感知:打破视觉语言导航中的粒度僵化
摘要
Xiaoming Wang Team 2601.21751 HJFY
2026-01-26 DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation
DV-VLN:基于大语言模型的视觉与语言导航双重验证可靠框架
摘要
Shoujun Zhou Team 2601.18492 HJFY
2026-01-26 \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation
NaVIDA:基于逆动力学增强的视觉语言导航
摘要
Feng Zheng Team 2601.18188 HJFY
2026-01-22 AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning
AION:基于双策略强化学习的空中室内目标导航系统
摘要
Lin Zhao Team 2601.15614 HJFY
2026-01-23 FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架
摘要
Yonggang Qi Team 2601.13976 HJFY
2026-01-19 Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration
Spatial-VLN:具备显式空间感知与探索能力的零样本视觉语言导航
摘要
Feitian Zhang Team 2601.12766 HJFY
2026-01-14 Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning
迈向开放环境与指令:基于快慢交互推理的通用视觉语言导航
摘要
Yahong Han Team 2601.09111 HJFY
2026-01-11 Residual Cross-Modal Fusion Networks for Audio-Visual Navigation Yi Wang et.al. 2601.08868 null
2026-01-13 VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory Shaoan Wang et.al. 2601.08665 null
2026-01-12 GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap Farzad Shami et.al. 2601.07375 null

评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。

📌 VLM

Publish Date (YYYY-MM-DD) Title Authors PDF HJFY 评估
2026-05-14 Does Synthetic Layered Design Data Benefit Layered Design Decomposition?
合成分层设计数据是否有助于分层设计分解?
摘要
Qifeng Chen Team 2605.15167 HJFY
2026-05-14 On the Cultural Anachronism and Temporal Reasoning in Vision Language Models
论视觉语言模型中的文化时代错位与时间推理
摘要
Zhiqiang Shen Team 2605.15071 HJFY
2026-05-14 LATERN: Test-Time Context-Aware Explainable Video Anomaly Detection
LATERN:测试时上下文感知的可解释视频异常检测
摘要
Muchao Ye Team 2605.15054 HJFY
2026-05-14 MHSA: A Lightweight Framework for Mitigating Hallucinations via Steered Attention in LVLMs
MHSA:一种通过引导注意力减轻大型视觉语言模型幻觉的轻量级框架
摘要
Yu Wang Team 2605.14966 HJFY
2026-05-14 Octopus: History-Free Gradient Orthogonalization for Continual Learning in Multimodal Large Language Models
章鱼:面向多模态大语言模型持续学习的无历史梯度正交化方法
摘要
Chao Ma Team 2605.14938 HJFY
2026-05-14 Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA
程序链:面向程序性问答的分层视觉-语言推理
摘要
Derek F. Wong Team 2605.14928 HJFY
2026-05-14 SteerSeg: Attention Steering for Reasoning Video Segmentation
SteerSeg:面向推理视频分割的注意力引导
摘要
Lars Petersson Team 2605.14908 HJFY
2026-05-14 MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models
MemLens: 评估大型视觉-语言模型的多模态长期记忆能力
摘要
Simon See Team 2605.14906 HJFY
2026-05-14 Your CLIP has 164 dimensions of noise: Exploring the embeddings covariance eigenspectrum of contrastively pretrained vision-language transformers
你的CLIP含有164个噪声维度:对比预训练视觉-语言Transformer的嵌入协方差特征谱探索
摘要
Przemysław Biecek Team 2605.14893 HJFY
2026-05-14 Exploring Vision-Language Models for Online Signature Verification: A Zero-Shot Capability Study
探索视觉-语言模型用于在线签名验证:零样本能力研究
摘要
Javier Ortega-Garcia Team 2605.14845 HJFY
2026-05-08 Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment
Proxy3D:通过语义聚类与对齐实现高效视语言模型的3D表征
摘要
Wenzhao Zheng Team 2605.08064 HJFY
2026-05-08 Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models
面向视觉语言模型的无目标幻觉强化反学习
摘要
Jinsong Su Team 2605.08031 HJFY
2026-05-08 SphereVAD: Training-Free Video Anomaly Detection via Geodesic Inference on the Unit Hypersphere
SphereVAD:基于单位超球面上测地线推理的无训练视频异常检测
摘要
Xiaochun Cao Team 2605.08003 HJFY
2026-05-08 MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL:在视觉证据缺失情境下评估可信医学视觉语言模型
摘要
Xiang Li Team 2605.07919 HJFY
2026-05-08 Anisotropic Modality Align
各向异性模态对齐
摘要
Hui Xiong Team 2605.07825 HJFY
2026-05-08 GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
GazeVLM:通过内部注意力控制实现主动视觉进行多模态推理
摘要
Mattia Rigotti Team 2605.07817 HJFY
2026-05-08 Operating Within the Operational Design Domain: Zero-Shot Perception with Vision-Language Models
在运行设计域内运行:基于视觉语言模型的零样本感知
摘要
Plachetka Christopher Team 2605.07649 HJFY
2026-05-08 LithoBench: Benchmarking Large Multimodal Models for Remote-Sensing Lithology Interpretation
LithoBench:面向遥感岩性解译的大规模多模态模型基准测试
摘要
Wei Han Team 2605.07640 HJFY
2026-05-08 PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
PolarVLM:弥合视觉语言模型中的语义-物理鸿沟
摘要
Zhanyu Ma Team 2605.07574 HJFY
2026-05-08 Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs
超越GSD即文本:遥感视觉语言模型的连续尺度条件化
摘要
Yawei Li Team 2605.07562 HJFY
2026-05-05 StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
StateVLM:用于机器人可操作属性推理的状态感知视觉-语言模型
摘要
Stefan Wermter Team 2605.03927 HJFY
2026-05-05 Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing
基于视觉语言嵌入与超维计算的机器人检测任务感知扫描参数配置
摘要
Farhad Imani Team 2605.03909 HJFY
2026-05-05 CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2: 面向真实世界文档处理的大规模多模态模型读写能力基准评测
摘要
Dayiheng Liu Team 2605.03903 HJFY
2026-05-05 Deco: Extending Personal Physical Objects into Pervasive AI Companion through a Dual-Embodiment Framework
Deco:通过双具身框架将个人实体物品扩展为普适人工智能伴侣
摘要
Xuhai Xu Team 2605.03882 HJFY
2026-05-05 Quantifying the human visual exposome with vision language models
利用视觉语言模型量化人类视觉暴露组
摘要
Magdalena Katharina Wekenborg Team 2605.03863 HJFY
2026-05-05 Enhancing Visual Question Answering with Multimodal LLMs via Chain-of-Question Guided Retrieval-Augmented Generation
基于链式问题引导的检索增强生成增强多模态大语言模型视觉问答
摘要
Chia-Wen Lin Team 2605.03790 HJFY
2026-05-05 Before Forgetting, Learn to Remember: Revisiting Foundational Learning Failures in LVLM Unlearning Benchmarks
遗忘之前,先学会记忆:重新审视LVLM遗忘基准中的基础学习失败
摘要
YoungBin Kim Team 2605.03759 HJFY
2026-05-05 Uni-OPD: Unifying On-Policy Distillation with a Dual-Perspective Recipe
摘要
Hehe Fan Team 2605.03677 HJFY
2026-05-05 The Detector Teaches Itself: Lightweight Self-Supervised Adaptation for Open-Vocabulary Object Detection
检测器自我教导:面向开放词汇目标检测的轻量级自监督适应性方法
摘要
Changjae Oh Team 2605.03642 HJFY
2026-05-05 Erase Persona, Forget Lore: Benchmarking Multimodal Copyright Unlearning in Large Vision Language Models
抹除角色,遗忘设定:大型视觉语言模型中多模态版权遗忘的基准测试
摘要
YoungBin Kim Team 2605.03547 HJFY
2026-04-30 AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images
AEGIS:面向AI生成学术图片取证分析的全方位基准
摘要
Haihong E Team 2604.28177 HJFY
2026-04-30 PhyCo: Learning Controllable Physical Priors for Generative Motion
PhyCo:学习可控物理先验以生成运动
摘要
Manmohan Chandraker Team 2604.28169 HJFY
2026-04-30 PRISM: Pre-alignment via Black-box On-policy Distillation for Multimodal Reinforcement Learning
PRISM:多模态强化学习中基于黑盒在线策略蒸馏的预对齐方法
摘要
Chengwei Qin Team 2604.28123 HJFY
2026-04-30 FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction
FreeOcc:无需训练的具体化开放词汇占据预测
摘要
Changhao Chen Team 2604.28115 HJFY
2026-04-30 SpecVQA: A Benchmark for Spectral Understanding and Visual Question Answering in Scientific Images
SpecVQA:面向科学图像的光谱理解与视觉问答基准
摘要
Xi Fang Team 2604.28039 HJFY
2026-04-30 Echo-α: Large Agentic Multimodal Reasoning Model for Ultrasound Interpretation
Echo-α:面向超声影像解读的大型智能多模态推理模型
摘要
Dacheng Tao Team 2604.28011 HJFY
2026-04-30 TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions
TransVLM:用于检测任意镜头转换的视觉-语言框架与基准
摘要
Mingming Gong Team 2604.27975 HJFY
2026-04-30 FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
FineState-Bench: 面向细粒度GUI状态设定的状态条件定位基准
摘要
Xiuying Chen Team 2604.27974 HJFY
2026-04-30 From Mirage to Grounding: Towards Reliable Multimodal Circuit-to-Verilog Code Generation
从幻象到根基:迈向可靠的多模态电路到Verilog代码生成
摘要
Xin Xi Team 2604.27969 HJFY
2026-04-30 The Effects of Visual Priming on Cooperative Behavior in Vision-Language Models
视觉启动对视觉语言模型中合作行为的影响
摘要
Kenneth J. K. Ong Team 2604.27953 HJFY
2026-04-23 When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
当提示压倒视觉:LVLM中提示诱导的幻觉
摘要
Matthieu Cord Team 2604.21911 HJFY
2026-04-23 From Codebooks to VLMs: Evaluating Automated Visual Discourse Analysis for Climate Change on Social Media
从编码本到视觉语言模型:社交媒体上气候变化视觉话语的自动化评估分析
摘要
Margret Keuper Team 2604.21786 HJFY
2026-04-23 Ramen: Robust Test-Time Adaptation of Vision-Language Models with Active Sample Selection
Ramen:通过主动样本选择实现视觉-语言模型的鲁棒测试时自适应
摘要
Jingrui He Team 2604.21728 HJFY
2026-04-23 Seeing Isn't Believing: Uncovering Blind Spots in Evaluator Vision-Language Models
眼见不为实:揭示评估型视觉语言模型中的盲点
摘要
Mitesh M. Khapra Team 2604.21523 HJFY
2026-04-23 Do MLLMs Understand Pointing? Benchmarking and Enhancing Referential Reasoning in Egocentric Vision
多模态大语言模型理解指向吗?面向第一人称视角的指代推理基准构建与能力增强
摘要
Jie Zhou Team 2604.21461 HJFY
2026-04-23 VG-CoT: Towards Trustworthy Visual Reasoning via Grounded Chain-of-Thought
VG-CoT:基于可信视觉推理的接地链式思维方法
摘要
YoungBin Kim Team 2604.21396 HJFY
2026-04-23 A Deployable Embodied Vision-Language Navigation System with Hierarchical Cognition and Context-Aware Exploration
一种具有层次化认知与上下文感知探索的可部署具身视觉语言导航系统
摘要
Lihua Xie Team 2604.21363 HJFY
2026-04-23 Prototype-Based Test-Time Adaptation of Vision-Language Models
基于原型的视觉语言模型测试时自适应方法
摘要
Rongrong Ji Team 2604.21360 HJFY
2026-04-23 Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
符号化根基揭示抽象视觉推理中的表征瓶颈
摘要
Tanel Tammet Team 2604.21346 HJFY
2026-04-23 Latent Denoising Improves Visual Alignment in Large Multimodal Models
潜在去噪提升大型多模态模型的视觉对齐
摘要
Viktor Prasanna Team 2604.21343 HJFY
2026-04-20 Mitigating Multimodal Hallucination via Phase-wise Self-reward
通过分阶段自奖励缓解多模态幻觉
摘要
Min Zhang Team 2604.17982 HJFY
2026-04-20 From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models
从注意力头到神经元:多任务视觉语言模型中的因果归因与调控
摘要
Ming Jiang Team 2604.17941 HJFY
2026-04-20 OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive:基于视觉-语言-动作模型的多范式统一驾驶框架
摘要
Zhipeng Zhang Team 2604.17915 HJFY
2026-04-20 AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
AeroRAG:面向细粒度航空视觉推理的结构化多模态检索增强大语言模型
摘要
Xuecheng Wu Team 2604.17889 HJFY
2026-04-20 SpaceDex: Generalizable Dexterous Grasping in Tiered Workspaces
SpaceDex:分层工作空间中的泛化灵巧抓取
摘要
Ning Tan Team 2604.17888 HJFY
2026-04-20 Weakly-Supervised Referring Video Object Segmentation through Text Supervision
摘要
Hanli Wang Team 2604.17797 HJFY
2026-04-20 When Vision-Language Models Judge Without Seeing: Exposing Informativeness Bias
当视觉语言模型未见而判:揭示信息量偏见
摘要
Dan Roth Team 2604.17768 HJFY
2026-04-19 BioVLM: Routing Prompts, Not Parameters, for Cross-Modality Generalization in Biomedical VLMs
BioVLM:通过路由提示而非参数实现生物医学视觉语言模型的跨模态泛化
摘要
Biplab Banerjee Team 2604.17629 HJFY
2026-04-19 PBSBench: A Multi-Level Vision-Language Framework and Benchmark for Hematopathology Whole Slide Image Interpretation
PBSBench:用于血液病理学全玻片图像解读的多层次视觉-语言框架与基准
摘要
Ping Zhang Team 2604.17570 HJFY
2026-04-19 RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding
RS-HyRe-R1:一种克服遥感图像理解中感知惯性的混合奖励机制
摘要
Haifeng Li Team 2604.17504 HJFY
2026-04-15 One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding
每帧一令牌:迈向长视频理解的极致压缩
摘要
Yu-Xiong Wang Team 2604.14149 HJFY
2026-04-15 ROSE: Retrieval-Oriented Segmentation Enhancement
ROSE:面向检索的分割增强框架
摘要
Yu-Gang Jiang Team 2604.14147 HJFY
2026-04-15 HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA:一种以视觉定位为中心的分层具身操作系统
摘要
Ping Luo Team 2604.14125 HJFY
2026-04-15 Training-Free Semantic Multi-Object Tracking with Vision-Language Models
无需训练的语义多目标跟踪:基于视觉-语言模型的方法
摘要
Lorenzo Vaquero Team 2604.14074 HJFY
2026-04-15 Towards Unconstrained Human-Object Interaction
迈向无约束的人-物交互检测
摘要
Elisa Ricci Team 2604.14069 HJFY
2026-04-15 Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
解码变化:利用多模态大语言模型统一遥感变化检测与理解
摘要
Zide Fan Team 2604.14044 HJFY
2026-04-15 Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios
寻解:评估多模态大语言模型在日常场景中基于视觉线索的推理能力
摘要
Xu Jia Team 2604.14041 HJFY
2026-04-15 POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker:迈向从零开始训练多模态智能搜索模型
摘要
Weidi Xie Team 2604.14029 HJFY
2026-04-15 MAny: Merge Anything for Multimodal Continual Instruction Tuning
MAny:面向多模态持续指令调优的任意合并框架
摘要
Kele Xu Team 2604.14016 HJFY
2026-04-15 Reward Design for Physical Reasoning in Vision-Language Models
视觉语言模型中物理推理的奖励设计
摘要
Sameera Horawalavithana Team 2604.13993 HJFY
2026-04-06 Rethinking Model Efficiency: Multi-Agent Inference with Large Models
重新思考模型效率:大模型的多智能体推理
摘要
Qi Qian Team 2604.04929 HJFY
2026-04-06 Vero: An Open RL Recipe for General Visual Reasoning
Vero:面向通用视觉推理的开源强化学习方案
摘要
Zhuang Liu Team 2604.04917 HJFY
2026-04-06 ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality
ClickAIXR:基于设备的多模态视觉-语言交互在扩展现实中与现实世界物体的应用
摘要
Ivan Viola Team 2604.04905 HJFY
2026-04-06 Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
超越全局分数:细粒度标记定位作为LVLM幻觉的鲁棒检测器
摘要
Vu Minh Hieu Phan Team 2604.04863 HJFY
2026-04-06 The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
适应中的盲点:量化与缓解微调驾驶模型中的灾难性遗忘
摘要
Zhipeng Zhang Team 2604.04857 HJFY
2026-04-06 Less Detail, Better Answers: Degradation-Driven Prompting for VQA
细节越少,答案越好:面向视觉问答的降质驱动提示方法
摘要
Bohan Zhuang Team 2604.04838 HJFY
2026-04-06 Discovering Failure Modes in Vision-Language Models using RL
利用强化学习探索视觉语言模型的失效模式
摘要
Aishwarya Agrawal Team 2604.04733 HJFY
2026-04-06 ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration
ROSClaw:一种面向异构多智能体协作的分层语义-物理框架
摘要
Jie Chen Team 2604.04664 HJFY
2026-04-06 Synthesis4AD: Synthetic Anomalies are All You Need for 3D Anomaly Detection
Synthesis4AD:合成异常即三维异常检测所需全部
摘要
Weiming Shen Team 2604.04658 HJFY
2026-04-06 InCTRLv2: Generalist Residual Models for Few-Shot Anomaly Detection and Segmentation
InCTRLv2:用于少样本异常检测与分割的通用残差模型
摘要
Guansong Pang Team 2604.04632 HJFY
2026-04-03 CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
CoME-VL:扩展互补多编码器视觉-语言学习
摘要
Salman Khan Team 2604.03231 HJFY
2026-04-03 The Compression Gap: Why Discrete Tokenization Limits Vision-Language-Action Model Scaling
压缩鸿沟:为何离散标记化限制视觉-语言-动作模型的扩展
摘要
Takuya Shiba Team 2604.03191 HJFY
2026-04-03 Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
理解幻觉在多模态推理模型强化后训练中的作用
摘要
Tianlong Chen Team 2604.03179 HJFY
2026-04-03 EffiMiniVLM: A Compact Dual-Encoder Regression Framework
EffiMiniVLM:一种紧凑型双编码器回归框架
摘要
Yan Chai Hum Team 2604.03172 HJFY
2026-04-03 Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Chart-RL:基于策略优化强化学习的视觉语言模型图表问答视觉推理增强方法
摘要
Shekhar Jain Team 2604.03157 HJFY
2026-04-03 FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation
FSUNav:一种用于快速、安全且通用的零样本目标导向导航的大脑-小脑架构
摘要
Wei Zhang Team 2604.03139 HJFY
2026-04-03 Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models
揭示物理世界语义漏洞:面向红外视觉语言模型的通用对抗性补丁
摘要
Wen Yao Team 2604.03117 HJFY
2026-04-03 MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs
MI-Pruner:基于跨模态互信息引导的高效多模态大语言模型令牌剪枝器
摘要
Matthew B. Blaschko Team 2604.03072 HJFY
2026-04-03 QVAD: A Question-Centric Agentic Framework for Efficient and Training-Free Video Anomaly Detection
QVAD:一种以问题为中心的高效免训练视频异常检测代理框架
摘要
Yasin Yilmaz Team 2604.03040 HJFY
2026-04-03 Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
Agentic-MME:智能体能力究竟为多模态智能带来什么?
摘要
Yi-Fan Zhang Team 2604.03016 HJFY
2026-03-31 Scaling Video Pretraining for Surgical Foundation Models
扩展视频预训练以构建外科基础模型
摘要
Zuozhu Liu Team 2603.29966 HJFY
2026-03-31 EC-Bench: Enumeration and Counting Benchmark for Ultra-Long Videos
EC-Bench:超长视频的枚举与计数基准测试
摘要
Yutaka Matsuo Team 2603.29943 HJFY
2026-03-31 ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation
ATP-Bench:迈向多模态大语言模型交错生成的智能体工具规划
摘要
Guanjun Jiang Team 2603.29902 HJFY
2026-03-31 DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA
DIAL:通过潜在世界建模实现意图与动作解耦的端到端视觉语言动作模型
摘要
Xihui Liu Team 2603.29844 HJFY
2026-03-31 SceneTeract: Agentic Functional Affordances and VLM Grounding in 3D Scenes
SceneTeract:三维场景中的智能体功能可供性与视觉语言模型接地验证
摘要
Maks Ovsjanikov Team 2603.29798 HJFY
2026-03-31 From Skeletons to Semantics: Design and Deployment of a Hybrid Edge-Based Action Detection System for Public Safety
从骨架到语义:面向公共安全的混合边缘动作检测系统设计与部署
摘要
Jan Schagen Team 2603.29777 HJFY
2026-03-31 TSHA: A Benchmark for Visual Language Models in Trustworthy Safety Hazard Assessment Scenarios
TSHA:面向可信安全风险评估场景的视觉语言模型基准
摘要
Xin Tan Team 2603.29759 HJFY
2026-03-31 A Comprehensive Information-Decomposition Analysis of Large Vision-Language Models
大型视觉语言模型的信息分解综合分析
摘要
Hideki Nakayama Team 2603.29676 HJFY
2026-03-31 Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
存储更少,发现更多:新颖性过滤如何提升边缘摄像头的跨模态检索效果
摘要
Sherif Abdelwahab Team 2603.29631 HJFY
2026-03-31 Calibrated Confidence Expression for Radiology Report Generation
放射学报告生成中的校准置信度表达
摘要
Matthias Keicher Team 2603.29492 HJFY
2026-03-26 SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding
SlotVTG:面向泛化视频时序定位的对象中心适配器
摘要
Jinwoo Choi Team 2603.25733 HJFY
2026-03-26 Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos
Colon-Bench:一种用于全流程结肠镜视频中可扩展密集病灶标注的智能体工作流
摘要
Xin Gao Team 2603.25645 HJFY
2026-03-26 LanteRn: Latent Visual Structured Reasoning
LanteRn:潜在视觉结构化推理
摘要
André Martins Team 2603.25629 HJFY
2026-03-26 Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification
多模态大语言模型中的人口统计学公平性:人脸验证中的性别与种族偏见基准研究
摘要
Sébastien Marcel Team 2603.25613 HJFY
2026-03-26 GeoHeight-Bench: Towards Height-Aware Multimodal Reasoning in Remote Sensing
GeoHeight-Bench:迈向遥感中的高度感知多模态推理
摘要
Wufan Zhao Team 2603.25565 HJFY
2026-03-26 Humans vs Vision-Language Models: A Unified Measure of Narrative Coherence
人类与视觉语言模型:叙事连贯性的统一度量
摘要
Sharid Loáiciga Team 2603.25537 HJFY
2026-03-26 GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
GridVAD:基于分层帧网格空间推理的开放集视频异常检测
摘要
Sondos Mohamed Team 2603.25467 HJFY
2026-03-26 HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
HiSpatial:驾驭视觉语言模型中的层次化三维空间理解
摘要
Jiaolong Yang Team 2603.25411 HJFY
2026-03-26 Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models
形态与实质:针对本地视觉语言模型的双层侧信道攻击
摘要
Mordechai Guri Team 2603.25403 HJFY
2026-03-26 DAGverse: Building Document-Grounded Semantic DAGs from Scientific Papers
DAGverse:从科学论文构建基于文档的语义有向无环图
摘要
Huan Liu Team 2603.25293 HJFY
2026-03-25 Vision-Language Models vs Human: Perceptual Image Quality Assessment
视觉语言模型与人类:感知图像质量评估对比
摘要
Brian Deegan Team 2603.24578 HJFY
2026-03-25 VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
VFIG:利用视觉语言模型将复杂图形矢量化至SVG格式
摘要
Ranjay Krishna Team 2603.24575 HJFY
2026-03-25 LensWalk: Agentic Video Understanding by Planning How You See in Videos
LensWalk:通过规划视频观看方式实现智能体驱动的视频理解
摘要
Shiguang Shan Team 2603.24558 HJFY
2026-03-25 UI-Voyager: A Self-Evolving GUI Agent Learning via Failed Experience
UI-Voyager:一种通过失败经验实现自我进化的图形用户界面代理学习框架
摘要
Jie Jiang Team 2603.24533 HJFY
2026-03-25 Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
跨模态原型对齐与混合:面向免训练小样本分类
摘要
Joost van de Weijer Team 2603.24528 HJFY
2026-03-25 Video-Only ToM: Enhancing Theory of Mind in Multimodal Large Language Models
纯视频心智理论:增强多模态大语言模型的心智理论能力
摘要
Jiansheng Chen Team 2603.24484 HJFY
2026-03-25 Unleashing Vision-Language Semantics for Deepfake Video Detection
释放视觉-语言语义在深度伪造视频检测中的潜力
摘要
Guansong Pang Team 2603.24454 HJFY
2026-03-25 3D-Mix for VLA: A Plug-and-Play Module for Integrating VGGT-based 3D Information into Vision-Language-Action Models
面向VLA的3D-Mix:将基于VGGT的三维信息集成到视觉-语言-动作模型中的即插即用模块
摘要
Kai Chen Team 2603.24393 HJFY
2026-03-25 ViHOI: Human-Object Interaction Synthesis with Visual Priors
ViHOI:基于视觉先验的人-物交互合成
摘要
Changxing Ding Team 2603.24383 HJFY
2026-03-25 GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization
GeoRouter:面向全球图像地理定位的动态范式路由
摘要
Xiangyu Zhao Team 2603.24376 HJFY
2026-03-19 Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding
生成模型通晓空间:释放隐式三维先验以促进场景理解
摘要
Xiang Bai Team 2603.19235 HJFY
2026-03-19 Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders
视觉语言模型是否需要视觉Transformer?评估状态空间模型作为视觉编码器的表现
摘要
Paola Cascante-Bonilla Team 2603.19209 HJFY
2026-03-19 Tinted Frames: Question Framing Blinds Vision-Language Models
着色框架:问题框架使视觉语言模型失明
摘要
Ritwik Gupta Team 2603.19203 HJFY
2026-03-19 Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
语义与度量:面向视觉语言导航的多智能体概率性接地方法
摘要
Nakul Gopalan Team 2603.19166 HJFY
2026-03-19 GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning
GSMem:将3D高斯泼溅作为持久空间记忆,用于零样本具身探索与推理
摘要
Yu Yin Team 2603.19137 HJFY
2026-03-19 TAU-R1: Visual Language Model for Traffic Anomaly Understanding
TAU-R1:面向交通异常理解的可视语言模型
摘要
Nic Zhang Team 2603.19098 HJFY
2026-03-19 SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues
SAVeS:通过语义线索引导视觉语言模型的安全判断
摘要
Bernard Ghanem Team 2603.19092 HJFY
2026-03-19 SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation
SwiftTailor:基于几何图像表示的高效三维服装生成
摘要
Phong Nguyen Team 2603.19053 HJFY
2026-03-19 TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation
TerraScope:面向地球观测的像素级视觉推理
摘要
Paolo Rota Team 2603.19039 HJFY
2026-03-19 SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
SEM:面向视觉语言模型事后去偏的稀疏嵌入调制方法
摘要
Massimiliano Mancini Team 2603.19028 HJFY
2026-03-18 Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
统一时空令牌评分:面向高效视频视觉语言模型
摘要
Sangho Lee Team 2603.18004 HJFY
2026-03-18 Universal Skeleton Understanding via Differentiable Rendering and MLLMs
基于可微分渲染与多模态大语言模型的通用骨架理解
摘要
Mengyuan Liu Team 2603.18003 HJFY
2026-03-18 Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models
Loc3R-VLM:基于语言的定位与视觉语言模型的三维推理
摘要
Marc Pollefeys Team 2603.18002 HJFY
2026-03-18 Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding
感知空间:面向高效精准三维场景理解的自我运动感知视频表征
摘要
Kang G. Shin Team 2603.17980 HJFY
2026-03-18 ProbeFlow: Training-Free Adaptive Flow Matching for Vision-Language-Action Models
ProbeFlow:面向视觉-语言-动作模型的无训练自适应流匹配方法
摘要
Qiongfeng Shi Team 2603.17850 HJFY
2026-03-18 Fine-Grained Post-Training Quantization for Large Vision Language Models with Quantization-Aware Integrated Gradients
基于量化感知积分梯度的大规模视觉语言模型细粒度后训练量化
摘要
Xu-Yao Zhang Team 2603.17809 HJFY
2026-03-18 Evidence Packing for Cross-Domain Image Deepfake Detection with LVLMs
基于大型视觉语言模型的跨域图像深度伪造检测证据包方法
摘要
Zhaohong Jia Team 2603.17761 HJFY
2026-03-18 Concept-to-Pixel: Prompt-Free Universal Medical Image Segmentation
概念到像素:无需提示的通用医学图像分割
摘要
Shaohua Kevin Zhou Team 2603.17746 HJFY
2026-03-18 SARE: Sample-wise Adaptive Reasoning for Training-free Fine-grained Visual Recognition
SARE:面向免训练细粒度视觉识别的样本自适应推理框架
摘要
Xuhong Zhang Team 2603.17729 HJFY
2026-03-18 From Virtual Environments to Real-World Trials: Emerging Trends in Autonomous Driving
从虚拟环境到现实世界测试:自动驾驶的新兴趋势
摘要
A. Behera Team 2603.17714 HJFY
2026-03-13 Visual-ERM: Reward Modeling for Visual Equivalence
Visual-ERM:视觉等价性奖励建模
摘要
Yuhang Zang Team 2603.13224 HJFY
2026-03-13 Navig-AI-tion: Navigation by Contextual AI and Spatial Audio
导航AI化:基于情境人工智能与空间音频的导航系统
摘要
Mar Gonzalez-Franco Team 2603.13200 HJFY
2026-03-13 Towards Spatio-Temporal World Scene Graph Generation from Monocular Videos
面向单目视频的时空世界场景图生成
摘要
Vibhav Gogate Team 2603.13185 HJFY
2026-03-13 Geometry-Guided Camera Motion Understanding in VideoLLMs
视频大语言模型中的几何引导相机运动理解
摘要
Guan-Ming Su Team 2603.13119 HJFY
2026-03-13 Evaluating VLMs' Spatial Reasoning Over Robot Motion: A Step Towards Robot Planning with Motion Preferences
评估视觉语言模型在机器人运动中的空间推理能力:迈向融合运动偏好的机器人规划
摘要
Martim Brandão Team 2603.13100 HJFY
2026-03-13 Reasoning over Video: Evaluating How MLLMs Extract, Integrate, and Reconstruct Spatiotemporal Evidence
视频推理评估:探究多模态大语言模型如何提取、整合与重构时空证据
摘要
Hwanjun Song Team 2603.13091 HJFY
2026-03-13 Topo-R1: Detecting Topological Anomalies via Vision-Language Models
Topo-R1:基于视觉语言模型的拓扑异常检测
摘要
Chao Chen Team 2603.13054 HJFY
2026-03-13 ESPIRE: A Diagnostic Benchmark for Embodied Spatial Reasoning of Vision-Language Models
ESPIRE:面向视觉语言模型具身空间推理的诊断性基准
摘要
Zilong Zheng Team 2603.13033 HJFY
2026-03-13 A Closed-Form Solution for Debiasing Vision-Language Models with Utility Guarantees Across Modalities and Tasks
一种跨模态与任务具有效用保证的视觉语言模型去偏闭式解
摘要
Oya Celiktutan Team 2603.12998 HJFY
2026-03-13 Test-Time Attention Purification for Backdoored Large Vision Language Models
针对后门大型视觉语言模型的测试时注意力净化
摘要
Miao Xu Team 2603.12989 HJFY
2026-03-10 X-GS: An Extensible Open Framework Unifying 3DGS Architectures with Downstream Multimodal Models
X-GS:一个可扩展的开放框架,统一3DGS架构与下游多模态模型
摘要
Irwin King Team 2603.09632 HJFY
2026-03-10 Speech-Omni-Lite: Portable Speech Interfaces for Vision-Language Models
Speech-Omni-Lite:面向视觉语言模型的便携式语音接口
摘要
Xiao Chen Team 2603.09627 HJFY
2026-03-10 More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
超越简单叠加:面向全景恶劣场景的全景语言模型
摘要
Rainer Stiefelhagen Team 2603.09573 HJFY
2026-03-10 GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision
GeoSolver:通过细粒度过程监督扩展遥感领域的测试时推理能力
摘要
Bo Yang Team 2603.09551 HJFY
2026-03-10 Towards Unified Multimodal Interleaved Generation via Group Relative Policy Optimization
通过分组相对策略优化实现统一的多模态交错生成
摘要
Li Zhang Team 2603.09538 HJFY
2026-03-10 Probing the Reliability of Driving VLMs: From Inconsistent Responses to Grounded Temporal Reasoning
探究驾驶视觉语言模型的可靠性:从响应不一致到基于时序的推理
摘要
Alain Pagani Team 2603.09512 HJFY
2026-03-10 Evolving Prompt Adaptation for Vision-Language Models
面向视觉语言模型的演化提示适应方法
摘要
Yang Li Team 2603.09493 HJFY
2026-03-10 StyleVLA: Driving Style-Aware Vision Language Action Model for Autonomous Driving
StyleVLA:面向自动驾驶的驾驶风格感知视觉语言动作模型
摘要
Johannes Betz Team 2603.09482 HJFY
2026-03-10 Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
剪除冗余,保留精髓:基于协同重要性-多样性原则的视觉语言模型视觉令牌压缩
摘要
Wenjie Pei Team 2603.09480 HJFY
2026-03-10 MORE-R1: Guiding LVLM for Multimodal Object-Entity Relation Extraction via Stepwise Reasoning with Reinforcement Learning
MORE-R1:通过强化学习引导大型视觉语言模型进行多模态对象-实体关系提取的逐步推理
摘要
Tong Mo Team 2603.09478 HJFY
2026-03-06 Multimodal Large Language Models as Image Classifiers
多模态大语言模型作为图像分类器
摘要
Jiri Matas Team 2603.06578 HJFY
2026-03-06 Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion
Omni-Diffusion:基于掩码离散扩散的统一多模态理解与生成
摘要
Chaoyou Fu Team 2603.06577 HJFY
2026-03-06 SUREON: A Benchmark and Vision-Language-Model for Surgical Reasoning
SUREON:一个用于外科推理的基准与视觉语言模型
摘要
Omid Mohareri Team 2603.06570 HJFY
2026-03-06 Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Penguin-VL:探索基于LLM视觉编码器的视觉语言模型效率极限
摘要
Leoweiliang Team 2603.06569 HJFY
2026-03-06 Do Foundation Models Know Geometry? Probing Frozen Features for Continuous Physical Measurement
基础模型懂几何吗?探究冻结特征在连续物理测量中的应用
摘要
Yakov Pyotr Shkolnikov Team 2603.06459 HJFY
2026-03-06 OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis
OralGPT-Plus:通过强化学习掌握视觉工具用于全景X射线分析
摘要
Hao Tang Team 2603.06366 HJFY
2026-03-06 K-MaT: Knowledge-Anchored Manifold Transport for Cross-Modal Prompt Learning in Medical Imaging
K-MaT:基于知识锚定的流形迁移用于医学影像中的跨模态提示学习
摘要
Shadi Albarqouni Team 2603.06340 HJFY
2026-03-06 WMoE-CLIP: Wavelet-Enhanced Mixture-of-Experts Prompt Learning for Zero-Shot Anomaly Detection
WMoE-CLIP:基于小波增强专家混合提示学习的零样本异常检测
摘要
Chao Huang Team 2603.06313 HJFY
2026-03-06 DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models
DEX-AR:一种面向自回归视觉语言模型的动态可解释性方法
摘要
Hilde Kuehne Team 2603.06302 HJFY
2026-03-06 HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models
HiPP-Prune:面向视觉语言模型的分层偏好条件结构化剪枝
摘要
Raul Santos-Rodriguez Team 2603.06270 HJFY
2026-03-05 HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
HALP:无需生成任何词元即可检测视觉语言模型中的幻觉现象
摘要
Jiawei Zhou Team 2603.05465 HJFY
2026-03-05 ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
ORMOT:面向全向指涉多目标跟踪的数据集与框架
摘要
Wenbing Tao Team 2603.05384 HJFY
2026-03-05 Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Wiki-R1:通过数据与采样课程激励基于知识的视觉问答中的多模态推理
摘要
Xuming He Team 2603.05256 HJFY
2026-03-05 Critic in the Loop: A Tri-System VLA Framework for Robust Long-Horizon Manipulation
闭环批评者:一种用于鲁棒长程操作的三系统视觉语言动作框架
摘要
Shanlin Zhong Team 2603.05185 HJFY
2026-03-05 Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Logi-PAR:基于可微分规则的逻辑增强型患者活动识别
摘要
Kawsar Farooq Team 2603.05184 HJFY
2026-03-05 Mario: Multimodal Graph Reasoning with Large Language Models
Mario:基于大语言模型的多模态图推理
摘要
Qiaoyu Tan Team 2603.05181 HJFY
2026-03-05 UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
UniM:一个统一的任意到任意交错多模态基准
摘要
Wynne Hsu Team 2603.05075 HJFY
2026-03-05 Direct Contact-Tolerant Motion Planning With Vision Language Models
基于视觉语言模型的直接接触容忍运动规划
摘要
Chengzhong Xu Team 2603.05017 HJFY
2026-03-05 VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
VisionPangu:一款拥有17亿参数的紧凑且细粒度多模态助手
摘要
Wenpo Song Team 2603.04957 HJFY
2026-03-05 AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
AdaIAT:自适应增强对生成文本的关注以缓解大型视觉语言模型中的幻觉问题
摘要
Xiangui Kang Team 2603.04908 HJFY
2026-02-26 MediX-R1: Open Ended Medical Reinforcement Learning
MediX-R1:开放式医学强化学习框架
摘要
Hisham Cholakkal Team 2602.23363 HJFY
2026-02-26 Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
规模无法克服语用学:报告偏差对视觉-语言推理的影响
摘要
Ranjay Krishna Team 2602.23351 HJFY
2026-02-26 Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation?
检索与分割:少量示例足以弥合开放词汇分割中的监督鸿沟吗?
摘要
Giorgos Tolias Team 2602.23339 HJFY
2026-02-26 CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays
CXReasonAgent:基于证据的胸部X光诊断推理智能体
摘要
Edward Choi Team 2602.23276 HJFY
2026-02-26 Large Multimodal Models as General In-Context Classifiers
大型多模态模型作为通用上下文内分类器
摘要
Elisa Ricci Team 2602.23229 HJFY
2026-02-26 MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction
MovieTeller:基于工具增强的电影剧情摘要与ID一致渐进式抽象
摘要
Gaoang Wang Team 2602.23228 HJFY
2026-02-26 Efficient Encoder-Free Fourier-based 3D Large Multimodal Model
高效无编码器的基于傅里叶变换的3D大型多模态模型
摘要
Fabio Poiesi Team 2602.23153 HJFY
2026-02-26 Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy
以言构形:弱监督视觉-语言建模用于人脑显微成像
摘要
Christian Schiffer Team 2602.23088 HJFY
2026-02-26 SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling
SubspaceAD:基于子空间建模的无训练少样本异常检测方法
摘要
Egor Bondarev Team 2602.23013 HJFY
2026-02-26 FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning
FactGuard:基于强化学习的智能体视频虚假信息检测
摘要
Zhaoqi Wang Team 2602.22963 HJFY
2026-02-24 Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
Spa3R:面向三维视觉推理的预测性空间场建模
摘要
Xinggang Wang Team 2602.21186 HJFY
2026-02-24 Seeing Through Words: Controlling Visual Retrieval Quality with Language Models
透过文字看见:利用语言模型控制视觉检索质量
摘要
Yun Fu Team 2602.21175 HJFY
2026-02-24 LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis
LUMEN:用于预后与诊断的纵向多模态放射学模型
摘要
Marius George Linguraru Team 2602.21142 HJFY
2026-02-24 VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation
VAUQ:面向LVLM自评估的视觉感知不确定性量化
摘要
Sharon Li Team 2602.21054 HJFY
2026-02-24 OCR-Agent: Agentic OCR with Capability and Memory Reflection
OCR-Agent:具备能力与记忆反思的智能OCR代理
摘要
Ying Cai Team 2602.21053 HJFY
2026-02-24 Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning
不止于所见:无需微调,让CLIP理解否定的视觉描述
摘要
Zejiang He Team 2602.21035 HJFY
2026-02-24 From Perception to Action: An Interactive Benchmark for Vision Reasoning
从感知到行动:视觉推理的交互式基准测试
摘要
Roy Ka-Wei Lee Team 2602.21015 HJFY
2026-02-24 CrystaL: Spontaneous Emergence of Visual Latents in MLLMs
CrystaL:多模态大语言模型中视觉潜在特征的自发涌现
摘要
Xiang Li Team 2602.20980 HJFY
2026-02-24 Are Multimodal Large Language Models Good Annotators for Image Tagging?
多模态大语言模型是图像标注的优秀注释者吗?
摘要
Masashi Sugiyama Team 2602.20972 HJFY
2026-02-24 LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
LongVideo-R1:面向低成本长视频理解的智能导航方法
摘要
Qixiang Ye Team 2602.20913 HJFY
2026-02-19 Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
通过细粒度细节定位推动黑盒大视觉语言模型攻击前沿
摘要
Zhiqiang Shen Team 2602.17645 HJFY
2026-02-19 Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning
抗灾难性遗忘的单次增量联邦学习
摘要
Monowar Bhuyan Team 2602.17625 HJFY
2026-02-19 AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games
AI游戏商店:通过人类游戏实现机器通用智能的可扩展、开放式评估
摘要
Joshua B. Tenenbaum Team 2602.17594 HJFY
2026-02-19 RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward
RetouchIQ:基于指令的图像修饰多模态大语言模型智能体与通用奖励机制
摘要
Handong Zhao Team 2602.17558 HJFY
2026-02-19 GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking
GraphThinker:通过事件图思维强化视频推理
摘要
Shaogang Gong Team 2602.17555 HJFY
2026-02-19 LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs
LATA:面向医学视觉语言模型置信度校准的拉普拉斯辅助转导自适应方法
摘要
Zongyuan Ge Team 2602.17535 HJFY
2026-02-19 QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery
QuPAINT:面向量子材料发现的物理感知指令调优方法
摘要
Khoa Luu Team 2602.17478 HJFY
2026-02-19 EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models
EAGLE:面向多模态大语言模型免调优工业异常检测的专家增强注意力引导方法
摘要
Seon Han Choi Team 2602.17419 HJFY
2026-02-19 EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models
EntropyPrune:基于矩阵熵引导的多模态大语言模型视觉令牌剪枝
摘要
Lianghua He Team 2602.17196 HJFY
2026-02-19 Selective Training for Large Vision Language Models via Visual Information Gain
基于视觉信息增益的大型视觉语言模型选择性训练
摘要
Sangheum Hwang Team 2602.17186 HJFY
2026-02-12 Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment
扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效
摘要
Marco Pavone Team 2602.12281 HJFY
2026-02-12 ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
ExStrucTiny:面向文档图像中模式可变结构化信息提取的基准数据集
摘要
Manuela Veloso Team 2602.12203 HJFY
2026-02-12 Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education
视觉推理基准:评估多模态大语言模型在基础教育课堂真实视觉问题上的表现
摘要
Oliver G. B. Garrod Team 2602.12196 HJFY
2026-02-12 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting
3DGSNav:通过主动3D高斯泼溅增强视觉语言模型在物体导航中的推理能力
摘要
Xinyi Yu Team 2602.12159 HJFY
2026-02-12 DeepSight: An All-in-One LM Safety Toolkit
DeepSight:一体化大型模型安全工具箱
摘要
Xia Hu Team 2602.12092 HJFY
2026-02-12 Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning
可供性图化任务世界:面向可扩展具身学习的自演化任务生成
摘要
Changshui Zhang Team 2602.12065 HJFY
2026-02-12 Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation
本地视觉语言模型能否超越视觉Transformer提升活动识别能力?——以新生儿复苏为例的研究
摘要
Øyvind Meinich-Bache Team 2602.12002 HJFY
2026-02-12 Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation
空间思维链:连接理解与生成模型以实现空间推理生成
摘要
Long Chen Team 2602.11980 HJFY
2026-02-12 Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion
评估视觉语言模型在法语PDF转Markdown任务中的性能基准
摘要
Nicolas Mery Team 2602.11960 HJFY
2026-02-12 Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization
双LLM是否优于单一模型?一种用于医药内容优化的师生双头LLM架构
摘要
Anubhav Girdhar Team 2602.11957 HJFY
2026-02-10 Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection
Reason-IAD:面向可解释工业异常检测的知识引导动态潜在推理框架
摘要
Xiaochun Cao Team 2602.09850 HJFY
2026-02-10 Kelix Technique Report
Kelix技术报告
摘要
Ziqi Wang Team 2602.09843 HJFY
2026-02-10 SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
SAKED:通过稳定性感知的知识增强解码缓解大型视觉语言模型中的幻觉问题
摘要
Xudong Jiang Team 2602.09825 HJFY
2026-02-10 GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
GenSeg-R1:基于强化学习的视觉语言细粒度指代分割
摘要
Uma Mahesh Team 2602.09701 HJFY
2026-02-10 VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model
VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地
摘要
Hui Xiong Team 2602.09638 HJFY
2026-02-10 AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
AGMark:面向大型视觉语言模型的注意力引导动态水印技术
摘要
Linlin Wang Team 2602.09611 HJFY
2026-02-10 Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing
Tele-Omni:面向视频生成与编辑的统一多模态框架
摘要
Xuelong Li Team 2602.09609 HJFY
2026-02-10 Delving into Spectral Clustering with Vision-Language Representations
探索基于视觉-语言表征的光谱聚类方法
摘要
Zhen Fang Team 2602.09586 HJFY
2026-02-10 Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination
手术刀:通过混合高斯桥精细对齐注意力激活流形以缓解多模态幻觉
摘要
Koichi Shirahata Team 2602.09541 HJFY
2026-02-10 DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment
DR.Experts:面向盲图像质量评估的失真感知专家差分细化方法
摘要
Runze Hu Team 2602.09531 HJFY
2026-02-04 When LLaVA Meets Objects: Token Composition for Vision-Language-Models
当LLaVA遇见物体:视觉语言模型的令牌组合
摘要
Hilde Kuehne Team 2602.04864 HJFY
2026-02-04 El Agente Estructural: An Artificially Intelligent Molecular Editor
结构智能体:一种人工智能分子编辑器
摘要
Varinia Bernales Team 2602.04849 HJFY
2026-02-04 VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text?
VISTA-Bench:视觉语言模型真的能像理解纯文本一样理解图像中的文本吗?
摘要
Huchuan Lu Team 2602.04802 HJFY
2026-02-04 Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases
多模态大语言模型中的对齐漂移:对八个模型版本有害性的两阶段纵向评估
摘要
Emily Dix Team 2602.04739 HJFY
2026-02-04 SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation
SAR-RAG:通过语义搜索、检索与多模态大语言模型生成的自动目标识别视觉问答
摘要
Andreas Spanias Team 2602.04712 HJFY
2026-02-04 Annotation Free Spacecraft Detection and Segmentation using Vision Language Models
基于视觉语言模型的无标注航天器检测与分割
摘要
Djamila Aouada Team 2602.04699 HJFY
2026-02-04 AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation
AGILE:基于智能体生成从视频重建手-物交互
摘要
Chunhua Shen Team 2602.04672 HJFY
2026-02-04 PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective
PIO-FVLM:从推理目标视角重新审视用于VLM加速的无训练视觉令牌缩减
摘要
Chunhua Shen Team 2602.04657 HJFY
2026-02-04 Relational Scene Graphs for Object Grounding of Natural Language Commands
面向自然语言指令中物体定位的关系场景图
摘要
Ville Kyrki Team 2602.04635 HJFY
2026-02-04 LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation
LEAD:面向忠实放射学报告生成的层级专家对齐解码
摘要
Yan Song Team 2602.04617 HJFY
2026-02-02 Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts
Avenir-Web:基于混合定位专家的人类经验模仿式多模态网络代理
摘要
Mengdi Wang Team 2602.02468 HJFY
2026-02-02 MentisOculi: Revealing the Limits of Reasoning with Mental Imagery
MentisOculi:揭示心智意象推理的局限性
摘要
Wieland Brendel Team 2602.02465 HJFY
2026-02-02 Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning
面向任务推理的关系感知分层三维场景图
摘要
Kostas Alexis Team 2602.02456 HJFY
2026-02-02 World-Gymnast: Training Robots with Reinforcement Learning in a World Model
世界体操家:在世界模型中通过强化学习训练机器人
摘要
Sherry Yang Team 2602.02454 HJFY
2026-02-02 ReasonEdit: Editing Vision-Language Models using Human Reasoning
ReasonEdit:基于人类推理的视觉语言模型编辑
摘要
Thomas Hartvigsen Team 2602.02408 HJFY
2026-02-02 LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization
LongVPO:从锚定线索到自我推理的长视频偏好优化
摘要
Limin Wang Team 2602.02341 HJFY
2026-02-02 Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models
Vision-DeepResearch基准:重新思考多模态大语言模型的视觉与文本搜索能力
摘要
Shaosheng Cao Team 2602.02185 HJFY
2026-02-02 See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力
摘要
Takeo Igarashi Team 2602.02063 HJFY
2026-02-02 Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models
Auto-Comp:面向对比式视觉语言模型可扩展组合性探测的自动化流程
摘要
Toshihiko Yamasaki Team 2602.02043 HJFY
2026-02-02 One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
一图多配:在大规模广告图像生成中协调多样化的群体点击偏好
摘要
Jian Liang Team 2602.02033 HJFY
2026-01-30 User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments Junfeng Lin et.al. 2601.23281 null
2026-01-30 Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models Yi Zhang et.al. 2601.23253 null
2026-01-30 Structured Over Scale: Learning Spatial Reasoning from Educational Video Bishoy Galoaa et.al. 2601.23251 null
2026-01-30 Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning Xiangyu Zeng et.al. 2601.23224 null
2026-01-30 Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training Anglin Liu et.al. 2601.23220 null
2026-01-30 Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization Hui Lu et.al. 2601.23179 null
2026-01-30 Hearing is Believing? Evaluating and Analyzing Audio Language Model Sycophancy with SYAUDIO Junchi Yao et.al. 2601.23149 null
2026-01-30 One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs Youxu Shi et.al. 2601.23041 null
2026-01-30 Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models Anmin Wang et.al. 2601.22959 null
2026-01-30 Alignment among Language, Vision and Action Representations Nicola Milano et.al. 2601.22948 null
2026-01-29 UEval: A Benchmark for Unified Multimodal Generation Bo Li et.al. 2601.22155 null
2026-01-29 Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions Xiaoxiao Sun et.al. 2601.22150 null
2026-01-29 SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence Saoud Aldowaish et.al. 2601.22114 null
2026-01-29 VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning Yibo Wang et.al. 2601.22069 null
2026-01-29 Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models Wenxuan Huang et.al. 2601.22060 null
2026-01-29 MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources Baorui Ma et.al. 2601.22054 null
2026-01-29 Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning Chengyi Cai et.al. 2601.22020 null
2026-01-29 Causal World Modeling for Robot Control Lin Li et.al. 2601.21998 null
2026-01-29 Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models Konstantinos P. Panousis et.al. 2601.21944 null
2026-01-29 VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models Yunhao Li et.al. 2601.21915 null

评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。