每日自动追踪 Vision-Language-Action (VLA)、Vision-Language Navigation (VLN) 和 Vision-Language Models (VLM) 的最新 arXiv 论文。
Updated on 2026.03.02
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-02-26 | EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents EmbodMocap:面向具身智能体的野外四维人-场景重建 摘要 |
Taku Komura Team | 2602.23205 | HJFY |
|
| 2026-02-26 | Residual Koopman Spectral Profiling for Predicting and Preventing Transformer Training Instability 基于残差库普曼谱分析预测与防止Transformer训练不稳定性 摘要 |
Yutaka Matsuo Team | 2602.22988 | HJFY |
|
| 2026-02-26 | Automated Robotic Needle Puncture for Percutaneous Dilatational Tracheostomy 经皮扩张气管切开术的自动化机器人针穿刺系统 摘要 |
Andrew Weightman Team | 2602.22952 | HJFY |
|
| 2026-02-26 | DySL-VLA: Efficient Vision-Language-Action Model Inference via Dynamic-Static Layer-Skipping for Robot Manipulation DySL-VLA:通过动态-静态层跳跃实现机器人操作中高效视觉-语言-动作模型推理 摘要 |
Meng Li Team | 2602.22896 | HJFY |
|
| 2026-02-26 | GraspLDP: Towards Generalizable Grasping Policy via Latent Diffusion GraspLDP:基于潜在扩散的通用化抓取策略研究 摘要 |
Di Huang Team | 2602.22862 | HJFY |
|
| 2026-02-26 | ArtPro: Self-Supervised Articulated Object Reconstruction with Adaptive Integration of Mobility Proposals ArtPro:基于自适应运动提议集成的自监督关节物体重建 摘要 |
Changhe Tu Team | 2602.22666 | HJFY |
|
| 2026-02-26 | Rethinking the Practicality of Vision-language-action Model: A Comprehensive Benchmark and An Improved Baseline 重新审视视觉-语言-动作模型的实用性:一个综合性基准与改进基线 摘要 |
Haoang Li Team | 2602.22663 | HJFY |
|
| 2026-02-26 | Metamorphic Testing of Vision-Language Action-Enabled Robots 视觉-语言-动作赋能机器人的蜕变测试 摘要 |
Aitor Arrieta Team | 2602.22579 | HJFY |
|
| 2026-02-26 | SignVLA: A Gloss-Free Vision-Language-Action Framework for Real-Time Sign Language-Guided Robotic Manipulation SignVLA:一种无需注释的视觉-语言-动作框架,用于实时手语引导的机器人操作 摘要 |
Zezhi Tang Team | 2602.22514 | HJFY |
|
| 2026-02-25 | When to Act, Ask, or Learn: Uncertainty-Aware Policy Steering 何时执行、询问或学习:不确定性感知的策略引导 摘要 |
Andrea Bajcsy Team | 2602.22474 | HJFY |
|
| 2026-02-24 | NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning NoRD:一种无需推理、数据高效驱动的视觉-语言-动作模型 摘要 |
Wei Zhan Team | 2602.21172 | HJFY |
|
| 2026-02-24 | ActionReasoning: Robot Action Reasoning in 3D Space with LLM for Robotic Brick Stacking 行动推理:基于大语言模型的机器人三维空间动作推理与砖块堆叠应用 摘要 |
Brian Sheil Team | 2602.21161 | HJFY |
|
| 2026-02-24 | HALO: A Unified Vision-Language-Action Model for Embodied Multimodal Chain-of-Thought Reasoning HALO:面向具身多模态思维链推理的统一视觉-语言-动作模型 摘要 |
Song Guo Team | 2602.21157 | HJFY |
|
| 2026-02-24 | From Perception to Action: An Interactive Benchmark for Vision Reasoning 从感知到行动:视觉推理的交互式基准测试 摘要 |
Roy Ka-Wei Lee Team | 2602.21015 | HJFY |
|
| 2026-02-24 | Notes-to-Self: Scratchpad Augmented VLAs for Memory Dependent Manipulation Tasks 自我笔记:用于依赖记忆操作任务的便签增强型视觉语言动作模型 摘要 |
Roland Memisevic Team | 2602.21013 | HJFY |
|
| 2026-02-24 | Toward an Agentic Infused Software Ecosystem 迈向赋能代理的软件生态系统 摘要 |
Mark Marron Team | 2602.20979 | HJFY |
|
| 2026-02-24 | IG-RFT: An Interaction-Guided RL Framework for VLA Models in Long-Horizon Robotic Manipulation IG-RFT:面向长时程机器人操作的交互引导强化学习框架,用于视觉-语言-动作模型 摘要 |
Huixu Dong Team | 2602.20715 | HJFY |
|
| 2026-02-24 | How Foundational Skills Influence VLM-based Embodied Agents:A Native Perspective 基础技能如何影响基于视觉语言模型的具身智能体:一个原生视角 摘要 |
Tong Xu Team | 2602.20687 | HJFY |
|
| 2026-02-24 | Recursive Belief Vision Language Model 递归信念视觉语言模型 摘要 |
Nirav Patel Team | 2602.20659 | HJFY |
|
| 2026-02-24 | Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion 基于掩码视觉-语言-动作扩散的高效可解释端到端自动驾驶 摘要 |
Ziran Wang Team | 2602.20577 | HJFY |
|
| 2026-02-19 | When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs 当视觉凌驾于语言之上:评估与缓解视觉语言动作模型中的反事实失败 摘要 |
Mingyu Ding Team | 2602.17659 | HJFY |
|
| 2026-02-19 | What Breaks Embodied AI Security:LLM Vulnerabilities, CPS Flaws,or Something Else? 什么在破坏具身人工智能安全:大语言模型漏洞、信息物理系统缺陷,还是其他因素? 摘要 |
Yue Zhang Team | 2602.17345 | HJFY |
|
| 2026-02-19 | FRAPPE: Infusing World Modeling into Generalist Policies via Multiple Future Representation Alignment FRAPPE:通过多未来表示对齐将世界建模融入通用策略 摘要 |
Donglin Wang Team | 2602.17259 | HJFY |
|
| 2026-02-19 | Web Verbs: Typed Abstractions for Reliable Task Composition on the Agentic Web 网络动词:面向智能网络可靠任务组合的类型化抽象 摘要 |
Suman Nath Team | 2602.17245 | HJFY |
|
| 2026-02-19 | Benchmarking the Effects of Object Pose Estimation and Reconstruction on Robotic Grasping Success 评估物体姿态估计与重建对机器人抓取成功率影响的基准研究 摘要 |
Torsten Sattler Team | 2602.17101 | HJFY |
|
| 2026-02-18 | MALLVI: a multi agent framework for integrated generalized robotics manipulation MALLVI:一种面向集成通用机器人操作的多智能体框架 摘要 |
Babak Khalaj Team | 2602.16898 | HJFY |
|
| 2026-02-18 | EgoScale: Scaling Dexterous Manipulation with Diverse Egocentric Human Data EgoScale:利用多样化的自我中心人类数据扩展灵巧操作能力 摘要 |
Linxi Fan Team | 2602.16710 | HJFY |
|
| 2026-02-19 | RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation RoboGene:通过多样性驱动的智能体框架提升视觉语言动作预训练,实现真实世界任务生成 摘要 |
Jian Tang Team | 2602.16444 | HJFY |
|
| 2026-02-17 | Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation 学习检索可导航候选对象以实现高效的视觉与语言导航 摘要 |
Lina Yao Team | 2602.15724 | HJFY |
|
| 2026-02-17 | The Next Paradigm Is User-Centric Agent, Not Platform-Centric Service 下一代范式是用户中心智能体,而非平台中心服务 摘要 |
Enhong Chen Team | 2602.15682 | HJFY |
|
| 2026-02-12 | Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment 扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效 摘要 |
Marco Pavone Team | 2602.12281 | HJFY |
|
| 2026-02-12 | Embodied AI Agents for Team Collaboration in Co-located Blue-Collar Work 面向共址蓝领工作团队协作的具身人工智能体 摘要 |
Thomas Olsson Team | 2602.12136 | HJFY |
|
| 2026-02-12 | GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning GigaBrain-0.5M*:一种基于世界模型强化学习训练的视觉-语言-动作模型 摘要 |
Zheng Zhu Team | 2602.12099 | HJFY |
|
| 2026-02-12 | VLAW: Iterative Co-Improvement of Vision-Language-Action Policy and World Model VLAW:视觉-语言-动作策略与世界模型的迭代协同改进 摘要 |
Chelsea Finn Team | 2602.12063 | HJFY |
|
| 2026-02-12 | HoloBrain-0 Technical Report HoloBrain-0技术报告 摘要 |
Zhizhong Su Team | 2602.12062 | HJFY |
|
| 2026-02-12 | When would Vision-Proprioception Policies Fail in Robotic Manipulation? 视觉-本体感知策略在机器人操作中何时会失效? 摘要 |
Di Hu Team | 2602.12032 | HJFY |
|
| 2026-02-12 | Robot-DIFT: Distilling Diffusion Features for Geometrically Consistent Visuomotor Control Robot-DIFT:提取扩散特征以实现几何一致的视觉运动控制 摘要 |
Georgia Chalvatzaki Team | 2602.11934 | HJFY |
|
| 2026-02-12 | JEPA-VLA: Video Predictive Embedding is Needed for VLA Models JEPA-VLA:视觉语言动作模型需要视频预测性嵌入 摘要 |
Mingsheng Long Team | 2602.11832 | HJFY |
|
| 2026-02-12 | Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes Clutt3R-Seg:面向杂乱场景中语言驱动抓取的稀疏视角三维实例分割 摘要 |
Ayoung Kim Team | 2602.11660 | HJFY |
|
| 2026-02-12 | ViTaS: Visual Tactile Soft Fusion Contrastive Learning for Visuomotor Learning ViTaS:面向视觉运动学习的视觉触觉软融合对比学习 摘要 |
Huazhe Xu Team | 2602.11643 | HJFY |
|
| 2026-02-10 | MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation MVISTA-4D:具有测试时动作推理能力的视图一致4D世界模型,用于机器人操作 摘要 |
Xiangyu Yue Team | 2602.09878 | HJFY |
|
| 2026-02-10 | Code2World: A GUI World Model via Renderable Code Generation Code2World:通过可渲染代码生成的GUI世界模型 摘要 |
Kevin Qinghong Lin Team | 2602.09856 | HJFY |
|
| 2026-02-10 | BagelVLA: Enhancing Long-Horizon Manipulation via Interleaved Vision-Language-Action Generation BagelVLA:通过交错视觉-语言-动作生成增强长时程操作能力 摘要 |
Jianyu Chen Team | 2602.09849 | HJFY |
|
| 2026-02-10 | NavDreamer: Video Models as Zero-Shot 3D Navigators NavDreamer:视频模型作为零样本三维导航器 摘要 |
Fei Gao Team | 2602.09765 | HJFY |
|
| 2026-02-10 | Rethinking Visual-Language-Action Model Scaling: Alignment, Mixture, and Regularization 重新审视视觉-语言-动作模型的规模化:对齐、混合与正则化 摘要 |
Qin Jin Team | 2602.09722 | HJFY |
|
| 2026-02-10 | AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild AutoFly:面向野外无人机自主导航的视觉-语言-动作模型 摘要 |
Hui Xiong Team | 2602.09657 | HJFY |
|
| 2026-02-10 | VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地 摘要 |
Hui Xiong Team | 2602.09638 | HJFY |
|
| 2026-02-10 | Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures Hand2World:基于自由空间手势的自回归第一人称交互生成 摘要 |
Xingang Pan Team | 2602.09600 | HJFY |
|
| 2026-02-10 | Preference Aligned Visuomotor Diffusion Policies for Deformable Object Manipulation 面向可变形物体操作的偏好对齐视觉运动扩散策略 摘要 |
Danica Kragic Team | 2602.09583 | HJFY |
|
| 2026-02-10 | AUHead: Realistic Emotional Talking Head Generation via Action Units Control AUHead:基于动作单元控制的逼真情感说话头部生成 摘要 |
Tat-Seng Chua Team | 2602.09534 | HJFY |
|
| 2026-02-04 | Capturing Visual Environment Structure Correlates with Control Performance 捕捉视觉环境结构与控制性能的相关性 摘要 |
Yu-Xiong Wang Team | 2602.04880 | HJFY |
|
| 2026-02-04 | CoWTracker: Tracking by Warping instead of Correlation CoWTracker:通过变形而非相关性进行跟踪 摘要 |
Andrea Vedaldi Team | 2602.04877 | HJFY |
|
| 2026-02-04 | Relational Scene Graphs for Object Grounding of Natural Language Commands 面向自然语言指令中物体定位的关系场景图 摘要 |
Ville Kyrki Team | 2602.04635 | HJFY |
|
| 2026-02-04 | Act, Sense, Act: Learning Non-Markovian Active Perception Strategies from Large-Scale Egocentric Human Data 行动、感知、再行动:从大规模第一人称人类数据中学习非马尔可夫主动感知策略 摘要 |
Wenzhao Lian Team | 2602.04600 | HJFY |
|
| 2026-02-04 | A Unified Complementarity-based Approach for Rigid-Body Manipulation and Motion Prediction 基于互补性的统一方法在刚体操作与运动预测中的应用 摘要 |
Riddhiman Laha Team | 2602.04522 | HJFY |
|
| 2026-02-04 | EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models EgoActor:通过视觉语言模型将任务规划落地为具身机器人的空间感知自我中心动作 摘要 |
Börje F. Karlsson Team | 2602.04515 | HJFY |
|
| 2026-02-04 | Self-evolving Embodied AI 自演化的具身人工智能 摘要 |
Wenwu Zhu Team | 2602.04411 | HJFY |
|
| 2026-02-04 | GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning GeneralVLA:具备知识引导轨迹规划的通用视觉-语言-动作模型 摘要 |
Hao Tang Team | 2602.04315 | HJFY |
|
| 2026-02-04 | Viewpoint Matters: Dynamically Optimizing Viewpoints with Masked Autoencoder for Visual Manipulation 视角至关重要:利用掩码自编码器动态优化视觉操控的视角 摘要 |
Wenzhao Lian Team | 2602.04243 | HJFY |
|
| 2026-02-04 | GeoLanG: Geometry-Aware Language-Guided Grasping with Unified RGB-D Multimodal Learning GeoLanG:基于统一RGB-D多模态学习的几何感知语言引导抓取 摘要 |
Hongliang Ren Team | 2602.04231 | HJFY |
|
| 2026-02-02 | TIC-VLA: A Think-in-Control Vision-Language-Action Model for Robot Navigation in Dynamic Environments TIC-VLA:一种用于动态环境中机器人导航的思控一体化视觉-语言-动作模型 摘要 |
Jiaqi Ma Team | 2602.02459 | HJFY |
|
| 2026-02-02 | World-Gymnast: Training Robots with Reinforcement Learning in a World Model 世界体操家:在世界模型中通过强化学习训练机器人 摘要 |
Sherry Yang Team | 2602.02454 | HJFY |
|
| 2026-02-02 | SoMA: A Real-to-Sim Neural Simulator for Robotic Soft-body Manipulation SoMA:面向机器人软体操作的真实到仿真神经模拟器 摘要 |
Jiangmiao Pang Team | 2602.02402 | HJFY |
|
| 2026-02-02 | MAIN-VLA: Modeling Abstraction of Intention and eNvironment for Vision-Language-Action Models MAIN-VLA:为视觉-语言-动作模型建模意图与环境的抽象 摘要 |
Lemiao Qiu Team | 2602.02212 | HJFY |
|
| 2026-02-02 | FD-VLA: Force-Distilled Vision-Language-Action Model for Contact-Rich Manipulation FD-VLA:用于接触丰富操作的力蒸馏视觉-语言-动作模型 摘要 |
Haiyue Zhu Team | 2602.02142 | HJFY |
|
| 2026-02-02 | See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力 摘要 |
Takeo Igarashi Team | 2602.02063 | HJFY |
|
| 2026-02-02 | Concept-Based Dictionary Learning for Inference-Time Safety in Vision Language Action Models 面向视觉语言动作模型推理时安全性的概念词典学习方法 摘要 |
Di Wang Team | 2602.01834 | HJFY |
|
| 2026-02-02 | From Knowing to Doing Precisely: A General Self-Correction and Termination Framework for VLA models 从精确认知到精准执行:面向视觉语言动作模型的通用自校正与终止框架 摘要 |
Jianzong Wang Team | 2602.01811 | HJFY |
|
| 2026-02-02 | AgenticLab: A Real-World Robot Agent Platform that Can See, Think, and Act AgenticLab:一个能够观察、思考与行动的真实世界机器人智能体平台 摘要 |
Yu She Team | 2602.01662 | HJFY |
|
| 2026-02-02 | From Perception to Action: Spatial AI Agents and World Models 从感知到行动:空间人工智能代理与世界模型 摘要 |
Esteban Rojas Team | 2602.01644 | HJFY |
|
| 2026-01-30 | Temporally Coherent Imitation Learning via Latent Action Flow Matching for Robotic Manipulation | Wu Songwei et.al. | 2601.23087 | null |
|
| 2026-01-30 | EAG-PT: Emission-Aware Gaussians and Path Tracing for Indoor Scene Reconstruction and Editing | Xijie Yang et.al. | 2601.23065 | null |
|
| 2026-01-30 | Learning Geometrically-Grounded 3D Visual Representations for View-Generalizable Robotic Manipulation | Di Zhang et.al. | 2601.22988 | null |
|
| 2026-01-30 | Alignment among Language, Vision and Action Representations | Nicola Milano et.al. | 2601.22948 | null |
|
| 2026-01-30 | When Anomalies Depend on Context: Learning Conditional Compatibility for Anomaly Detection | Shashank Mishra et.al. | 2601.22868 | null |
|
| 2026-01-30 | Vision-Language Models Unlock Task-Centric Latent Actions | Alexander Nikulin et.al. | 2601.22714 | null |
|
| 2026-01-30 | Best-of-Q: Improving VLM agents with Q-function Action Ranking at Inference | Emilien Biré et.al. | 2601.22701 | null |
|
| 2026-01-30 | CARE: Multi-Task Pretraining for Latent Continuous Action Representation in Robot Control | Jiaqi Shi et.al. | 2601.22467 | null |
|
| 2026-01-29 | PoSafeNet: Safe Learning with Poset-Structured Neural Nets | Kiwan Wong et.al. | 2601.22356 | null |
|
| 2026-01-29 | DynamicVLA: A Vision-Language-Action Model for Dynamic Object Manipulation | Haozhe Xie et.al. | 2601.22153 | null |
|
| 2026-01-29 | PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction | Changjian Jiang et.al. | 2601.22046 | null |
|
| 2026-01-29 | PocketDP3: Efficient Pocket-Scale 3D Visuomotor Policy | Jinhao Zhang et.al. | 2601.22018 | null |
|
| 2026-01-29 | Causal World Modeling for Robot Control | Lin Li et.al. | 2601.21998 | null |
|
| 2026-01-29 | MoE-ACT: Improving Surgical Imitation Learning Policies through Supervised Mixture-of-Experts | Lorenzo Mazza et.al. | 2601.21971 | null |
|
| 2026-01-29 | Information Filtering via Variational Regularization for Robot Manipulation | Jinhao Zhang et.al. | 2601.21926 | null |
|
| 2026-01-29 | Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation | Jiankun Peng et.al. | 2601.21751 | null |
|
| 2026-01-29 | CoFreeVLA: Collision-Free Dual-Arm Manipulation via Vision-Language-Action Model and Risk Estimation | Xuanran Zhai et.al. | 2601.21712 | null |
|
| 2026-01-29 | AIR-VLA: Vision-Language-Action Systems for Aerial Manipulation | Jianli Sun et.al. | 2601.21602 | null |
|
| 2026-01-29 | EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots | Zixing Lei et.al. | 2601.21570 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-02-20 | CapNav: Benchmarking Vision Language Models on Capability-conditioned Indoor Navigation CapNav:基于能力条件室内导航的视觉语言模型基准测试 摘要 |
Jon Froehlich Team | 2602.18424 | HJFY |
|
| 2026-02-17 | Learning to Retrieve Navigable Candidates for Efficient Vision-and-Language Navigation 学习检索可导航候选对象以实现高效的视觉与语言导航 摘要 |
Lina Yao Team | 2602.15724 | HJFY |
|
| 2026-02-17 | One Agent to Guide Them All: Empowering MLLMs for Vision-and-Language Navigation via Explicit World Representation 一智体引领全局:通过显式世界表征赋能多模态大语言模型实现视觉与语言导航 摘要 |
Qi Wu Team | 2602.15400 | HJFY |
|
| 2026-02-16 | pFedNavi: Structure-Aware Personalized Federated Vision-Language Navigation for Embodied AI pFedNavi:面向具身AI的结构感知个性化联邦视觉语言导航 摘要 |
Haibing Guan Team | 2602.14401 | HJFY |
|
| 2026-02-12 | ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation ABot-N0:面向通用具身导航的视觉-语言-动作基础模型技术报告 摘要 |
Mu Xu Team | 2602.11598 | HJFY |
|
| 2026-02-10 | Hydra-Nav: Object Navigation via Adaptive Dual-Process Reasoning Hydra-Nav:基于自适应双过程推理的目标导航 摘要 |
Yiming Gan Team | 2602.09972 | HJFY |
|
| 2026-02-10 | AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild AutoFly:面向野外无人机自主导航的视觉-语言-动作模型 摘要 |
Hui Xiong Team | 2602.09657 | HJFY |
|
| 2026-02-09 | When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning 何时想象与想象多少:基于世界模型的自适应测试时缩放用于视觉空间推理 摘要 |
Mohit Bansal Team | 2602.08236 | HJFY |
|
| 2026-02-10 | LCLA: Language-Conditioned Latent Alignment for Vision-Language Navigation LCLA:面向视觉语言导航的语言条件化潜在对齐框架 摘要 |
Soumik Sarkar Team | 2602.07629 | HJFY |
|
| 2026-02-06 | Bridging the Indoor-Outdoor Gap: Vision-Centric Instruction-Guided Embodied Navigation for the Last Meters 弥合室内外鸿沟:面向最后几米的视觉中心化指令引导具身导航 摘要 |
Mu Xu Team | 2602.06427 | HJFY |
|
| 2026-02-06 | Nipping the Drift in the Bud: Retrospective Rectification for Robust Vision-Language Navigation 防微杜渐:基于回溯修正的鲁棒视觉语言导航 摘要 |
Weiying Xie Team | 2602.06356 | HJFY |
|
| 2026-02-05 | Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation 稀疏视频生成推动现实世界超视距视觉语言导航 摘要 |
Hongyang Li Team | 2602.05827 | HJFY |
|
| 2026-02-05 | Allocentric Perceiver: Disentangling Allocentric Reasoning from Egocentric Visual Priors via Frame Instantiation 他者中心感知器:通过框架实例化从他者视觉先验中解耦他者中心推理 摘要 |
Weiming Zhang Team | 2602.05789 | HJFY |
|
| 2026-02-05 | MerNav: A Highly Generalizable Memory-Execute-Review Framework for Zero-Shot Object Goal Navigation MerNav:一种高度可泛化的记忆-执行-回顾框架,用于零样本目标导航 摘要 |
Mu Xu Team | 2602.05467 | HJFY |
|
| 2026-02-02 | LangMap: A Hierarchical Benchmark for Open-Vocabulary Goal Navigation LangMap:面向开放词汇目标导航的分层基准 摘要 |
Anton van den Hengel Team | 2602.02220 | HJFY |
|
| 2026-01-31 | APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation APEX:一种用于异步空中目标导航的解耦记忆型探索器 摘要 |
Shuo Yang Team | 2602.00551 | HJFY |
|
| 2026-02-03 | MapDream: Task-Driven Map Learning for Vision-Language Navigation MapDream:面向视觉语言导航的任务驱动地图学习 摘要 |
Zhaoxin Fan Team | 2602.00222 | HJFY |
|
| 2026-01-29 | Dynamic Topology Awareness: Breaking the Granularity Rigidity in Vision-Language Navigation 动态拓扑感知:打破视觉语言导航中的粒度僵化 摘要 |
Xiaoming Wang Team | 2601.21751 | HJFY |
|
| 2026-01-26 | DV-VLN: Dual Verification for Reliable LLM-Based Vision-and-Language Navigation DV-VLN:基于大语言模型的视觉与语言导航双重验证可靠框架 摘要 |
Shoujun Zhou Team | 2601.18492 | HJFY |
|
| 2026-01-26 | \textsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation NaVIDA:基于逆动力学增强的视觉语言导航 摘要 |
Feng Zheng Team | 2601.18188 | HJFY |
|
| 2026-01-22 | AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning AION:基于双策略强化学习的空中室内目标导航系统 摘要 |
Lin Zhao Team | 2601.15614 | HJFY |
|
| 2026-01-23 | FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation FantasyVLN:面向视觉语言导航的统一多模态思维链推理框架 摘要 |
Yonggang Qi Team | 2601.13976 | HJFY |
|
| 2026-01-19 | Spatial-VLN: Zero-Shot Vision-and-Language Navigation With Explicit Spatial Perception and Exploration Spatial-VLN:具备显式空间感知与探索能力的零样本视觉语言导航 摘要 |
Feitian Zhang Team | 2601.12766 | HJFY |
|
| 2026-01-14 | Towards Open Environments and Instructions: General Vision-Language Navigation via Fast-Slow Interactive Reasoning 迈向开放环境与指令:基于快慢交互推理的通用视觉语言导航 摘要 |
Yahong Han Team | 2601.09111 | HJFY |
|
| 2026-01-11 | Residual Cross-Modal Fusion Networks for Audio-Visual Navigation | Yi Wang et.al. | 2601.08868 | null |
|
| 2026-01-13 | VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory | Shaoan Wang et.al. | 2601.08665 | null |
|
| 2026-01-12 | GROKE: Vision-Free Navigation Instruction Evaluation via Graph Reasoning on OpenStreetMap | Farzad Shami et.al. | 2601.07375 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。
| Publish Date (YYYY-MM-DD) | Title | Authors | HJFY | 评估 | |
|---|---|---|---|---|---|
| 2026-02-26 | MediX-R1: Open Ended Medical Reinforcement Learning MediX-R1:开放式医学强化学习框架 摘要 |
Hisham Cholakkal Team | 2602.23363 | HJFY |
|
| 2026-02-26 | Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning 规模无法克服语用学:报告偏差对视觉-语言推理的影响 摘要 |
Ranjay Krishna Team | 2602.23351 | HJFY |
|
| 2026-02-26 | Retrieve and Segment: Are a Few Examples Enough to Bridge the Supervision Gap in Open-Vocabulary Segmentation? 检索与分割:少量示例足以弥合开放词汇分割中的监督鸿沟吗? 摘要 |
Giorgos Tolias Team | 2602.23339 | HJFY |
|
| 2026-02-26 | CXReasonAgent: Evidence-Grounded Diagnostic Reasoning Agent for Chest X-rays CXReasonAgent:基于证据的胸部X光诊断推理智能体 摘要 |
Edward Choi Team | 2602.23276 | HJFY |
|
| 2026-02-26 | Large Multimodal Models as General In-Context Classifiers 大型多模态模型作为通用上下文内分类器 摘要 |
Elisa Ricci Team | 2602.23229 | HJFY |
|
| 2026-02-26 | MovieTeller: Tool-augmented Movie Synopsis with ID Consistent Progressive Abstraction MovieTeller:基于工具增强的电影剧情摘要与ID一致渐进式抽象 摘要 |
Gaoang Wang Team | 2602.23228 | HJFY |
|
| 2026-02-26 | Efficient Encoder-Free Fourier-based 3D Large Multimodal Model 高效无编码器的基于傅里叶变换的3D大型多模态模型 摘要 |
Fabio Poiesi Team | 2602.23153 | HJFY |
|
| 2026-02-26 | Cytoarchitecture in Words: Weakly Supervised Vision-Language Modeling for Human Brain Microscopy 以言构形:弱监督视觉-语言建模用于人脑显微成像 摘要 |
Christian Schiffer Team | 2602.23088 | HJFY |
|
| 2026-02-26 | SubspaceAD: Training-Free Few-Shot Anomaly Detection via Subspace Modeling SubspaceAD:基于子空间建模的无训练少样本异常检测方法 摘要 |
Egor Bondarev Team | 2602.23013 | HJFY |
|
| 2026-02-26 | FactGuard: Agentic Video Misinformation Detection via Reinforcement Learning FactGuard:基于强化学习的智能体视频虚假信息检测 摘要 |
Zhaoqi Wang Team | 2602.22963 | HJFY |
|
| 2026-02-24 | Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning Spa3R:面向三维视觉推理的预测性空间场建模 摘要 |
Xinggang Wang Team | 2602.21186 | HJFY |
|
| 2026-02-24 | Seeing Through Words: Controlling Visual Retrieval Quality with Language Models 透过文字看见:利用语言模型控制视觉检索质量 摘要 |
Yun Fu Team | 2602.21175 | HJFY |
|
| 2026-02-24 | LUMEN: Longitudinal Multi-Modal Radiology Model for Prognosis and Diagnosis LUMEN:用于预后与诊断的纵向多模态放射学模型 摘要 |
Marius George Linguraru Team | 2602.21142 | HJFY |
|
| 2026-02-24 | VAUQ: Vision-Aware Uncertainty Quantification for LVLM Self-Evaluation VAUQ:面向LVLM自评估的视觉感知不确定性量化 摘要 |
Sharon Li Team | 2602.21054 | HJFY |
|
| 2026-02-24 | OCR-Agent: Agentic OCR with Capability and Memory Reflection OCR-Agent:具备能力与记忆反思的智能OCR代理 摘要 |
Ying Cai Team | 2602.21053 | HJFY |
|
| 2026-02-24 | Not Just What's There: Enabling CLIP to Comprehend Negated Visual Descriptions Without Fine-tuning 不止于所见:无需微调,让CLIP理解否定的视觉描述 摘要 |
Zejiang He Team | 2602.21035 | HJFY |
|
| 2026-02-24 | From Perception to Action: An Interactive Benchmark for Vision Reasoning 从感知到行动:视觉推理的交互式基准测试 摘要 |
Roy Ka-Wei Lee Team | 2602.21015 | HJFY |
|
| 2026-02-24 | CrystaL: Spontaneous Emergence of Visual Latents in MLLMs CrystaL:多模态大语言模型中视觉潜在特征的自发涌现 摘要 |
Xiang Li Team | 2602.20980 | HJFY |
|
| 2026-02-24 | Are Multimodal Large Language Models Good Annotators for Image Tagging? 多模态大语言模型是图像标注的优秀注释者吗? 摘要 |
Masashi Sugiyama Team | 2602.20972 | HJFY |
|
| 2026-02-24 | LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding LongVideo-R1:面向低成本长视频理解的智能导航方法 摘要 |
Qixiang Ye Team | 2602.20913 | HJFY |
|
| 2026-02-19 | Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting 通过细粒度细节定位推动黑盒大视觉语言模型攻击前沿 摘要 |
Zhiqiang Shen Team | 2602.17645 | HJFY |
|
| 2026-02-19 | Catastrophic Forgetting Resilient One-Shot Incremental Federated Learning 抗灾难性遗忘的单次增量联邦学习 摘要 |
Monowar Bhuyan Team | 2602.17625 | HJFY |
|
| 2026-02-19 | AI Gamestore: Scalable, Open-Ended Evaluation of Machine General Intelligence with Human Games AI游戏商店:通过人类游戏实现机器通用智能的可扩展、开放式评估 摘要 |
Joshua B. Tenenbaum Team | 2602.17594 | HJFY |
|
| 2026-02-19 | RetouchIQ: MLLM Agents for Instruction-Based Image Retouching with Generalist Reward RetouchIQ:基于指令的图像修饰多模态大语言模型智能体与通用奖励机制 摘要 |
Handong Zhao Team | 2602.17558 | HJFY |
|
| 2026-02-19 | GraphThinker: Reinforcing Video Reasoning with Event Graph Thinking GraphThinker:通过事件图思维强化视频推理 摘要 |
Shaogang Gong Team | 2602.17555 | HJFY |
|
| 2026-02-19 | LATA: Laplacian-Assisted Transductive Adaptation for Conformal Uncertainty in Medical VLMs LATA:面向医学视觉语言模型置信度校准的拉普拉斯辅助转导自适应方法 摘要 |
Zongyuan Ge Team | 2602.17535 | HJFY |
|
| 2026-02-19 | QuPAINT: Physics-Aware Instruction Tuning Approach to Quantum Material Discovery QuPAINT:面向量子材料发现的物理感知指令调优方法 摘要 |
Khoa Luu Team | 2602.17478 | HJFY |
|
| 2026-02-19 | EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models EAGLE:面向多模态大语言模型免调优工业异常检测的专家增强注意力引导方法 摘要 |
Seon Han Choi Team | 2602.17419 | HJFY |
|
| 2026-02-19 | EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models EntropyPrune:基于矩阵熵引导的多模态大语言模型视觉令牌剪枝 摘要 |
Lianghua He Team | 2602.17196 | HJFY |
|
| 2026-02-19 | Selective Training for Large Vision Language Models via Visual Information Gain 基于视觉信息增益的大型视觉语言模型选择性训练 摘要 |
Sangheum Hwang Team | 2602.17186 | HJFY |
|
| 2026-02-12 | Scaling Verification Can Be More Effective than Scaling Policy Learning for Vision-Language-Action Alignment 扩展验证在视觉-语言-动作对齐中比扩展策略学习更有效 摘要 |
Marco Pavone Team | 2602.12281 | HJFY |
|
| 2026-02-12 | ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images ExStrucTiny:面向文档图像中模式可变结构化信息提取的基准数据集 摘要 |
Manuela Veloso Team | 2602.12203 | HJFY |
|
| 2026-02-12 | Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education 视觉推理基准:评估多模态大语言模型在基础教育课堂真实视觉问题上的表现 摘要 |
Oliver G. B. Garrod Team | 2602.12196 | HJFY |
|
| 2026-02-12 | 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting 3DGSNav:通过主动3D高斯泼溅增强视觉语言模型在物体导航中的推理能力 摘要 |
Xinyi Yu Team | 2602.12159 | HJFY |
|
| 2026-02-12 | DeepSight: An All-in-One LM Safety Toolkit DeepSight:一体化大型模型安全工具箱 摘要 |
Xia Hu Team | 2602.12092 | HJFY |
|
| 2026-02-12 | Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning 可供性图化任务世界:面向可扩展具身学习的自演化任务生成 摘要 |
Changshui Zhang Team | 2602.12065 | HJFY |
|
| 2026-02-12 | Can Local Vision-Language Models improve Activity Recognition over Vision Transformers? -- Case Study on Newborn Resuscitation 本地视觉语言模型能否超越视觉Transformer提升活动识别能力?——以新生儿复苏为例的研究 摘要 |
Øyvind Meinich-Bache Team | 2602.12002 | HJFY |
|
| 2026-02-12 | Spatial Chain-of-Thought: Bridging Understanding and Generation Models for Spatial Reasoning Generation 空间思维链:连接理解与生成模型以实现空间推理生成 摘要 |
Long Chen Team | 2602.11980 | HJFY |
|
| 2026-02-12 | Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion 评估视觉语言模型在法语PDF转Markdown任务中的性能基准 摘要 |
Nicolas Mery Team | 2602.11960 | HJFY |
|
| 2026-02-12 | Are Two LLMs Better Than One? A Student-Teacher Dual-Head LLMs Architecture for Pharmaceutical Content Optimization 双LLM是否优于单一模型?一种用于医药内容优化的师生双头LLM架构 摘要 |
Anubhav Girdhar Team | 2602.11957 | HJFY |
|
| 2026-02-10 | Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection Reason-IAD:面向可解释工业异常检测的知识引导动态潜在推理框架 摘要 |
Xiaochun Cao Team | 2602.09850 | HJFY |
|
| 2026-02-10 | Kelix Technique Report Kelix技术报告 摘要 |
Ziqi Wang Team | 2602.09843 | HJFY |
|
| 2026-02-10 | SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding SAKED:通过稳定性感知的知识增强解码缓解大型视觉语言模型中的幻觉问题 摘要 |
Xudong Jiang Team | 2602.09825 | HJFY |
|
| 2026-02-10 | GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation GenSeg-R1:基于强化学习的视觉语言细粒度指代分割 摘要 |
Uma Mahesh Team | 2602.09701 | HJFY |
|
| 2026-02-10 | VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model VideoAfford:基于多模态大语言模型从人-物交互视频中实现三维功能可及性接地 摘要 |
Hui Xiong Team | 2602.09638 | HJFY |
|
| 2026-02-10 | AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models AGMark:面向大型视觉语言模型的注意力引导动态水印技术 摘要 |
Linlin Wang Team | 2602.09611 | HJFY |
|
| 2026-02-10 | Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing Tele-Omni:面向视频生成与编辑的统一多模态框架 摘要 |
Xuelong Li Team | 2602.09609 | HJFY |
|
| 2026-02-10 | Delving into Spectral Clustering with Vision-Language Representations 探索基于视觉-语言表征的光谱聚类方法 摘要 |
Zhen Fang Team | 2602.09586 | HJFY |
|
| 2026-02-10 | Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination 手术刀:通过混合高斯桥精细对齐注意力激活流形以缓解多模态幻觉 摘要 |
Koichi Shirahata Team | 2602.09541 | HJFY |
|
| 2026-02-10 | DR.Experts: Differential Refinement of Distortion-Aware Experts for Blind Image Quality Assessment DR.Experts:面向盲图像质量评估的失真感知专家差分细化方法 摘要 |
Runze Hu Team | 2602.09531 | HJFY |
|
| 2026-02-04 | When LLaVA Meets Objects: Token Composition for Vision-Language-Models 当LLaVA遇见物体:视觉语言模型的令牌组合 摘要 |
Hilde Kuehne Team | 2602.04864 | HJFY |
|
| 2026-02-04 | El Agente Estructural: An Artificially Intelligent Molecular Editor 结构智能体:一种人工智能分子编辑器 摘要 |
Varinia Bernales Team | 2602.04849 | HJFY |
|
| 2026-02-04 | VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? VISTA-Bench:视觉语言模型真的能像理解纯文本一样理解图像中的文本吗? 摘要 |
Huchuan Lu Team | 2602.04802 | HJFY |
|
| 2026-02-04 | Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases 多模态大语言模型中的对齐漂移:对八个模型版本有害性的两阶段纵向评估 摘要 |
Emily Dix Team | 2602.04739 | HJFY |
|
| 2026-02-04 | SAR-RAG: ATR Visual Question Answering by Semantic Search, Retrieval, and MLLM Generation SAR-RAG:通过语义搜索、检索与多模态大语言模型生成的自动目标识别视觉问答 摘要 |
Andreas Spanias Team | 2602.04712 | HJFY |
|
| 2026-02-04 | Annotation Free Spacecraft Detection and Segmentation using Vision Language Models 基于视觉语言模型的无标注航天器检测与分割 摘要 |
Djamila Aouada Team | 2602.04699 | HJFY |
|
| 2026-02-04 | AGILE: Hand-Object Interaction Reconstruction from Video via Agentic Generation AGILE:基于智能体生成从视频重建手-物交互 摘要 |
Chunhua Shen Team | 2602.04672 | HJFY |
|
| 2026-02-04 | PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective PIO-FVLM:从推理目标视角重新审视用于VLM加速的无训练视觉令牌缩减 摘要 |
Chunhua Shen Team | 2602.04657 | HJFY |
|
| 2026-02-04 | Relational Scene Graphs for Object Grounding of Natural Language Commands 面向自然语言指令中物体定位的关系场景图 摘要 |
Ville Kyrki Team | 2602.04635 | HJFY |
|
| 2026-02-04 | LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation LEAD:面向忠实放射学报告生成的层级专家对齐解码 摘要 |
Yan Song Team | 2602.04617 | HJFY |
|
| 2026-02-02 | Avenir-Web: Human-Experience-Imitating Multimodal Web Agents with Mixture of Grounding Experts Avenir-Web:基于混合定位专家的人类经验模仿式多模态网络代理 摘要 |
Mengdi Wang Team | 2602.02468 | HJFY |
|
| 2026-02-02 | MentisOculi: Revealing the Limits of Reasoning with Mental Imagery MentisOculi:揭示心智意象推理的局限性 摘要 |
Wieland Brendel Team | 2602.02465 | HJFY |
|
| 2026-02-02 | Relationship-Aware Hierarchical 3D Scene Graph for Task Reasoning 面向任务推理的关系感知分层三维场景图 摘要 |
Kostas Alexis Team | 2602.02456 | HJFY |
|
| 2026-02-02 | World-Gymnast: Training Robots with Reinforcement Learning in a World Model 世界体操家:在世界模型中通过强化学习训练机器人 摘要 |
Sherry Yang Team | 2602.02454 | HJFY |
|
| 2026-02-02 | ReasonEdit: Editing Vision-Language Models using Human Reasoning ReasonEdit:基于人类推理的视觉语言模型编辑 摘要 |
Thomas Hartvigsen Team | 2602.02408 | HJFY |
|
| 2026-02-02 | LongVPO: From Anchored Cues to Self-Reasoning for Long-Form Video Preference Optimization LongVPO:从锚定线索到自我推理的长视频偏好优化 摘要 |
Limin Wang Team | 2602.02341 | HJFY |
|
| 2026-02-02 | Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Vision-DeepResearch基准:重新思考多模态大语言模型的视觉与文本搜索能力 摘要 |
Shaosheng Cao Team | 2602.02185 | HJFY |
|
| 2026-02-02 | See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers See2Refine:视觉-语言反馈提升基于大语言模型的eHMI行为设计能力 摘要 |
Takeo Igarashi Team | 2602.02063 | HJFY |
|
| 2026-02-02 | Auto-Comp: An Automated Pipeline for Scalable Compositional Probing of Contrastive Vision-Language Models Auto-Comp:面向对比式视觉语言模型可扩展组合性探测的自动化流程 摘要 |
Toshihiko Yamasaki Team | 2602.02043 | HJFY |
|
| 2026-02-02 | One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation 一图多配:在大规模广告图像生成中协调多样化的群体点击偏好 摘要 |
Jian Liang Team | 2602.02033 | HJFY |
|
| 2026-01-30 | User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments | Junfeng Lin et.al. | 2601.23281 | null |
|
| 2026-01-30 | Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models | Yi Zhang et.al. | 2601.23253 | null |
|
| 2026-01-30 | Structured Over Scale: Learning Spatial Reasoning from Educational Video | Bishoy Galoaa et.al. | 2601.23251 | null |
|
| 2026-01-30 | Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning | Xiangyu Zeng et.al. | 2601.23224 | null |
|
| 2026-01-30 | Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training | Anglin Liu et.al. | 2601.23220 | null |
|
| 2026-01-30 | Make Anything Match Your Target: Universal Adversarial Perturbations against Closed-Source MLLMs via Multi-Crop Routed Meta Optimization | Hui Lu et.al. | 2601.23179 | null |
|
| 2026-01-30 | Hearing is Believing? Evaluating and Analyzing Audio Language Model Sycophancy with SYAUDIO | Junchi Yao et.al. | 2601.23149 | null |
|
| 2026-01-30 | One-shot Optimized Steering Vector for Hallucination Mitigation for VLMs | Youxu Shi et.al. | 2601.23041 | null |
|
| 2026-01-30 | Triage: Hierarchical Visual Budgeting for Efficient Video Reasoning in Vision-Language Models | Anmin Wang et.al. | 2601.22959 | null |
|
| 2026-01-30 | Alignment among Language, Vision and Action Representations | Nicola Milano et.al. | 2601.22948 | null |
|
| 2026-01-29 | UEval: A Benchmark for Unified Multimodal Generation | Bo Li et.al. | 2601.22155 | null |
|
| 2026-01-29 | Do VLMs Perceive or Recall? Probing Visual Perception vs. Memory with Classic Visual Illusions | Xiaoxiao Sun et.al. | 2601.22150 | null |
|
| 2026-01-29 | SINA: A Circuit Schematic Image-to-Netlist Generator Using Artificial Intelligence | Saoud Aldowaish et.al. | 2601.22114 | null |
|
| 2026-01-29 | VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning | Yibo Wang et.al. | 2601.22069 | null |
|
| 2026-01-29 | Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models | Wenxuan Huang et.al. | 2601.22060 | null |
|
| 2026-01-29 | MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources | Baorui Ma et.al. | 2601.22054 | null |
|
| 2026-01-29 | Visual-Guided Key-Token Regularization for Multimodal Large Language Model Unlearning | Chengyi Cai et.al. | 2601.22020 | null |
|
| 2026-01-29 | Causal World Modeling for Robot Control | Lin Li et.al. | 2601.21998 | null |
|
| 2026-01-29 | Clarity: The Flexibility-Interpretability Trade-Off in Sparsity-aware Concept Bottleneck Models | Konstantinos P. Panousis et.al. | 2601.21944 | null |
|
| 2026-01-29 | VideoAesBench: Benchmarking the Video Aesthetics Perception Capabilities of Large Multimodal Models | Yunhao Li et.al. | 2601.21915 | null |
|
评估状态保存在浏览器本地(localStorage),换设备/浏览器不会同步。