ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00
Abstract
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs' capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
中文标题/摘要
标题:ChartAB:图表定位与密集对齐基准
图表在可视化、推理、数据分析以及人类思想交流中发挥着重要作用。然而,现有的视觉-语言模型(VLMs)在细节感知方面仍存在不足,难以从图表中提取精细结构。这种图表定位的限制也阻碍了它们比较多个图表和推理的能力。在本文中,我们引入了一个新的“图表对齐基准(ChartAB)”,以全面评估VLMs在图表定位任务中的表现,即提取表格数据、定位可视化元素以及从不同类型和复杂度的图表中识别各种属性。我们设计了一个JSON模板,以方便计算每个定位任务的评估指标。通过引入一种新颖的两阶段推理工作流,基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几种近期VLMs的评估分析揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异,并指出了当前模型需要加强的具体技能。
Summary / 总结
The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing attributes. It uses a JSON template to calculate specific evaluation metrics and a two-stage inference workflow to compare elements across charts. The benchmark reveals perception biases, weaknesses, and hallucinations in recent VLMs, highlighting the need to improve their fine-grained understanding of charts.
论文提出了ChartAB,一个用于评估视觉-语言模型在图表定位任务中的基准,包括提取表格数据、定位可视化元素和识别属性。通过使用JSON模板和两阶段推理工作流,基准评估模型在跨图表对元素进行对齐和比较的能力。对最近VLMs的评估揭示了偏见、弱点和幻觉,突显了当前模型在图表理解任务中需要增强的具体技能。
SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Authors: Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
First: 2025-10-30T17:52:39+00:00 · Latest: 2025-10-30T17:52:39+00:00
Abstract
This work introduces SteerVLM, a lightweight steering module designed to
guide Vision-Language Models (VLMs) towards outputs that better adhere to
desired instructions. Our approach learns from the latent embeddings of paired
prompts encoding target and converse behaviors to dynamically adjust
activations connecting the language modality with image context. This allows
for fine-grained, inference-time control over complex output semantics without
modifying model weights while preserving performance on off-target tasks. Our
steering module requires learning parameters equal to 0.14% of the original
VLM's size. Our steering module gains model control through dimension-wise
activation modulation and adaptive steering across layers without requiring
pre-extracted static vectors or manual tuning of intervention points.
Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a
multimodal dataset specifically created to facilitate the development and
evaluation of VLM steering techniques. Our method outperforms existing
intervention techniques on steering and hallucination mitigation benchmarks for
VLMs and proposes a robust solution for multimodal model control through
activation engineering.
中文标题/摘要
标题:SteerVLM:通过轻量级激活转向实现视觉语言模型稳健的模型控制
本工作介绍了SteerVLM,这是一种轻量级的转向模块,旨在引导视觉语言模型(VLMs)生成更符合所需指令的输出。我们的方法通过学习配对提示的潜在嵌入,编码目标和相反行为,动态调整语言模态与图像上下文之间的激活连接。这允许在不修改模型权重的情况下,在推理时对复杂的输出语义进行精细控制,同时保持对离目标任务的性能。我们的转向模块的学习参数量仅为原始VLM大小的0.14%。我们的转向模块通过维度上的激活调制和跨层自适应转向获得模型控制,无需预先提取的静态向量或手动调整干预点。此外,我们还引入了VNIA(视觉叙事意图对齐)多模态数据集,专门用于促进VLM转向技术的发展和评估。我们的方法在VLM的转向和幻觉缓解基准测试中优于现有干预技术,并提出了一种通过激活工程实现多模态模型控制的稳健解决方案。
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
First: 2025-06-24T17:30:27+00:00 · Latest: 2025-10-30T16:38:19+00:00
Comments: 39 pages, 24 figures
Abstract
Recent vision-language-action (VLA) models built on pretrained
vision-language models (VLMs) have demonstrated strong performance in robotic
manipulation. However, these models remain constrained by the single-frame
image paradigm and fail to fully leverage the temporal information offered by
multi-frame histories, as directly feeding multiple frames into VLM backbones
incurs substantial computational overhead and inference latency. We propose
CronusVLA, a unified framework that extends single-frame VLA models to the
multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame
pretraining on large-scale embodied datasets with autoregressive prediction of
action tokens, establishing an effective embodied vision-language foundation;
(2) Multi-frame post-training, which adapts the prediction of the
vision-language backbone from discrete tokens to learnable features, and
aggregates historical information via feature chunking. CronusVLA effectively
addresses the existing challenges of multi-frame modeling while enhancing
performance and observational robustness. To evaluate the robustness under
temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel
benchmark featuring 24 types of observational disturbances and 120 severity
levels. Experiments across three embodiments in simulated and real-world
environments demonstrate that CronusVLA achieves leading performance and
superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8%
improvement over OpenVLA on LIBERO, and the highest robustness score on
SimplerEnv-OR. These results highlight the potential of efficient multi-frame
adaptation in VLA models for more powerful and robust real-world deployment.
中文标题/摘要
标题:CronusVLA:通过多帧视觉-语言-动作建模实现高效稳健操作
基于预训练视觉-语言模型(VLMs)的近期视觉-语言-动作(VLA)模型在机器人操作方面表现出强大的性能。然而,这些模型仍然受限于单帧图像范式,未能充分利用多帧历史提供的时间信息,直接将多帧输入到VLM主干中会带来巨大的计算开销和推理延迟。我们提出了一种名为CronusVLA的统一框架,将单帧VLA模型扩展到多帧范式。CronusVLA遵循两阶段过程:(1)在大规模具身数据集上进行单帧预训练,通过自回归预测动作标记,建立有效的具身视觉-语言基础;(2)多帧后训练,将视觉-语言主干的预测从离散标记调整为可学习特征,并通过特征分块聚合历史信息。CronusVLA有效解决了多帧建模的现有挑战,同时提高了性能和观测鲁棒性。为了评估在时间和空间扰动下的鲁棒性,我们引入了SimplerEnv-OR基准,包含24种观测扰动类型和120种严重程度级别。在模拟和真实环境中的三种具身模型实验表明,CronusVLA实现了领先性能和优越的鲁棒性,在SimplerEnv中的成功率达到了70.9%,在LIBERO中的性能提高了26.8%,在SimplerEnv-OR中获得了最高的鲁棒性得分。这些结果突显了VLA模型中高效多帧适应的潜力,使其在更强大和鲁棒的实际部署中具有更大的可能性。
Summary / 总结
CronusVLA is a framework that extends single-frame vision-language-action models to a multi-frame paradigm to improve robotic manipulation. It uses a two-stage process: single-frame pretraining for action token prediction and multi-frame post-training for feature learning and historical information aggregation. Experiments show CronusVLA outperforms existing models with a 70.9% success rate on SimplerEnv and a 26.8% improvement on LIBERO compared to OpenVLA, demonstrating enhanced performance and robustness under various disturbances.
CronusVLA 是一个框架,将单帧视觉-语言-动作模型扩展到多帧范式,以提高机器人操作能力。它采用两阶段过程:单帧预训练进行动作标记预测和多帧后训练进行特征学习和历史信息聚合。实验表明,CronusVLA 在 SimplerEnv 上的成功率为 70.9%,在 LIBERO 上比 OpenVLA 提高了 26.8%,显示出在各种干扰下增强的性能和鲁棒性。
All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
First: 2025-10-30T16:08:25+00:00 · Latest: 2025-10-30T16:08:25+00:00
Abstract
Autonomous Vehicles (AVs) are transforming the future of transportation
through advances in intelligent perception, decision-making, and control
systems. However, their success is tied to one core capability, reliable object
detection in complex and multimodal environments. While recent breakthroughs in
Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable
progress, the field still faces a critical challenge as knowledge remains
fragmented across multimodal perception, contextual reasoning, and cooperative
intelligence. This survey bridges that gap by delivering a forward-looking
analysis of object detection in AVs, emphasizing emerging paradigms such as
Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI
rather than re-examining outdated techniques. We begin by systematically
reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR,
and Radar) and their fusion strategies, highlighting not only their
capabilities and limitations in dynamic driving environments but also their
potential to integrate with recent advances in LLM/VLM-driven perception
frameworks. Next, we introduce a structured categorization of AV datasets that
moves beyond simple collections, positioning ego-vehicle, infrastructure-based,
and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a
cross-analysis of data structures and characteristics. Ultimately, we analyze
cutting-edge detection methodologies, ranging from 2D and 3D pipelines to
hybrid sensor fusion, with particular attention to emerging transformer-driven
approaches powered by Vision Transformers (ViTs), Large and Small Language
Models (SLMs), and VLMs. By synthesizing these perspectives, our survey
delivers a clear roadmap of current capabilities, open challenges, and future
opportunities.
中文标题/摘要
标题:自动驾驶所需的一切:从像素、点和提示到下一代融合与多模态大/小语言模型/视觉模型在自动驾驶车辆中的应用
自动驾驶车辆(AVs)通过智能感知、决策和控制系统的发展正在重塑未来的交通。然而,它们的成功取决于一个核心能力——在复杂和多模态环境中可靠地进行目标检测。尽管计算机视觉(CV)和人工智能(AI)领域的最新突破推动了显著的进步,但该领域仍面临一个关键挑战,即知识在多模态感知、上下文推理和协同智能方面仍碎片化。本文综述填补了这一空白,通过提供面向未来的AV目标检测分析,强调了新兴范式,如视觉语言模型(VLMs)、大型语言模型(LLMs)和生成AI,而不是重新审视过时的技术。我们首先系统地回顾了AV传感器(摄像头、超声波、激光雷达和雷达)及其融合策略,不仅突出了它们在动态驾驶环境中的能力和局限性,还强调了它们与基于大/小语言模型/视觉模型的感知框架的潜在整合。接着,我们介绍了AV数据集的结构化分类,超越了简单的集合,将自我车辆、基础设施和协同数据集(例如V2V、V2I、V2X、I2I)置于其中,随后进行了数据结构和特征的交叉分析。最后,我们分析了最新的检测方法,从2D和3D管道到混合传感器融合,特别关注由视觉变换器(ViTs)、大型和小型语言模型(SLMs)和VLMs驱动的新兴变换器方法。通过综合这些视角,我们的综述提供了一条清晰的当前能力、开放挑战和未来机遇的路线图。
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Authors: Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
First: 2025-10-30T13:26:58+00:00 · Latest: 2025-10-30T13:26:58+00:00
Comments: Preprint
Abstract
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
中文标题/摘要
标题:通过头部-尾部再平衡对抗LVLM自我提升中的马太效应
自我提升已成为提升大型视觉-语言模型(LVLM)推理能力的主要范式,其中模型通过迭代探索和学习成功的轨迹。然而,在这一过程中,我们发现一个关键问题:模型在生成简单查询(即头部数据)的高质量轨迹方面表现出色,但在处理更复杂的查询(即尾部数据)方面却遇到困难。这导致了一种不平衡的优化,使模型优先关注简单的推理技能,而阻碍了其解决更复杂推理任务的能力。随着迭代次数的增加,这种不平衡变得越来越明显——我们将其称为“马太效应”——最终阻碍了模型的进一步改进并导致性能瓶颈。为了应对这一挑战,我们从两个角度引入了四种有效的策略:分布重塑和轨迹重采样,以在探索和学习自我提升过程中实现头部-尾部再平衡。在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型的视觉推理任务上的广泛实验表明,我们的方法在视觉推理能力上始终优于传统的自我提升,平均高出3.86分。
Summary / 总结
The paper addresses the issue of the Matthew effect in self-improvement of large vision-language models (LVLMs), where models tend to excel at simple tasks (head data) but struggle with complex ones (tail data). To counteract this, the authors propose four strategies from distribution-reshaping and trajectory-resampling perspectives to achieve head-tail re-balancing. Experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models show that these methods improve visual reasoning capabilities by an average of 3.86 points compared to vanilla self-improvement.
论文研究了大型视觉-语言模型(LVLM)自我提升过程中出现的马太效应问题,即模型在简单任务(头部数据)上表现优异,但在复杂任务(尾部数据)上表现较差。为解决这一不平衡,作者提出了四种策略,分别从数据重塑和轨迹重采样的角度出发。实验结果显示,这些方法在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上平均提高了3.86个点的视觉推理能力,优于传统的自我提升方法。
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Authors: Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
First: 2025-10-30T13:11:23+00:00 · Latest: 2025-10-30T13:11:23+00:00
Abstract
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
中文标题/摘要
标题:代表级反事实校准以实现无偏零样本识别
物体-上下文捷径仍然是视觉-语言模型中的一个持续性挑战,当测试场景与熟悉的训练共现情况不同时,会削弱零样本识别的可靠性。我们将此问题重新定义为因果推理问题,并提出:如果物体出现在不同的环境中,预测结果会如何?为了在推理时回答这一问题,我们估计CLIP表示空间中的物体和背景期望,并通过重新组合物体特征与从外部数据集、批邻居或文本描述中采样的多种不同背景,合成反事实嵌入。通过估计总直接效应和模拟干预,我们进一步减去背景激活,保留有益的物体-背景交互,同时减轻幻觉得分。无需重新训练或设计提示,我们的方法在上下文敏感基准测试中显著提高了最差群体和平均准确率,建立了新的零样本状态最先进水平。除了性能,我们的框架提供了一种轻量级的代表级反事实方法,为无偏和可靠的多模态推理提供了实用的因果途径。
Summary / 总结
The paper addresses the challenge of object-context shortcuts in vision-language models, which can lead to unreliable zero-shot recognition when test scenarios differ from training data. To tackle this, the authors propose a method that recombines object features with diverse alternative contexts to estimate counterfactual embeddings. By estimating the Total Direct Effect and simulating interventions, they mitigate hallucinated scores while preserving beneficial object-context interactions. This approach improves both worst-group and average accuracy on context-sensitive benchmarks, setting a new zero-shot state of the art without requiring retraining or prompt design.
论文针对视觉-语言模型中存在的对象-上下文捷径问题,该问题可能导致在测试场景与训练数据不同步时零样本识别的可靠性降低。为此,作者提出了一种方法,通过重新组合对象特征与多样化的替代上下文来估计反事实嵌入。通过估计总直接效应和模拟干预,他们减轻了幻觉分数,同时保留了有益的对象-上下文交互。这种方法在上下文敏感基准测试中提高了最坏群体和平均准确率,建立了新的零样本状态的艺术水平,无需重新训练或设计提示。
Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Authors: Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang
First: 2025-10-30T13:09:00+00:00 · Latest: 2025-10-30T13:09:00+00:00
Comments: 12 pages, 7 figures
Abstract
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few
known normal samples. Most existing methods rely on the generalization ability
of pre-trained vision-language models (VLMs) to recognize potentially anomalous
regions through feature similarity between text descriptions and images.
However, due to the lack of detailed textual descriptions, these methods can
only pre-define image-level descriptions to match each visual patch token to
identify potential anomalous regions, which leads to the semantic misalignment
between image descriptions and patch-level visual anomalies, achieving
sub-optimal localization performance. To address the above issues, we propose
the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and
fine-grained textual descriptions for existing anomaly detection datasets with
automatic construction pipeline. Based on the MFSC, we propose a novel
framework named FineGrainedAD to improve anomaly localization performance,
which consists of two components: Multi-Level Learnable Prompt (MLLP) and
Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics
into multi-level learnable prompts through automatic replacement and
concatenation mechanism, while MLSA designs region aggregation strategy and
multi-level alignment training to facilitate learnable prompts better align
with corresponding visual regions. Experiments demonstrate that the proposed
FineGrainedAD achieves superior overall performance in few-shot settings on
MVTec-AD and VisA datasets.
中文标题/摘要
标题:朝细粒度的视觉-语言对齐方向发展少量样本异常检测
少量样本异常检测(FSAD)方法使用少量已知正常样本识别异常区域。现有大多数方法依赖预训练的视觉-语言模型(VLMs)通过文本描述和图像特征之间的相似性来识别潜在的异常区域。但由于缺乏详细的文本描述,这些方法只能预先定义图像级别的描述来匹配每个视觉补丁标记,以识别潜在的异常区域,这导致了图像描述与补丁级别视觉异常之间的语义不匹配,从而导致次优的定位性能。为了解决上述问题,我们提出了多级细粒度语义描述(MFSC),为现有的异常检测数据集提供多级和细粒度的文本描述,并通过自动构建管道进行自动构建。基于MFSC,我们提出了一种新的框架FineGrainedAD,以提高异常定位性能,该框架由两个组件组成:多级可学习提示(MLLP)和多级语义对齐(MLSA)。MLLP通过自动替换和连接机制将细粒度语义引入多级可学习提示,而MLSA设计了区域聚合策略和多级对齐训练,以促进可学习提示更好地与相应的视觉区域对齐。实验表明,提出的FineGrainedAD在MVTec-AD和VisA数据集的少量样本设置中实现了优越的整体性能。
Summary / 总结
The paper addresses the challenge of few-shot anomaly detection by proposing Multi-Level Fine-Grained Semantic Caption (MFSC) to provide detailed textual descriptions for anomaly detection datasets. It introduces a novel framework called FineGrainedAD, which includes Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA) to improve anomaly localization. The framework enhances the alignment between textual descriptions and visual anomalies, leading to better performance on MVTec-AD and VisA datasets compared to existing methods.
论文通过提出多级精细语义描述(Multi-Level Fine-Grained Semantic Caption, MFSC)和新型框架FineGrainedAD来解决少样本异常检测中的语义对齐问题。MFSC为异常检测数据集提供详细的文本描述,而FineGrainedAD包含多级可学习提示(Multi-Level Learnable Prompt, MLLP)和多级语义对齐(Multi-Level Semantic Alignment, MLSA),以提高异常定位性能。实验表明,FineGrainedAD在MVTec-AD和VisA数据集的少样本设置中优于现有方法。
A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Authors: Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
First: 2025-10-30T12:45:24+00:00 · Latest: 2025-10-30T12:45:24+00:00
Comments: 23 pages, 14 figures
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for
adapting large vision-language models (VLMs) to unseen tasks without relying on
labeled data. However, the lack of dispersion between textual features can hurt
calibration performance, which raises concerns about VLMs' reliability,
trustworthiness, and safety. Current TPT approaches primarily focus on
improving prompt calibration by either maximizing average textual feature
dispersion or enforcing orthogonality constraints to encourage angular
separation. However, these methods may not always have optimal angular
separation between class-wise textual features, which implies overlooking the
critical role of angular diversity. To address this, we propose A-TPT, a novel
TPT framework that introduces angular diversity to encourage uniformity in the
distribution of normalized textual features induced by corresponding learnable
prompts. This uniformity is achieved by maximizing the minimum pairwise angular
distance between features on the unit hypersphere. We show that our approach
consistently surpasses state-of-the-art TPT methods in reducing the aggregate
average calibration error while maintaining comparable accuracy through
extensive experiments with various backbones on different datasets. Notably,
our approach exhibits superior zero-shot calibration performance on natural
distribution shifts and generalizes well to medical datasets. We provide
extensive analyses, including theoretical aspects, to establish the grounding
of A-TPT. These results highlight the potency of promoting angular diversity to
achieve well-dispersed textual features, significantly improving VLM
calibration during test-time adaptation. Our code will be made publicly
available.
中文标题/摘要
标题:A-TPT:视觉语言模型测试时提示调优的角多样性校准特性
测试时提示调优(TPT)已成为一种有前景的技术,用于在无需依赖标记数据的情况下,将大型视觉语言模型(VLMs)适应未见过的任务。然而,文本特征之间的缺乏分散性会损害校准性能,这引起了人们对VLMs可靠性和安全性的担忧。当前的TPT方法主要通过最大化平均文本特征分散性或施加正交约束来鼓励角度分离,以提高提示校准。然而,这些方法可能无法始终在类别间文本特征之间实现最优的角度分离,这意味着忽视了角多样性的关键作用。为了解决这个问题,我们提出了一种新颖的A-TPT框架,该框架引入了角多样性,以鼓励由相应可学习提示诱导的归一化文本特征的分布均匀性。这种均匀性是通过最大化单位超球面上特征之间的最小成对角度距离来实现的。我们通过在不同数据集上使用各种骨干网络进行广泛实验,展示了我们的方法在降低综合平均校准误差方面始终优于最先进的TPT方法,同时保持了相当的准确性。值得注意的是,我们的方法在自然分布转移的零样本校准性能方面表现出色,并且能够很好地泛化到医学数据集。我们提供了广泛的分析,包括理论方面,以建立A-TPT的基础。这些结果突显了促进角多样性以实现分散的文本特征的潜力,显著提高了VLM在测试时适应过程中的校准。我们的代码将公开发布。
Summary / 总结
The paper introduces A-TPT, a novel test-time prompt tuning framework that enhances the calibration performance of vision-language models by promoting angular diversity. Unlike existing methods that focus on maximizing average textual feature dispersion or enforcing orthogonality, A-TPT maximizes the minimum pairwise angular distance between features on the unit hypersphere. Extensive experiments show that A-TPT consistently outperforms state-of-the-art TPT methods in reducing calibration error while maintaining accuracy, especially in zero-shot settings and medical datasets.
该论文提出了A-TPT,这是一种新颖的测试时提示调优框架,通过增强文本特征的角多样性来提升视觉语言模型的校准性能。通过最大化特征之间的最小成对角距离,A-TPT在减少校准误差的同时保持了准确性,这一表现跨越了多种数据集和模型架构。特别是在自然分布变化的零样本校准方面表现出色,并且在医学数据集上具有良好的泛化能力。
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Authors: Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel
First: 2025-10-20T15:41:55+00:00 · Latest: 2025-10-30T12:05:58+00:00
Abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility by
detecting objects from arbitrary text queries. However, their zero-shot
performance in specialized domains like Remote Sensing (RS) is often
compromised by the inherent ambiguity of natural language, limiting critical
downstream applications. For instance, an OVD model may struggle to distinguish
between fine-grained classes such as "fishing boat" and "yacht" since their
embeddings are similar and often inseparable. This can hamper specific user
goals, such as monitoring illegal fishing, by producing irrelevant detections.
To address this, we propose a cascaded approach that couples the broad
generalization of a large pre-trained OVD model with a lightweight few-shot
classifier. Our method first employs the zero-shot model to generate
high-recall object proposals. These proposals are then refined for high
precision by a compact classifier trained in real-time on only a handful of
user-annotated examples - drastically reducing the high costs of RS imagery
annotation.The core of our framework is FLAME, a one-step active learning
strategy that selects the most informative samples for training. FLAME
identifies, on the fly, uncertain marginal candidates near the decision
boundary using density estimation, followed by clustering to ensure sample
diversity. This efficient sampling technique achieves high accuracy without
costly full-model fine-tuning and enables instant adaptation, within less then
a minute, which is significantly faster than state-of-the-art alternatives.Our
method consistently surpasses state-of-the-art performance on RS benchmarks,
establishing a practical and resource-efficient framework for adapting
foundation models to specific user needs.
中文标题/摘要
标题:FLAME驱动的即时OVD适应:基于活跃边际样本探索的少样本定位
开放词汇对象检测(OVD)模型通过从任意文本查询中检测对象提供了显著的灵活性。然而,它们在诸如遥感(RS)等专门领域中的零样本性能往往因自然语言的固有歧义而受损,限制了关键的下游应用。例如,一个OVD模型可能难以区分“渔船”和“游艇”这类细粒度类别,因为它们的嵌入相似且经常不可分。这可能妨碍特定用户目标,如监测非法捕鱼,导致无关的检测结果。为了解决这一问题,我们提出了一种级联方法,将大型预训练OVD模型的广泛泛化与轻量级少样本分类器相结合。我们的方法首先使用零样本模型生成高召回的对象提案,然后通过仅在少量用户标注示例上实时训练的小型分类器进行高精度细化,从而大幅降低RS图像标注的高昂成本。我们框架的核心是FLAME,这是一种一步式主动学习策略,能够选择最具信息量的样本进行训练。FLAME利用密度估计在决策边界附近即时识别不确定的边际候选样本,然后通过聚类确保样本多样性。这种高效的采样技术在无需昂贵的全模型微调的情况下实现了高精度,并能够在不到一分钟内实现即时适应,显著快于最先进的替代方案。我们的方法在RS基准测试中始终超越了最先进的性能,建立了一个实用且资源高效的框架,用于将基础模型适应特定用户需求。
Summary / 总结
The paper addresses the challenge of zero-shot performance in open-vocabulary object detection (OVD) models in specialized domains like Remote Sensing (RS), where fine-grained class distinctions can be ambiguous. It proposes a cascaded approach combining a large pre-trained OVD model with a lightweight few-shot classifier. The method uses a one-step active learning strategy, FLAME, to select informative samples for real-time training, achieving high accuracy and enabling rapid adaptation within minutes, significantly faster than existing methods. The approach consistently outperforms state-of-the-art methods on RS benchmarks.
论文针对开放词汇对象检测(OVD)模型在遥感(RS)等专业领域中的零样本性能不足问题,提出了结合大型预训练OVD模型和轻量级少样本分类器的级联方法。核心方法FLAME实现实时选择最具信息量的样本进行训练,达到高精度并能在几分钟内实现快速适应。实验表明,该方法在RS基准测试中优于现有最佳方案。
MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Authors: Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto
First: 2025-10-30T11:58:36+00:00 · Latest: 2025-10-30T11:58:36+00:00
Abstract
Artificial intelligence in healthcare requires models that are accurate and
interpretable. We advance mechanistic interpretability in medical vision by
applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP,
a vision-language model trained on chest radiographs and reports. To quantify
interpretability, we propose an evaluation framework that combines correlation
metrics, entropy analyzes, and automated neuron naming via the MedGEMMA
foundation model. Experiments on the CheXpert dataset show that MedSAE neurons
achieve higher monosemanticity and interpretability than raw MedCLIP features.
Our findings bridge high-performing medical AI and transparency, offering a
scalable step toward clinically reliable representations.
中文标题/摘要
标题:MedSAE:通过稀疏自编码器剖析MedCLIP表示
医疗保健中的人工智能需要准确且可解释的模型。我们通过将医疗稀疏自编码器(MedSAEs)应用于MedCLIP的潜在空间,推进了医学视觉的机制可解释性,MedCLIP是一种在胸部X光片和报告上训练的视觉-语言模型。为了量化可解释性,我们提出了一种结合相关性度量、熵分析和通过MedGEMMA基础模型自动命名神经元的评估框架。在CheXpert数据集上的实验表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征。我们的研究结果将高性能的医疗AI与透明度相结合,提供了一条通往临床可靠表示的可扩展途径。
Summary / 总结
The research aims to enhance the interpretability of medical vision models by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. The study evaluates interpretability using a framework that includes correlation metrics, entropy analysis, and automated neuron naming via MedGEMMA. The experiments on the CheXpert dataset demonstrate that MedSAE neurons exhibit higher monosemanticity and interpretability compared to raw MedCLIP features, bridging the gap between high-performing medical AI and transparency.
研究旨在通过将Medical Sparse Autoencoders (MedSAEs)应用于MedCLIP的潜在空间,提升医疗视觉模型的可解释性,MedCLIP是一个在胸部X光片和报告上训练的视觉-语言模型。研究使用包括相关性指标、熵分析和通过MedGEMMA基础模型自动命名神经元的评估框架。实验结果表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征,填补了高性能医疗AI与透明度之间的差距。
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee
Venue: EMNLP 2025 Oral
First: 2025-09-04T17:59:43+00:00 · Latest: 2025-10-30T10:58:04+00:00
Comments: EMNLP 2025 Oral; Project Homepage:
https://yanzehong.github.io/trust-vl/
Abstract
Multimodal misinformation, encompassing textual, visual, and cross-modal
distortions, poses an increasing societal threat that is amplified by
generative AI. Existing methods typically focus on a single type of distortion
and struggle to generalize to unseen scenarios. In this work, we observe that
different distortion types share common reasoning capabilities while also
requiring task-specific skills. We hypothesize that joint training across
distortion types facilitates knowledge sharing and enhances the model's ability
to generalize. To this end, we introduce TRUST-VL, a unified and explainable
vision-language model for general multimodal misinformation detection. TRUST-VL
incorporates a novel Question-Aware Visual Amplifier module, designed to
extract task-specific visual features. To support training, we also construct
TRUST-Instruct, a large-scale instruction dataset containing 198K samples
featuring structured reasoning chains aligned with human fact-checking
workflows. Extensive experiments on both in-domain and zero-shot benchmarks
demonstrate that TRUST-VL achieves state-of-the-art performance, while also
offering strong generalization and interpretability.
中文标题/摘要
标题:TRUST-VL:一种可解释的通用多模态虚假信息检测助手
多模态虚假信息,包括文本、视觉和跨模态的扭曲,构成了日益严重的社会威胁,这种威胁被生成式AI放大。现有方法通常专注于一种类型的扭曲,并难以泛化到未见过的场景。在本文中,我们观察到不同类型的扭曲共享一些共同的推理能力,同时也需要特定的任务技能。我们假设跨类型联合训练有助于知识共享并增强模型的泛化能力。为此,我们引入了TRUST-VL,这是一种统一且可解释的视觉语言模型,用于通用多模态虚假信息检测。TRUST-VL 包含一个新颖的问答感知视觉增强模块,旨在提取特定任务的视觉特征。为了支持训练,我们还构建了TRUST-Instruct,一个包含198K样本的大规模指令数据集,这些样本具有与人类事实核查工作流程对齐的结构化推理链。在领域内和零样本基准上的广泛实验表明,TRUST-VL 达到了最先进的性能,同时提供了强大的泛化能力和可解释性。
Summary / 总结
TRUST-VL is designed to address the challenge of detecting multimodal misinformation by jointly training on different types of distortions. It includes a Question-Aware Visual Amplifier module to extract task-specific visual features and is trained on a large instruction dataset called TRUST-Instruct. Experiments show that TRUST-VL outperforms existing methods and demonstrates strong generalization and interpretability.
TRUST-VL旨在通过联合训练不同类型的扭曲来检测多模态虚假信息,包含一个任务感知的视觉放大模块以提取特定的视觉特征,并使用名为TRUST-Instruct的大规模指令数据集进行训练。实验表明,TRUST-VL在性能上超越了现有方法,并且具有较强的泛化能力和可解释性。
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning -- A Benchmark Dataset and Method
Authors: Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
First: 2025-09-08T14:55:16+00:00 · Latest: 2025-10-30T10:15:05+00:00
Comments: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
Abstract
Dark humor in online memes poses unique challenges due to its reliance on
implicit, sensitive, and culturally contextual cues. To address the lack of
resources and methods for detecting dark humor in multimodal content, we
introduce a novel dataset of 4,379 Reddit memes annotated for dark humor,
target category (gender, mental health, violence, race, disability, and other),
and a three-level intensity rating (mild, moderate, severe). Building on this
resource, we propose a reasoning-augmented framework that first generates
structured explanations for each meme using a Large Vision-Language Model
(VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective
to iteratively refine its explanations, ensuring completeness and alignment. We
then extract textual features from both the OCR transcript and the self-refined
reasoning via a text encoder, while visual features are obtained using a vision
transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three
streams, text, image, and reasoning, via pairwise attention mechanisms,
producing a unified representation for classification. Experimental results
demonstrate that our approach outperforms strong baselines across three tasks:
dark humor detection, target identification, and intensity prediction. The
dataset, annotations, and code are released to facilitate further research in
multimodal humor understanding and content moderation. Code and Dataset are
available at:
https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
中文标题/摘要
标题:D-HUMOR:通过多模态开放式推理理解黑色幽默——基准数据集与方法
在线表情包中的黑色幽默因其依赖于隐含、敏感和文化背景的提示而面临独特挑战。为了解决检测多模态内容中黑色幽默资源和方法的缺乏,我们引入了一个包含4,379个带有黑色幽默标注的Reddit表情包的数据集,标注了目标类别(性别、心理健康、暴力、种族、残疾和其他)和三级强度评分(轻微、中等、严重)。在此基础上,我们提出了一种增强推理框架,首先使用大型视觉-语言模型(VLM)为每个表情包生成结构化解释。通过角色反转自循环,VLM 采用作者的视角迭代优化其解释,确保完整性和一致性。然后,我们从OCR转录文本和自优化推理中提取文本特征,使用视觉变换器获取视觉特征。三流交叉推理网络(TCRNet)通过成对注意力机制融合这三流,即文本、图像和推理,生成统一表示进行分类。实验结果表明,我们的方法在黑色幽默检测、目标识别和强度预测三项任务上均优于强基线。该数据集、标注和代码已发布,以促进多模态幽默理解和内容审核方面的进一步研究。代码和数据集可在以下链接获取:https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
Summary / 总结
The paper introduces D-HUMOR, a dataset of 4,379 Reddit memes annotated for dark humor, target category, and intensity. It proposes a reasoning-augmented framework using a Large Vision-Language Model to iteratively generate structured explanations, which are then refined through a Role-Reversal Self-Loop. The framework uses a Tri-stream Cross-Reasoning Network to fuse text, image, and reasoning features, outperforming strong baselines in dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are publicly available to support further research in multimodal humor understanding and content moderation.
论文旨在解决在线表情包中暗黑幽默的理解难题,这种幽默依赖于隐含和文化敏感的线索。研究引入了一个包含4,379个标注表情包的数据集,并提出了一种增强推理框架,使用大型视觉-语言模型为每个表情包生成结构化解释。该框架通过角色反转自循环迭代完善解释,从文本和图像中提取特征,并通过三流交叉推理网络融合这些特征。该方法在检测暗黑幽默、识别目标类别和预测强度等级方面优于强基线。数据集、标注和代码已公开,以支持进一步研究多模态幽默理解和内容审核。
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
First: 2025-10-30T08:21:50+00:00 · Latest: 2025-10-30T08:21:50+00:00
Comments: 10 pages
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet
their grasp of temporal information in video remains weak and, crucially,
under-evaluated. We probe this gap with a deceptively simple but revealing
challenge: judging the arrow of time (AoT)-whether a short clip is played
forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated
benchmark that tests whether VLMs can infer temporal direction in natural
videos using the same stimuli and behavioral baselines established for humans.
Our comprehensive evaluation of open-weight and proprietary, reasoning and
non-reasoning VLMs reveals that most models perform near chance, and even the
best lag far behind human accuracy on physically irreversible processes (e.g.,
free fall, diffusion/explosion) and causal manual actions (division/addition)
that humans recognize almost instantly. These results highlight a fundamental
gap in current multimodal systems: while they capture rich visual-semantic
correlations, they lack the inductive biases required for temporal continuity
and causal understanding. We release the code and data for AoT-PsyPhyBENCH to
encourage further progress in the physical and temporal reasoning capabilities
of VLMs.
中文标题/摘要
标题:时间流动的方向如何?基于心理物理学的视觉-语言模型评估
现代视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在视频中的时间信息理解方面仍然薄弱且未得到充分评估。我们通过一个看似简单但揭示性强的挑战——判断时间箭头(AoT)——即判断短片段是正向播放还是反向播放,来探索这一差距。我们引入了AoT-PsyPhyBENCH,这是一个经心理物理学验证的基准测试,测试VLMs是否能在自然视频中推断时间方向,使用与人类相同的刺激和行为基线。我们对开放权重和专有、推理和非推理VLMs的全面评估显示,大多数模型的表现接近随机猜测,甚至最好的模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/加法)上的人类识别能力方面也远远落后。这些结果突显了当前多模态系统中的一个基本差距:虽然它们捕捉了丰富的视觉-语义关联,但缺乏用于时间连续性和因果理解的归纳偏置。我们发布了AoT-PsyPhyBENCH的代码和数据,以鼓励进一步提高VLMs在物理和时间推理能力方面的发展。
Summary / 总结
The study evaluates the temporal understanding of vision-language models (VLMs) by introducing AoT-PsyPhyBENCH, a benchmark based on psychophysical validation. It tests VLMs' ability to infer the direction of time in natural videos, revealing that most models perform near chance and lag significantly behind human accuracy, especially in recognizing irreversible processes and causal actions. This highlights a fundamental gap in current VLMs' temporal reasoning capabilities despite their strong visual-semantic correlations.
该研究使用基于心理物理学的基准AoT-PsyPhyBENCH评估了视觉-语言模型(VLMs)的时间理解能力。研究旨在解决VLMs在辨别视频中时间方向方面被忽视的能力评估问题。评估结果显示,大多数VLMs的表现接近随机,特别是在识别不可逆物理过程和因果手动动作方面,表明它们在时间推理能力方面存在显著差距。研究结果表明,VLMs需要更强的时间连续性和因果理解的归纳偏置,以达到人类的性能水平。
MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction
Authors: Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba
Venue: ICCV 2025
First: 2025-10-30T05:12:29+00:00 · Latest: 2025-10-30T05:12:29+00:00
Comments: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD)
Workshop at ICCV 2025
Abstract
Large annotated datasets are essential for training robust Computer-Aided
Diagnosis (CAD) models for breast cancer detection or risk prediction. However,
acquiring such datasets with fine-detailed annotation is both costly and
time-consuming. Vision-Language Models (VLMs), such as CLIP, which are
pre-trained on large image-text pairs, offer a promising solution by enhancing
robustness and data efficiency in medical imaging tasks. This paper introduces
a novel Multi-View Mammography and Language Model for breast cancer
classification and risk prediction, trained on a dataset of paired mammogram
images and synthetic radiology reports. Our MV-MLM leverages multi-view
supervision to learn rich representations from extensive radiology data by
employing cross-modal self-supervision across image-text pairs. This includes
multiple views and the corresponding pseudo-radiology reports. We propose a
novel joint visual-textual learning strategy to enhance generalization and
accuracy performance over different data types and tasks to distinguish breast
tissues or cancer characteristics(calcification, mass) and utilize these
patterns to understand mammography images and predict cancer risk. We evaluated
our method on both private and publicly available datasets, demonstrating that
the proposed model achieves state-of-the-art performance in three
classification tasks: (1) malignancy classification, (2) subtype
classification, and (3) image-based cancer risk prediction. Furthermore, the
model exhibits strong data efficiency, outperforming existing fully supervised
or VLM baselines while trained on synthetic text reports and without the need
for actual radiology reports.
中文标题/摘要
标题:MV-MLM:连接多视角乳腺X线摄影与语言以实现乳腺癌诊断与风险预测
大规模标注数据集对于训练用于乳腺癌检测或风险预测的稳健计算机辅助诊断(CAD)模型至关重要。然而,获取具有精细详细标注的数据集既昂贵又耗时。视觉-语言模型(VLMs),如CLIP,通过在大规模图像-文本对上进行预训练,提供了增强医疗成像任务中鲁棒性和数据效率的有希望的解决方案。本文介绍了一种新的多视角乳腺X线摄影和语言模型,用于乳腺癌分类和风险预测,该模型基于配对的乳腺X线摄影图像和合成放射学报告数据集进行训练。我们的MV-MLM利用多视角监督,通过跨模态自监督从广泛的放射学数据中学习丰富的表示。这包括多个视角及其相应的伪放射学报告。我们提出了一种新颖的联合视觉-文本学习策略,以增强在不同数据类型和任务上的泛化能力和准确性表现,以区分乳腺组织或癌症特征(钙化、肿块),并利用这些模式来理解乳腺X线摄影图像和预测癌症风险。我们在私人和公开可用的数据集上评估了该方法,证明了所提出模型在三个分类任务中的最佳性能:(1) 恶性分类,(2) 亚型分类,(3) 图像基癌症风险预测。此外,该模型表现出强大的数据效率,在使用合成文本报告进行训练且无需实际放射学报告的情况下,优于现有的完全监督或VLM基线。
Summary / 总结
This paper introduces MV-MLM, a novel model that combines multi-view mammography and language for breast cancer diagnosis and risk prediction. It leverages Vision-Language Models (VLMs) and multi-view supervision to enhance robustness and data efficiency. The model outperforms existing methods in three classification tasks: malignancy classification, subtype classification, and image-based cancer risk prediction, demonstrating strong data efficiency and generalization.
该论文提出了一种名为MV-MLM的新模型,结合了多视角乳腺X线摄影和语言技术,用于乳腺癌诊断和风险预测。该模型利用Vision-Language模型(VLM)和多视角监督来增强鲁棒性和数据效率。该模型在恶性肿瘤分类、亚型分类和基于图像的癌症风险预测三个分类任务中表现出色,显示出强大的数据效率和泛化能力。
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
Authors: Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
First: 2025-10-30T03:22:30+00:00 · Latest: 2025-10-30T03:22:30+00:00
Abstract
Large vision language models (VLMs) have advanced graphical user interface
(GUI) task automation but still lag behind humans. We hypothesize this gap
stems from missing core GUI knowledge, which existing training schemes (such as
supervised fine tuning and reinforcement learning) alone cannot fully address.
By analyzing common failure patterns in GUI task execution, we distill GUI
knowledge into three dimensions: (1) interface perception, knowledge about
recognizing widgets and system states; (2) interaction prediction, knowledge
about reasoning action state transitions; and (3) instruction understanding,
knowledge about planning, verifying, and assessing task completion progress. We
further introduce GUI Knowledge Bench, a benchmark with multiple choice and
yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux,
IOS) and 292 applications. Our evaluation shows that current VLMs identify
widget functions but struggle with perceiving system states, predicting
actions, and verifying task completion. Experiments on real world GUI tasks
further validate the close link between GUI knowledge and task success. By
providing a structured framework for assessing GUI knowledge, our work supports
the selection of VLMs with greater potential prior to downstream training and
provides insights for building more capable GUI agents.
中文标题/摘要
标题:GUI知识基准:揭示GUI任务中VLM失败背后的知识差距
大型视觉语言模型(VLMs)在图形用户界面(GUI)任务自动化方面取得了进展,但仍落后于人类。我们假设这种差距源于缺失的核心GUI知识,而现有的训练方案(如监督微调和强化学习)无法完全解决这一问题。通过分析GUI任务执行中的常见失败模式,我们将GUI知识提炼为三个维度:(1)界面感知,关于识别控件和系统状态的知识;(2)交互预测,关于推理动作状态转换的知识;(3)指令理解,关于规划、验证和评估任务完成进度的知识。我们进一步引入了GUI知识基准,这是一个包含跨六个平台(Web、Android、MacOS、Windows、Linux、iOS)和292个应用程序的多项选择和是/非问题的基准。我们的评估显示,当前的VLMs能够识别控件功能,但在感知系统状态、预测动作和验证任务完成方面存在困难。在真实世界GUI任务上的实验进一步验证了GUI知识与任务成功之间的密切联系。通过提供一个结构化的框架来评估GUI知识,我们的工作支持在下游训练前选择具有更大潜力的VLMs,并为构建更强大的GUI代理提供了见解。
Summary / 总结
The research aims to identify the knowledge gap in large vision language models (VLMs) for GUI task automation, hypothesizing that this gap arises from insufficient core GUI knowledge. The study analyzes common failure patterns and categorizes GUI knowledge into three dimensions: interface perception, interaction prediction, and instruction understanding. The GUI Knowledge Bench, a benchmark with questions across six platforms and 292 applications, reveals that current VLMs can identify widget functions but struggle with system state perception, action prediction, and task verification. These findings highlight the need for better GUI knowledge in VLMs for improved task success.
研究旨在识别导致大型视觉语言模型(VLMs)在GUI任务中失败的知识缺口,假设这一缺口源于核心GUI知识的缺失。研究引入了一个名为GUI Knowledge Bench的基准,评估VLMs在界面感知、交互预测和指令理解三个维度上的表现。关键发现表明,VLMs能够识别控件,但在感知系统状态、预测动作和验证任务完成方面存在困难。这项工作支持在下游训练前选择具有更大潜力的VLMs,并为构建更强大的GUI代理提供了见解。
Empowering Agentic Video Analytics Systems with Video Language Models
Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
First: 2025-05-01T02:40:23+00:00 · Latest: 2025-10-30T03:12:42+00:00
Comments: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations
and appendix
Abstract
AI-driven video analytics has become increasingly important across diverse
domains. However, existing systems are often constrained to specific,
predefined tasks, limiting their adaptability in open-ended analytical
scenarios. The recent emergence of Vision Language Models (VLMs) as
transformative technologies offers significant potential for enabling
open-ended video understanding, reasoning, and analytics. Nevertheless, their
limited context windows present challenges when processing ultra-long video
content, which is prevalent in real-world applications. To address this, we
introduce AVA, a VLM-powered system designed for open-ended, advanced video
analytics. AVA incorporates two key innovations: (1) the near real-time
construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or
continuous video streams, and (2) an agentic retrieval-generation mechanism
that leverages EKGs to handle complex and diverse queries. Comprehensive
evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that
AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy,
respectively-significantly surpassing existing VLM and video
Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video
analytics in ultra-long and open-world video scenarios, we introduce a new
benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours
in duration, along with 120 manually annotated, diverse, and complex
question-answer pairs. On AVA-100, AVA achieves top-tier performance with an
accuracy of 75.8%. The source code of AVA is available at
https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at
https://huggingface.co/datasets/iesc/Ava-100.
中文标题/摘要
标题:利用视频语言模型赋能代理型视频分析系统
AI驱动的视频分析在多个领域变得越来越重要。然而,现有的系统通常局限于特定的、预定义的任务,限制了它们在开放性分析场景中的适应性。最近,视觉语言模型(VLMs)的出现为实现开放性视频理解、推理和分析提供了巨大潜力。然而,它们有限的上下文窗口在处理超长视频内容时带来了挑战,而这种内容在实际应用中非常普遍。为了解决这个问题,我们提出了AVA,这是一种基于VLM的系统,旨在实现开放性、高级的视频分析。AVA包含两项关键创新:(1)近实时构建事件知识图谱(EKGs)以高效索引长或连续视频流,(2)一种代理检索生成机制,利用EKGs处理复杂和多样的查询。在公共基准LVBench和VideoMME-Long上的全面评估表明,AVA达到了最先进的性能,分别取得了62.3%和64.1%的准确率,显著超过了现有的VLM和视频检索增强生成(RAG)系统。此外,为了评估超长和开放世界视频场景中的视频分析,我们引入了一个新的基准AVA-100。该基准包括8个超过10小时的视频,以及120个手动标注的、多样且复杂的问答对。在AVA-100上,AVA取得了顶级性能,准确率为75.8%。AVA的源代码可在https://github.com/I-ESC/Project-Ava获取。AVA-100基准数据集可在https://huggingface.co/datasets/iesc/Ava-100获取。
Summary / 总结
The research aims to enhance the adaptability of AI-driven video analytics systems by leveraging Vision Language Models (VLMs). AVA, a VLM-powered system, introduces Event Knowledge Graphs (EKGs) for efficient indexing and an agentic retrieval-generation mechanism to handle complex queries. AVA outperforms existing systems on public benchmarks, achieving 62.3% and 64.1% accuracy on LVBench and VideoMME-Long, respectively. Additionally, AVA demonstrates robust performance on the newly introduced AVA-100 benchmark, achieving 75.8% accuracy on ultra-long videos.
论文介绍了AVA,这是一种基于VLM的开放型视频分析系统,解决了现有系统在处理多样和复杂查询时的局限性。AVA利用事件知识图谱进行高效索引,并采用代理检索生成机制。在公共基准测试和新引入的AVA-100上的评估表明,AVA在LVBench和VideoMME-Long上的准确率分别达到了62.3%和64.1%,在AVA-100上的准确率为75.8%,超越了现有系统。
Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations
Authors: Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin
Venue: NeurIPS 2025
First: 2025-05-24T08:20:36+00:00 · Latest: 2025-10-30T02:59:44+00:00
Comments: NeurIPS 2025
Abstract
Pre-trained stable diffusion models (SD) have shown great advances in visual
correspondence. In this paper, we investigate the capabilities of Diffusion
Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs
exhibit a critical phenomenon in which very few feature activations exhibit
significantly larger values than others, known as \textit{massive activations},
leading to uninformative representations and significant performance
degradation for DiTs. The massive activations consistently concentrate at very
few fixed dimensions across all image patch tokens, holding little local
information. We trace these dimension-concentrated massive activations and find
that such concentration can be effectively localized by the zero-initialized
Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose
Diffusion Transformer Feature (DiTF), a training-free framework designed to
extract semantic-discriminative features from DiTs. Specifically, DiTF employs
AdaLN to adaptively localize and normalize massive activations with
channel-wise modulation. In addition, we develop a channel discard strategy to
further eliminate the negative impacts from massive activations. Experimental
results demonstrate that our DiTF outperforms both DINO and SD-based models and
establishes a new state-of-the-art performance for DiTs in different visual
correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
中文标题/摘要
标题:通过调节大规模激活释放扩散变换器的视觉对应能力
预训练的稳定扩散模型(SD)在视觉对应方面取得了巨大进展。本文研究了扩散变换器(DiTs)在精确密集对应方面的能力。与SD不同,DiTs表现出一种关键现象,即极少数特征激活值显著大于其他值,称为“大规模激活”,导致DiTs的不具信息性表示和显著性能下降。大规模激活在所有图像块标记中始终集中在非常少数的固定维度上,几乎没有局部信息。我们追踪这些维度集中的大规模激活,并发现这种集中可以通过零初始化的自适应层归一化(AdaLN-zero)有效定位。基于这些发现,我们提出了一种无需训练的扩散变换器特征(DiTF)框架,旨在从DiTs中提取语义区分特征。具体而言,DiTF使用AdaLN以通道级调节来适应性定位和归一化大规模激活。此外,我们还开发了一种通道丢弃策略,以进一步消除大规模激活的负面影响。实验结果表明,我们的DiTF在不同视觉对应任务中均优于DINO和基于SD的模型,并在Spair-71k和AP-10K-C.S.上分别建立了DiTs的新最佳性能(+9.4%和+4.4%)。
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Authors: Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon
First: 2025-09-01T03:13:50+00:00 · Latest: 2025-10-30T02:05:44+00:00
Comments: Accepted for presentation at the IEEE BigData 2025 Workshop (Special
Session on Intelligent Data Mining). This v2 updates formatting and adds IEEE
copyright notice
Abstract
Speculative decoding accelerates large language model inference, but its
reliance on a fixed speculation length is suboptimal in large-batch serving
environments with diverse requests. This paper explores a new direction for
dynamic adaptation by investigating a novel class of post-hoc, diagnostic
signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free
framework built on two primary components: (1) a predictive signal based on the
variance of the Kullback-Leibler (KLD) divergence, which diagnoses the
generation's regional stability, and (2) an adaptive speculation length cap to
mitigate the straggler problem in per-sequence decoding. Experiments
demonstrate the potential of using KLD-based stability signals for dynamic
adaptation. An algorithm guided by these signals achieves end-to-end latency
competitive with leading baselines and exhibits superior robustness across
diverse workloads. This robustness is particularly valuable in challenging
low-acceptance-rate regimes, where the proposed signal maintains its diagnostic
utility. Collectively, these findings validate post-hoc signals as a valuable
component for building more robust and intelligent LLM inference systems, and
highlight a promising direction for future research on dynamic speculation
length adaptation.
中文标题/摘要
标题:DSDE:基于KLD稳定性动态推测解码用于实际服务
推测解码加速了大型语言模型的推理,但在具有多样化请求的大批量服务环境中,其依赖于固定推测长度是不理想的。本文探索了一种新的动态适应方向,通过研究一种新型的后验诊断信号。我们提出了动态推测解码引擎(DSDE),这是一种无需训练的框架,主要由两个组成部分构成:(1)基于Kullback-Leibler(KLD)散度方差的预测信号,用于诊断生成的区域稳定性;(2)一种自适应推测长度上限,以缓解逐序列解码中的拖后腿问题。实验表明,使用KLD基稳定性信号进行动态适应具有潜力。由这些信号指导的算法在端到端延迟方面与领先基准相当,并且在各种工作负载下表现出更优的鲁棒性。这种鲁棒性在低接受率的挑战性环境中尤为重要,所提出的信号在此类环境中仍保持其诊断作用。这些发现验证了后验信号作为构建更鲁棒和智能的LLM推理系统的重要组成部分的价值,并强调了未来研究动态推测长度适应的有希望的方向。
Summary / 总结
The paper addresses the limitations of fixed speculation length in speculative decoding for large language models, proposing DSDE, a training-free framework using KLD variance as a diagnostic signal for dynamic speculation length adaptation. Experiments show that DSDE achieves competitive end-to-end latency and superior robustness across diverse workloads, especially in low-acceptance-rate regimes, validating the utility of post-hoc signals in LLM inference systems.
论文针对大型语言模型中固定推测长度的局限性,提出了一种基于KLD方差作为诊断信号的训练-free框架DSDE,用于动态推测长度适应。实验表明,DSDE在端到端延迟上具有竞争力,并且在各种工作负载下表现出更优的鲁棒性,特别是在低接受率的环境中,验证了后验信号在LLM推理系统中的实用性。
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
Authors: Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
Venue: NeurIPS 2025
First: 2025-05-19T17:59:27+00:00 · Latest: 2025-10-30T01:42:07+00:00
Comments: NeurIPS 2025 Datasets & Benchmarks
Abstract
Chart understanding presents a unique challenge for large vision-language
models (LVLMs), as it requires the integration of sophisticated textual and
visual reasoning capabilities. However, current LVLMs exhibit a notable
imbalance between these skills, falling short on visual reasoning that is
difficult to perform in text. We conduct a case study using a synthetic dataset
solvable only through visual reasoning and show that model performance degrades
significantly with increasing visual complexity, while human performance
remains robust. We then introduce ChartMuseum, a new Chart Question Answering
(QA) benchmark containing 1,162 expert-annotated questions spanning multiple
reasoning types, curated from real-world charts across 184 sources,
specifically built to evaluate complex visual and textual reasoning. Unlike
prior chart understanding benchmarks -- where frontier models perform similarly
and near saturation -- our benchmark exposes a substantial gap between model
and human performance, while effectively differentiating model capabilities:
although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro
attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct
achieves only 38.5%. Moreover, on questions requiring primarily visual
reasoning, all models experience a 35%-55% performance drop from
text-reasoning-heavy question performance. Lastly, our qualitative error
analysis reveals specific categories of visual reasoning that are challenging
for current LVLMs.
中文标题/摘要
标题:ChartMuseum:测试大型视觉语言模型的视觉推理能力
图表理解对大型视觉语言模型(LVLMs)提出了独特挑战,因为它需要结合复杂的文本和视觉推理能力。然而,当前的LVLMs在这两方面技能之间表现出明显的不平衡,特别是在难以在文本中执行的视觉推理方面表现不佳。我们使用一个仅通过视觉推理才能解决的合成数据集进行了案例研究,结果显示,随着视觉复杂性的增加,模型的性能显著下降,而人类的表现则保持稳定。然后,我们引入了ChartMuseum,这是一个包含1,162个专家标注问题的新图表问答基准,涵盖了多种推理类型,从184个来源的真实世界图表中精选而来,专门用于评估复杂的视觉和文本推理能力。与之前的图表理解基准不同,这些基准中前沿模型的表现相似且接近饱和,而我们的基准则揭示了模型与人类表现之间存在的显著差距,同时有效地区分了模型的能力:尽管人类的准确率为93%,但表现最好的模型Gemini-2.5-Pro仅达到63.0%,而领先的开源LVLM Qwen2.5-VL-72B-Instruct仅达到38.5%。此外,在主要需要视觉推理的问题上,所有模型的性能从主要依赖文本推理的问题性能中下降了35%-55%。最后,我们的定性错误分析揭示了当前LVLMs在某些视觉推理类别中面临的挑战。
Summary / 总结
The research aims to evaluate the visual reasoning capabilities of large vision-language models (LVLMs) by introducing a new benchmark, ChartMuseum, which contains 1,162 expert-annotated questions from real-world charts. The study shows that LVLMs perform poorly on complex visual reasoning tasks, with the best model achieving only 63.0% accuracy, while humans achieve 93%. The benchmark effectively highlights the gap between model and human performance, especially in tasks requiring primarily visual reasoning, where models experience a significant performance drop.
研究旨在通过引入ChartMuseum这一新基准来评估大型视觉-语言模型(LVLMs)的视觉推理能力。研究使用合成数据集和真实世界数据集来评估LVLMs的表现,结果显示,随着视觉复杂性的增加,模型性能显著下降,而人类表现则保持稳定。该基准包含来自184个来源的1,162个专家标注的问题,揭示了模型和人类之间的显著差距,人类的准确率为93%,而表现最好的模型Gemini-2.5-Pro仅达到63.0%。此外,当问题主要依赖视觉推理时,模型的表现会下降35%-55%,突显了当前LVLMs在特定视觉推理类别上的挑战。
Dynamic VLM-Guided Negative Prompting for Diffusion Models
Authors: Hoyeon Chang, Seungjin Kim, Yoonseok Choi
Venue: NeurIPS
2025
First: 2025-10-30T01:10:25+00:00 · Latest: 2025-10-30T01:10:25+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS
2025) Workshop: The First Workshop on Generative and Protective AI for
Content Creation
Abstract
We propose a novel approach for dynamic negative prompting in diffusion
models that leverages Vision-Language Models (VLMs) to adaptively generate
negative prompts during the denoising process. Unlike traditional Negative
Prompting methods that use fixed negative prompts, our method generates
intermediate image predictions at specific denoising steps and queries a VLM to
produce contextually appropriate negative prompts. We evaluate our approach on
various benchmark datasets and demonstrate the trade-offs between negative
guidance strength and text-image alignment.
中文标题/摘要
标题:动态VLM引导的负提示在扩散模型中的应用
我们提出了一种新的扩散模型中动态负提示的方法,利用视觉语言模型(VLM)在去噪过程中自适应地生成负提示。与传统的使用固定负提示的方法不同,我们的方法在特定的去噪步骤中生成中间图像预测,并查询VLM生成上下文相关的负提示。我们在各种基准数据集上评估了该方法,并展示了负引导强度与文本图像对齐之间的权衡。
Summary / 总结
The paper introduces a new method for dynamic negative prompting in diffusion models using Vision-Language Models (VLMs) to generate contextually appropriate negative prompts during the denoising process. Unlike fixed negative prompts, this approach generates intermediate image predictions and queries a VLM to produce negative prompts, showing trade-offs between negative guidance strength and text-image alignment in experiments on various benchmark datasets.
该论文提出了一种使用视觉语言模型(VLM)在去噪过程中动态生成上下文相关负提示的新方法。不同于固定负提示,该方法生成中间图像预测并查询VLM生成负提示,实验表明在各种基准数据集上负指导强度与文本图像对齐之间的权衡。
Reasoning Visual Language Model for Chest X-Ray Analysis
Authors: Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu
First: 2025-10-28T00:48:00+00:00 · Latest: 2025-10-30T00:14:35+00:00
Comments: NV-Reason-CXR-3B
Abstract
Vision-language models (VLMs) have shown strong promise for medical image
analysis, but most remain opaque, offering predictions without the transparent,
stepwise reasoning clinicians rely on. We present a framework that brings
chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by
reasoning-first training paradigms, our approach is designed to learn how
experts reason, not just what they conclude, by aligning intermediate steps
with observable image evidence and radiology workflow. Beyond accuracy, the
explicit reasoning traces support clinical auditability: they reveal why a
conclusion was reached, which alternatives were considered, and where
uncertainty remains, enabling quality assurance, error analysis, and safer
human-AI collaboration.
Our model couples high-fidelity visual encoding with a two-stage training
recipe: a reasoning-style supervised fine-tuning (SFT) followed by
reinforcement learning (RL) that uses verifiable rewards over a list of X-ray
abnormalities. The model outputs reasoning that mirrors radiologists systematic
thought process, uncertainty, and differential diagnosis. In
out-of-distribution evaluation, the approach achieves competitive multi-label
classification while improving interpretability. In a reader study with expert
radiologists, full reasoning traces increased confidence, supported error
auditing, and reduced time to finalize reports. We release code and the model
NV-Reason-CXR-3B to support community progress toward trustworthy, explainable
AI in chest radiography and other medical imaging tasks where reasoning quality
is as critical as prediction quality.
中文标题/摘要
标题:胸部X光分析的推理视觉语言模型
视觉-语言模型(VLMs)在医学图像分析方面显示出强大的潜力,但大多数模型仍然不透明,无法提供临床医生依赖的透明、逐步的推理过程。我们提出了一种框架,将链式思考(CoT)推理引入胸部X光解释。受推理优先训练范式的启发,我们的方法旨在学习专家如何推理,而不仅仅是他们得出的结论,通过将中间步骤与可观察的图像证据和放射学工作流程对齐。除了准确性之外,明确的推理轨迹支持临床审计:它们揭示了结论是如何得出的,考虑了哪些替代方案,以及不确定性在哪里,从而促进质量保证、错误分析和更安全的人工智能协作。
我们的模型结合了高保真视觉编码,并采用两阶段训练配方:一种推理风格的监督微调(SFT)后,通过使用可验证奖励的强化学习(RL)来处理X光异常列表。模型输出的推理过程与放射科医生系统的思维过程、不确定性以及鉴别诊断相呼应。在分布外评估中,该方法在多标签分类方面表现出竞争力,同时提高了可解释性。在专家放射科医生的读者研究中,完整的推理轨迹增加了信心,支持了错误审计,并减少了最终报告所需的时间。我们发布了代码和模型NV-Reason-CXR-3B,以支持社区在胸部放射学和其他医学成像任务中对可信、可解释的人工智能的研究。
Summary / 总结
This paper introduces a visual language model for chest X-ray analysis that incorporates chain-of-thought reasoning to enhance clinical auditability. The model is trained using a two-stage process: supervised fine-tuning followed by reinforcement learning, which aligns intermediate reasoning steps with observable image evidence. The approach improves interpretability and supports error auditing, leading to increased confidence and reduced report time among radiologists. The model outputs reasoning that mirrors radiologists' thought processes and differential diagnoses, achieving competitive multi-label classification performance.
该论文提出了一种结合链式推理的胸部X光分析视觉语言模型,以增强临床可审计性。模型采用两阶段训练:监督微调后跟强化学习,使中间推理步骤与可观察的图像证据对齐。该方法提高了可解释性并支持错误审计,使放射科医生的信心增加并减少了报告时间。模型输出的推理过程类似于放射科医生的思考过程和鉴别诊断,实现了具有竞争力的多标签分类性能。
CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Authors: Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut
Venue: 2025 Conference on Empirical Methods in Natural Language
Processing
First: 2025-10-29T22:34:26+00:00 · Latest: 2025-10-29T22:34:26+00:00
Abstract
Humans can naturally identify, reason about, and explain anomalies in their
environment. In computer vision, this long-standing challenge remains limited
to industrial defects or unrealistic, synthetically generated anomalies,
failing to capture the richness and unpredictability of real-world anomalies.
In this work, we introduce CAVE, the first benchmark of real-world visual
anomalies. CAVE supports three open-ended tasks: anomaly description,
explanation, and justification; with fine-grained annotations for visual
grounding and categorizing anomalies based on their visual manifestations,
their complexity, severity, and commonness. These annotations draw inspiration
from cognitive science research on how humans identify and resolve anomalies,
providing a comprehensive framework for evaluating Vision-Language Models
(VLMs) in detecting and understanding anomalies. We show that state-of-the-art
VLMs struggle with visual anomaly perception and commonsense reasoning, even
with advanced prompting strategies. By offering a realistic and cognitively
grounded benchmark, CAVE serves as a valuable resource for advancing research
in anomaly detection and commonsense reasoning in VLMs.
中文标题/摘要
标题:CAVE:检测和解释视觉环境中的常识异常
人类可以自然地识别、推理和解释环境中的异常。在计算机视觉领域,这一长期挑战仍然局限于工业缺陷或不现实、合成生成的异常,未能捕捉到现实世界异常的丰富性和不可预测性。在本项工作中,我们引入了CAVE,这是首个现实世界视觉异常基准。CAVE 支持三个开放任务:异常描述、解释和论证;并提供了细粒度的视觉定位注释和基于异常视觉表现、复杂性、严重性和普遍性的分类注释。这些注释借鉴了认知科学中关于人类识别和解决异常的研究,为评估视觉语言模型(VLMs)在检测和理解异常方面的表现提供了全面框架。我们展示了最先进的VLMs在视觉异常感知和常识推理方面存在困难,即使使用了高级提示策略。通过提供一个现实且认知基础的基准,CAVE 成为推动异常检测和常识推理研究的重要资源。
Summary / 总结
The paper introduces CAVE, a benchmark for real-world visual anomalies, addressing the limitation of current benchmarks which focus on industrial defects or synthetic anomalies. CAVE includes tasks for anomaly description, explanation, and justification, with detailed annotations for visual grounding and categorization. The study demonstrates that state-of-the-art Vision-Language Models struggle with visual anomaly perception and commonsense reasoning, highlighting the need for better models in this area. By providing a realistic and cognitively grounded benchmark, CAVE aims to advance research in anomaly detection and commonsense reasoning in VLMs.
该论文介绍了CAVE,一个针对真实世界视觉异常的基准,解决了当前基准主要关注工业缺陷或合成异常的问题。CAVE包括异常描述、解释和论证的任务,附有详细的视觉定位和分类注释。研究发现,最先进的视觉语言模型在视觉异常感知和常识推理方面仍然存在困难,即使使用了高级提示策略。该基准提供了一个现实且认知基础的资源,用于推进异常检测和常识推理在视觉语言模型中的研究。
GenIR: Generative Visual Feedback for Mental Image Retrieval
Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Venue: NeurIPS 2025
First: 2025-06-06T16:28:03+00:00 · Latest: 2025-10-29T22:25:02+00:00
Comments: NeurIPS 2025
Abstract
Vision-language models (VLMs) have shown strong performance on text-to-image
retrieval benchmarks. However, bridging this success to real-world applications
remains a challenge. In practice, human search behavior is rarely a one-shot
action. Instead, it is often a multi-round process guided by clues in mind.
That is, a mental image ranging from vague recollections to vivid mental
representations of the target image. Motivated by this gap, we study the task
of Mental Image Retrieval (MIR), which targets the realistic yet underexplored
setting where users refine their search for a mentally envisioned image through
multi-round interactions with an image search engine. Central to successful
interactive retrieval is the capability of machines to provide users with
clear, actionable feedback; however, existing methods rely on indirect or
abstract verbal feedback, which can be ambiguous, misleading, or ineffective
for users to refine the query. To overcome this, we propose GenIR, a generative
multi-round retrieval paradigm leveraging diffusion-based image generation to
explicitly reify the AI system's understanding at each round. These synthetic
visual representations provide clear, interpretable feedback, enabling users to
refine their queries intuitively and effectively. We further introduce a fully
automated pipeline to generate a high-quality multi-round MIR dataset.
Experimental results demonstrate that GenIR significantly outperforms existing
interactive methods in the MIR scenario. This work establishes a new task with
a dataset and an effective generative retrieval method, providing a foundation
for future research in this direction
中文标题/摘要
标题:GenIR:生成式视觉反馈的思维图像检索
视觉语言模型(VLMs)在文本到图像检索基准测试中表现出色。然而,将这些成功应用到实际应用中仍然是一个挑战。实际上,人类的搜索行为通常不是一次性的,而是一个由脑海中线索引导的多轮过程。也就是说,从模糊的记忆到对目标图像的生动心理表征。受此差距的启发,我们研究了思维图像检索(MIR)任务,该任务旨在通过与图像搜索引擎的多轮交互来细化用户对心中想象的图像的搜索,这在现实但尚未充分探索的环境中具有重要意义。成功的交互检索的核心能力是机器能够为用户提供清晰、可操作的反馈;然而,现有方法依赖于间接或抽象的口头反馈,这可能会使用户难以细化查询。为了解决这个问题,我们提出了GenIR,这是一种利用基于扩散的图像生成技术的生成式多轮检索范式,在每一轮中明确地体现AI系统的理解。这些合成的视觉表示提供了清晰、可解释的反馈,使用户能够直观有效地细化查询。我们还引入了一个完全自动化的流水线来生成高质量的多轮MIR数据集。实验结果表明,GenIR在MIR场景中显著优于现有的交互式方法。这项工作建立了一个新的任务,包括一个数据集和一个有效的生成检索方法,为该领域的未来研究奠定了基础
Summary / 总结
The paper addresses the challenge of mental image retrieval (MIR) by proposing GenIR, a generative multi-round retrieval paradigm. It leverages diffusion-based image generation to provide clear, actionable feedback to users, enabling them to refine their queries effectively. Experiments show that GenIR outperforms existing interactive methods in the MIR scenario.
论文提出了GenIR,一种生成性的多轮检索范式,旨在解决心理图像检索(MIR)的问题。受交互式搜索中需要清晰、可操作反馈的驱动,GenIR 使用基于扩散的图像生成在每一轮提供明确的视觉反馈。实验结果表明,GenIR 在 MIR 场景中优于现有方法,展示了其在多轮交互中引导用户查询的有效性。
MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory
Authors: Ana Carolina Condez, Diogo Tavares, João Magalhães
Venue: ACM MM
First: 2025-06-06T02:52:13+00:00 · Latest: 2025-10-29T21:34:31+00:00
Comments: Updated version: corresponds to the ACM MM '25 published paper and
includes full appendix material
Abstract
Recent advances in vision-language models have enabled rich semantic
understanding across modalities. However, these encoding methods lack the
ability to interpret or reason about the moral dimensions of content-a crucial
aspect of human cognition. In this paper, we address this gap by introducing
MoralCLIP, a novel embedding representation method that extends multimodal
learning with explicit moral grounding based on Moral Foundations Theory (MFT).
Our approach integrates visual and textual moral cues into a unified embedding
space, enabling cross-modal moral alignment. MoralCLIP is grounded on the
multi-label dataset Social-Moral Image Database to identify co-occurring moral
foundations in visual content. For MoralCLIP training, we design a moral data
augmentation strategy to scale our annotated dataset to 15,000 image-text pairs
labeled with MFT-aligned dimensions. Our results demonstrate that explicit
moral supervision improves both unimodal and multimodal understanding of moral
content, establishing a foundation for morally-aware AI systems capable of
recognizing and aligning with human moral values.
中文标题/摘要
标题:MoralCLIP:基于道德基础理论的视觉-语言表示对比对齐
近期视觉-语言模型的发展使跨模态的丰富语义理解成为可能。然而,这些编码方法缺乏解释或推理内容道德维度的能力——这是人类认知的一个关键方面。本文通过引入MoralCLIP,一种基于道德基础理论(MFT)的新型多模态嵌入表示方法,来解决这一问题。我们的方法将视觉和文本道德线索整合到一个统一的嵌入空间中,实现跨模态的道德对齐。MoralCLIP基于多标签数据集Social-Moral Image Database,以识别视觉内容中共同出现的道德基础。为了训练MoralCLIP,我们设计了一种道德数据增强策略,将标注数据集扩展到15,000个带有MFT对齐维度的图像-文本对。我们的结果表明,显式的道德监督可以提高单模态和多模态对道德内容的理解,为具备识别和与人类道德价值观对齐的道德意识AI系统奠定了基础。
Summary / 总结
MoralCLIP is a novel method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory. It integrates visual and textual moral cues into a unified embedding space, using a moral data augmentation strategy to train on 15,000 image-text pairs. The results show that explicit moral supervision enhances both unimodal and multimodal understanding of moral content, paving the way for morally-aware AI systems.
MoralCLIP 是一种基于道德基础理论的方法,将显式的道德指导引入多模态学习。它将视觉和文本的道德线索整合到一个统一的嵌入空间中,并通过道德数据增强策略训练了15,000个带有MFT对齐维度的图像-文本对。结果表明,显式的道德监督可以增强单模态和多模态对道德内容的理解,为具备道德意识的AI系统奠定了基础。
CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
Authors: Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma
First: 2025-03-06T03:40:01+00:00 · Latest: 2025-10-29T20:44:13+00:00
Comments: Datasets link:
https://huggingface.co/datasets/LLDDSS/Causal3D_Dataset
Abstract
True intelligence hinges on the ability to uncover and leverage hidden causal
relations. Despite significant progress in AI and computer vision (CV), there
remains a lack of benchmarks for assessing models' abilities to infer latent
causality from complex visual data. In this paper, we introduce
\textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates
structured data (tables) with corresponding visual representations (images) to
evaluate causal reasoning. Designed within a systematic framework, Causal3D
comprises 19 3D-scene datasets capturing diverse causal relations, views, and
backgrounds, enabling evaluations across scenes of varying complexity. We
assess multiple state-of-the-art methods, including classical causal discovery,
causal representation learning, and large/vision-language models (LLMs/VLMs).
Our experiments show that as causal structures grow more complex without prior
knowledge, performance declines significantly, highlighting the challenges even
advanced methods face in complex causal scenarios. Causal3D serves as a vital
resource for advancing causal reasoning in CV and fostering trustworthy AI in
critical domains.
中文标题/摘要
标题:CAUSAL3D:视觉数据因果学习的综合基准
真正的智能依赖于发现和利用隐藏的因果关系的能力。尽管在人工智能和计算机视觉(CV)方面取得了显著进展,但仍缺乏评估模型从复杂视觉数据中推断潜在因果关系能力的基准。在本文中,我们介绍了\textsc{\textbf{Causal3D}},这是一种新颖且全面的基准,将结构化数据(表格)与相应的视觉表示(图像)结合在一起,以评估因果推理能力。Causal3D 设计在系统框架内,包含19个3D场景数据集,捕捉各种因果关系、视角和背景,使不同复杂度场景的评估成为可能。我们评估了多种最先进的方法,包括经典因果发现、因果表示学习以及大型/视觉语言模型(LLMs/VLMs)。实验结果显示,随着因果结构变得更加复杂且缺乏先验知识时,性能显著下降,突显了即使在复杂因果场景中,先进方法所面临的挑战。Causal3D 是推进CV中的因果推理和促进关键领域可信AI的重要资源。
Summary / 总结
CAUSAL3D is a new comprehensive benchmark that integrates structured data with visual representations to evaluate causal reasoning in complex scenes. It includes 19 3D-scene datasets with diverse causal relations, views, and backgrounds. The benchmark assesses various state-of-the-art methods, showing that performance declines significantly as causal structures become more complex without prior knowledge. This highlights the challenges advanced methods face in complex causal scenarios.
CAUSAL3D 是一个新基准,用于评估模型从视觉数据中推断因果关系的能力。它结合了结构化数据和视觉表示,并包含19个3D场景数据集,以评估在不同复杂性场景中的因果推理能力。实验表明,随着因果结构变得越来越复杂,性能会显著下降,表明即使在复杂因果场景中,先进方法也面临挑战。
Latent Chain-of-Thought for Visual Reasoning
Authors: Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
Venue: NeurIPS 2025
First: 2025-10-27T23:10:06+00:00 · Latest: 2025-10-29T18:48:20+00:00
Comments: NeurIPS 2025
Abstract
Chain-of-thought (CoT) reasoning is critical for improving the
interpretability and reliability of Large Vision-Language Models (LVLMs).
However, existing training algorithms such as SFT, PPO, and GRPO may not
generalize well across unseen reasoning tasks and heavily rely on a biased
reward model. To address this challenge, we reformulate reasoning in LVLMs as
posterior inference and propose a scalable training algorithm based on
amortized variational inference. By leveraging diversity-seeking reinforcement
learning algorithms, we introduce a novel sparse reward function for
token-level learning signals that encourage diverse, high-likelihood latent
CoT, overcoming deterministic sampling limitations and avoiding reward hacking.
Additionally, we implement a Bayesian inference-scaling strategy that replaces
costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank
optimal rationales and answers. We empirically demonstrate that the proposed
method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in
terms of effectiveness, generalization, and interpretability.
中文标题/摘要
标题:视觉推理中的潜在思维链
思维链(CoT)推理对于提高大型视觉-语言模型(LVLM)的可解释性和可靠性至关重要。然而,现有的训练算法如SFT、PPO和GRPO可能在未见过的推理任务上表现不佳,并且严重依赖于有偏的奖励模型。为了解决这一挑战,我们将LVLM中的推理重新表述为后验推断,并提出了一种基于近似变分推断的可扩展训练算法。通过利用寻求多样性的强化学习算法,我们引入了一种新颖的稀疏奖励函数,用于促进多样且高似然的潜在CoT,克服了确定性采样的局限性,避免了奖励作弊。此外,我们实现了贝叶斯推理扩展策略,用边际似然替代昂贵的Best-of-N和束搜索,高效地排名最优论据和答案。我们实证证明,所提出的方法在七个推理基准上增强了最先进的LVLM,在有效性、泛化能力和可解释性方面表现出色。
Summary / 总结
The paper addresses the challenge of improving the interpretability and reliability of Large Vision-Language Models (LVLMs) by reformulating reasoning as posterior inference and proposing a scalable training algorithm based on amortized variational inference. It introduces a novel sparse reward function for token-level learning signals to encourage diverse, high-likelihood latent CoT, and implements a Bayesian inference-scaling strategy to efficiently rank optimal rationales and answers. The method enhances state-of-the-art LVLMs on seven reasoning benchmarks in terms of effectiveness, generalization, and interpretability.
研究旨在通过解决现有训练算法的局限性,提高大型视觉-语言模型(LVLM)的可解释性和可靠性。方法将LVLM中的推理重新表述为后验推理,并引入基于近似变分推理的可扩展训练算法。关键发现表明,所提出的方法在七个推理基准上提升了最先进的LVLM,提高了有效性、泛化能力和可解释性。
FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
First: 2025-10-29T17:58:14+00:00 · Latest: 2025-10-29T17:58:14+00:00
Abstract
Articulated 3D objects are central to many applications in robotics, AR/VR,
and animation. Recent approaches to modeling such objects either rely on
optimization-based reconstruction pipelines that require dense-view supervision
or on feed-forward generative models that produce coarse geometric
approximations and often overlook surface texture. In contrast, open-world 3D
generation of static objects has achieved remarkable success, especially with
the advent of native 3D diffusion models such as Trellis. However, extending
these methods to articulated objects by training native 3D diffusion models
poses significant challenges. In this work, we present FreeArt3D, a
training-free framework for articulated 3D object generation. Instead of
training a new model on limited articulated data, FreeArt3D repurposes a
pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape
prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by
treating articulation as an additional generative dimension. Given a few images
captured in different articulation states, FreeArt3D jointly optimizes the
object's geometry, texture, and articulation parameters without requiring
task-specific training or access to large-scale articulated datasets. Our
method generates high-fidelity geometry and textures, accurately predicts
underlying kinematic structures, and generalizes well across diverse object
categories. Despite following a per-instance optimization paradigm, FreeArt3D
completes in minutes and significantly outperforms prior state-of-the-art
approaches in both quality and versatility.
中文标题/摘要
标题:FreeArt3D:无需训练的3D可动物体生成方法利用3D扩散
3D可动物体在机器人学、AR/VR和动画等领域中至关重要。最近对这类物体建模的方法要么依赖于需要密集视角监督的优化重建管道,要么依赖于生成前馈模型,这些模型生成粗略的几何近似,往往忽略了表面纹理。相比之下,静态3D物体的开放世界生成已经取得了显著成功,尤其是随着原生3D扩散模型(如Trellis)的出现。然而,将这些方法扩展到可动物体并训练原生3D扩散模型带来了重大挑战。在本文中,我们提出了FreeArt3D,这是一种无需训练的3D可动物体生成框架。FreeArt3D 不是针对有限的可动数据训练新模型,而是将一个预先训练好的静态3D扩散模型(例如Trellis)重新用于强大的形状先验。它将Score Distillation Sampling (SDS) 扩展到3D到4D领域,将可动性视为额外的生成维度。给定不同可动状态下的少量图像,FreeArt3D 联合优化物体的几何形状、纹理和可动参数,无需特定任务的训练或访问大规模可动数据集。我们的方法生成高保真几何形状和纹理,准确预测潜在的运动结构,并在多种物体类别中表现出良好的泛化能力。尽管遵循实例优化范式,FreeArt3D 完成时间仅需几分钟,并且在质量和多功能性方面显著优于先前的先进方法。
Summary / 总结
FreeArt3D is a training-free framework for generating articulated 3D objects. It repurposes a pre-trained static 3D diffusion model as a shape prior and extends Score Distillation Sampling to the 3D-to-4D domain. Given a few images of an object in different articulation states, FreeArt3D optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training. The method generates high-fidelity geometry and textures, accurately predicts kinematic structures, and generalizes well across diverse object categories, outperforming previous approaches in both quality and versatility.
FreeArt3D 是一个无需训练的框架,用于生成 articulated 3D 对象。它将一个预训练的静态 3D 扩散模型作为形状先验,并将其扩展到 3D 到 4D 领域。给定对象在不同姿态状态下的几张图片,FreeArt3D 优化对象的几何形状、纹理和姿态参数,无需特定任务的训练。该方法生成高保真几何形状和纹理,准确预测运动结构,并在多种对象类别中表现出良好的泛化能力,优于之前的先进方法。
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion
Authors: Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim
Venue: NeurIPS 2025
First: 2025-10-29T17:17:32+00:00 · Latest: 2025-10-29T17:17:32+00:00
Comments: NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff
Abstract
Text-to-image diffusion models often exhibit degraded performance when
generating images beyond their training resolution. Recent training-free
methods can mitigate this limitation, but they often require substantial
computation or are incompatible with recent Diffusion Transformer models. In
this paper, we propose ScaleDiff, a model-agnostic and highly efficient
framework for extending the resolution of pretrained diffusion models without
any additional training. A core component of our framework is Neighborhood
Patch Attention (NPA), an efficient mechanism that reduces computational
redundancy in the self-attention layer with non-overlapping patches. We
integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing
(LFM) to better generate fine details. Furthermore, we apply Structure Guidance
to enhance global structure during the denoising process. Experimental results
demonstrate that ScaleDiff achieves state-of-the-art performance among
training-free methods in terms of both image quality and inference speed on
both U-Net and Diffusion Transformer architectures.
中文标题/摘要
标题:ScaleDiff:通过高效且模型无关的扩散实现高分辨率图像合成
文本到图像的扩散模型在生成超出其训练分辨率的图像时通常表现出性能下降。最近的无训练方法可以缓解这一限制,但它们往往需要大量计算或与最近的扩散变换器模型不兼容。在本文中,我们提出了一种模型无关且高效的框架ScaleDiff,无需额外训练即可扩展预训练扩散模型的分辨率。我们框架的核心组件是邻域块注意力(NPA),这是一种高效的机制,通过非重叠块减少自注意力层中的计算冗余。我们将NPA集成到SDEdit管道中,并引入潜在频率混合(LFM)以更好地生成细部。此外,我们在去噪过程中应用结构引导以增强全局结构。实验结果表明,ScaleDiff在U-Net和扩散变换器架构上均实现了无训练方法中的最佳性能,无论是图像质量还是推理速度。
Summary / 总结
ScaleDiff is a model-agnostic and efficient framework for extending the resolution of pretrained diffusion models without additional training. It uses Neighborhood Patch Attention (NPA) to reduce computational redundancy and integrates Latent Frequency Mixing (LFM) and Structure Guidance to enhance image quality and inference speed. Experimental results show that ScaleDiff outperforms other training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
ScaleDiff 是一个模型无关且高效的框架,能够在不进行额外训练的情况下扩展预训练扩散模型的分辨率。它引入了 Neighborhood Patch Attention (NPA) 来减少计算冗余,并使用 Latent Frequency Mixing (LFM) 来增强细部生成。此外,ScaleDiff 还应用了 Structure Guidance 来在去噪过程中增强全局结构。实验结果表明,ScaleDiff 在 U-Net 和 Diffusion Transformer 架构上在图像质量和推理速度方面均优于其他无训练方法。
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp
First: 2025-10-29T16:32:26+00:00 · Latest: 2025-10-29T16:32:26+00:00
Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but
struggle with long, visually complex documents that demand analysis and
integration of information spread across multiple pages. Existing approaches
typically rely on fixed reasoning templates or rigid pipelines, which force
VLMs into a passive role and hinder both efficiency and generalization. We
present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement
learning framework that fine-tunes VLMs as interactive agents capable of
actively navigating long, visually rich documents. ALDEN introduces a novel
fetch action that directly accesses the page by index, complementing the
classic search action and better exploiting document structure. For dense
process supervision and efficient training, we propose a rule-based cross-level
reward that provides both turn- and token-level signals. To address the
empirically observed training instability caused by numerous visual tokens from
long documents, we further propose a visual-semantic anchoring mechanism that
applies a dual-path KL-divergence constraint to stabilize visual and textual
representations separately during training. Trained on a corpus constructed
from three open-source datasets, ALDEN achieves state-of-the-art performance on
five long-document benchmarks. Overall, ALDEN marks a step beyond passive
document reading toward agents that autonomously navigate and reason across
long, visually rich documents, offering a robust path to more accurate and
efficient long-document understanding.
中文标题/摘要
标题:ALDEN:在长文档中进行主动导航和证据收集的强化学习
视觉语言模型(VLMs)在解释图文丰富的图像方面表现出色,但在处理长篇复杂文档时却遇到困难,这些文档需要对分布在多页上的信息进行分析和整合。现有方法通常依赖固定的推理模板或刚性的处理流程,这迫使VLMs处于被动角色,影响了效率和泛化能力。我们提出了主动长文档导航(ALDEN),这是一种多轮次的强化学习框架,能够将VLMs训练成能够主动导航长文档的交互式代理。ALDEN引入了一种新的获取动作,可以直接通过索引访问页面,补充了经典的搜索动作,并更好地利用了文档结构。为了实现密集的过程监督和高效的训练,我们提出了一种基于规则的跨层次奖励机制,提供了轮次级和标记级的信号。为了解决由长文档中的大量视觉标记引起的训练不稳定问题,我们进一步提出了一种视觉语义锚定机制,在训练过程中分别对视觉和文本表示施加双重路径的KL散度约束,以稳定它们。ALDEN在三个开源数据集构建的语料库上进行训练,实现了五个长文档基准测试中的最佳性能。总体而言,ALDEN标志着从被动文档阅读向能够自主导航和推理的长文档的转变,提供了一条更准确和高效的长文档理解的稳健路径。
Summary / 总结
ALDEN is a multi-turn reinforcement learning framework that enhances vision-language models to actively navigate and gather evidence from long, visually complex documents. It introduces a novel fetch action and a rule-based cross-level reward system to improve efficiency and generalization. ALDEN also includes a visual-semantic anchoring mechanism to stabilize training. The model achieves state-of-the-art performance on five long-document benchmarks, demonstrating its capability to autonomously navigate and reason across such documents more accurately and efficiently than existing approaches.
ALDEN 是一个强化学习框架,旨在提升视觉语言模型处理长篇复杂文档的能力。它引入了新的 fetch 动作和基于规则的跨层级奖励,以提高效率和通用性。ALDEN 还提出了一种视觉语义锚定机制,以在训练中稳定视觉和文本表示。该模型在五个长文档基准测试中达到了最先进的性能,展示了其在长文档中自主导航和推理的能力。
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Authors: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-10-29T15:20:10+00:00 · Latest: 2025-10-29T15:20:10+00:00
Comments: 13 pages, 6 figures
Abstract
The growing success of Vision-Language-Action (VLA) models stems from the
promise that pretrained Vision-Language Models (VLMs) can endow agents with
transferable world knowledge and vision-language (VL) grounding, laying a
foundation for action models with broader generalization. Yet when these VLMs
are adapted to the action modality, it remains unclear to what extent their
original VL representations and knowledge are preserved. In this work, we
conduct a systematic study of representation retention during VLA fine-tuning,
showing that naive action fine-tuning leads to degradation of visual
representations. To characterize and measure these effects, we probe VLA's
hidden representations and analyze attention maps, further, we design a set of
targeted tasks and methods that contrast VLA models with their counterpart
VLMs, isolating changes in VL capabilities induced by action fine-tuning. We
further evaluate a range of strategies for aligning visual representations and
introduce a simple yet effective method that mitigates degradation and yields
improved generalization to out-of-distribution (OOD) scenarios. Taken together,
our analysis clarifies the trade-off between action fine-tuning and the
degradation of VL representations and highlights practical approaches to
recover inherited VL capabilities. Code is publicly available:
https://blind-vla-paper.github.io
中文标题/摘要
标题:不要盲目训练VLA:为OOD泛化对齐视觉表示
视觉-语言-行动(VLA)模型的成功得益于预训练视觉-语言模型(VLMs)赋予代理广泛转移的世界知识和视觉-语言(VL)定位的承诺,为具有更广泛泛化能力的行动模型奠定了基础。然而,当这些VLMs适应行动模态时,尚不清楚它们原始的VL表示和知识在多大程度上得到了保留。在本文中,我们系统研究了VLA微调期间表示的保留情况,表明简单的行动微调会导致视觉表示的退化。为了表征和测量这些影响,我们探测了VLA的隐藏表示并分析了注意力图,进一步设计了一系列对比VLA模型与其对应VLMs的目标任务和方法,以隔离由行动微调引起的VL能力的变化。我们还评估了一系列视觉表示对齐策略,并引入了一种简单而有效的方法,该方法减轻了退化并提高了对分布外(OOD)场景的泛化能力。综上所述,我们的分析阐明了行动微调与VL表示退化之间的权衡,并强调了恢复继承的VL能力的实用方法。代码已公开:https://blind-vla-paper.github.io
Summary / 总结
This study investigates the impact of fine-tuning Vision-Language-Action (VLA) models on their visual representations and generalization capabilities. The research finds that naive action fine-tuning degrades visual representations, leading to poorer out-of-distribution (OOD) generalization. To address this, the authors propose a method to align visual representations, which improves OOD performance without significantly affecting action capabilities.
该研究探讨了动作微调对Vision-Language-Action (VLA)模型视觉表示的影响,发现简单的微调会降低这些表示。作者设计了目标任务和方法来衡量降级,并引入了一种简单有效的对齐方法,以提高对未见过分布场景的泛化能力。他们的分析突出了动作微调与视觉表示质量之间的权衡,并提供了缓解这一问题的实际解决方案。