arXiv 论文速递

2025-11-02 03:26
Snapshot: 20251102_0326
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00
Abstract
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
中文标题/摘要
标题:ChartAB:图表定位与密集对齐基准
图表在可视化、推理、数据分析以及人类思想交流中发挥着重要作用。然而,现有的视觉-语言模型(VLMs)在细节感知方面仍存在不足,难以从图表中提取精细的结构。这种图表定位的限制也阻碍了它们比较多个图表和推理的能力。在本文中,我们引入了一个新的“图表对齐基准(ChartAB)”,以全面评估VLMs在图表定位任务中的表现,即提取表格数据、定位可视化元素以及识别各种不同类型的图表的各种属性。我们设计了一个JSON模板,以方便计算每个定位任务特定的评估指标。通过引入一种新颖的两阶段推理工作流,基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几个最近的VLMs的评估分析揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异,并指出了当前模型需要加强的具体技能。
Summary / 总结
The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing chart attributes. By using a two-stage inference workflow and a JSON template for evaluation, the benchmark assesses models' ability to align and compare elements across charts. The analysis reveals biases, weaknesses, and hallucinations in current models, highlighting the need for improved fine-grained chart understanding capabilities.
论文介绍了ChartAB基准,用于评估视觉-语言模型(VLMs)在图表定位任务中的表现,包括从不同类型的图表中提取表格数据、定位可视化元素和识别属性。该基准使用JSON模板计算特定的评估指标,并采用两阶段推理工作流来比较图表中的元素。对近期VLMs的评估揭示了它们在图表理解中的偏见、弱点和幻觉,突显了当前模型在图表理解任务中需要增强的精细技能。
SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Authors: Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
First: 2025-10-30T17:52:39+00:00 · Latest: 2025-10-30T17:52:39+00:00
Abstract
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
中文标题/摘要
标题:SteerVLM:通过轻量级激活转向实现视觉语言模型稳健的模型控制
本工作介绍了SteerVLM,这是一种轻量级的转向模块,旨在引导视觉语言模型(VLMs)生成更符合所需指令的输出。我们的方法通过学习配对提示的潜在嵌入,编码目标和相反行为,动态调整语言模态与图像上下文之间的激活连接。这允许在不修改模型权重的情况下,在推理时对复杂输出语义进行精细控制,同时保持对离目标任务性能的保留。我们的转向模块的学习参数量仅为原始VLM大小的0.14%。我们的转向模块通过维度上的激活调制和跨层自适应转向获得模型控制,无需预先提取的静态向量或手动调整干预点。此外,我们还引入了VNIA(视觉叙事意图对齐)多模态数据集,专门用于促进VLM转向技术的发展和评估。我们的方法在VLM转向和幻觉缓解基准测试中优于现有干预技术,并通过激活工程提出了多模态模型控制的稳健解决方案。
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
First: 2025-06-24T17:30:27+00:00 · Latest: 2025-10-30T16:38:19+00:00
Comments: 39 pages, 24 figures
Abstract
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
中文标题/摘要
标题:CronusVLA:通过多帧视觉-语言-动作建模实现高效稳健操作
基于预训练视觉-语言模型(VLMs)的近期视觉-语言-动作(VLA)模型在机器人操作方面表现出强大的性能。然而,这些模型仍然受限于单帧图像范式,未能充分利用多帧历史提供的时间信息,因为直接将多帧输入到VLM主干中会带来巨大的计算开销和推理延迟。我们提出了一种名为CronusVLA的统一框架,将单帧VLA模型扩展到多帧范式。CronusVLA遵循两阶段过程:(1)在大规模具身数据集上进行单帧预训练,通过自回归预测动作标记,建立有效的具身视觉-语言基础;(2)多帧后训练,将视觉-语言主干的预测从离散标记调整为可学习特征,并通过特征分块聚合历史信息。CronusVLA有效解决了多帧建模的现有挑战,同时提高了性能和观测鲁棒性。为了评估在时间和空间扰动下的鲁棒性,我们引入了SimplerEnv-OR基准,该基准包含24种观测扰动类型和120种严重程度级别。在模拟和真实环境中的三种具身模型实验表明,CronusVLA在SimplerEnv中的性能领先,鲁棒性优于OpenVLA 26.8%,并在SimplerEnv-OR中获得最高鲁棒性评分。这些结果突显了VLA模型中高效多帧适应的潜力,使其在更强大和鲁棒的实际部署中具有更大的可能性。
Summary / 总结
CronusVLA is a unified framework that extends single-frame vision-language-action models to a multi-frame paradigm, addressing computational overhead and inference latency issues. It involves single-frame pretraining with autoregressive action token prediction and multi-frame post-training for feature learning and historical information aggregation. Experiments show CronusVLA outperforms existing models with a 70.9% success rate on SimplerEnv and a 26.8% improvement on LIBERO, demonstrating enhanced performance and robustness.
CronusVLA 是一个统一框架,将单帧视觉-语言-动作模型扩展到多帧范式,解决计算开销和推理延迟问题。它包含两个阶段:单帧预训练进行自回归动作标记预测和多帧后训练进行特征分块和历史信息聚合。实验结果显示,CronusVLA 在 SimplerEnv 中的成功率为 70.9%,在 LIBERO 上比 OpenVLA 提高了 26.8%,并且在 SimplerEnv-OR 上具有最高的鲁棒性得分。
All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
First: 2025-10-30T16:08:25+00:00 · Latest: 2025-10-30T16:08:25+00:00
Abstract
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
中文标题/摘要
标题:自动驾驶所需的一切:从像素、点和提示到下一代融合与多模态大/小语言模型/视觉模型在自动驾驶车辆中的应用
自动驾驶车辆(AVs)通过智能感知、决策和控制系统的发展正在重塑交通运输的未来。然而,它们的成功取决于一个核心能力——在复杂和多模态环境中可靠地进行目标检测。尽管计算机视觉(CV)和人工智能(AI)领域的最新突破推动了显著的进步,但该领域仍面临一个关键挑战,即知识在多模态感知、上下文推理和协同智能方面仍碎片化。本文综述填补了这一空白,通过提供面向未来的AV目标检测分析,强调了新兴范式,如视觉语言模型(VLMs)、大型语言模型(LLMs)和生成AI,而不是重新审视过时的技术。我们首先系统地回顾了AV传感器的基本谱系(摄像头、超声波、激光雷达和雷达)及其融合策略,不仅突出了它们在动态驾驶环境中的能力和局限性,还强调了它们与基于大/小语言模型/视觉模型的感知框架的最新进展的整合潜力。接下来,我们介绍了AV数据集的结构化分类,超越了简单的集合,将自我车辆、基础设施和协同数据集(例如V2V、V2I、V2X、I2I)置于其中,随后进行了数据结构和特征的交叉分析。最终,我们分析了最新的检测方法,从2D和3D管道到混合传感器融合,特别关注由视觉变换器(ViTs)、大型和小型语言模型(SLMs)和VLMs驱动的新兴变换器方法。通过综合这些视角,我们的综述提供了一条清晰的当前能力、开放挑战和未来机遇的路线图。
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Authors: Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
First: 2025-10-30T13:26:58+00:00 · Latest: 2025-10-30T13:26:58+00:00
Comments: Preprint
Abstract
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
中文标题/摘要
标题:通过头部-尾部再平衡对抗LVLM自我提升中的马太效应
自我提升已成为提升大型视觉-语言模型(LVLM)推理能力的主要范式,其中模型通过迭代探索和学习成功的轨迹。然而,在这一过程中,我们发现一个关键问题:模型在生成简单查询(即头部数据)的高质量轨迹方面表现出色,但在处理更复杂的查询(即尾部数据)方面却遇到困难。这导致了一种不平衡的优化,促使模型优先考虑简单的推理技能,而阻碍其解决更复杂推理任务的能力。随着迭代次数的增加,这种不平衡变得越来越明显——我们将其称为“马太效应”——最终阻碍了模型的进一步改进并导致性能瓶颈。为了应对这一挑战,我们从两个角度引入了四种有效的策略:分布重塑和轨迹重采样,以在探索和学习自我提升过程中实现头部-尾部再平衡。在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型的视觉推理任务上的广泛实验表明,我们的方法在视觉推理能力上始终优于传统的自我提升,平均高出3.86分。
Summary / 总结
The paper addresses the Matthew effect in self-improvement of large vision-language models (LVLMs), where models tend to excel at simple tasks (head data) but struggle with complex ones (tail data). To counteract this imbalance, the authors propose four strategies for distribution reshaping and trajectory resampling to achieve head-tail re-balancing. Experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models show that these methods improve visual reasoning capabilities by an average of 3.86 points compared to vanilla self-improvement.
论文针对大型视觉语言模型(LVLMs)在自我提升过程中出现的简单任务表现优异而复杂任务表现不佳的问题,提出了四种头尾重新平衡策略,集中在分布重塑和轨迹重采样。实验结果显示,这些方法显著提升了视觉推理能力,与传统自我提升方法相比,平均提高了3.86个点的性能。
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Authors: Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
First: 2025-10-30T13:11:23+00:00 · Latest: 2025-10-30T13:11:23+00:00
Abstract
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
中文标题/摘要
标题:代表级反事实校准以实现无偏零样本识别
物体-上下文捷径仍然是视觉-语言模型中的一个持续性挑战,当测试场景与熟悉的训练共现情况不同时,会削弱零样本识别的可靠性。我们将此问题重新表述为因果推理问题,并提出:如果物体出现在不同的环境中,预测结果会如何?为了在推理时回答这一问题,我们估计CLIP表示空间中的物体和背景期望,并通过重新组合来自外部数据集、批邻居或文本描述的多样化替代上下文中的物体特征,合成反事实嵌入。通过估计总直接效应和模拟干预,我们进一步减去背景激活,保留有益的物体-上下文交互,同时减轻幻觉得分。无需重新训练或设计提示,我们的方法在上下文敏感基准测试中显著提高了最差群体和平均准确率,建立了新的零样本状态最先进水平。除了性能,我们的框架提供了一种轻量级的代表级反事实方法,为无偏和可靠的多模态推理提供了实用的因果途径。
Summary / 总结
The paper addresses the challenge of object-context shortcuts in vision-language models, which can reduce zero-shot recognition reliability. It proposes a method to estimate object and background expectations within CLIP's representation space and synthesizes counterfactual embeddings by recombining object features with diverse alternative contexts. This approach improves both worst-group and average accuracy on context-sensitive benchmarks, setting a new zero-shot state of the art without requiring retraining or prompt design.
论文解决了视觉-语言模型中对象-上下文捷径的问题,这可能会降低零样本识别的可靠性。它提出了一种方法,在CLIP的表示空间中估计对象和背景的期望,并通过重新组合对象特征与多样化的替代上下文来合成反事实嵌入。这种方法在上下文敏感基准测试中提高了最坏群体和平均准确率,没有需要重新训练或设计提示。
Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Authors: Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang
First: 2025-10-30T13:09:00+00:00 · Latest: 2025-10-30T13:09:00+00:00
Comments: 12 pages, 7 figures
Abstract
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few known normal samples. Most existing methods rely on the generalization ability of pre-trained vision-language models (VLMs) to recognize potentially anomalous regions through feature similarity between text descriptions and images. However, due to the lack of detailed textual descriptions, these methods can only pre-define image-level descriptions to match each visual patch token to identify potential anomalous regions, which leads to the semantic misalignment between image descriptions and patch-level visual anomalies, achieving sub-optimal localization performance. To address the above issues, we propose the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and fine-grained textual descriptions for existing anomaly detection datasets with automatic construction pipeline. Based on the MFSC, we propose a novel framework named FineGrainedAD to improve anomaly localization performance, which consists of two components: Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics into multi-level learnable prompts through automatic replacement and concatenation mechanism, while MLSA designs region aggregation strategy and multi-level alignment training to facilitate learnable prompts better align with corresponding visual regions. Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings on MVTec-AD and VisA datasets.
中文标题/摘要
标题:朝细粒度视觉-语言对齐方向发展少量样本异常检测
少量样本异常检测(FSAD)方法使用少量已知正常样本识别异常区域。现有大多数方法依赖预训练的视觉-语言模型(VLMs)通过文本描述和图像特征之间的相似性来识别潜在的异常区域。但由于缺乏详细的文本描述,这些方法只能预先定义图像级别的描述来匹配每个视觉补丁标记,从而导致图像描述与补丁级别的视觉异常之间存在语义不匹配,实现次优的定位性能。为了解决上述问题,我们提出了多级细粒度语义描述(MFSC),为现有的异常检测数据集提供多级和细粒度的文本描述,并通过自动构建管道进行自动构造。基于MFSC,我们提出了一种新的框架FineGrainedAD,以提高异常定位性能,该框架由两个组件组成:多级可学习提示(MLLP)和多级语义对齐(MLSA)。MLLP通过自动替换和连接机制将细粒度语义引入多级可学习提示,而MLSA设计了区域聚合策略和多级对齐训练,以促进可学习提示更好地与相应的视觉区域对齐。实验表明,提出的FineGrainedAD在MVTec-AD和VisA数据集的少量样本设置中实现了优越的整体性能。
Summary / 总结
The paper addresses the issue of semantic misalignment in few-shot anomaly detection by proposing Multi-Level Fine-Grained Semantic Caption (MFSC) and a novel framework called FineGrainedAD. MFSC provides detailed textual descriptions for anomaly detection datasets, and FineGrainedAD includes Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA) to enhance anomaly localization. Experiments show that FineGrainedAD outperforms existing methods on MVTec-AD and VisA datasets in few-shot settings.
本文通过提出多级细粒度语义描述(MFSC)和一种名为FineGrainedAD的新框架来解决少样本异常检测中的语义对齐问题。MFSC为异常检测数据集提供详细的文本描述,FineGrainedAD包括多级可学习提示(MLLP)和多级语义对齐(MLSA),以增强异常定位。实验表明,FineGrainedAD在MVTec-AD和VisA数据集的少样本设置中优于现有方法。
A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Authors: Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
First: 2025-10-30T12:45:24+00:00 · Latest: 2025-10-30T12:45:24+00:00
Comments: 23 pages, 14 figures
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for adapting large vision-language models (VLMs) to unseen tasks without relying on labeled data. However, the lack of dispersion between textual features can hurt calibration performance, which raises concerns about VLMs' reliability, trustworthiness, and safety. Current TPT approaches primarily focus on improving prompt calibration by either maximizing average textual feature dispersion or enforcing orthogonality constraints to encourage angular separation. However, these methods may not always have optimal angular separation between class-wise textual features, which implies overlooking the critical role of angular diversity. To address this, we propose A-TPT, a novel TPT framework that introduces angular diversity to encourage uniformity in the distribution of normalized textual features induced by corresponding learnable prompts. This uniformity is achieved by maximizing the minimum pairwise angular distance between features on the unit hypersphere. We show that our approach consistently surpasses state-of-the-art TPT methods in reducing the aggregate average calibration error while maintaining comparable accuracy through extensive experiments with various backbones on different datasets. Notably, our approach exhibits superior zero-shot calibration performance on natural distribution shifts and generalizes well to medical datasets. We provide extensive analyses, including theoretical aspects, to establish the grounding of A-TPT. These results highlight the potency of promoting angular diversity to achieve well-dispersed textual features, significantly improving VLM calibration during test-time adaptation. Our code will be made publicly available.
中文标题/摘要
标题:A-TPT:视觉语言模型测试时提示调优的角多样性校准特性
测试时提示调优(TPT)已成为一种有前景的技术,用于在无需依赖标记数据的情况下,将大型视觉语言模型(VLMs)适应未见过的任务。然而,文本特征之间的缺乏分散性会损害校准性能,这引起了人们对VLMs可靠性和安全性的担忧。当前的TPT方法主要通过最大化平均文本特征分散性或施加正交约束来鼓励角度分离,以提高提示校准。然而,这些方法可能无法始终在类别间文本特征之间实现最优的角度分离,这意味着忽视了角多样性的关键作用。为了解决这个问题,我们提出了一种新颖的A-TPT框架,该框架引入了角多样性,以鼓励由相应可学习提示诱导的归一化文本特征的分布均匀性。这种均匀性是通过最大化单位超球面上特征之间的最小成对角度距离来实现的。我们通过在不同数据集上使用各种骨干网络进行广泛实验,展示了我们的方法在降低综合平均校准误差方面始终优于最先进的TPT方法,同时保持了相当的准确性。值得注意的是,我们的方法在自然分布转移的零样本校准性能方面表现出色,并且能够很好地泛化到医学数据集。我们提供了广泛的分析,包括理论方面,以建立A-TPT的基础。这些结果突显了促进角多样性以实现分散的文本特征的潜力,显著提高了VLM在测试时适应过程中的校准。我们的代码将公开发布。
Summary / 总结
The paper introduces A-TPT, a novel test-time prompt tuning framework that enhances the angular diversity of textual features to improve the calibration performance of vision-language models. By maximizing the minimum pairwise angular distance between features on the unit hypersphere, A-TPT consistently outperforms existing methods in reducing calibration errors while maintaining accuracy. The approach shows superior zero-shot calibration on natural distribution shifts and generalizes well to medical datasets.
论文提出了A-TPT,这是一种新颖的视觉-语言模型测试时提示调优框架,旨在通过增强角度多样性来提升校准性能。不同于现有方法侧重于平均特征分散或正交性,A-TPT通过在单位超球面上最大化特征之间的最小成对角度距离来实现这一目标。广泛的实验表明,A-TPT在减少校准误差、保持准确性方面优于最先进的TPT方法,特别是在自然分布偏移的零样本校准和医疗数据集上的表现尤为突出。
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Authors: Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel
First: 2025-10-20T15:41:55+00:00 · Latest: 2025-10-30T12:05:58+00:00
Abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
中文标题/摘要
标题:FLAME驱动的即时OVD适应:基于活跃边际样本探索的少样本定位
开放词汇对象检测(OVD)模型通过从任意文本查询中检测对象提供了显著的灵活性。然而,它们在诸如遥感(RS)等专门领域中的零样本性能往往因自然语言的固有歧义而受损,限制了关键的下游应用。例如,一个OVD模型可能难以区分“渔船”和“游艇”这类细粒度类别,因为它们的嵌入相似且经常不可分。这可能妨碍特定用户目标,如监测非法捕鱼,导致无关的检测结果。为了解决这一问题,我们提出了一种级联方法,将大型预训练OVD模型的广泛泛化与轻量级少样本分类器相结合。我们的方法首先使用零样本模型生成高召回的对象提案,然后通过仅在少量用户标注示例上实时训练的小型分类器进行高精度细化,从而大幅降低RS图像标注的高昂成本。我们框架的核心是FLAME,这是一种一步式主动学习策略,能够选择最具信息量的样本进行训练。FLAME利用密度估计在决策边界附近即时识别不确定的边际候选样本,然后通过聚类确保样本多样性。这种高效的采样技术在无需昂贵的全模型微调的情况下实现了高精度,并能够在不到一分钟内实现即时适应,显著快于最先进的替代方案。我们的方法在RS基准测试中始终超越了最先进的性能,建立了一个实用且资源高效的框架,用于将基础模型适应特定用户需求。
Summary / 总结
The paper addresses the challenge of low zero-shot performance of open-vocabulary object detection (OVD) models in specialized domains like Remote Sensing (RS) due to the ambiguity of natural language. It proposes a cascaded approach combining a large pre-trained OVD model with a lightweight few-shot classifier. The method uses the pre-trained model to generate object proposals, which are then refined by a compact classifier trained on user-annotated examples. The core of the framework is FLAME, an active learning strategy that selects informative samples for training. FLAME identifies uncertain samples near the decision boundary and ensures sample diversity. This approach achieves high accuracy and enables rapid adaptation within minutes, outperforming state-of-the-art methods on RS benchmarks.
论文针对开放词汇对象检测(OVD)模型在遥感(RS)等专业领域中的零样本性能不佳问题,特别是细粒度类别的区分模糊。提出了一种级联方法,结合大型预训练OVD模型和轻量级的少量样本分类器。该方法使用预训练模型生成对象提案,然后通过少量用户标注的样本训练紧凑的分类器进行精炼。框架的核心是FLAME,这是一种实时选择训练中最具信息量样本的主动学习策略。FLAME通过密度估计识别决策边界附近的不确定样本,并通过聚类确保样本多样性。这种方法实现了高精度,并能在几分钟内实现即时适应,超越了最先进的性能,在RS基准测试中表现出色。
MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Authors: Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto
First: 2025-10-30T11:58:36+00:00 · Latest: 2025-10-30T11:58:36+00:00
Abstract
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyzes, and automated neuron naming via the MedGEMMA foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations.
中文标题/摘要
标题:MedSAE:通过稀疏自编码器剖析MedCLIP表示
医疗保健中的人工智能需要准确且可解释的模型。我们通过将医疗稀疏自编码器(MedSAEs)应用于MedCLIP的潜在空间,推进了医学视觉的机制可解释性,MedCLIP是一种在胸部X光片和报告上训练的视觉-语言模型。为了量化可解释性,我们提出了一种结合相关性度量、熵分析和通过MedGEMMA基础模型自动命名神经元的评估框架。在CheXpert数据集上的实验表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征。我们的研究结果将高性能的医疗AI与透明度相结合,提供了一条通往临床可靠表示的可扩展途径。
Summary / 总结
The research aims to enhance the interpretability of medical vision models by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. The study proposes an evaluation framework using correlation metrics, entropy analysis, and automated neuron naming via MedGEMMA. Experiments on the CheXpert dataset demonstrate that MedSAE neurons provide higher monosemanticity and interpretability compared to raw MedCLIP features, bridging high-performing medical AI with transparency.
研究旨在通过将Medical Sparse Autoencoders (MedSAEs)应用于MedCLIP的潜在空间,提高医学视觉模型的可解释性,MedCLIP是一个在胸部X光片和报告上训练的视觉-语言模型。研究提出了一种评估框架,结合了相关性度量、熵分析和通过MedGEMMA基础模型的自动神经元命名。实验结果表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征,实现了高性能医学AI与透明度的结合。
TRUST-VL: An Explainable News Assistant for General Multimodal Misinformation Detection
Authors: Zehong Yan, Peng Qi, Wynne Hsu, Mong Li Lee
Venue: EMNLP 2025 Oral
First: 2025-09-04T17:59:43+00:00 · Latest: 2025-10-30T10:58:04+00:00
Comments: EMNLP 2025 Oral; Project Homepage: https://yanzehong.github.io/trust-vl/
Abstract
Multimodal misinformation, encompassing textual, visual, and cross-modal distortions, poses an increasing societal threat that is amplified by generative AI. Existing methods typically focus on a single type of distortion and struggle to generalize to unseen scenarios. In this work, we observe that different distortion types share common reasoning capabilities while also requiring task-specific skills. We hypothesize that joint training across distortion types facilitates knowledge sharing and enhances the model's ability to generalize. To this end, we introduce TRUST-VL, a unified and explainable vision-language model for general multimodal misinformation detection. TRUST-VL incorporates a novel Question-Aware Visual Amplifier module, designed to extract task-specific visual features. To support training, we also construct TRUST-Instruct, a large-scale instruction dataset containing 198K samples featuring structured reasoning chains aligned with human fact-checking workflows. Extensive experiments on both in-domain and zero-shot benchmarks demonstrate that TRUST-VL achieves state-of-the-art performance, while also offering strong generalization and interpretability.
中文标题/摘要
标题:TRUST-VL:一种可解释的通用多模态虚假信息检测助手
多模态虚假信息,包括文本、视觉和跨模态的扭曲,构成了日益严重的社会威胁,而生成式AI进一步加剧了这一威胁。现有方法通常专注于单一类型的扭曲,并难以泛化到未见过的场景。在本文中,我们观察到不同类型的扭曲共享一些共同的推理能力,同时也需要特定任务的技能。我们假设跨类型联合训练有助于知识共享并增强模型的泛化能力。为此,我们引入了TRUST-VL,这是一种统一且可解释的视觉语言模型,用于通用多模态虚假信息检测。TRUST-VL 包含一个新颖的问答视觉增强模块,旨在提取特定任务的视觉特征。为了支持训练,我们还构建了TRUST-Instruct,这是一个包含198K样本的大规模指令数据集,样本中包含与人类事实核查工作流程对齐的结构化推理链。在领域内和零样本基准上的广泛实验表明,TRUST-VL 达到了最先进的性能,同时提供了强大的泛化能力和可解释性。
Summary / 总结
The research aims to address the challenge of detecting multimodal misinformation, which includes textual, visual, and cross-modal distortions, by developing a unified model that can handle different types of distortions. TRUST-VL, a vision-language model, is introduced, which includes a Question-Aware Visual Amplifier module to extract task-specific visual features. The model is trained using TRUST-Instruct, a large dataset of 198K samples with structured reasoning chains. Experiments show that TRUST-VL outperforms existing methods and demonstrates strong generalization and interpretability capabilities.
研究旨在应对由生成式AI加剧的文本、视觉和跨模态混合错误信息的挑战。作者提出了一种统一的视觉语言模型TRUST-VL,该模型包含一个问题感知视觉放大模块,用于提取任务特定的视觉特征。TRUST-VL基于包含198K样本的大规模指令数据集TRUST-Instruct进行训练,这些样本具有与人类事实核查工作流程对齐的结构化推理链。实验结果表明,TRUST-VL在领域内和零样本基准测试中均表现出色,具有强大的泛化能力和可解释性。
D-HUMOR: Dark Humor Understanding via Multimodal Open-ended Reasoning -- A Benchmark Dataset and Method
Authors: Sai Kartheek Reddy Kasu, Mohammad Zia Ur Rehman, Shahid Shafi Dar, Rishi Bharat Junghare, Dhanvin Sanjay Namboodiri, Nagendra Kumar
First: 2025-09-08T14:55:16+00:00 · Latest: 2025-10-30T10:15:05+00:00
Comments: Accepted at IEEE International Conference on Data Mining (ICDM) 2025
Abstract
Dark humor in online memes poses unique challenges due to its reliance on implicit, sensitive, and culturally contextual cues. To address the lack of resources and methods for detecting dark humor in multimodal content, we introduce a novel dataset of 4,379 Reddit memes annotated for dark humor, target category (gender, mental health, violence, race, disability, and other), and a three-level intensity rating (mild, moderate, severe). Building on this resource, we propose a reasoning-augmented framework that first generates structured explanations for each meme using a Large Vision-Language Model (VLM). Through a Role-Reversal Self-Loop, VLM adopts the author's perspective to iteratively refine its explanations, ensuring completeness and alignment. We then extract textual features from both the OCR transcript and the self-refined reasoning via a text encoder, while visual features are obtained using a vision transformer. A Tri-stream Cross-Reasoning Network (TCRNet) fuses these three streams, text, image, and reasoning, via pairwise attention mechanisms, producing a unified representation for classification. Experimental results demonstrate that our approach outperforms strong baselines across three tasks: dark humor detection, target identification, and intensity prediction. The dataset, annotations, and code are released to facilitate further research in multimodal humor understanding and content moderation. Code and Dataset are available at: https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
中文标题/摘要
标题:D-HUMOR:通过多模态开放式推理理解黑色幽默——基准数据集与方法
在线表情包中的黑色幽默因其依赖于隐含、敏感和文化背景的提示而面临独特挑战。为了解决检测多模态内容中黑色幽默资源和方法的缺乏,我们引入了一个包含4,379个带有黑色幽默标注的Reddit表情包的数据集,标注了目标类别(性别、心理健康、暴力、种族、残疾和其他)和三级强度评分(轻微、中等、严重)。在此基础上,我们提出了一种增强推理框架,首先使用大型视觉-语言模型(VLM)为每个表情包生成结构化解释。通过角色反转自循环,VLM 采用作者的视角迭代优化其解释,确保完整性和一致性。然后,我们从OCR转录文本和自优化推理中提取文本特征,使用视觉变换器获取视觉特征。三流交叉推理网络(TCRNet)通过成对注意力机制融合这三流,即文本、图像和推理,生成统一表示进行分类。实验结果表明,我们的方法在黑色幽默检测、目标识别和强度预测三项任务上均优于强基线。该数据集、标注和代码已发布,以促进多模态幽默理解和内容审核方面的进一步研究。代码和数据集可在以下链接获取:https://github.com/Sai-Kartheek-Reddy/D-Humor-Dark-Humor-Understanding-via-Multimodal-Open-ended-Reasoning
Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Authors: Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
First: 2025-10-30T08:21:50+00:00 · Latest: 2025-10-30T08:21:50+00:00
Comments: 10 pages
Abstract
Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and, crucially, under-evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best lag far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
中文标题/摘要
标题:时间流动的方向如何?基于心理物理学的视觉-语言模型评估
现代视觉-语言模型(VLMs)在许多多模态任务中表现出色,但在视频中的时间信息理解方面仍然薄弱且未得到充分评估。我们通过一个看似简单但揭示性强的挑战——判断时间箭头(AoT)——即判断短片段是正向播放还是反向播放,来探索这一差距。我们引入了AoT-PsyPhyBENCH,这是一个经心理物理学验证的基准测试,测试VLMs是否能在自然视频中推断出时间方向,使用与人类相同的刺激和行为基线。我们对开放权重和专有、推理和非推理VLMs的全面评估显示,大多数模型的表现接近随机猜测,甚至最好的模型在物理不可逆过程(如自由落体、扩散/爆炸)和因果手动动作(如分割/加法)上的人类识别能力方面也远远落后。这些结果突显了当前多模态系统中的一个基本差距:虽然它们捕捉到了丰富的视觉-语义关联,但缺乏用于时间连续性和因果理解的归纳偏置。我们发布了AoT-PsyPhyBENCH的代码和数据,以鼓励进一步提高VLMs在物理和时间推理能力方面的能力。
Summary / 总结
This study evaluates the temporal understanding of vision-language models (VLMs) by introducing AoT-PsyPhyBENCH, a benchmark based on psychophysical validation. The models were tested on their ability to determine the direction of time in short video clips. Most models performed poorly, even on irreversible processes and causal actions, indicating a significant gap in their temporal reasoning capabilities. The results suggest that VLMs need better inductive biases for temporal continuity and causal understanding.
该研究通过引入基于心理物理验证的AoT-PsyPhyBENCH基准,评估了视觉语言模型(VLMs)在理解视频中时间信息方面的能力。评估结果显示,大多数VLMs在识别自然视频中的时间箭头时表现接近随机,即使是表现最好的模型在识别不可逆过程和因果动作方面也远远落后于人类的准确度。这突显了当前VLMs在时间推理能力方面的关键缺陷,尽管它们在视觉语义理解方面表现出色。
MV-MLM: Bridging Multi-View Mammography and Language for Breast Cancer Diagnosis and Risk Prediction
Authors: Shunjie-Fabian Zheng, Hyeonjun Lee, Thijs Kooi, Ali Diba
Venue: ICCV 2025
First: 2025-10-30T05:12:29+00:00 · Latest: 2025-10-30T05:12:29+00:00
Comments: Accepted to Computer Vision for Automated Medical Diagnosis (CVAMD) Workshop at ICCV 2025
Abstract
Large annotated datasets are essential for training robust Computer-Aided Diagnosis (CAD) models for breast cancer detection or risk prediction. However, acquiring such datasets with fine-detailed annotation is both costly and time-consuming. Vision-Language Models (VLMs), such as CLIP, which are pre-trained on large image-text pairs, offer a promising solution by enhancing robustness and data efficiency in medical imaging tasks. This paper introduces a novel Multi-View Mammography and Language Model for breast cancer classification and risk prediction, trained on a dataset of paired mammogram images and synthetic radiology reports. Our MV-MLM leverages multi-view supervision to learn rich representations from extensive radiology data by employing cross-modal self-supervision across image-text pairs. This includes multiple views and the corresponding pseudo-radiology reports. We propose a novel joint visual-textual learning strategy to enhance generalization and accuracy performance over different data types and tasks to distinguish breast tissues or cancer characteristics(calcification, mass) and utilize these patterns to understand mammography images and predict cancer risk. We evaluated our method on both private and publicly available datasets, demonstrating that the proposed model achieves state-of-the-art performance in three classification tasks: (1) malignancy classification, (2) subtype classification, and (3) image-based cancer risk prediction. Furthermore, the model exhibits strong data efficiency, outperforming existing fully supervised or VLM baselines while trained on synthetic text reports and without the need for actual radiology reports.
中文标题/摘要
标题:MV-MLM:连接多视角乳腺X线摄影与语言以实现乳腺癌诊断与风险预测
大规模标注数据集对于训练用于乳腺癌检测或风险预测的稳健计算机辅助诊断(CAD)模型至关重要。然而,获取具有精细详细标注的数据集既昂贵又耗时。视觉-语言模型(VLMs),如CLIP,通过在大规模图像-文本对上进行预训练,提供了增强医疗成像任务中鲁棒性和数据效率的有希望的解决方案。本文介绍了一种新的多视角乳腺X线摄影和语言模型,用于乳腺癌分类和风险预测,该模型基于配对的乳腺X线摄影图像和合成放射学报告数据集进行训练。我们的MV-MLM利用多视角监督,通过跨模态自监督从广泛的放射学数据中学习丰富的表示。这包括多个视角及其相应的伪放射学报告。我们提出了一种新的联合视觉-文本学习策略,以增强在不同数据类型和任务上的泛化和准确性性能,区分乳腺组织或癌症特征(钙化、肿块),并利用这些模式来理解乳腺X线摄影图像和预测癌症风险。我们在私人和公开可用的数据集上评估了该方法,证明了所提出模型在三个分类任务中的最佳性能:(1) 恶性分类,(2) 亚型分类,(3) 图像基癌症风险预测。此外,该模型表现出强大的数据效率,在使用合成文本报告进行训练且无需实际放射学报告的情况下,优于现有的完全监督或VLM基线。
Summary / 总结
The research aims to develop a robust Computer-Aided Diagnosis (CAD) model for breast cancer detection and risk prediction using a novel Multi-View Mammography and Language Model (MV-MLM). The model leverages Vision-Language Models (VLMs) and multi-view supervision to learn from paired mammogram images and synthetic radiology reports. Experimental results show that MV-MLM outperforms existing methods in three classification tasks and demonstrates strong data efficiency, achieving state-of-the-art performance without the need for actual radiology reports.
研究旨在利用新型多视图乳腺成像和语言模型(MV-MLM)开发一种用于乳腺癌检测和风险预测的稳健计算机辅助诊断(CAD)模型。该模型利用视觉语言模型(VLMs)和多视图监督从配对的乳腺X光图像和合成放射学报告中学习。实验结果表明,MV-MLM在三个分类任务中超越了现有方法,并展示了强大的数据效率,在无需实际放射学报告的情况下实现了最先进的性能。
GUI Knowledge Bench: Revealing the Knowledge Gap Behind VLM Failures in GUI Tasks
Authors: Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, Qing Li
First: 2025-10-30T03:22:30+00:00 · Latest: 2025-10-30T03:22:30+00:00
Abstract
Large vision language models (VLMs) have advanced graphical user interface (GUI) task automation but still lag behind humans. We hypothesize this gap stems from missing core GUI knowledge, which existing training schemes (such as supervised fine tuning and reinforcement learning) alone cannot fully address. By analyzing common failure patterns in GUI task execution, we distill GUI knowledge into three dimensions: (1) interface perception, knowledge about recognizing widgets and system states; (2) interaction prediction, knowledge about reasoning action state transitions; and (3) instruction understanding, knowledge about planning, verifying, and assessing task completion progress. We further introduce GUI Knowledge Bench, a benchmark with multiple choice and yes/no questions across six platforms (Web, Android, MacOS, Windows, Linux, IOS) and 292 applications. Our evaluation shows that current VLMs identify widget functions but struggle with perceiving system states, predicting actions, and verifying task completion. Experiments on real world GUI tasks further validate the close link between GUI knowledge and task success. By providing a structured framework for assessing GUI knowledge, our work supports the selection of VLMs with greater potential prior to downstream training and provides insights for building more capable GUI agents.
中文标题/摘要
标题:GUI知识基准:揭示GUI任务中VLM失败背后的知识差距
大型视觉语言模型(VLMs)在图形用户界面(GUI)任务自动化方面取得了进展,但仍落后于人类。我们假设这种差距源于缺失的核心GUI知识,而现有的训练方案(如监督微调和强化学习)无法完全解决这一问题。通过分析GUI任务执行中的常见失败模式,我们将GUI知识提炼为三个维度:(1)界面感知,关于识别控件和系统状态的知识;(2)交互预测,关于推理动作状态转换的知识;(3)指令理解,关于规划、验证和评估任务完成进度的知识。我们进一步介绍了GUI知识基准,这是一个包含跨六个平台(Web、Android、MacOS、Windows、Linux、iOS)和292个应用程序的多项选择和是/非问题的基准。我们的评估显示,当前的VLMs能够识别控件功能,但在感知系统状态、预测动作和验证任务完成方面存在困难。在真实世界GUI任务上的实验进一步验证了GUI知识与任务成功之间的密切联系。通过提供一个结构化的框架来评估GUI知识,我们的工作支持在下游训练前选择具有更大潜力的VLMs,并为构建更强大的GUI代理提供了见解。
Summary / 总结
The research aims to identify the knowledge gap in large vision language models (VLMs) for GUI task automation, hypothesizing that this gap arises from insufficient core GUI knowledge. The study analyzes common failure patterns and distills GUI knowledge into three dimensions: interface perception, interaction prediction, and instruction understanding. The GUI Knowledge Bench, a benchmark with questions across six platforms and 292 applications, evaluates current VLMs, revealing their struggles with perceiving system states, predicting actions, and verifying task completion. The findings highlight the importance of GUI knowledge for task success and suggest a structured framework for assessing and improving VLMs for GUI tasks.
研究旨在识别大型视觉语言模型(VLMs)在GUI任务自动化中的知识缺口,假设这种差距源于核心GUI知识的不足。研究分析了常见的失败模式,并将GUI知识归纳为三个维度:界面感知、交互预测和指令理解。GUI知识基准包括针对六个平台和292个应用程序的问题,评估当前的VLMs,结果显示它们在感知系统状态、预测动作和验证任务完成方面存在困难。研究结果强调了GUI知识对于任务成功的重要性,并提供了一个结构化的框架来评估和改进VLMs以用于GUI任务。
Empowering Agentic Video Analytics Systems with Video Language Models
Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
First: 2025-05-01T02:40:23+00:00 · Latest: 2025-10-30T03:12:42+00:00
Comments: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
Abstract
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
中文标题/摘要
标题:利用视频语言模型赋能代理型视频分析系统
AI驱动的视频分析在多个领域变得越来越重要。然而,现有的系统通常局限于特定的、预定义的任务,限制了它们在开放分析场景中的适应性。最近,视觉语言模型(VLMs)的出现为实现开放式的视频理解、推理和分析提供了巨大的潜力。然而,它们有限的上下文窗口在处理超长视频内容时带来了挑战,而这种内容在实际应用中非常普遍。为了解决这个问题,我们提出了AVA,这是一种基于VLM的系统,旨在实现开放式的高级视频分析。AVA包含两项关键创新:(1)近实时构建事件知识图谱(EKGs)以高效索引长或连续的视频流,(2)一种代理检索生成机制,利用EKGs处理复杂的多样查询。在公共基准LVBench和VideoMME-Long上的全面评估表明,AVA达到了最先进的性能,分别取得了62.3%和64.1%的准确率,显著超过了现有的VLM和视频检索增强生成(RAG)系统。此外,为了评估超长和开放世界的视频分析,我们引入了一个新的基准AVA-100。该基准包括8个超过10小时的视频,以及120个手动标注的、多样且复杂的问答对。在AVA-100上,AVA取得了顶级的性能,准确率为75.8%。AVA的源代码可在https://github.com/I-ESC/Project-Ava获取。AVA-100基准数据集可在https://huggingface.co/datasets/iesc/Ava-100获取。
Summary / 总结
The research aims to enhance the adaptability of AI-driven video analytics systems by leveraging Vision Language Models (VLMs). AVA, a VLM-powered system, introduces Event Knowledge Graphs (EKGs) for efficient indexing and an agentic retrieval-generation mechanism to handle complex queries. AVA demonstrates superior performance on public benchmarks, achieving 62.3% and 64.1% accuracy, and 75.8% accuracy on the newly introduced AVA-100 benchmark, surpassing existing systems.
研究旨在通过利用Vision Language Models (VLMs)来增强AI驱动的视频分析系统的适应性。AVA是一个VLM驱动的系统,引入了事件知识图谱(EKGs)进行高效索引,并采用一种能处理复杂查询的自主检索生成机制。AVAA在公共基准测试LVBench和VideoMME-Long上分别实现了62.3%和64.1%的准确率,并在新引入的AVAA-100基准测试中,针对超长视频实现了75.8%的准确率。
Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations
Authors: Chaofan Gan, Yuanpeng Tu, Xi Chen, Tieyuan Chen, Yuxi Li, Mehrtash Harandi, Weiyao Lin
Venue: NeurIPS 2025
First: 2025-05-24T08:20:36+00:00 · Latest: 2025-10-30T02:59:44+00:00
Comments: NeurIPS 2025
Abstract
Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as \textit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (\eg, with +9.4\% on Spair-71k and +4.4\% on AP-10K-C.S.).
中文标题/摘要
标题:通过调节大规模激活释放扩散变换器的视觉对应能力
预训练的稳定扩散模型(SD)在视觉对应方面取得了巨大进展。本文研究了扩散变换器(DiTs)在精确密集对应方面的能力。与SD不同,DiTs表现出一种关键现象,即极少数特征激活值显著大于其他值,称为“大规模激活”,导致DiTs的不具信息性表示和显著性能下降。大规模激活在所有图像块标记中始终集中在非常少数的固定维度上,几乎没有局部信息。我们追踪这些维度集中的大规模激活,并发现这种集中可以通过零初始化的自适应层归一化(AdaLN-zero)有效定位。基于这些发现,我们提出了一种无需训练的扩散变换器特征(DiTF)框架,旨在从DiTs中提取语义区分特征。具体而言,DiTF使用AdaLN按通道调节和归一化大规模激活。此外,我们开发了一种通道丢弃策略,进一步消除大规模激活的负面影响。实验结果表明,我们的DiTF在不同视觉对应任务中均优于DINO和基于SD的模型,并在Spair-71k和AP-10K-C.S.上分别建立了DiTs的新最佳性能(+9.4%和+4.4%)。
Summary / 总结
This paper investigates the use of Diffusion Transformers (DiTs) for visual correspondence, focusing on the issue of massive activations that lead to uninformative representations and performance degradation. The authors propose DiTF, a training-free framework that uses AdaLN to adaptively localize and normalize massive activations, and a channel discard strategy to eliminate their negative impacts. Experimental results show that DiTF outperforms both DINO and SD-based models, establishing a new state-of-the-art performance in visual correspondence tasks.
本文研究了Diffusion Transformers (DiTs)在视觉对应中的应用,重点关注大规模激活导致的无信息表示和性能下降问题。作者提出了一种名为DiTF的训练免费框架,利用AdaLN对大规模激活进行自适应定位和归一化,并开发了一种通道丢弃策略以消除其负面影响。实验结果表明,DiTF在视觉对应任务中(例如,在Spair-71k上提高了9.4%,在AP-10K-C上提高了4.4%)优于DINO和基于SD的模型,建立了新的性能基准。
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Authors: Mingyu Yang, Jae-Young Choi, Kihyo Moon, Minsung Jang, Eunjoo Jeon
First: 2025-09-01T03:13:50+00:00 · Latest: 2025-10-30T02:05:44+00:00
Comments: Accepted for presentation at the IEEE BigData 2025 Workshop (Special Session on Intelligent Data Mining). This v2 updates formatting and adds IEEE copyright notice
Abstract
Speculative decoding accelerates large language model inference, but its reliance on a fixed speculation length is suboptimal in large-batch serving environments with diverse requests. This paper explores a new direction for dynamic adaptation by investigating a novel class of post-hoc, diagnostic signals. We propose Dynamic Speculative Decoding Engine (DSDE), a training-free framework built on two primary components: (1) a predictive signal based on the variance of the Kullback-Leibler (KLD) divergence, which diagnoses the generation's regional stability, and (2) an adaptive speculation length cap to mitigate the straggler problem in per-sequence decoding. Experiments demonstrate the potential of using KLD-based stability signals for dynamic adaptation. An algorithm guided by these signals achieves end-to-end latency competitive with leading baselines and exhibits superior robustness across diverse workloads. This robustness is particularly valuable in challenging low-acceptance-rate regimes, where the proposed signal maintains its diagnostic utility. Collectively, these findings validate post-hoc signals as a valuable component for building more robust and intelligent LLM inference systems, and highlight a promising direction for future research on dynamic speculation length adaptation.
中文标题/摘要
标题:DSDE:基于KLD稳定性动态推测解码用于实际服务
推测解码加速了大型语言模型的推理,但在具有多样化请求的大批量服务环境中,其依赖于固定推测长度是不理想的。本文探索了一种新的动态适应方向,通过研究一种新型的后处理诊断信号。我们提出了动态推测解码引擎(DSDE),这是一种无需训练的框架,由两个主要组件组成:(1)基于Kullback-Leibler(KLD)散度方差的预测信号,用于诊断生成的区域稳定性;(2)一种自适应推测长度上限,以缓解逐序列解码中的拖后腿问题。实验表明,使用KLD基稳定性信号进行动态适应具有潜力。由这些信号指导的算法在端到端延迟方面与领先基准相当,并且在各种工作负载中表现出更优越的鲁棒性。特别是在低接受率的挑战性环境中,所提出的信号保持其诊断效用。这些发现验证了后处理信号作为构建更鲁棒和智能的LLM推理系统的重要组成部分的价值,并强调了未来研究动态推测长度适应的有希望的方向。
Summary / 总结
This paper addresses the limitations of fixed speculation length in speculative decoding for large language model inference in diverse request environments. It introduces DSDE, a training-free framework using KLD variance as a predictive signal for regional stability and an adaptive speculation length cap to reduce stragglers. Experiments show that DSDE achieves competitive end-to-end latency and superior robustness across various workloads, especially in low-acceptance-rate regimes, validating the use of post-hoc signals for dynamic speculation length adaptation.
本文针对固定推测长度在多样请求环境中的大型语言模型推理中的局限性,提出了DSDE框架,该框架利用KLD方差作为区域稳定性的预测信号,并采用自适应推测长度上限来提高鲁棒性。实验表明,DSDE在各种工作负载下实现了与领先基线相当的端到端延迟,并且在低接受率环境中表现出更优的性能,验证了后处理信号在动态推测长度调整中的应用价值。
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models
Authors: Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, Ramya Namuduri, Bodun Hu, Juan Diego Rodriguez, Puyuan Peng, Greg Durrett
Venue: NeurIPS 2025
First: 2025-05-19T17:59:27+00:00 · Latest: 2025-10-30T01:42:07+00:00
Comments: NeurIPS 2025 Datasets & Benchmarks
Abstract
Chart understanding presents a unique challenge for large vision-language models (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.
中文标题/摘要
标题:ChartMuseum:测试大型视觉语言模型的视觉推理能力
图表理解对大型视觉语言模型(LVLMs)提出了独特挑战,因为它需要结合复杂的文本和视觉推理能力。然而,当前的LVLMs在这两方面的技能存在明显不平衡,特别是在难以在文本中执行的视觉推理方面表现不佳。我们使用一个仅通过视觉推理才能解决的合成数据集进行了案例研究,结果显示,随着视觉复杂性的增加,模型的性能显著下降,而人类的表现则保持稳定。然后,我们引入了ChartMuseum,这是一个包含1,162个专家标注问题的新图表问答基准,涵盖了多种推理类型,从184个来源的真实世界图表中精选而来,专门用于评估复杂的视觉和文本推理能力。与之前的图表理解基准不同,这些基准中前沿模型的表现相似且接近饱和,而我们的基准则揭示了模型与人类表现之间存在的显著差距,同时有效地区分了模型的能力:尽管人类的准确率为93%,但表现最好的模型Gemini-2.5-Pro仅达到63.0%,而领先的开源LVLM Qwen2.5-VL-72B-Instruct仅达到38.5%。此外,在主要需要视觉推理的问题上,所有模型的表现从主要依赖文本推理的问题中下降了35%-55%。最后,我们的定性错误分析揭示了当前LVLMs在某些视觉推理类别中面临的挑战。
Summary / 总结
The study aims to evaluate the visual reasoning capabilities of large vision-language models (LVLMs) by introducing a synthetic dataset and a new benchmark, ChartMuseum. The method involves creating a dataset solvable only through visual reasoning and a benchmark with 1,162 expert-annotated questions from real-world charts. Key findings show that model performance declines significantly with increasing visual complexity, while human performance remains robust. The best-performing model, Gemini-2.5-Pro, achieves only 63.0% accuracy, compared to 93% for humans, highlighting a substantial gap between model and human performance.
研究旨在通过引入合成数据集和新的基准ChartMuseum来评估大型视觉-语言模型(LVLM)的视觉推理能力。方法包括创建仅通过视觉推理可解的数据集和包含1,162个专家标注问题的新基准,这些问题来自184个来源的真实世界图表。关键发现表明,随着视觉复杂性的增加,模型性能显著下降,而人类性能保持稳定。最佳模型Gemini-2.5-Pro的准确率为63.0%,而人类的准确率为93%,这表明模型和人类之间的性能差距很大。
Dynamic VLM-Guided Negative Prompting for Diffusion Models
Authors: Hoyeon Chang, Seungjin Kim, Yoonseok Choi
Venue: NeurIPS 2025
First: 2025-10-30T01:10:25+00:00 · Latest: 2025-10-30T01:10:25+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: The First Workshop on Generative and Protective AI for Content Creation
Abstract
We propose a novel approach for dynamic negative prompting in diffusion models that leverages Vision-Language Models (VLMs) to adaptively generate negative prompts during the denoising process. Unlike traditional Negative Prompting methods that use fixed negative prompts, our method generates intermediate image predictions at specific denoising steps and queries a VLM to produce contextually appropriate negative prompts. We evaluate our approach on various benchmark datasets and demonstrate the trade-offs between negative guidance strength and text-image alignment.
中文标题/摘要
标题:动态VLM引导的负提示在扩散模型中的应用
我们提出了一种新的扩散模型中动态负提示的方法,利用视觉语言模型(VLM)在去噪过程中自适应地生成负提示。与传统的使用固定负提示的方法不同,我们的方法在特定的去噪步骤中生成中间图像预测,并查询VLM生成上下文相关的负提示。我们在各种基准数据集上评估了该方法,并展示了负引导强度与文本-图像对齐之间的权衡。
Summary / 总结
The paper introduces a dynamic negative prompting technique for diffusion models that utilizes Vision-Language Models to generate contextually appropriate negative prompts during the denoising process. Unlike static negative prompts, this method generates intermediate image predictions and queries a VLM to produce relevant negative prompts. The evaluation on benchmark datasets shows a trade-off between negative guidance strength and text-image alignment.
该论文提出了一种使用Vision-Language模型(VLM)在去噪过程中动态生成上下文相关负提示的新方法。不同于固定的负提示,这种方法会在中间图像预测的基础上查询VLM生成适应性的负提示。实验结果表明,在各种基准数据集上,这种方法在负引导强度和文本-图像对齐之间达到了平衡。
Reasoning Visual Language Model for Chest X-Ray Analysis
Authors: Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu
First: 2025-10-28T00:48:00+00:00 · Latest: 2025-10-30T00:14:35+00:00
Comments: NV-Reason-CXR-3B
Abstract
Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.
中文标题/摘要
标题:胸部X光分析的推理视觉语言模型
视觉语言模型(VLMs)在医学图像分析方面显示出强大的潜力,但大多数模型仍然不透明,仅提供预测而缺乏临床医生依赖的透明推理步骤。我们提出了一种框架,将链式思考(CoT)推理引入胸部X光解释。受推理优先训练范式的启发,我们的方法旨在学习专家如何推理,而不仅仅是他们得出的结论,通过将中间步骤与可观察的图像证据和放射学工作流程对齐。除了准确性之外,明确的推理轨迹支持临床审计:它们揭示了结论是如何得出的,考虑了哪些替代方案,以及不确定性在哪里,从而促进质量保证、错误分析和更安全的人工智能协作。 我们的模型结合了高保真视觉编码,并采用两阶段训练配方:一种推理风格的监督微调(SFT)后跟使用可验证奖励的强化学习(RL),该奖励基于X光异常列表。模型输出的推理过程与放射科医生系统的思维过程、不确定性以及鉴别诊断相呼应。在分布外评估中,该方法在多标签分类方面表现出竞争力,同时提高了可解释性。在专家放射科医生的读者研究中,完整的推理轨迹增加了信心,支持了错误审计,并减少了最终报告所需的时间。我们发布了代码和模型NV-Reason-CXR-3B,以支持社区在胸部放射学和其他医学成像任务中对可信、可解释的人工智能的研究。
Summary / 总结
The research aims to enhance the transparency of vision-language models in medical image analysis, particularly for chest X-ray interpretation. The method involves a two-stage training process: reasoning-style supervised fine-tuning followed by reinforcement learning, which aligns intermediate steps with observable image evidence. Key findings show that this approach improves interpretability and supports clinical auditability, increasing radiologists' confidence and reducing report time. The model, NV-Reason-CXR-3B, outputs reasoning that mirrors radiologists' thought processes and differential diagnoses, achieving competitive multi-label classification while enhancing explainability.
研究旨在提高视觉语言模型在医学图像分析中的透明度,特别是胸部X光解读。方法包括两阶段训练:推理风格的监督微调和使用可验证奖励的强化学习,使中间步骤与可观察的图像证据对齐。关键发现表明,这种方法提高了可解释性并支持临床审计,增加了放射科医生的信心并减少了报告时间。该模型NV-Reason-CXR-3B输出的推理过程与放射科医生的思维过程和鉴别诊断相吻合,实现了具有竞争力的多标签分类,同时增强了可解释性。
CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Authors: Rishika Bhagwatkar, Syrielle Montariol, Angelika Romanou, Beatriz Borges, Irina Rish, Antoine Bosselut
Venue: 2025 Conference on Empirical Methods in Natural Language Processing
First: 2025-10-29T22:34:26+00:00 · Latest: 2025-10-29T22:34:26+00:00
Abstract
Humans can naturally identify, reason about, and explain anomalies in their environment. In computer vision, this long-standing challenge remains limited to industrial defects or unrealistic, synthetically generated anomalies, failing to capture the richness and unpredictability of real-world anomalies. In this work, we introduce CAVE, the first benchmark of real-world visual anomalies. CAVE supports three open-ended tasks: anomaly description, explanation, and justification; with fine-grained annotations for visual grounding and categorizing anomalies based on their visual manifestations, their complexity, severity, and commonness. These annotations draw inspiration from cognitive science research on how humans identify and resolve anomalies, providing a comprehensive framework for evaluating Vision-Language Models (VLMs) in detecting and understanding anomalies. We show that state-of-the-art VLMs struggle with visual anomaly perception and commonsense reasoning, even with advanced prompting strategies. By offering a realistic and cognitively grounded benchmark, CAVE serves as a valuable resource for advancing research in anomaly detection and commonsense reasoning in VLMs.
中文标题/摘要
标题:CAVE:检测和解释视觉环境中的常识异常
人类可以自然地识别、推理和解释环境中的异常。在计算机视觉领域,这一长期挑战仍然局限于工业缺陷或不现实、合成生成的异常,未能捕捉到现实世界异常的丰富性和不可预测性。在本项工作中,我们引入了CAVE,这是首个现实世界视觉异常基准。CAVE 支持三个开放任务:异常描述、解释和论证;并提供了细粒度的视觉定位注释和基于异常视觉表现、复杂性、严重性和普遍性的分类注释。这些注释借鉴了认知科学研究人类如何识别和解决异常的方法,为评估视觉语言模型(VLMs)在检测和理解异常方面的表现提供了全面框架。我们展示了最先进的VLMs在视觉异常感知和常识推理方面存在困难,即使使用高级提示策略也是如此。通过提供一个现实和认知基础的基准,CAVE 成为推动异常检测和常识推理研究的重要资源。
Summary / 总结
The paper introduces CAVE, a benchmark for real-world visual anomalies, addressing the limitations of existing benchmarks which focus on industrial defects or synthetic anomalies. CAVE includes tasks for anomaly description, explanation, and justification, with detailed annotations for visual grounding and categorization. The study demonstrates that state-of-the-art Vision-Language Models struggle with visual anomaly perception and commonsense reasoning, highlighting the need for better models in this area. By providing a realistic and cognitively grounded benchmark, CAVE aims to advance research in anomaly detection and commonsense reasoning in VLMs.
研究旨在解决当前计算机视觉系统在处理真实世界视觉环境中的异常时面临的挑战。CAVE是一个新的基准,提出了异常描述、解释和论证三个任务,并提供了详细的注释。研究发现,最先进的视觉语言模型在视觉异常感知和常识推理方面表现不佳,这表明需要改进这些模型以应对这一问题。
GenIR: Generative Visual Feedback for Mental Image Retrieval
Authors: Diji Yang, Minghao Liu, Chung-Hsiang Lo, Yi Zhang, James Davis
Venue: NeurIPS 2025
First: 2025-06-06T16:28:03+00:00 · Latest: 2025-10-29T22:25:02+00:00
Comments: NeurIPS 2025
Abstract
Vision-language models (VLMs) have shown strong performance on text-to-image retrieval benchmarks. However, bridging this success to real-world applications remains a challenge. In practice, human search behavior is rarely a one-shot action. Instead, it is often a multi-round process guided by clues in mind. That is, a mental image ranging from vague recollections to vivid mental representations of the target image. Motivated by this gap, we study the task of Mental Image Retrieval (MIR), which targets the realistic yet underexplored setting where users refine their search for a mentally envisioned image through multi-round interactions with an image search engine. Central to successful interactive retrieval is the capability of machines to provide users with clear, actionable feedback; however, existing methods rely on indirect or abstract verbal feedback, which can be ambiguous, misleading, or ineffective for users to refine the query. To overcome this, we propose GenIR, a generative multi-round retrieval paradigm leveraging diffusion-based image generation to explicitly reify the AI system's understanding at each round. These synthetic visual representations provide clear, interpretable feedback, enabling users to refine their queries intuitively and effectively. We further introduce a fully automated pipeline to generate a high-quality multi-round MIR dataset. Experimental results demonstrate that GenIR significantly outperforms existing interactive methods in the MIR scenario. This work establishes a new task with a dataset and an effective generative retrieval method, providing a foundation for future research in this direction
中文标题/摘要
标题:GenIR:生成式视觉反馈的思维图像检索
视觉语言模型(VLMs)在文本到图像检索基准测试中表现出色。然而,将这种成功应用到实际应用中仍然是一个挑战。实际上,人类的搜索行为很少是一次性的,而是一个由脑海中线索引导的多轮过程。也就是说,从模糊的记忆到对目标图像的生动心理表征。受此差距的启发,我们研究了思维图像检索(MIR)任务,该任务旨在通过与图像搜索引擎的多轮交互,让用户逐步细化他们对心中想象的图像的搜索。成功的交互检索的核心在于机器能够为用户提供清晰、可操作的反馈;然而,现有方法依赖于间接或抽象的口头反馈,这可能会使用户难以细化查询。为了解决这个问题,我们提出了GenIR,这是一种利用基于扩散的图像生成技术的生成式多轮检索范式,在每一轮中明确地体现AI系统的理解。这些合成的视觉表示提供了清晰、可解释的反馈,使用户能够直观有效地细化查询。我们还引入了一个全自动流水线来生成高质量的多轮MIR数据集。实验结果表明,GenIR在MIR场景中显著优于现有的交互式方法。这项工作建立了一个新的任务,包括一个数据集和一个有效的生成式检索方法,为该领域的未来研究奠定了基础
Summary / 总结
The paper addresses the challenge of mental image retrieval (MIR) by proposing GenIR, a generative multi-round retrieval paradigm. Motivated by the need for clear, actionable feedback in interactive search, GenIR uses diffusion-based image generation to provide explicit visual feedback at each round. Experimental results show that GenIR outperforms existing methods in the MIR scenario, demonstrating its effectiveness in guiding users to refine their queries more intuitively and effectively.
论文提出了一种生成式多轮检索方法GenIR,以解决心理图像检索(MIR)的问题。受需要在交互搜索中提供清晰且可操作反馈的驱动,GenIR 使用基于扩散的图像生成技术,在每一轮提供明确的视觉反馈。实验结果表明,GenIR 在 MIR 场景中优于现有交互方法,展示了其在引导用户更直观和有效地细化查询方面的有效性。
MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory
Authors: Ana Carolina Condez, Diogo Tavares, João Magalhães
Venue: ACM MM
First: 2025-06-06T02:52:13+00:00 · Latest: 2025-10-29T21:34:31+00:00
Comments: Updated version: corresponds to the ACM MM '25 published paper and includes full appendix material
Abstract
Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.
中文标题/摘要
标题:MoralCLIP:基于道德基础理论的视觉-语言表示对比对齐
近期视觉-语言模型的发展使跨模态的丰富语义理解成为可能。然而,这些编码方法缺乏解释或推理内容道德维度的能力——这是人类认知的一个关键方面。本文通过引入MoralCLIP,一种基于道德基础理论(MFT)的新型多模态嵌入表示方法,来解决这一缺口。我们的方法将视觉和文本道德线索整合到一个统一的嵌入空间中,实现跨模态道德对齐。MoralCLIP基于多标签数据集Social-Moral Image Database,以识别视觉内容中共同出现的道德基础。为了训练MoralCLIP,我们设计了一种道德数据增强策略,将标注数据集扩展到15,000张带有MFT对齐维度的图像-文本对。我们的结果表明,显式的道德监督可以提高单模态和多模态对道德内容的理解,为具备识别和与人类道德价值观对齐的道德意识AI系统奠定了基础。
Summary / 总结
MoralCLIP is a novel method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory. It integrates visual and textual moral cues into a unified embedding space, using a moral data augmentation strategy to train on 15,000 image-text pairs. The results show that explicit moral supervision enhances both unimodal and multimodal understanding of moral content, paving the way for morally-aware AI systems.
MoralCLIP 是一种方法,通过结合道德维度(基于道德基础理论)来增强视觉-语言模型。它使用道德数据增强策略在 15,000 个图像-文本对上进行训练,从而提高单模态和多模态对道德内容的理解。这有助于更好地识别和与人类道德价值观对齐,推动道德意识 AI 系统的发展。
CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data
Authors: Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma
First: 2025-03-06T03:40:01+00:00 · Latest: 2025-10-29T20:44:13+00:00
Comments: Datasets link: https://huggingface.co/datasets/LLDDSS/Causal3D_Dataset
Abstract
True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models' abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.
中文标题/摘要
标题:CAUSAL3D:视觉数据因果学习的综合基准
真正的智能依赖于发现和利用隐藏的因果关系的能力。尽管在人工智能和计算机视觉(CV)方面取得了显著进展,但仍缺乏评估模型从复杂视觉数据中推断潜在因果关系能力的基准。在本文中,我们介绍了\textsc{\textbf{Causal3D}},这是一种新颖且全面的基准,将结构化数据(表格)与相应的视觉表示(图像)结合在一起,以评估因果推理能力。Causal3D 设计在系统框架内,包含19个3D场景数据集,捕捉各种因果关系、视角和背景,使不同复杂度场景的评估成为可能。我们评估了多种最先进的方法,包括经典因果发现、因果表示学习以及大型/视觉语言模型(LLMs/VLMs)。实验结果显示,随着因果结构变得更加复杂且缺乏先验知识时,性能显著下降,突显了即使在复杂因果场景中,先进方法所面临的挑战。Causal3D 是推进CV中的因果推理和促进关键领域可信AI的重要资源。
Summary / 总结
CAUSAL3D is a new comprehensive benchmark for evaluating models' ability to infer causal relationships from complex visual data. It integrates structured data (tables) with corresponding images to assess causal reasoning across 19 diverse 3D-scene datasets. Experiments show that as causal structures become more complex, performance declines significantly, indicating the challenges even advanced methods face in complex causal scenarios.
CAUSAL3D 是一个新的综合基准,用于评估模型从复杂视觉数据中推断因果关系的能力。它将结构化数据(表格)与相应的图像集成,并包含19个3D场景数据集来评估各种因果推理方法。实验表明,在缺乏先验知识的情况下,随着因果结构的复杂性增加,性能显著下降,突显了在复杂因果场景中的挑战。
Latent Chain-of-Thought for Visual Reasoning
Authors: Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao
Venue: NeurIPS 2025
First: 2025-10-27T23:10:06+00:00 · Latest: 2025-10-29T18:48:20+00:00
Comments: NeurIPS 2025
Abstract
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
中文标题/摘要
标题:视觉推理中的潜在思维链
思维链(CoT)推理对于提高大型视觉-语言模型(LVLM)的可解释性和可靠性至关重要。然而,现有的训练算法如SFT、PPO和GRPO可能在未见过的推理任务上表现不佳,并且严重依赖于有偏的奖励模型。为了解决这一挑战,我们将LVLM中的推理重新表述为后验推断,并提出了一种基于近似变分推断的可扩展训练算法。通过利用寻求多样性的强化学习算法,我们引入了一种新颖的稀疏奖励函数,用于促进多样且高似然的潜在CoT,克服了确定性采样的局限性,避免了奖励作弊。此外,我们实现了贝叶斯推理扩展策略,用边际似然替代了昂贵的Best-of-N和束搜索,以高效地排名最优的推理和答案。我们实证证明,所提出的方法在七个推理基准上增强了最先进的LVLM,在有效性、泛化能力和可解释性方面表现出色。
Summary / 总结
The paper aims to improve the interpretability and reliability of Large Vision-Language Models (LVLMs) by addressing the limitations of existing training algorithms. It proposes a scalable training algorithm based on amortized variational inference, which reformulates reasoning as posterior inference. The method introduces a novel sparse reward function and a Bayesian inference-scaling strategy to encourage diverse, high-likelihood latent CoT, leading to better performance on seven reasoning benchmarks in terms of effectiveness, generalization, and interpretability.
研究旨在通过解决现有训练算法的局限性,提高大型视觉-语言模型(LVLM)的可解释性和可靠性。方法将LVLM中的推理重新表述为后验推理,并引入基于近似变分推断的可扩展训练算法。实验结果表明,该方法在七个推理基准上提升了最先进的LVLM的效果、泛化能力和可解释性。
FreeArt3D: Training-Free Articulated Object Generation using 3D Diffusion
Authors: Chuhao Chen, Isabella Liu, Xinyue Wei, Hao Su, Minghua Liu
First: 2025-10-29T17:58:14+00:00 · Latest: 2025-10-29T17:58:14+00:00
Abstract
Articulated 3D objects are central to many applications in robotics, AR/VR, and animation. Recent approaches to modeling such objects either rely on optimization-based reconstruction pipelines that require dense-view supervision or on feed-forward generative models that produce coarse geometric approximations and often overlook surface texture. In contrast, open-world 3D generation of static objects has achieved remarkable success, especially with the advent of native 3D diffusion models such as Trellis. However, extending these methods to articulated objects by training native 3D diffusion models poses significant challenges. In this work, we present FreeArt3D, a training-free framework for articulated 3D object generation. Instead of training a new model on limited articulated data, FreeArt3D repurposes a pre-trained static 3D diffusion model (e.g., Trellis) as a powerful shape prior. It extends Score Distillation Sampling (SDS) into the 3D-to-4D domain by treating articulation as an additional generative dimension. Given a few images captured in different articulation states, FreeArt3D jointly optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or access to large-scale articulated datasets. Our method generates high-fidelity geometry and textures, accurately predicts underlying kinematic structures, and generalizes well across diverse object categories. Despite following a per-instance optimization paradigm, FreeArt3D completes in minutes and significantly outperforms prior state-of-the-art approaches in both quality and versatility.
中文标题/摘要
标题:FreeArt3D:无需训练的3D可动物体生成方法利用3D扩散
3D可动物体在机器人学、AR/VR和动画等领域中至关重要。最近对这类物体建模的方法要么依赖于需要密集视角监督的优化重建管道,要么依赖于生成前馈模型,这些模型会产生粗糙的几何近似,并且往往忽视表面纹理。相比之下,静态3D物体的开放世界生成已经取得了显著成功,尤其是随着原生3D扩散模型(如Trellis)的出现。然而,将这些方法扩展到可动物体并训练原生3D扩散模型面临着重大挑战。在本文中,我们提出了FreeArt3D,这是一种无需训练的3D可动物体生成框架。FreeArt3D 不是针对有限的可动数据训练新模型,而是将一个预先训练好的静态3D扩散模型(例如Trellis)重新用于强大的形状先验。它通过将可动性视为额外的生成维度,将Score Distillation Sampling (SDS) 扩展到3D到4D领域。给定不同可动状态下的少量图像,FreeArt3D 联合优化物体的几何形状、纹理和可动参数,而无需特定任务的训练或访问大规模可动数据集。我们的方法生成了高保真度的几何形状和纹理,准确预测了潜在的运动结构,并在多种物体类别中表现出良好的泛化能力。尽管遵循实例优化范式,FreeArt3D 完成时间仅需几分钟,并且在质量和多功能性方面显著优于先前的先进方法。
Summary / 总结
FreeArt3D is a training-free framework for generating articulated 3D objects. It leverages a pre-trained static 3D diffusion model, such as Trellis, and extends Score Distillation Sampling to the 3D-to-4D domain. Given a few images of an object in different articulation states, FreeArt3D optimizes the object's geometry, texture, and articulation parameters without requiring task-specific training or large-scale datasets. The method produces high-fidelity geometry and textures, accurately predicts kinematic structures, and generalizes well across various object categories, outperforming previous approaches in both quality and versatility.
FreeArt3D 是一个无需训练的框架,用于生成 articulated 3D 对象。它利用预训练的静态 3D 扩散模型和 Score Distillation Sampling 来优化几何形状、纹理和关节参数。该方法只需要几张对象在不同关节状态下的图像,并不需要特定任务的训练。它能够生成高保真度的几何形状和纹理,准确预测运动结构,并在各种对象类别中表现出良好的泛化能力,优于之前的先进方法在质量和灵活性方面。
ScaleDiff: Higher-Resolution Image Synthesis via Efficient and Model-Agnostic Diffusion
Authors: Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, Dong-Jin Kim
Venue: NeurIPS 2025
First: 2025-10-29T17:17:32+00:00 · Latest: 2025-10-29T17:17:32+00:00
Comments: NeurIPS 2025. Code: https://github.com/KSH00906/ScaleDiff
Abstract
Text-to-image diffusion models often exhibit degraded performance when generating images beyond their training resolution. Recent training-free methods can mitigate this limitation, but they often require substantial computation or are incompatible with recent Diffusion Transformer models. In this paper, we propose ScaleDiff, a model-agnostic and highly efficient framework for extending the resolution of pretrained diffusion models without any additional training. A core component of our framework is Neighborhood Patch Attention (NPA), an efficient mechanism that reduces computational redundancy in the self-attention layer with non-overlapping patches. We integrate NPA into an SDEdit pipeline and introduce Latent Frequency Mixing (LFM) to better generate fine details. Furthermore, we apply Structure Guidance to enhance global structure during the denoising process. Experimental results demonstrate that ScaleDiff achieves state-of-the-art performance among training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
中文标题/摘要
标题:ScaleDiff:通过高效且模型无关的扩散实现高分辨率图像合成
文本到图像的扩散模型在生成超出训练分辨率的图像时通常表现出性能下降。最近的无训练方法可以缓解这一限制,但它们往往需要大量计算或与最近的扩散变换器模型不兼容。在本文中,我们提出了一种模型无关且高效的框架ScaleDiff,无需额外训练即可扩展预训练扩散模型的分辨率。我们框架的核心组件是邻域块注意力(NPA),这是一种高效的机制,通过非重叠块减少自注意力层中的计算冗余。我们将NPA集成到SDEdit管道中,并引入潜在频率混合(LFM)以更好地生成细部。此外,我们在去噪过程中应用结构引导以增强全局结构。实验结果表明,ScaleDiff在U-Net和扩散变换器架构上均实现了无训练方法中的最佳性能,无论是图像质量还是推理速度。
Summary / 总结
ScaleDiff is a model-agnostic and efficient framework for enhancing the resolution of pretrained diffusion models without additional training. It introduces Neighborhood Patch Attention (NPA) to reduce computational redundancy and Latent Frequency Mixing (LFM) to improve fine details. ScaleDiff also applies Structure Guidance to enhance global structure during the denoising process. Experimental results show that ScaleDiff outperforms other training-free methods in terms of both image quality and inference speed on both U-Net and Diffusion Transformer architectures.
ScaleDiff 是一个模型无关且高效的框架,能够在无需额外训练的情况下扩展预训练扩散模型的分辨率。它使用 Neighborhood Patch Attention (NPA) 来减少计算冗余,并结合 Latent Frequency Mixing (LFM) 和 Structure Guidance 来提升图像质量。ScaleDiff 在 U-Net 和 Diffusion Transformer 架构上均实现了最先进的性能,在图像质量和推理速度方面均表现出色。
ALDEN: Reinforcement Learning for Active Navigation and Evidence Gathering in Long Documents
Authors: Tianyu Yang, Terry Ruas, Yijun Tian, Jan Philip Wahle, Daniel Kurzawe, Bela Gipp
First: 2025-10-29T16:32:26+00:00 · Latest: 2025-10-29T16:32:26+00:00
Abstract
Vision-language models (VLMs) excel at interpreting text-rich images but struggle with long, visually complex documents that demand analysis and integration of information spread across multiple pages. Existing approaches typically rely on fixed reasoning templates or rigid pipelines, which force VLMs into a passive role and hinder both efficiency and generalization. We present Active Long-DocumEnt Navigation (ALDEN), a multi-turn reinforcement learning framework that fine-tunes VLMs as interactive agents capable of actively navigating long, visually rich documents. ALDEN introduces a novel fetch action that directly accesses the page by index, complementing the classic search action and better exploiting document structure. For dense process supervision and efficient training, we propose a rule-based cross-level reward that provides both turn- and token-level signals. To address the empirically observed training instability caused by numerous visual tokens from long documents, we further propose a visual-semantic anchoring mechanism that applies a dual-path KL-divergence constraint to stabilize visual and textual representations separately during training. Trained on a corpus constructed from three open-source datasets, ALDEN achieves state-of-the-art performance on five long-document benchmarks. Overall, ALDEN marks a step beyond passive document reading toward agents that autonomously navigate and reason across long, visually rich documents, offering a robust path to more accurate and efficient long-document understanding.
中文标题/摘要
标题:ALDEN:在长文档中进行主动导航和证据收集的强化学习
视觉语言模型(VLMs)在解释图文丰富的图像方面表现出色,但在处理长篇复杂文档时却遇到困难,这些文档需要对分布在多页上的信息进行分析和整合。现有方法通常依赖固定的推理模板或刚性管道,这迫使VLMs处于被动角色,影响了效率和泛化能力。我们提出了Active Long-DocumEnt Navigation (ALDEN),这是一种多轮次的强化学习框架,可以将VLMs微调为能够主动导航长图文文档的交互式代理。ALDEN引入了一种新的获取动作,可以直接通过索引访问页面,补充了经典的搜索动作,并更好地利用了文档结构。为了进行密集的过程监督和高效的训练,我们提出了一种基于规则的跨层次奖励机制,提供了轮次级和标记级的信号。为了解决由长文档中的大量视觉标记引起的训练不稳定性问题,我们进一步提出了一种视觉语义锚定机制,在训练过程中分别对视觉和文本表示施加双重路径的KL散度约束,以稳定它们。ALDEN在三个开源数据集构建的语料库上进行训练,实现了五个长文档基准测试中的最佳性能。总体而言,ALDEN标志着从被动文档阅读向能够自主导航和在长图文文档中进行推理的代理的一步跨越,提供了一条通往更准确和高效的长文档理解的稳健路径。
Summary / 总结
ALDEN is a reinforcement learning framework that enhances vision-language models to actively navigate and gather evidence from long, complex documents. It introduces a novel fetch action and a rule-based cross-level reward system to improve efficiency and generalization. ALDEN also includes a visual-semantic anchoring mechanism to stabilize training. The model achieves state-of-the-art performance on five long-document benchmarks, demonstrating its effectiveness in autonomous document navigation and reasoning.
ALDEN 是一个强化学习框架,旨在增强视觉语言模型,使其能够主动导航和从长且复杂的文档中收集证据。它引入了一种新的获取动作和基于规则的跨层级奖励系统,以提高效率和泛化能力。ALDEN 还包含一种视觉语义锚定机制,以在训练过程中稳定视觉和文本表示。该模型在五个长文档基准测试中达到了最先进的性能,展示了其在自主文档导航和推理方面的有效性。
Don't Blind Your VLA: Aligning Visual Representations for OOD Generalization
Authors: Nikita Kachaev, Mikhail Kolosov, Daniil Zelezetsky, Alexey K. Kovalev, Aleksandr I. Panov
First: 2025-10-29T15:20:10+00:00 · Latest: 2025-10-29T15:20:10+00:00
Comments: 13 pages, 6 figures
Abstract
The growing success of Vision-Language-Action (VLA) models stems from the promise that pretrained Vision-Language Models (VLMs) can endow agents with transferable world knowledge and vision-language (VL) grounding, laying a foundation for action models with broader generalization. Yet when these VLMs are adapted to the action modality, it remains unclear to what extent their original VL representations and knowledge are preserved. In this work, we conduct a systematic study of representation retention during VLA fine-tuning, showing that naive action fine-tuning leads to degradation of visual representations. To characterize and measure these effects, we probe VLA's hidden representations and analyze attention maps, further, we design a set of targeted tasks and methods that contrast VLA models with their counterpart VLMs, isolating changes in VL capabilities induced by action fine-tuning. We further evaluate a range of strategies for aligning visual representations and introduce a simple yet effective method that mitigates degradation and yields improved generalization to out-of-distribution (OOD) scenarios. Taken together, our analysis clarifies the trade-off between action fine-tuning and the degradation of VL representations and highlights practical approaches to recover inherited VL capabilities. Code is publicly available: https://blind-vla-paper.github.io
中文标题/摘要
标题:不要盲目训练VLA:为OOD泛化对齐视觉表示
视觉-语言-行动(VLA)模型的成功得益于预训练视觉-语言模型(VLMs)赋予代理广泛转移的世界知识和视觉-语言(VL)定位的承诺,为具有更广泛泛化能力的行动模型奠定了基础。然而,当这些VLMs适应行动模态时,尚不清楚它们原始的VL表示和知识在多大程度上得到了保留。在本文中,我们系统研究了VLA微调期间表示的保留情况,表明简单的行动微调会导致视觉表示的退化。为了表征和测量这些影响,我们探测了VLA的隐藏表示并分析了注意力图,进一步设计了一系列对比VLA模型与其对应VLMs的目标任务和方法,以隔离由行动微调引起的VL能力的变化。我们还评估了一系列视觉表示对齐策略,并引入了一种简单而有效的方法,该方法减轻了退化并提高了对分布外(OOD)场景的泛化能力。综上所述,我们的分析阐明了行动微调与VL表示退化之间的权衡,并强调了恢复继承的VL能力的实用方法。代码已公开:https://blind-vla-paper.github.io
Summary / 总结
This work investigates the impact of fine-tuning Vision-Language-Action (VLA) models on their visual representations and generalization capabilities. The study finds that naive action fine-tuning degrades visual representations, leading to poorer out-of-distribution (OOD) generalization. To address this, the authors propose a simple method to align visual representations, which improves OOD performance without compromising action capabilities. The analysis highlights the trade-off between action fine-tuning and visual representation preservation, and provides practical strategies to mitigate degradation.
该研究探讨了对Vision-Language-Action (VLA)模型进行动作微调对其视觉表示和泛化能力的影响。研究发现,简单的动作微调会降低视觉表示,导致在未知分布(OOD)场景下的表现较差。为了解决这一问题,作者提出了一种简单的方法来对齐视觉表示,这可以改善OOD泛化而不会显著损失动作性能。研究揭示了动作微调与视觉表示保留之间的权衡,并介绍了缓解降解的实用方法。
History
20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553