ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00
Abstract
Charts play an important role in visualization, reasoning, data analysis, and
the exchange of ideas among humans. However, existing vision-language models
(VLMs) still lack accurate perception of details and struggle to extract
fine-grained structures from charts. Such limitations in chart grounding also
hinder their ability to compare multiple charts and reason over them. In this
paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a
comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting
tabular data, localizing visualization elements, and recognizing various
attributes from charts of diverse types and complexities. We design a JSON
template to facilitate the calculation of evaluation metrics specifically
tailored for each grounding task. By incorporating a novel two-stage inference
workflow, the benchmark can further evaluate VLMs' capability to align and
compare elements/attributes across two charts. Our analysis of evaluations on
several recent VLMs reveals new insights into their perception biases,
weaknesses, robustness, and hallucinations in chart understanding. These
findings highlight the fine-grained discrepancies among VLMs in chart
understanding tasks and point to specific skills that need to be strengthened
in current models.
中文标题/摘要
标题:ChartAB:图表定位与密集对齐基准
图表在可视化、推理、数据分析以及人类之间的思想交流中发挥着重要作用。然而,现有的视觉-语言模型(VLMs)在细节感知方面仍存在不足,难以从图表中提取精细的结构。这些在图表定位方面的限制也阻碍了它们比较多个图表和推理的能力。在本文中,我们引入了一个新的“图表对齐基准(ChartAB)”,以全面评估VLMs在图表定位任务中的表现,即提取表格数据、定位可视化元素以及从不同类型和复杂度的图表中识别各种属性。我们设计了一个JSON模板,以方便计算每个定位任务特定的评估指标。通过引入一种新颖的两阶段推理工作流,基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几种近期VLMs的评估分析揭示了它们在图表理解方面的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异,并指出了当前模型需要加强的具体技能。
Summary / 总结
The paper introduces ChartAB, a benchmark for evaluating vision-language models (VLMs) in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing attributes from diverse charts. The benchmark uses a JSON template to calculate specific evaluation metrics and a two-stage inference workflow to assess the models' ability to align and compare elements across charts. The analysis reveals biases, weaknesses, and hallucinations in VLMs, highlighting the need to improve their fine-grained understanding of charts.
论文提出了ChartAB基准,用于评估视觉-语言模型在图表定位任务中的表现,包括提取表格数据、定位可视化元素和识别属性。它使用JSON模板来计算评估指标,并采用两阶段推理工作流来评估模型在跨图表对齐和比较元素的能力。分析揭示了模型中的偏见、弱点和幻觉,突出了需要在图表理解任务中增强的具体技能。
SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Authors: Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
First: 2025-10-30T17:52:39+00:00 · Latest: 2025-10-30T17:52:39+00:00
Abstract
This work introduces SteerVLM, a lightweight steering module designed to
guide Vision-Language Models (VLMs) towards outputs that better adhere to
desired instructions. Our approach learns from the latent embeddings of paired
prompts encoding target and converse behaviors to dynamically adjust
activations connecting the language modality with image context. This allows
for fine-grained, inference-time control over complex output semantics without
modifying model weights while preserving performance on off-target tasks. Our
steering module requires learning parameters equal to 0.14% of the original
VLM's size. Our steering module gains model control through dimension-wise
activation modulation and adaptive steering across layers without requiring
pre-extracted static vectors or manual tuning of intervention points.
Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a
multimodal dataset specifically created to facilitate the development and
evaluation of VLM steering techniques. Our method outperforms existing
intervention techniques on steering and hallucination mitigation benchmarks for
VLMs and proposes a robust solution for multimodal model control through
activation engineering.
中文标题/摘要
标题:SteerVLM:通过轻量级激活转向实现视觉语言模型稳健的模型控制
本工作介绍了SteerVLM,这是一种轻量级的转向模块,旨在引导视觉语言模型(VLMs)生成更符合所需指令的输出。我们的方法通过学习配对提示的潜在嵌入,编码目标和相反行为,动态调整语言模态与图像上下文之间的激活连接。这允许在不修改模型权重的情况下,在推理时对复杂的输出语义进行精细控制,同时保持对离目标任务的性能。我们的转向模块的学习参数量仅为原始VLM大小的0.14%。我们的转向模块通过维度上的激活调制和跨层自适应转向获得模型控制,无需预先提取的静态向量或手动调整干预点。此外,我们还引入了VNIA(视觉叙事意图对齐)多模态数据集,专门用于促进VLM转向技术的发展和评估。我们的方法在VLM转向和幻觉缓解基准测试中优于现有干预技术,并提出了一种通过激活工程实现多模态模型控制的稳健解决方案。
Summary / 总结
SteerVLM is a lightweight module that guides VLMs to produce outputs more aligned with desired instructions by dynamically adjusting activations. It learns from paired prompts to steer the model without changing model weights, requiring only 0.14% of the original VLM's parameters. SteerVLM outperforms existing techniques in steering and hallucination mitigation, offering a robust solution for controlling VLMs through activation modulation.
SteerVLM 是一个轻量级模块,通过动态调整激活来引导 VLM 生成更符合指令要求的输出。它通过学习配对提示来引导模型,而不改变模型权重,仅需原始 VLM 参数的 0.14%。SteerVLM 在引导和幻觉缓解基准测试中优于现有技术,提供了一种通过激活工程实现多模态模型控制的稳健解决方案。
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
First: 2025-06-24T17:30:27+00:00 · Latest: 2025-10-30T16:38:19+00:00
Comments: 39 pages, 24 figures
Abstract
Recent vision-language-action (VLA) models built on pretrained
vision-language models (VLMs) have demonstrated strong performance in robotic
manipulation. However, these models remain constrained by the single-frame
image paradigm and fail to fully leverage the temporal information offered by
multi-frame histories, as directly feeding multiple frames into VLM backbones
incurs substantial computational overhead and inference latency. We propose
CronusVLA, a unified framework that extends single-frame VLA models to the
multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame
pretraining on large-scale embodied datasets with autoregressive prediction of
action tokens, establishing an effective embodied vision-language foundation;
(2) Multi-frame post-training, which adapts the prediction of the
vision-language backbone from discrete tokens to learnable features, and
aggregates historical information via feature chunking. CronusVLA effectively
addresses the existing challenges of multi-frame modeling while enhancing
performance and observational robustness. To evaluate the robustness under
temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel
benchmark featuring 24 types of observational disturbances and 120 severity
levels. Experiments across three embodiments in simulated and real-world
environments demonstrate that CronusVLA achieves leading performance and
superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8%
improvement over OpenVLA on LIBERO, and the highest robustness score on
SimplerEnv-OR. These results highlight the potential of efficient multi-frame
adaptation in VLA models for more powerful and robust real-world deployment.
中文标题/摘要
标题:CronusVLA:通过多帧视觉-语言-动作建模实现高效稳健操作
基于预训练视觉-语言模型(VLMs)的近期视觉-语言-动作(VLA)模型在机器人操作方面表现出强大的性能。然而,这些模型仍然受限于单帧图像范式,未能充分利用多帧历史提供的时间信息,因为直接将多帧输入到VLM主干中会带来巨大的计算开销和推理延迟。我们提出了一种名为CronusVLA的统一框架,将单帧VLA模型扩展到多帧范式。CronusVLA遵循两阶段过程:(1)在大规模具身数据集上进行单帧预训练,通过自回归预测动作标记,建立有效的具身视觉-语言基础;(2)多帧后训练,将视觉-语言主干的预测从离散标记调整为可学习特征,并通过特征分块聚合历史信息。CronusVLA有效解决了多帧建模的现有挑战,同时提高了性能和观测鲁棒性。为了评估在时间和空间扰动下的鲁棒性,我们引入了SimplerEnv-OR基准,该基准包含24种观测扰动类型和120种严重程度级别。在模拟和真实环境中的三种具身模型实验表明,CronusVLA实现了领先性能和优越的鲁棒性,在SimplerEnv中的成功率达到了70.9%,在LIBERO中的性能提高了26.8%,在SimplerEnv-OR中获得了最高的鲁棒性得分。这些结果突显了VLA模型中高效多帧适应的潜力,使其在更强大和鲁棒的实际部署中具有更大的可能性。
Summary / 总结
CronusVLA is a unified framework that extends single-frame vision-language-action models to a multi-frame paradigm, addressing computational overhead and inference latency issues. It consists of two stages: single-frame pretraining for establishing an embodied vision-language foundation, and multi-frame post-training for learning from historical information. Experiments show CronusVLA outperforms existing models with a 70.9% success rate on SimplerEnv and a 26.8% improvement over OpenVLA on LIBERO, demonstrating enhanced performance and robustness under various disturbances.
CronusVLA 是一种统一框架,将单帧视觉-语言-动作模型扩展到多帧范式,以利用时间信息并减少计算开销。它包括两个阶段:单帧预训练以建立视觉-语言基础,以及多帧后训练以从历史信息中学习。实验结果显示,CronusVLA 在 SimplerEnv 中的成功率为 70.9%,在 LIBERO 上比 OpenVLA 提高了 26.8% 的性能,并在 SimplerEnv-OR 上获得了最高的鲁棒性评分,证明了其在机器人操作任务中的有效性和鲁棒性。
All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
First: 2025-10-30T16:08:25+00:00 · Latest: 2025-10-30T16:08:25+00:00
Abstract
Autonomous Vehicles (AVs) are transforming the future of transportation
through advances in intelligent perception, decision-making, and control
systems. However, their success is tied to one core capability, reliable object
detection in complex and multimodal environments. While recent breakthroughs in
Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable
progress, the field still faces a critical challenge as knowledge remains
fragmented across multimodal perception, contextual reasoning, and cooperative
intelligence. This survey bridges that gap by delivering a forward-looking
analysis of object detection in AVs, emphasizing emerging paradigms such as
Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI
rather than re-examining outdated techniques. We begin by systematically
reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR,
and Radar) and their fusion strategies, highlighting not only their
capabilities and limitations in dynamic driving environments but also their
potential to integrate with recent advances in LLM/VLM-driven perception
frameworks. Next, we introduce a structured categorization of AV datasets that
moves beyond simple collections, positioning ego-vehicle, infrastructure-based,
and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a
cross-analysis of data structures and characteristics. Ultimately, we analyze
cutting-edge detection methodologies, ranging from 2D and 3D pipelines to
hybrid sensor fusion, with particular attention to emerging transformer-driven
approaches powered by Vision Transformers (ViTs), Large and Small Language
Models (SLMs), and VLMs. By synthesizing these perspectives, our survey
delivers a clear roadmap of current capabilities, open challenges, and future
opportunities.
中文标题/摘要
标题:自动驾驶所需的一切:从像素、点和提示到下一代融合与多模态大/小语言模型/视觉模型在自动驾驶车辆中的应用
自动驾驶车辆(AVs)通过智能感知、决策和控制系统的发展正在重塑未来的交通。然而,它们的成功取决于一个核心能力——在复杂和多模态环境中可靠地进行目标检测。尽管计算机视觉(CV)和人工智能(AI)领域的最新突破推动了显著的进步,但该领域仍面临一个关键挑战,即知识在多模态感知、上下文推理和协同智能方面仍碎片化。本文综述填补了这一空白,通过提供面向未来的AV目标检测分析,强调了新兴范式,如视觉语言模型(VLMs)、大型语言模型(LLMs)和生成AI,而不是重新审视过时的技术。我们首先系统地回顾了AV传感器(摄像头、超声波、激光雷达和雷达)及其融合策略,不仅突出了它们在动态驾驶环境中的能力和局限性,还强调了它们与基于大/小语言模型/视觉模型的感知框架的潜在整合。接着,我们介绍了AV数据集的结构化分类,超越了简单的集合,将自我车辆、基础设施和协同数据集(例如V2V、V2I、V2X、I2I)置于其中,随后进行了数据结构和特征的交叉分析。最后,我们分析了最新的检测方法,从2D和3D管道到混合传感器融合,特别关注由视觉变换器(ViTs)、大型和小型语言模型(SLMs)和VLMs驱动的新兴变换器方法。通过综合这些视角,我们的综述提供了一条清晰的当前能力、开放挑战和未来机遇的路线图。
Summary / 总结
This paper aims to address the critical challenge of reliable object detection in autonomous vehicles (AVs) by integrating vision-language models (VLMs), large language models (LLMs), and generative AI. The authors review the fundamental AV sensors and their fusion strategies, categorize AV datasets, and analyze cutting-edge detection methodologies. Key findings include the potential of VLMs and LLMs in integrating multimodal perception and contextual reasoning, and the importance of hybrid sensor fusion for next-generation AVs.
本文旨在通过整合视觉语言模型(VLMs)、大型语言模型(LLMs)和生成式AI来解决自动驾驶汽车(AVs)中可靠的物体检测问题。作者回顾了基本的AV传感器及其融合策略,对AV数据集进行了分类,并分析了最新的检测方法。主要发现包括VLMs和LLMs在多模态感知和上下文推理中的潜在作用,以及混合传感器融合对于下一代AV的重要性。
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Authors: Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
First: 2025-10-30T13:26:58+00:00 · Latest: 2025-10-30T13:26:58+00:00
Comments: Preprint
Abstract
Self-improvement has emerged as a mainstream paradigm for advancing the
reasoning capabilities of large vision-language models (LVLMs), where models
explore and learn from successful trajectories iteratively. However, we
identify a critical issue during this process: the model excels at generating
high-quality trajectories for simple queries (i.e., head data) but struggles
with more complex ones (i.e., tail data). This leads to an imbalanced
optimization that drives the model to prioritize simple reasoning skills, while
hindering its ability to tackle more complex reasoning tasks. Over iterations,
this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew
effect"--which ultimately hinders further model improvement and leads to
performance bottlenecks. To counteract this challenge, we introduce four
efficient strategies from two perspectives: distribution-reshaping and
trajectory-resampling, to achieve head-tail re-balancing during the
exploration-and-learning self-improvement process. Extensive experiments on
Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks
demonstrate that our methods consistently improve visual reasoning
capabilities, outperforming vanilla self-improvement by 3.86 points on average.
中文标题/摘要
标题:通过头部-尾部再平衡对抗LVLM自我提升中的马太效应
自我提升已成为提升大型视觉-语言模型(LVLM)推理能力的主要范式,其中模型通过迭代探索和学习成功的轨迹。然而,在这一过程中,我们发现一个关键问题:模型在生成简单查询(即头部数据)的高质量轨迹方面表现出色,但在处理更复杂的查询(即尾部数据)方面却遇到困难。这导致了一种不平衡的优化,使模型优先关注简单的推理技能,而阻碍了其解决更复杂推理任务的能力。随着迭代次数的增加,这种不平衡变得越来越明显——我们将其称为“马太效应”——最终阻碍了模型的进一步改进并导致性能瓶颈。为了应对这一挑战,我们从两个角度引入了四种有效的策略:分布重塑和轨迹重采样,以在探索和学习自我提升过程中实现头部-尾部再平衡。在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型的视觉推理任务上的广泛实验表明,我们的方法在视觉推理能力上始终优于传统的自我提升,平均高出3.86分。
Summary / 总结
This paper addresses the issue of the Matthew effect in self-improvement of large vision-language models (LVLMs), where the models excel at simple tasks but struggle with complex ones. To counteract this, the authors propose four strategies for head-tail re-balancing during the self-improvement process. Experiments show that these methods improve visual reasoning capabilities by 3.86 points on average compared to traditional self-improvement methods.
论文研究了大型视觉-语言模型(LVLM)自我改进过程中出现的马太效应,即模型在简单任务(头数据)上表现优异,但在复杂任务(尾数据)上表现不佳。为解决这一问题,作者提出了四种策略,在探索-学习过程中实现头尾平衡。实验表明,这些方法在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上提高了视觉推理能力,平均提升了3.86分。
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Authors: Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
First: 2025-10-30T13:11:23+00:00 · Latest: 2025-10-30T13:11:23+00:00
Abstract
Object-context shortcuts remain a persistent challenge in vision-language
models, undermining zero-shot reliability when test-time scenes differ from
familiar training co-occurrences. We recast this issue as a causal inference
problem and ask: Would the prediction remain if the object appeared in a
different environment? To answer this at inference time, we estimate object and
background expectations within CLIP's representation space, and synthesize
counterfactual embeddings by recombining object features with diverse
alternative contexts sampled from external datasets, batch neighbors, or
text-derived descriptions. By estimating the Total Direct Effect and simulating
intervention, we further subtract background-only activation, preserving
beneficial object-context interactions while mitigating hallucinated scores.
Without retraining or prompt design, our method substantially improves both
worst-group and average accuracy on context-sensitive benchmarks, establishing
a new zero-shot state of the art. Beyond performance, our framework provides a
lightweight representation-level counterfactual approach, offering a practical
causal avenue for debiased and reliable multimodal reasoning.
中文标题/摘要
标题:代表级反事实校准以实现无偏零样本识别
物体-上下文捷径仍然是视觉-语言模型中的一个持续性挑战,当测试场景与熟悉的训练共现情况不同时,会削弱零样本识别的可靠性。我们将此问题重新表述为因果推理问题,并提出:如果物体出现在不同的环境中,预测结果会如何?为了在推理时回答这一问题,我们估计CLIP表示空间中的物体和背景期望,并通过重新组合来自外部数据集、批邻居或文本描述的多样化替代上下文中的物体特征,合成反事实嵌入。通过估计总直接效应和模拟干预,我们进一步减去背景激活,保留有益的物体-上下文交互,同时减轻幻觉得分。无需重新训练或设计提示,我们的方法在上下文敏感基准上显著提高了最差群体和平均准确率,建立了新的零样本状态最先进。除了性能,我们的框架提供了一种轻量级的代表级反事实方法,为无偏和可靠的多模态推理提供了实用的因果途径。
Summary / 总结
This paper addresses the challenge of object-context shortcuts in vision-language models by recasting the issue as a causal inference problem. The authors estimate object and background expectations within CLIP's representation space and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts. By estimating the Total Direct Effect and simulating intervention, they mitigate hallucinated scores while preserving beneficial object-context interactions. This method improves both worst-group and average accuracy on context-sensitive benchmarks, setting a new zero-shot state of the art without retraining or prompt design. Beyond performance, the framework offers a lightweight causal approach for debiased and reliable multimodal reasoning.
论文针对视觉-语言模型中对象-上下文捷径的问题,该问题可能导致在测试场景与训练场景不一致时零样本识别的可靠性降低。提出了一种方法,在CLIP的表示空间中估计对象和背景的期望,并通过重新组合对象特征与多样化的替代上下文来合成反事实嵌入。这种方法在上下文敏感基准测试中提高了最差群体和平均准确率,建立了新的零样本状态的前沿。除了性能提升,该方法还提供了一种轻量级的反事实框架,为去偏见和可靠的多模态推理提供了实用的因果途径。
Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection
Authors: Yuanting Fan, Jun Liu, Xiaochen Chen, Bin-Bin Gao, Jian Li, Yong Liu, Jinlong Peng, Chengjie Wang
First: 2025-10-30T13:09:00+00:00 · Latest: 2025-10-30T13:09:00+00:00
Comments: 12 pages, 7 figures
Abstract
Few-shot anomaly detection (FSAD) methods identify anomalous regions with few
known normal samples. Most existing methods rely on the generalization ability
of pre-trained vision-language models (VLMs) to recognize potentially anomalous
regions through feature similarity between text descriptions and images.
However, due to the lack of detailed textual descriptions, these methods can
only pre-define image-level descriptions to match each visual patch token to
identify potential anomalous regions, which leads to the semantic misalignment
between image descriptions and patch-level visual anomalies, achieving
sub-optimal localization performance. To address the above issues, we propose
the Multi-Level Fine-Grained Semantic Caption (MFSC) to provide multi-level and
fine-grained textual descriptions for existing anomaly detection datasets with
automatic construction pipeline. Based on the MFSC, we propose a novel
framework named FineGrainedAD to improve anomaly localization performance,
which consists of two components: Multi-Level Learnable Prompt (MLLP) and
Multi-Level Semantic Alignment (MLSA). MLLP introduces fine-grained semantics
into multi-level learnable prompts through automatic replacement and
concatenation mechanism, while MLSA designs region aggregation strategy and
multi-level alignment training to facilitate learnable prompts better align
with corresponding visual regions. Experiments demonstrate that the proposed
FineGrainedAD achieves superior overall performance in few-shot settings on
MVTec-AD and VisA datasets.
中文标题/摘要
标题:朝细粒度的视觉-语言对齐方向发展少量样本异常检测
少量样本异常检测(FSAD)方法使用少量已知正常样本识别异常区域。现有大多数方法依赖预训练的视觉-语言模型(VLMs)通过文本描述和图像特征之间的相似性来识别潜在的异常区域。然而,由于缺乏详细的文本描述,这些方法只能预先定义图像级别的描述来匹配每个视觉补丁标记,以识别潜在的异常区域,这导致了图像描述与补丁级别视觉异常之间的语义不匹配,从而导致次优的定位性能。为了解决上述问题,我们提出了多级细粒度语义描述(MFSC),为现有的异常检测数据集提供多级和细粒度的文本描述,并通过自动构建管道进行自动构建。基于MFSC,我们提出了一种新的框架FineGrainedAD,以提高异常定位性能,该框架由两个组件组成:多级可学习提示(MLLP)和多级语义对齐(MLSA)。MLLP通过自动替换和连接机制将细粒度语义引入多级可学习提示,而MLSA设计了区域聚合策略和多级对齐训练,以促进可学习提示更好地与相应的视觉区域对齐。实验表明,提出的FineGrainedAD在MVTec-AD和VisA数据集的少量样本设置中实现了优越的整体性能。
Summary / 总结
This paper addresses the issue of semantic misalignment in few-shot anomaly detection by proposing Multi-Level Fine-Grained Semantic Caption (MFSC) and a novel framework named FineGrainedAD. MFSC provides detailed textual descriptions for anomaly detection datasets, while FineGrainedAD includes Multi-Level Learnable Prompt (MLLP) and Multi-Level Semantic Alignment (MLSA) to enhance anomaly localization. Experiments show that FineGrainedAD outperforms existing methods on MVTec-AD and VisA datasets in few-shot settings.
论文针对现有少量样本异常检测方法中由于缺乏详细文本描述而导致的语义不匹配问题,提出了多级精细语义标注(MFSC)以提供详细文本描述,并提出了一种名为FineGrainedAD的新框架,该框架包括多级可学习提示(MLLP)和多级语义对齐(MLSA),以提高异常定位性能。实验结果显示FineGrainedAD在MVTec-AD和VisA数据集上的表现优于现有方法。
A-TPT: Angular Diversity Calibration Properties for Test-Time Prompt Tuning of Vision-Language Models
Authors: Shihab Aaqil Ahamed, Udaya S. K. P. Miriya Thanthrige, Ranga Rodrigo, Muhammad Haris Khan
First: 2025-10-30T12:45:24+00:00 · Latest: 2025-10-30T12:45:24+00:00
Comments: 23 pages, 14 figures
Abstract
Test-time prompt tuning (TPT) has emerged as a promising technique for
adapting large vision-language models (VLMs) to unseen tasks without relying on
labeled data. However, the lack of dispersion between textual features can hurt
calibration performance, which raises concerns about VLMs' reliability,
trustworthiness, and safety. Current TPT approaches primarily focus on
improving prompt calibration by either maximizing average textual feature
dispersion or enforcing orthogonality constraints to encourage angular
separation. However, these methods may not always have optimal angular
separation between class-wise textual features, which implies overlooking the
critical role of angular diversity. To address this, we propose A-TPT, a novel
TPT framework that introduces angular diversity to encourage uniformity in the
distribution of normalized textual features induced by corresponding learnable
prompts. This uniformity is achieved by maximizing the minimum pairwise angular
distance between features on the unit hypersphere. We show that our approach
consistently surpasses state-of-the-art TPT methods in reducing the aggregate
average calibration error while maintaining comparable accuracy through
extensive experiments with various backbones on different datasets. Notably,
our approach exhibits superior zero-shot calibration performance on natural
distribution shifts and generalizes well to medical datasets. We provide
extensive analyses, including theoretical aspects, to establish the grounding
of A-TPT. These results highlight the potency of promoting angular diversity to
achieve well-dispersed textual features, significantly improving VLM
calibration during test-time adaptation. Our code will be made publicly
available.
中文标题/摘要
标题:A-TPT:视觉语言模型测试时提示调优的角多样性校准特性
测试时提示调优(TPT)已成为一种有前景的技术,用于在无需依赖标记数据的情况下,将大型视觉语言模型(VLMs)适应未见过的任务。然而,文本特征之间的缺乏分散性会损害校准性能,这引起了人们对VLMs可靠性和安全性的担忧。当前的TPT方法主要通过最大化平均文本特征分散性或施加正交约束来鼓励角度分离,以提高提示校准。然而,这些方法可能无法始终在类别间文本特征之间实现最优的角度分离,这意味着忽视了角多样性的关键作用。为了解决这个问题,我们提出了一种新颖的A-TPT框架,该框架引入了角多样性,以鼓励由相应可学习提示诱导的归一化文本特征的分布均匀性。这种均匀性是通过最大化单位超球面上特征之间的最小成对角度距离来实现的。我们通过在不同数据集上使用各种骨干网络进行广泛实验,展示了我们的方法在降低综合平均校准误差方面始终优于最先进的TPT方法,同时保持了相当的准确性。值得注意的是,我们的方法在自然分布转移的零样本校准性能方面表现出色,并且能够很好地泛化到医学数据集。我们提供了广泛的分析,包括理论方面,以建立A-TPT的基础。这些结果突显了促进角多样性以实现分散的文本特征的潜力,显著提高了VLM在测试时适应过程中的校准。我们的代码将公开发布。
Summary / 总结
The paper addresses the issue of insufficient angular diversity in textual features, which can degrade the calibration performance of vision-language models during test-time prompt tuning (TPT). It introduces A-TPT, a novel TPT framework that maximizes the minimum pairwise angular distance between features to ensure uniform distribution. Extensive experiments show that A-TPT outperforms existing methods in reducing calibration error while maintaining accuracy, especially in zero-shot settings and medical datasets.
论文提出了A-TPT,这是一种新颖的视觉-语言模型测试时提示调优框架,旨在通过增强角度多样性来提升校准性能。通过最大化特征之间的最小成对角度距离,A-TPT在减少校准误差的同时保持了准确性,并且在自然分布变化的零样本校准上表现出色,同时在医学数据集上具有良好的泛化能力。
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Authors: Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel
First: 2025-10-20T15:41:55+00:00 · Latest: 2025-10-30T12:05:58+00:00
Abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility by
detecting objects from arbitrary text queries. However, their zero-shot
performance in specialized domains like Remote Sensing (RS) is often
compromised by the inherent ambiguity of natural language, limiting critical
downstream applications. For instance, an OVD model may struggle to distinguish
between fine-grained classes such as "fishing boat" and "yacht" since their
embeddings are similar and often inseparable. This can hamper specific user
goals, such as monitoring illegal fishing, by producing irrelevant detections.
To address this, we propose a cascaded approach that couples the broad
generalization of a large pre-trained OVD model with a lightweight few-shot
classifier. Our method first employs the zero-shot model to generate
high-recall object proposals. These proposals are then refined for high
precision by a compact classifier trained in real-time on only a handful of
user-annotated examples - drastically reducing the high costs of RS imagery
annotation.The core of our framework is FLAME, a one-step active learning
strategy that selects the most informative samples for training. FLAME
identifies, on the fly, uncertain marginal candidates near the decision
boundary using density estimation, followed by clustering to ensure sample
diversity. This efficient sampling technique achieves high accuracy without
costly full-model fine-tuning and enables instant adaptation, within less then
a minute, which is significantly faster than state-of-the-art alternatives.Our
method consistently surpasses state-of-the-art performance on RS benchmarks,
establishing a practical and resource-efficient framework for adapting
foundation models to specific user needs.
中文标题/摘要
标题:FLAME驱动的即时OVD适应:基于活跃边际样本探索的少样本定位
开放词汇对象检测(OVD)模型通过从任意文本查询中检测对象提供了显著的灵活性。然而,在如遥感(RS)等专门领域中,它们的零样本性能往往因自然语言的固有歧义而受损,限制了关键的下游应用。例如,一个OVD模型可能难以区分“渔船”和“游艇”这类细粒度类别,因为它们的嵌入相似且经常不可分。这可能妨碍特定用户目标,如监测非法捕鱼,导致无关的检测结果。为了解决这一问题,我们提出了一种级联方法,将大型预训练OVD模型的广泛泛化与轻量级少样本分类器相结合。我们的方法首先使用零样本模型生成高召回的对象提案,然后通过仅在少量用户标注示例上实时训练的小型分类器进行高精度细化,大幅降低了RS图像标注的高昂成本。我们框架的核心是FLAME,这是一种一步式主动学习策略,能够选择最具信息量的样本进行训练。FLAME利用密度估计在决策边界附近即时识别不确定的边际候选样本,然后通过聚类确保样本多样性。这种高效的采样技术在无需昂贵的全模型微调的情况下实现了高精度,并能在不到一分钟内实现即时适应,显著快于最先进的替代方案。我们的方法在RS基准测试中始终超越了最先进的性能,建立了一个实用且资源高效的框架,用于将基础模型适应特定用户需求。
Summary / 总结
The paper addresses the challenge of zero-shot performance in open-vocabulary object detection (OVD) models for specialized domains like Remote Sensing (RS), where fine-grained class distinctions are ambiguous. It proposes a cascaded approach combining a large pre-trained OVD model with a lightweight few-shot classifier. The method uses a one-step active learning strategy called FLAME to select informative samples for training, achieving high accuracy and enabling rapid adaptation within minutes. Experiments show that this approach consistently outperforms state-of-the-art methods on RS benchmarks.
论文针对开放词汇对象检测(OVD)模型在遥感(RS)等专业领域中的零样本性能不足问题,提出了一个级联方法,结合了一个大型预训练OVD模型和一个轻量级的少量样本分类器。该方法使用预训练模型生成对象提案,然后通过少量用户标注的示例训练一个紧凑的分类器进行细化。框架的核心是FLAME,这是一种实时选择具有信息性的样本进行训练的主动学习策略,能够在不进行全模型微调的情况下实现高精度,并且能够在几分钟内实现快速适应,实验结果表明该方法在RS基准测试中优于现有最佳方法。
MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Authors: Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto
First: 2025-10-30T11:58:36+00:00 · Latest: 2025-10-30T11:58:36+00:00
Abstract
Artificial intelligence in healthcare requires models that are accurate and
interpretable. We advance mechanistic interpretability in medical vision by
applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP,
a vision-language model trained on chest radiographs and reports. To quantify
interpretability, we propose an evaluation framework that combines correlation
metrics, entropy analyzes, and automated neuron naming via the MedGEMMA
foundation model. Experiments on the CheXpert dataset show that MedSAE neurons
achieve higher monosemanticity and interpretability than raw MedCLIP features.
Our findings bridge high-performing medical AI and transparency, offering a
scalable step toward clinically reliable representations.
中文标题/摘要
标题:MedSAE:通过稀疏自编码器剖析MedCLIP表示
医疗保健中的人工智能需要准确且可解释的模型。我们通过将医疗稀疏自编码器(MedSAEs)应用于MedCLIP的潜在空间,推进了医学视觉的机制可解释性,MedCLIP是一种在胸部X光片和报告上训练的视觉-语言模型。为了量化可解释性,我们提出了一种结合相关性度量、熵分析和通过MedGEMMA基础模型自动命名神经元的评估框架。在CheXpert数据集上的实验表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征。我们的研究结果将高性能的医疗AI与透明度相结合,提供了一条通往临床可靠表示的可扩展途径。
Summary / 总结
The research aims to enhance the interpretability of medical vision models by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. The study introduces an evaluation framework combining correlation metrics, entropy analysis, and automated neuron naming via MedGEMMA to quantify interpretability. Experiments on the CheXpert dataset demonstrate that MedSAE neurons achieve higher monosemanticity and interpretability compared to raw MedCLIP features, bridging high-performing medical AI with transparency.
研究旨在通过将Medical Sparse Autoencoders (MedSAEs)应用于MedCLIP的潜在空间,提升医疗视觉模型的可解释性,MedCLIP是一个在胸部X光片和报告上训练的视觉-语言模型。研究引入了一种结合相关性指标、熵分析和通过MedGEMMA基础模型自动命名神经元的评估框架来量化可解释性。实验结果表明,MedSAE神经元在单义性和可解释性方面优于原始的MedCLIP特征,实现了高性能医疗AI与透明度的结合。