arXiv 论文速递

2025-11-04 03:27
Snapshot: 20251104_0327
PETAR: Localized Findings Generation with Mask-Aware Vision-Language Modeling for PET Automated Reporting
Authors: Danyal Maqbool, Changhee Lee, Zachary Huemann, Samuel D. Church, Matthew E. Larson, Scott B. Perlman, Tomas A. Romero, Joshua D. Warner, Meghan Lubner, Xin Tie, Jameson Merkow, Junjie Hu, Steve Y. Cho, Tyler J. Bradshaw
First: 2025-10-31T17:49:01+00:00 · Latest: 2025-10-31T17:49:01+00:00
Abstract
Recent advances in vision-language models (VLMs) have enabled impressive multimodal reasoning, yet most medical applications remain limited to 2D imaging. In this work, we extend VLMs to 3D positron emission tomography and computed tomography (PET/CT), a domain characterized by large volumetric data, small and dispersed lesions, and lengthy radiology reports. We introduce a large-scale dataset comprising over 11,000 lesion-level descriptions paired with 3D segmentations from more than 5,000 PET/CT exams, extracted via a hybrid rule-based and large language model (LLM) pipeline. Building upon this dataset, we propose PETAR-4B, a 3D mask-aware vision-language model that integrates PET, CT, and lesion contours for spatially grounded report generation. PETAR bridges global contextual reasoning with fine-grained lesion awareness, producing clinically coherent and localized findings. Comprehensive automated and human evaluations demonstrate that PETAR substantially improves PET/CT report generation quality, advancing 3D medical vision-language understanding.
中文标题/摘要
标题:PETAR:基于掩码意识的视觉-语言建模在PET自动化报告中的局部发现生成
近期视觉-语言模型(VLMs)的发展使多模态推理取得了显著进展,但大多数医学应用仍局限于二维成像。本文将VLMs扩展到3D正电子发射断层扫描和计算机断层扫描(PET/CT),这是一个以大量体数据、小而分散的病灶和冗长的放射学报告为特征的领域。我们引入了一个包含超过11,000个病灶级描述的大规模数据集,这些描述与来自超过5,000次PET/CT检查的3D分割配对,通过混合基于规则和大型语言模型(LLM)的管道提取。基于此数据集,我们提出了PETAR-4B,这是一种3D掩码意识的视觉-语言模型,结合了PET、CT和病灶轮廓,用于空间定位报告生成。PETAR将全局上下文推理与细粒度的病灶意识相结合,生成临床一致且局部化的发现。全面的自动化和人工评估表明,PETAR显著提高了PET/CT报告生成的质量,推进了3D医学视觉-语言理解。
SpecAttn: Speculating Sparse Attention
Authors: Harsh Shah
Venue: NeurIPS 2025
First: 2025-10-31T17:12:34+00:00 · Latest: 2025-10-31T17:12:34+00:00
Comments: Accepted to NeurIPS 2025 Workshop on Structured Probabilistic Inference & Generative Modeling
Abstract
Large Language Models (LLMs) face significant computational bottlenecks during inference due to the quadratic complexity of self-attention mechanisms, particularly as context lengths increase. We introduce SpecAttn, a novel training-free approach that seamlessly integrates with existing speculative decoding techniques to enable efficient sparse attention in pre-trained transformers. Our key insight is to exploit the attention weights already computed by the draft model during speculative decoding to identify important tokens for the target model, eliminating redundant computation while maintaining output quality. SpecAttn employs three core techniques: KL divergence-based layer alignment between draft and target models, a GPU-optimized sorting-free algorithm for top-p token selection from draft attention patterns, and dynamic key-value cache pruning guided by these predictions. By leveraging the computational work already performed in standard speculative decoding pipelines, SpecAttn achieves over 75% reduction in key-value cache accesses with a mere 15.29% increase in perplexity on the PG-19 dataset, significantly outperforming existing sparse attention methods. Our approach demonstrates that speculative execution can be enhanced to provide approximate verification without significant performance degradation.
中文标题/摘要
标题:SpecAttn: 预测稀疏注意
大型语言模型(LLMs)在推理过程中由于自我注意机制的二次复杂性而面临显著的计算瓶颈,特别是在上下文长度增加时。我们引入了SpecAttn,这是一种无需训练的新颖方法,可以无缝集成到现有的推测性解码技术中,以在预训练的变压器中实现高效的稀疏注意。我们的核心见解是利用在推测性解码过程中由草稿模型计算出的注意权重来识别目标模型中的重要令牌,从而消除冗余计算并保持输出质量。SpecAttn 使用了三种核心技术:基于 KL 散度的草稿模型和目标模型之间的层对齐、一种基于草稿注意模式的 GPU 优化的无排序 top-p 令牌选择算法,以及由这些预测指导的动态键值缓存剪枝。通过利用标准推测性解码管道中已经完成的计算工作,SpecAttn 在 PG-19 数据集上实现了超过 75% 的键值缓存访问量减少,同时仅增加了 15.29% 的困惑度,显著优于现有的稀疏注意方法。我们的方法表明,推测执行可以增强以提供近似验证,而不会显著降低性能。
Summary / 总结
SpecAttn is a training-free method that integrates with speculative decoding to enable efficient sparse attention in pre-trained transformers. It uses KL divergence for layer alignment, a GPU-optimized sorting-free algorithm for token selection, and dynamic key-value cache pruning. SpecAttn reduces key-value cache accesses by over 75% with only a 15.29% increase in perplexity on the PG-19 dataset, outperforming existing sparse attention methods.
SpecAttn 是一种无需训练的方法,结合推测性解码来在预训练的变压器中实现高效的稀疏注意力。它使用 KL 散度进行层对齐,一种 GPU 优化的无排序算法进行 token 选择,以及由这些预测引导的动态键值缓存剪枝。SpecAttn 在 PG-19 数据集上将键值缓存访问次数减少了超过 75%,同时仅增加了 15.29% 的困惑度,优于现有的稀疏注意力方法。
VeriMoA: A Mixture-of-Agents Framework for Spec-to-HDL Generation
Authors: Heng Ping, Arijit Bhattacharjee, Peiyu Zhang, Shixuan Li, Wei Yang, Anzhe Cheng, Xiaole Zhang, Jesse Thomason, Ali Jannesari, Nesreen Ahmed, Paul Bogdan
First: 2025-10-31T16:40:58+00:00 · Latest: 2025-10-31T16:40:58+00:00
Abstract
Automation of Register Transfer Level (RTL) design can help developers meet increasing computational demands. Large Language Models (LLMs) show promise for Hardware Description Language (HDL) generation, but face challenges due to limited parametric knowledge and domain-specific constraints. While prompt engineering and fine-tuning have limitations in knowledge coverage and training costs, multi-agent architectures offer a training-free paradigm to enhance reasoning through collaborative generation. However, current multi-agent approaches suffer from two critical deficiencies: susceptibility to noise propagation and constrained reasoning space exploration. We propose VeriMoA, a training-free mixture-of-agents (MoA) framework with two synergistic innovations. First, a quality-guided caching mechanism to maintain all intermediate HDL outputs and enables quality-based ranking and selection across the entire generation process, encouraging knowledge accumulation over layers of reasoning. Second, a multi-path generation strategy that leverages C++ and Python as intermediate representations, decomposing specification-to-HDL translation into two-stage processes that exploit LLM fluency in high-resource languages while promoting solution diversity. Comprehensive experiments on VerilogEval 2.0 and RTLLM 2.0 benchmarks demonstrate that VeriMoA achieves 15--30% improvements in Pass@1 across diverse LLM backbones, especially enabling smaller models to match larger models and fine-tuned alternatives without requiring costly training.
中文标题/摘要
标题:VeriMoA:一种基于代理混合的从规格到HDL生成框架
自动化寄存器传输级(RTL)设计可以帮助开发人员满足日益增长的计算需求。大型语言模型(LLMs)在硬件描述语言(HDL)生成方面显示出潜力,但由于参数知识有限和领域特定约束,面临挑战。尽管提示工程和微调在知识覆盖和训练成本方面存在局限性,多代理架构提供了一种无需训练的范式,通过协作生成增强推理。然而,当前的多代理方法存在两个关键缺陷:易受噪声传播的影响和受限的推理空间探索。我们提出VeriMoA,一种无需训练的基于代理混合(MoA)框架,包含两个协同创新。首先,一种质量导向的缓存机制,用于维护所有中间HDL输出,并在整个生成过程中实现基于质量的排名和选择,鼓励在多层推理中积累知识。其次,一种多路径生成策略,利用C++和Python作为中间表示,将规格到HDL的转换分解为两个阶段的过程,利用LLMs在高资源语言中的流畅性,同时促进解决方案的多样性。在VerilogEval 2.0和RTLLM 2.0基准测试上的全面实验表明,VeriMoA在不同的LLM基础模型上实现了15-30%的Pass@1改进,特别是使较小的模型能够匹配较大的模型和微调替代方案,而无需进行昂贵的训练。
Summary / 总结
VeriMoA is a training-free mixture-of-agents framework designed to enhance the automation of RTL design through collaborative generation. It introduces a quality-guided caching mechanism and a multi-path generation strategy to improve reasoning and solution diversity. Experiments show that VeriMoA achieves 15-30% improvements in Pass@1 across various LLM backbones, particularly enabling smaller models to match larger models and fine-tuned alternatives without additional training costs.
VeriMoA 是一个无需训练的混合代理框架,旨在通过协作生成增强 RTL 设计的自动化。它引入了质量导向的缓存机制和多路径生成策略,以解决当前多代理方法的局限性。实验结果表明,VeriMoA 在 VerilogEval 2.0 和 RTLLM 2.0 基准测试中将 Pass@1 提高了 15-30%,特别地,使较小的模型能够匹配较大的模型和微调的替代方案,而无需额外的训练成本。
Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning
Authors: Yuhong Liu, Beichen Zhang, Yuhang Zang, Yuhang Cao, Long Xing, Xiaoyi Dong, Haodong Duan, Dahua Lin, Jiaqi Wang
First: 2025-10-31T16:30:08+00:00 · Latest: 2025-10-31T16:30:08+00:00
Comments: preprint
Abstract
Spatial understanding remains a weakness of Large Vision-Language Models (LVLMs). Existing supervised fine-tuning (SFT) and recent reinforcement learning with verifiable rewards (RLVR) pipelines depend on costly supervision, specialized tools, or constrained environments that limit scale. We introduce Spatial-SSRL, a self-supervised RL paradigm that derives verifiable signals directly from ordinary RGB or RGB-D images. Spatial-SSRL automatically formulates five pretext tasks that capture 2D and 3D spatial structure: shuffled patch reordering, flipped patch recognition, cropped patch inpainting, regional depth ordering, and relative 3D position prediction. These tasks provide ground-truth answers that are easy to verify and require no human or LVLM annotation. Training on our tasks substantially improves spatial reasoning while preserving general visual capabilities. On seven spatial understanding benchmarks in both image and video settings, Spatial-SSRL delivers average accuracy gains of 4.63% (3B) and 3.89% (7B) over the Qwen2.5-VL baselines. Our results show that simple, intrinsic supervision enables RLVR at scale and provides a practical route to stronger spatial intelligence in LVLMs.
中文标题/摘要
标题:Spatial-SSRL:通过自我监督强化学习提升空间理解
空间理解仍然是大型视觉-语言模型(LVLMs)的弱点。现有的监督微调(SFT)和最近的可验证奖励强化学习(RLVR)管道依赖于昂贵的监督、专门的工具或受限的环境,这限制了规模。我们引入了Spatial-SSRL,这是一种自我监督的RL范式,可以从普通的RGB或RGB-D图像中直接推导出可验证的信号。Spatial-SSRL 自动制定了五个预训练任务,捕捉2D和3D空间结构:打乱的块重排序、翻转的块识别、裁剪的块填充、区域深度排序和相对3D位置预测。这些任务提供了易于验证的正确答案,不需要人类或LVLM的标注。在我们的任务上进行训练显著提高了空间推理能力,同时保留了通用的视觉能力。在七个空间理解基准测试中,无论是图像还是视频设置,Spatial-SSRL 在Qwen2.5-VL 基线上的平均准确率分别提高了4.63%(3B)和3.89%(7B)。我们的结果表明,简单的内在监督使RLVR能够大规模实现,并为在LVLMs中实现更强的空间智能提供了实际途径。
Summary / 总结
Spatial-SSRL is a self-supervised reinforcement learning approach that enhances spatial understanding in large vision-language models (LVLMs) without costly supervision. It introduces five pretext tasks derived from ordinary RGB or RGB-D images to improve 2D and 3D spatial reasoning. On seven spatial understanding benchmarks, Spatial-SSRL achieves average accuracy gains of 4.63% and 3.89% over the Qwen2.5-VL baselines, demonstrating that simple intrinsic supervision can enable scalable reinforcement learning with verifiable rewards for LVLMs.
Spatial-SSRL 是一种无需昂贵监督的自监督强化学习方法,旨在增强大型视觉-语言模型(LVLM)的空间理解能力。它通过普通图像自动生成可验证的信号来自动制定五个预训练任务。在七个空间理解基准测试中,Spatial-SSRL 分别在 3B 和 7B 参数的 Qwen2.5-VL 基线模型上实现了 4.63% 和 3.89% 的平均准确率提升。
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Authors: Xiangyan Liu, Jinjie Ni, Zijian Wu, Chao Du, Longxu Dou, Haonan Wang, Tianyu Pang, Michael Qizhe Shieh
Venue: NeurIPS 2025
First: 2025-04-17T16:10:13+00:00 · Latest: 2025-10-31T15:41:28+00:00
Comments: NeurIPS 2025
Abstract
Recent advances in reinforcement learning (RL) have strengthened the reasoning capabilities of vision-language models (VLMs). However, enhancing policy exploration to better scale test-time compute remains largely underexplored. In addition, VLMs continue to struggle with imperfect visual perception, which in turn affects the subsequent reasoning process. We introduce NoisyRollout, a simple yet effective data augmentation method that addresses these issues by mixing training trajectories from both clean and moderately distorted images. This approach injects perceptual diversity, encouraging better policy exploration and leading to more robust reasoning. A noise annealing schedule gradually reduces distortion strength, aiding exploration early in training while ensuring later stability. Crucially, our method is easy-to-adopt--requiring no additional training cost and no modifications to the RL objective. Extensive experiments on 2 distinct training datasets demonstrate that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models across 5 out-of-domain reasoning and perception benchmarks. Furthermore, we validate the effectiveness of NoisyRollout across model sizes (7B and 32B), data scales (from 1K to 6K) and image augmentation types (Gaussion noise and rotation), highlighting its generalizability and scalability.
中文标题/摘要
标题:NoisyRollout:通过数据增强强化视觉推理
近期强化学习(RL)的进步增强了视觉语言模型(VLMs)的推理能力。然而,如何通过增强策略探索来更好地扩展测试时的计算能力仍然鲜有探索。此外,VLMs 在不完美的视觉感知方面仍然存在困难,这反过来影响了后续的推理过程。我们提出了 NoisyRollout,这是一种简单而有效的方法,通过混合干净图像和适度失真图像的训练轨迹来解决这些问题。这种方法注入了感知多样性,鼓励更好的策略探索,从而实现更稳健的推理。通过逐渐减少失真强度的噪声退火计划,NoisyRollout 在训练早期帮助探索,同时确保后期的稳定性。重要的是,我们的方法易于采用——无需额外的训练成本,也不需要修改RL目标。在两个不同的训练数据集上的广泛实验表明,NoisyRollout 在5个跨域推理和感知基准测试中实现了开源RL调优模型的最新性能。此外,我们验证了NoisyRollout在不同模型规模(7B和32B)、不同数据规模(从1K到6K)和不同图像增强类型(高斯噪声和旋转)中的有效性,突显了其通用性和可扩展性。
Summary / 总结
NoisyRollout is a data augmentation method that enhances the exploration capabilities of vision-language models in reinforcement learning by mixing training trajectories from clean and moderately distorted images. This approach improves robustness and performance across various benchmarks and model sizes. Extensive experiments show that NoisyRollout achieves state-of-the-art performance among open-source RL-tuned models and is easily adoptable with no additional training cost or modifications to the RL objective.
NoisyRollout 是一种数据增强方法,通过混合干净和适度失真的图像来增强强化学习中视觉语言模型的策略探索。它引入了感知多样性,从而提高了鲁棒性并取得了更好的性能。该方法易于采用,无需额外的训练成本或修改RL目标。大量实验表明,NoisyRollout 在不同数据规模和图像增强类型(高斯噪声和旋转)下,对于7B和32B模型在5个跨域基准测试中均达到了最先进的性能。
From Pixels to Paths: A Multi-Agent Framework for Editable Scientific Illustration
Authors: Jianwen Sun, Fanrui Zhang, Yukang Feng, Chuanhao Li, Zizhen Li, Jiaxin Ai, Yifan Chang, Yu Dai, Kaipeng Zhang
First: 2025-10-31T13:00:49+00:00 · Latest: 2025-10-31T13:00:49+00:00
Abstract
Scientific illustrations demand both high information density and post-editability. However, current generative models have two major limitations: Frist, image generation models output rasterized images lacking semantic structure, making it impossible to access, edit, or rearrange independent visual components in the images. Second, code-based generation methods (TikZ or SVG), although providing element-level control, force users into the cumbersome cycle of "writing-compiling-reviewing" and lack the intuitiveness of manipulation. Neither of these two approaches can well meet the needs for efficiency, intuitiveness, and iterative modification in scientific creation. To bridge this gap, we introduce VisPainter, a multi-agent framework for scientific illustration built upon the model context protocol. VisPainter orchestrates three specialized modules-a Manager, a Designer, and a Toolbox-to collaboratively produce diagrams compatible with standard vector graphics software. This modular, role-based design allows each element to be explicitly represented and manipulated, enabling true element-level control and any element can be added and modified later. To systematically evaluate the quality of scientific illustrations, we introduce VisBench, a benchmark with seven-dimensional evaluation metrics. It assesses high-information-density scientific illustrations from four aspects: content, layout, visual perception, and interaction cost. To this end, we conducted extensive ablation experiments to verify the rationality of our architecture and the reliability of our evaluation methods. Finally, we evaluated various vision-language models, presenting fair and credible model rankings along with detailed comparisons of their respective capabilities. Additionally, we isolated and quantified the impacts of role division, step control,and description on the quality of illustrations.
中文标题/摘要
标题:从像素到路径:基于模型上下文协议的多智能体科学插图框架
科学插图需要高信息密度和可编辑性。然而,当前的生成模型有两个主要局限性:首先,图像生成模型输出的是缺乏语义结构的位图图像,使得无法访问、编辑或重新排列图像中的独立视觉组件。其次,基于代码的生成方法(如TikZ或SVG),虽然提供了元素级别的控制,但迫使用户陷入“编写-编译-审阅”的繁琐循环中,缺乏操作的直观性。这两种方法都无法很好地满足科学创作中效率、直观性和迭代修改的需求。为弥合这一差距,我们引入了VisPainter,这是一种基于模型上下文协议构建的多智能体科学插图框架。VisPainter 组织了三个专门模块——管理者、设计师和工具箱——以协作方式生成与标准向量图形软件兼容的图表。这种模块化、基于角色的设计允许每个元素被明确表示和操作,实现真正的元素级控制,并且任何元素都可以在之后添加和修改。为了系统地评估科学插图的质量,我们引入了VisBench,这是一个具有七个维度评估指标的基准。它从内容、布局、视觉感知和交互成本四个方面评估高信息密度的科学插图。为此,我们进行了广泛的消融实验,以验证我们架构的合理性以及我们评估方法的可靠性。最后,我们评估了各种视觉语言模型,提供了公平且可信的模型排名,并详细比较了它们各自的性能。此外,我们还分离并量化了角色分工、步骤控制和描述对插图质量的影响。
Summary / 总结
The paper addresses the need for efficient and intuitive editing of scientific illustrations, which require both high information density and post-editability. It introduces VisPainter, a multi-agent framework that uses a Manager, Designer, and Toolbox to collaboratively generate vector graphics compatible with standard software. The framework enables true element-level control and allows for easy addition and modification of elements. VisBench, a benchmark with seven-dimensional evaluation metrics, is introduced to systematically assess the quality of scientific illustrations, and extensive ablation experiments verify the framework's effectiveness.
论文旨在解决科学插图中高信息密度和可编辑性之间的需求。提出了VisPainter,这是一种多代理框架,通过Manager、Designer和Toolbox协作生成矢量图形。VisPainter实现了真正的元素级控制和迭代修改,克服了现有基于像素的图像生成模型和基于代码的方法的局限性。该框架使用VisBench进行评估,这是一个具有七个维度评估指标的基准,广泛的消融实验验证了其有效性和可靠性。还测试了视觉语言模型,提供了公平的比较和详细的性能分析。
Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds
Authors: Wu Wei, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi
First: 2025-10-31T11:32:15+00:00 · Latest: 2025-10-31T11:32:15+00:00
Abstract
Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.
中文标题/摘要
标题:异质双曲流形上树结构的模态对齐
模态对齐对于视觉语言模型(VLMs)有效整合跨模态信息至关重要。然而,现有方法从文本中提取层次特征,而用单个特征表示每张图像,导致不对称且次优的对齐。为解决这一问题,我们提出了一种树结构对齐方法,该方法为图像和文本模态构建并对齐树状层次特征。具体而言,我们引入了一种语义感知的视觉特征提取框架,该框架在中间Transformer层的视觉类标记上应用交叉注意力机制,并由文本线索引导,以提取具有粗到细语义的视觉特征。然后,我们将两种模态的特征树嵌入具有不同曲率的双曲流形中,以有效地建模它们的层次结构。为了在具有不同曲率的异质双曲流形之间对齐,我们提出了异质流形上分布之间的KL距离度量,并通过最小化距离学习中间流形以实现流形对齐。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类任务中,我们的方法在少量样本和跨域设置下均优于强基线方法。
Summary / 总结
The paper addresses the issue of asymmetric and suboptimal modality alignment in vision-language models by proposing a method called Alignment across Trees. This method constructs and aligns hierarchical features for both image and text modalities using a semantic-aware visual feature extraction framework and hyperbolic manifolds with distinct curvatures. The experiments show that the proposed method outperforms strong baselines in taxonomic open-set classification tasks under few-shot and cross-domain settings.
研究旨在通过解决文本和图像在层次特征提取上的不对称性,改进视觉语言模型中的模态对齐。方法Alignment across Trees通过构建和对齐两种模态的层次特征树,并使用语义感知的视觉特征提取框架和双曲嵌入,来实现对齐。实验结果显示,该方法在少量样本和跨域设置下的分类任务中,优于强基线方法。
Mano Technical Report
Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
First: 2025-09-22T03:13:58+00:00 · Latest: 2025-10-31T09:42:28+00:00
Abstract
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
中文标题/摘要
标题:Mano 技术报告
图形用户界面(GUI)是人机交互的主要媒介,但由于视觉元素的复杂性、动态环境以及多步推理的需求,自动化GUI交互仍然具有挑战性。现有的基于视觉-语言模型(VLMs)的方法往往受到分辨率有限、领域不匹配和序列决策能力不足的限制。为了解决这些问题,我们提出了一种名为Mano的稳健的GUI代理,该代理基于在大量网络和计算机系统数据上预训练的多模态基础模型构建。我们的方法结合了一个新颖的模拟环境以生成高保真数据、三阶段训练流程(监督微调、离线强化学习和在线强化学习)以及一个验证模块以实现错误恢复。Mano在多个GUI基准测试中表现出最先进的性能,包括Mind2Web和OSWorld,显著提高了成功率和操作准确性。我们的工作为强化学习与VLMs的有效集成提供了新的见解,强调了领域特定数据、迭代训练和整体奖励设计的重要性。
Summary / 总结
The research addresses the challenges of automating GUI interactions by proposing Mano, a robust GUI agent based on a multi-modal foundation model. Mano integrates a simulated environment for data generation, a three-stage training pipeline, and a verification module. The agent shows state-of-the-art performance on benchmarks like Mind2Web and OSWorld, with improvements in success rate and operational accuracy. This work highlights the importance of domain-specific data and iterative training for practical GUI agent deployment.
研究旨在通过解决现有视觉-语言模型的局限性,提高图形用户界面(GUI)交互的自动化水平。提出了一个名为Mano的稳健GUI代理,利用广泛网页和计算机系统数据预训练的多模态基础模型。它采用三阶段训练管道和验证模块来提升性能。Mano在Mind2Web和OSWorld等基准测试中表现出最先进的成果,显著提高了成功率和操作准确性。
LangHOPS: Language Grounded Hierarchical Open-Vocabulary Part Segmentation
Authors: Yang Miao, Jan-Nico Zaech, Xi Wang, Fabien Despinoy, Danda Pani Paudel, Luc Van Gool
Venue: Neurips 2025
First: 2025-10-29T08:21:59+00:00 · Latest: 2025-10-31T09:11:14+00:00
Comments: 10 pages, 5 figures, 14 tables, Neurips 2025
Abstract
We propose LangHOPS, the first Multimodal Large Language Model (MLLM) based framework for open-vocabulary object-part instance segmentation. Given an image, LangHOPS can jointly detect and segment hierarchical object and part instances from open-vocabulary candidate categories. Unlike prior approaches that rely on heuristic or learnable visual grouping, our approach grounds object-part hierarchies in language space. It integrates the MLLM into the object-part parsing pipeline to leverage its rich knowledge and reasoning capabilities, and link multi-granularity concepts within the hierarchies. We evaluate LangHOPS across multiple challenging scenarios, including in-domain and cross-dataset object-part instance segmentation, and zero-shot semantic segmentation. LangHOPS achieves state-of-the-art results, surpassing previous methods by 5.5% Average Precision (AP) (in-domain) and 4.8% (cross-dataset) on the PartImageNet dataset and by 2.5% mIOU on unseen object parts in ADE20K (zero-shot). Ablation studies further validate the effectiveness of the language-grounded hierarchy and MLLM driven part query refinement strategy. The code will be released here.
中文标题/摘要
标题:LangHOPS:基于语言的层次开放词汇部件分割
我们提出了LangHOPS,这是第一个基于多模态大型语言模型(MLLM)的开放词汇物体部件实例分割框架。给定一张图像,LangHOPS 可以联合检测和分割来自开放词汇候选类别中的层次物体和部件实例。与依赖启发式或可学习视觉分组的先前方法不同,我们的方法将物体部件层次结构扎根于语言空间。它将 MLLM 集成到物体部件解析管道中,利用其丰富的知识和推理能力,并在层次结构内链接多粒度概念。我们在多个具有挑战性的场景中评估了 LangHOPS,包括领域内和跨数据集物体部件实例分割以及零样本语义分割。LangHOPS 达到了最先进的技术水平,在 PartImageNet 数据集上,领域内平均精度(AP)提高了 5.5%,跨数据集提高了 4.8%,在 ADE20K 中未见过的物体部件上,mIOU 提高了 2.5%。消融研究进一步验证了语言扎根层次结构和 MLLM 驱动部件查询精炼策略的有效性。代码将在此发布。
Summary / 总结
LangHOPS is a framework that uses a Multimodal Large Language Model to perform open-vocabulary object-part instance segmentation. It can detect and segment hierarchical object and part instances from various categories in an image. Unlike previous methods, LangHOPS grounds object-part hierarchies in language space and integrates a MLLM to leverage its knowledge and reasoning capabilities. LangHOPS outperforms previous methods by 5.5% AP in-domain and 4.8% AP cross-dataset on PartImageNet, and by 2.5% mIOU on unseen object parts in ADE20K for zero-shot semantic segmentation. Ablation studies confirm the effectiveness of the language-grounded hierarchy and part query refinement strategy.
LangHOPS 是一个使用多模态大型语言模型进行开放词汇对象部分实例分割的框架。它可以检测和分割图像中的层次化对象和部分实例。不同于以往的方法,LangHOPS 将对象部分层次结构置于语言空间,并集成了一个 MLLM 来利用其知识和推理能力。LangHOPS 达到了最先进的效果,比以前的方法在 PartImageNet 数据集上提高了 5.5% 的 AP(室内)和 4.8% 的 AP(跨数据集),以及在 ADE20K 上对未见过的对象部分的零样本语义分割提高了 2.5% 的 mIOU。消融研究进一步验证了语言导向的层次结构和 MLLM 驱动的部分查询精炼策略的有效性。
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Authors: Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You
First: 2025-10-31T08:41:13+00:00 · Latest: 2025-10-31T08:41:13+00:00
Abstract
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs.
中文标题/摘要
标题:FOCUS:长视频理解中的高效关键帧选择
多模态大型语言模型(MLLMs)将图像和视频帧表示为视觉令牌。然而,从单张图像扩展到一小时长的视频,会将令牌预算膨胀到实际限制之上。因此,流行的流水线要么均匀下采样,要么使用较小的视觉-语言模型进行检索式评分进行关键帧选择。然而,这些关键帧选择方法仍然依赖于预筛选以减少推理成本,可能会错过最有信息性的时刻。我们提出了FOCUS,一种无需训练、模型无关的关键帧选择模块,它在严格的令牌预算下选择查询相关的帧。FOCUS将关键帧选择形式化为多臂老虎机中的组合纯探索(CPE)问题:它将短时间片段视为臂,并使用经验均值和伯恩斯坦置信半径来识别信息区域,同时保留对不确定区域的探索。由此产生的两阶段探索-利用过程从理论上保证了顺序策略,首先识别高价值的时间区域,然后在每个区域内选择得分最高的帧。在两个长视频问答基准测试中,FOCUS在处理不到2%的视频帧的情况下实现了显著的准确率提升。对于超过20分钟的视频,它在LongVideoBench上实现了11.9%的准确率提升,证明了其作为关键帧选择方法的有效性,并为使用MLLMs进行可扩展的长视频理解提供了简单而通用的解决方案。
Summary / 总结
FOCUS is a training-free, model-agnostic keyframe selection method that selects query-relevant frames under a strict token budget for long video understanding. It formulates keyframe selection as a combinatorial pure-exploration problem in multi-armed bandits, identifying informative regions while preserving exploration. On two long-video question-answering benchmarks, FOCUS improves accuracy while processing less than 2% of video frames, especially demonstrating a 11.9% gain in accuracy for videos longer than 20 minutes.
FOCUS 是一种无需训练、模型无关的关键帧选择方法,能够在严格的令牌预算下选择查询相关的关键帧。它将关键帧选择问题表述为多臂老虎机中的组合纯探索问题,同时识别信息丰富的区域并保持探索不确定性区域。在两个长视频问答基准测试中,FOCUS通过处理不到2%的视频帧提高了准确性,对于超过20分钟的视频,其准确率提高了11.9%。
T3: Test-Time Model Merging in VLMs for Zero-Shot Medical Imaging Analysis
Authors: Raza Imam, Hu Wang, Dwarikanath Mahapatra, Mohammad Yaqub
First: 2025-10-31T08:05:40+00:00 · Latest: 2025-10-31T08:05:40+00:00
Comments: Main: 11 pages, Supplementary: 9 pages 10 tables, 10 figures
Abstract
In medical imaging, vision-language models face a critical duality: pretrained networks offer broad robustness but lack subtle, modality-specific characteristics, while fine-tuned expert models achieve high in-distribution accuracy yet falter under modality shift. Existing model-merging techniques, designed for natural-image benchmarks, are simple and efficient but fail to deliver consistent gains across diverse medical modalities; their static interpolation limits reliability in varied clinical tasks. To address this, we introduce Test-Time Task adaptive merging (T^3), a backpropagation-free framework that computes per-sample interpolation coefficients via the Jensen-Shannon divergence between the two models' output distributions. T^3 dynamically preserves local precision when models agree and defers to generalist robustness under drift. To overcome the inference costs of sample-wise merging, we further propose a batch-wise extension, T^3_B, that computes a merging coefficient across a batch of samples, dramatically reducing computational bottleneck. Recognizing the lack of a standardized medical-merging benchmark, we present a rigorous cross-evaluation protocol spanning in-domain, base-to-novel, and corruptions across four modalities. Empirically, T^3 sets new state-of-the-art in Top-1 accuracy and error reduction, outperforming strong baselines while maintaining efficiency, paving the way for adaptive MVLM deployment in clinical settings. Our code is available at https://github.com/Razaimam45/TCube.
中文标题/摘要
标题:T3:在VLMs中进行测试时模型融合以实现零样本医学影像分析
在医学影像领域,视觉-语言模型面临一个关键的二元性:预训练网络提供广泛的鲁棒性,但缺乏特定模态的细微特征,而微调专家模型在分布内达到高准确率,但在模态变化时表现不佳。现有的模型融合技术,针对自然图像基准设计,简单且高效,但在不同医学模态下未能提供一致的增益;它们的静态插值限制了在各种临床任务中的可靠性。为解决这一问题,我们引入了测试时任务自适应融合(T^3),这是一种无需反向传播的框架,通过计算两个模型输出分布之间的杰森-香农散度来确定每个样本的插值系数。T^3在模型一致时动态保持局部精度,在漂移时则依赖于通用鲁棒性。为克服样本级融合的推理成本,我们进一步提出了一种批量级扩展T^3_B,它在一批样本上计算融合系数,显著减少了计算瓶颈。鉴于缺乏标准化的医学融合基准,我们提出了一个跨越四个模态的跨评估协议,涵盖领域内、基础到新颖以及损坏情况。实验证明,T^3在Top-1准确率和错误减少方面设立了新的最佳水平,优于强大的基线模型,同时保持了效率,为临床环境中适应性MVLM部署铺平了道路。我们的代码可在https://github.com/Razaimam45/TCube获取。
Summary / 总结
The research addresses the challenge of combining the broad robustness of pretrained vision-language models with the modality-specific accuracy of fine-tuned expert models in medical imaging. T3 introduces a test-time merging framework that dynamically adjusts the interpolation between these models based on the Jensen-Shannon divergence, ensuring precision in agreement and robustness under drift. T3_B extends this by batch-wise merging, reducing computational costs. Experiments across four medical modalities show T3 outperforms strong baselines in Top-1 accuracy and error reduction, setting new state-of-the-art while maintaining efficiency.
论文提出了T^3,一种针对医学影像的视觉语言模型测试时模型合并技术,解决了现有方法的局限性。T^3利用Jensen-Shannon散度动态合并模型输出,在模型一致时保持精度,在模型漂移时依赖于稳健性。T^3_B通过合并一批样本进一步扩展了这一方法,减少了计算成本。实验表明,T^3在Top-1准确性和错误减少方面表现出色,优于强基线,同时保持了效率。
ECVL-ROUTER: Scenario-Aware Routing for Vision-Language Models
Authors: Xin Tang, Youfang Han, Fangfei Gou, Wei Zhao, Xin Meng, Yang Yu, Jinguo Zhang, Yuanchun Shi, Yuntao Wang, Tengxiang Zhang
First: 2025-10-31T07:46:44+00:00 · Latest: 2025-10-31T07:46:44+00:00
Comments: 23 pages, 13 figures, 7 tables
Abstract
Vision-Language Models (VLMs) excel in diverse multimodal tasks. However, user requirements vary across scenarios, which can be categorized into fast response, high-quality output, and low energy consumption. Relying solely on large models deployed in the cloud for all queries often leads to high latency and energy cost, while small models deployed on edge devices are capable of handling simpler tasks with low latency and energy cost. To fully leverage the strengths of both large and small models, we propose ECVL-ROUTER, the first scenario-aware routing framework for VLMs. Our approach introduces a new routing strategy and evaluation metrics that dynamically select the appropriate model for each query based on user requirements, maximizing overall utility. We also construct a multimodal response-quality dataset tailored for router training and validate the approach through extensive experiments. Results show that our approach successfully routes over 80\% of queries to the small model while incurring less than 10\% drop in problem solving probability.
中文标题/摘要
标题:ECVL-ROUTER:面向视觉语言模型的场景感知路由
视觉语言模型(VLMs)在多种跨模态任务中表现出色。然而,用户需求因场景而异,可以分为快速响应、高质量输出和低能耗三类。仅依赖部署在云端的大模型处理所有查询往往会导致高延迟和能耗,而部署在边缘设备的小模型则能够以低延迟和能耗处理简单的任务。为了充分利用大模型和小模型的优势,我们提出了ECVL-ROUTER,这是第一个面向VLMs的场景感知路由框架。我们的方法引入了一种新的路由策略和评估指标,能够根据用户需求动态选择合适的模型,最大化整体效用。我们还构建了一个针对路由训练的跨模态响应质量数据集,并通过大量实验验证了该方法。结果显示,我们的方法成功将超过80%的查询路由到小模型,同时问题解决概率的下降不到10%。
Summary / 总结
The paper proposes ECVL-ROUTER, a scenario-aware routing framework for Vision-Language Models (VLMs) to address varying user requirements. It introduces a new routing strategy and evaluation metrics to dynamically select the most suitable model for each query, balancing latency, quality, and energy consumption. Experiments show that over 80% of queries are routed to small models with less than 10% drop in problem-solving probability.
论文提出了ECVL-ROUTER,一种针对视觉-语言模型(VLMs)的场景感知路由框架,以应对不同的用户需求。该框架引入了一种新的路由策略和评估指标,能够动态选择最适合每个查询的模型,平衡延迟、质量和能耗。实验结果表明,超过80%的查询被路由到小型模型,且问题解决概率下降不到10%。
Enhancing Spatio-Temporal Zero-shot Action Recognition with Language-driven Description Attributes
Authors: Yehna Kim andYoung-Eun Kim, Seong-Whan Lee
First: 2025-10-31T07:45:44+00:00 · Latest: 2025-10-31T07:45:44+00:00
Abstract
Vision-Language Models (VLMs) have demonstrated impressive capabilities in zero-shot action recognition by learning to associate video embeddings with class embeddings. However, a significant challenge arises when relying solely on action classes to provide semantic context, particularly due to the presence of multi-semantic words, which can introduce ambiguity in understanding the intended concepts of actions. To address this issue, we propose an innovative approach that harnesses web-crawled descriptions, leveraging a large-language model to extract relevant keywords. This method reduces the need for human annotators and eliminates the laborious manual process of attribute data creation. Additionally, we introduce a spatio-temporal interaction module designed to focus on objects and action units, facilitating alignment between description attributes and video content. In our zero-shot experiments, our model achieves impressive results, attaining accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, underscoring the model's adaptability and effectiveness across various downstream tasks.
中文标题/摘要
标题:利用语言驱动描述属性增强时空零样本动作识别
视觉-语言模型(VLMs)在零样本动作识别方面通过学习将视频嵌入与类别嵌入关联起来,展示了令人印象深刻的性能。然而,仅依赖动作类别来提供语义上下文时存在重大挑战,尤其是由于多义词的存在,这可能导致对动作意图概念的理解产生歧义。为了解决这一问题,我们提出了一种创新方法,利用网络抓取的描述,并利用大型语言模型提取相关关键词。这种方法减少了对人工注释者的依赖,并消除了属性数据创建的繁琐手动过程。此外,我们引入了一个时空交互模块,旨在关注对象和动作单元,促进描述属性与视频内容之间的对齐。在我们的零样本实验中,我们的模型取得了令人印象深刻的结果,在UCF-101、HMDB-51和Kinetics-600上的准确率分别为81.0%、53.1%和68.9%,突显了模型在各种下游任务中的适应性和有效性。
Summary / 总结
The paper addresses the challenge of ambiguity in zero-shot action recognition due to multi-semantic words by proposing a method that uses web-crawled descriptions and a large-language model to extract relevant keywords, reducing the need for human annotation. It also introduces a spatio-temporal interaction module to align description attributes with video content. The model achieves accuracies of 81.0%, 53.1%, and 68.9% on UCF-101, HMDB-51, and Kinetics-600, respectively, demonstrating its effectiveness in various tasks.
研究旨在通过解决多义词引入的歧义问题,提高零样本动作识别的准确性。方法包括使用网络抓取的描述和大型语言模型提取相关关键词,减少人工标注的需求。模型还包含时空交互模块,以对齐描述属性和视频内容。实验结果显示,该模型在UCF-101、HMDB-51和Kinetics-600上的准确率分别为81.0%、53.1%和68.9%,证明了其在各种下游任务中的有效性和适应性。
Reconstructing Unseen Sentences from Speech-related Biosignals for Open-vocabulary Neural Communication
Authors: Deok-Seon Kim, Seo-Hyun Lee, Kang Yin, Seong-Whan Lee
First: 2025-10-31T07:31:13+00:00 · Latest: 2025-10-31T07:31:13+00:00
Comments: Accepted for publication in IEEE Transactions on Neural Systems and Rehabilitation Engineering
Abstract
Brain-to-speech (BTS) systems represent a groundbreaking approach to human communication by enabling the direct transformation of neural activity into linguistic expressions. While recent non-invasive BTS studies have largely focused on decoding predefined words or sentences, achieving open-vocabulary neural communication comparable to natural human interaction requires decoding unconstrained speech. Additionally, effectively integrating diverse signals derived from speech is crucial for developing personalized and adaptive neural communication and rehabilitation solutions for patients. This study investigates the potential of speech synthesis for previously unseen sentences across various speech modes by leveraging phoneme-level information extracted from high-density electroencephalography (EEG) signals, both independently and in conjunction with electromyography (EMG) signals. Furthermore, we examine the properties affecting phoneme decoding accuracy during sentence reconstruction and offer neurophysiological insights to further enhance EEG decoding for more effective neural communication solutions. Our findings underscore the feasibility of biosignal-based sentence-level speech synthesis for reconstructing unseen sentences, highlighting a significant step toward developing open-vocabulary neural communication systems adapted to diverse patient needs and conditions. Additionally, this study provides meaningful insights into the development of communication and rehabilitation solutions utilizing EEG-based decoding technologies.
中文标题/摘要
标题:从与语音相关的生物信号重构未见句子的开放词汇神经通信
脑-语音(BTS)系统代表了一种变革性的交流方式,通过直接将神经活动转化为语言表达。尽管近期非侵入式BTS研究主要集中在解码预定义的单词或句子上,但实现与自然人类交流相媲美的开放词汇神经通信需要解码不受限制的语音。此外,有效整合来自语音的各种信号对于开发个性化和适应性强的神经通信和康复解决方案至关重要。本研究通过利用从高密度脑电图(EEG)信号中提取的音素级信息,以及与肌电图(EMG)信号结合,探索了各种语音模式下以前未见句子的语音合成潜力。此外,我们还研究了句子重构过程中影响音素解码准确性的特性,并提供了神经生理学见解,以进一步提高EEG解码效果,从而更有效地开发神经通信解决方案。我们的研究结果强调了基于生物信号的句子级语音合成的可行性,这标志着朝着开发适应不同患者需求和条件的开放词汇神经通信系统迈出的重要一步。此外,本研究还为利用EEG解码技术开发通信和康复解决方案提供了有意义的见解。
Summary / 总结
This study aims to reconstruct unseen sentences from speech-related biosignals to achieve open-vocabulary neural communication. The researchers used phoneme-level information from high-density EEG signals and, in some cases, combined it with EMG signals to synthesize speech. Key findings include the feasibility of biosignal-based sentence-level speech synthesis, which is a significant step toward developing personalized neural communication systems. The study also offers insights into enhancing EEG decoding accuracy for better neural communication solutions.
本研究探索了利用高密度EEG和EMG等语音相关生物信号重建未见过的句子,以实现开放词汇的神经通信。通过利用音素级信息,研究人员在不同语音模式下实现了显著的解码准确率提升。研究结果表明,基于生物信号的句子级语音合成具有可行性,为开发更个性化和适应性的神经通信系统铺平了道路。
Generating Accurate and Detailed Captions for High-Resolution Images
Authors: Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung
First: 2025-10-31T04:22:22+00:00 · Latest: 2025-10-31T04:22:22+00:00
Comments: Work conducted in 2024; released for archival purposes
Abstract
Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
中文标题/摘要
标题:生成高分辨率图像的准确和详细描述
视觉语言模型(VLMs)通常难以生成高分辨率图像的准确和详细描述,因为它们通常在低分辨率输入(例如224x224或336x336像素)上进行预训练。将高分辨率图像缩小到这些尺寸可能会导致视觉细节的丢失和重要对象的遗漏。为了解决这一限制,我们提出了一种新的管道,该管道结合了视觉语言模型、大型语言模型(LLMs)和对象检测系统以提高描述质量。我们提出的管道通过一个新颖的多阶段过程来细化描述。给定高分辨率图像,首先使用VLM生成初始描述,然后使用LLM识别图像中的关键对象。LLM预测与已识别的关键对象共现的其他对象,并由对象检测系统验证这些预测。未在初始描述中提及的新检测对象将进行聚焦的、区域特定的描述,以确保它们被纳入。这一过程丰富了描述细节,同时通过去除未检测到的对象的引用减少了幻觉。我们使用成对比较和大型多模态模型的定量评分来评估增强的描述,并提供幻觉检测基准。在精心策划的高分辨率图像数据集上的实验表明,我们的管道生成了更详细和可靠的图像描述,同时有效减少了幻觉。
Summary / 总结
The paper addresses the challenge of generating accurate and detailed captions for high-resolution images by proposing a novel pipeline that integrates vision-language models, large language models, and object detection systems. This pipeline refines captions through a multi-stage process, identifying key objects and predicting additional relevant objects, which are then verified by object detection. The results show that the enhanced captions are more detailed and reliable, with reduced hallucinations compared to those generated by vision-language models alone.
论文提出了一种新的管道,结合了视觉语言模型、大型语言模型和物体检测系统,以解决高分辨率图像生成准确和详细描述的挑战。该管道通过多阶段过程细化描述,增加了细节并减少了幻觉。实验表明,增强后的描述比单独使用视觉语言模型生成的描述更详细和可靠。
Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation
Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
First: 2025-05-10T10:52:23+00:00 · Latest: 2025-10-31T04:00:41+00:00
Abstract
Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
中文标题/摘要
标题:整合视频与文本:多模态摘要生成与评估的平衡方法
视觉-语言模型(VLMs)在总结复杂的多模态输入,如整个电视节目集时,往往难以平衡视觉和文本信息。在本文中,我们提出了一种零样本视频到文本总结方法,该方法构建了整个集的剧本表示,有效地将关键视频时刻、对话和角色信息整合到一个统一的文档中。与之前的方案不同,我们同时生成剧本并命名角色,仅使用音频、视频和转录作为输入。此外,我们强调现有的总结指标可能无法评估多模态内容。为了解决这个问题,我们引入了MFactSum,这是一种多模态指标,可以同时评估视觉和文本模态中的摘要。使用MFactSum,我们在SummScreen3D数据集上评估我们的剧本摘要,结果显示,与最先进的VLMs(如Gemini 1.5)相比,我们的摘要包含20%更多的相关视觉信息,同时只需要输入75%的视频。
Variational Visual Question Answering for Uncertainty-Aware Selective Prediction
Authors: Tobias Jan Wieczorek, Nathalie Daun, Mohammad Emtiyaz Khan, Marcus Rohrbach
First: 2025-05-14T17:40:22+00:00 · Latest: 2025-10-31T02:57:26+00:00
Comments: under review at TMLR
Abstract
Despite remarkable progress in recent years, vision language models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve reliability by helping models selectively predict, that is, models respond only when they are sufficiently confident. Unfortunately, Bayesian methods are often assumed to be costly and ineffective for large models, and so far there exists little evidence to show otherwise, especially for multimodal applications. Here, we show the effectiveness and competitive edge of variational Bayes for selective prediction in VQA for the first time. We build on recent advances in variational methods for deep learning and propose an extension called "Variational VQA". This method improves calibration and yields significant gains for selective prediction on VQA and Visual Reasoning, particularly when the error tolerance is low ($\leq 1\%$). Often, just one posterior sample can yield more reliable answers than those obtained by models trained with AdamW. In addition, we propose a new risk-averse selector that outperforms standard sample averaging by considering the variance of predictions. Overall, we present compelling evidence that variational learning is a viable option to make large VLMs safer and more trustworthy.
中文标题/摘要
标题:变分视觉问答以实现不确定性感知的选择性预测
尽管近年来取得了显著进展,视觉语言模型(VLMs)在视觉问答(VQA)和视觉推理等任务上仍然容易表现出过度自信和幻觉。贝叶斯方法有可能通过帮助模型选择性地预测来提高可靠性,即模型仅在足够自信时才作出响应。不幸的是,贝叶斯方法通常被认为对大型模型来说成本高且效果不佳,迄今为止几乎没有证据表明其有效性,尤其是在多模态应用中。在这里,我们首次展示了变分贝叶斯方法在VQA中选择性预测的有效性和竞争优势。我们基于深度学习中变分方法的最新进展,提出了一种名为“变分VQA”的扩展方法。该方法提高了校准度,并在VQA和视觉推理中实现了显著的选择性预测增益,尤其是在容错率较低(≤1%)的情况下。通常,仅一个后验样本就能比使用AdamW训练的模型获得更可靠的答案。此外,我们提出了一种新的风险规避选择器,其性能优于标准样本平均,因为它考虑了预测的方差。总体而言,我们提供了有力的证据表明,变分学习是使大型VLMs更安全和更值得信赖的一种可行选择。
Summary / 总结
The paper addresses the issue of overconfidence and hallucinations in vision language models (VLMs) for tasks like VQA and visual reasoning. It introduces Variational VQA, which uses variational Bayesian methods to improve model reliability and selective prediction. The method enhances calibration and achieves significant gains, especially with low error tolerance, often outperforming models trained with AdamW. Additionally, a risk-averse selector is proposed, which performs better than standard sample averaging by considering prediction variance.
本文针对视觉语言模型(VLMs)在VQA和视觉推理等任务中的过度自信和幻觉问题,引入了使用变分贝叶斯方法的Variational VQA,以提高模型的可靠性和选择性预测能力。该方法在低误差容忍度下显示出显著的校准和性能提升,并在风险规避选择中优于标准模型。
AVA: Towards Agentic Video Analytics with Vision Language Models
Authors: Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu
First: 2025-05-01T02:40:23+00:00 · Latest: 2025-10-31T01:25:26+00:00
Comments: Accepted to NDSI 2026, 19pages, 12 figures, complementary evaluations and appendix
Abstract
AI-driven video analytics has become increasingly important across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Vision Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively-significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%. The source code of AVA is available at https://github.com/I-ESC/Project-Ava. The AVA-100 benchmark can be accessed at https://huggingface.co/datasets/iesc/Ava-100.
中文标题/摘要
标题:AVA:利用视觉语言模型实现有能动性的视频分析
AI驱动的视频分析在多个领域变得越来越重要。然而,现有的系统通常局限于特定的、预定义的任务,限制了它们在开放性分析场景中的适应性。最近,视觉语言模型(VLMs)的出现作为一种变革性技术,为实现开放性视频理解、推理和分析提供了巨大潜力。然而,它们有限的上下文窗口在处理真实世界应用中常见的超长视频内容时提出了挑战。为了解决这个问题,我们提出了AVA,一个基于VLM的系统,旨在实现开放性、高级的视频分析。AVA包含两项关键创新:(1)近实时构建事件知识图谱(EKGs)以高效索引长或连续视频流,(2)一种有能动性的检索生成机制,利用EKGs处理复杂和多样的查询。在公共基准LVBench和VideoMME-Long上的全面评估表明,AVA达到了最先进的性能,分别取得了62.3%和64.1%的准确率,显著超过了现有的VLM和视频检索增强生成(RAG)系统。此外,为了评估超长和开放世界视频场景中的视频分析,我们引入了一个新的基准AVA-100。该基准包括8个超过10小时的视频,以及120个手动标注的、多样且复杂的问答对。在AVA-100上,AVA取得了顶级性能,准确率为75.8%。AVA的源代码可在https://github.com/I-ESC/Project-Ava获取。AVA-100基准数据集可在https://huggingface.co/datasets/iesc/Ava-100获取。
Summary / 总结
The paper introduces AVA, a system using Vision Language Models for open-ended video analytics, addressing the limitations of existing systems in handling diverse and complex queries. AVA innovates with Event Knowledge Graphs for efficient indexing and an agentic retrieval-generation mechanism. Comprehensive evaluations show AVA outperforms existing systems on public benchmarks and a new ultra-long video benchmark, AVA-100, with accuracies of 62.3%, 64.1%, and 75.8% respectively.
AVA 是一个基于 VLM 的系统,旨在进行开放式的视频分析,解决现有系统在处理多样和复杂查询时的局限性。它引入了事件知识图谱(EKG)进行高效索引,并采用一种代理检索生成机制。AVA 在 LVBench 和 VideoMME-Long 上分别达到了 62.3% 和 64.1% 的准确率,并在新引入的 AVA-100 基准上实现了 75.8% 的准确率,该基准包括超长视频和复杂查询。
Towards Automated Semantic Interpretability in Reinforcement Learning via Vision-Language Models
Authors: Zhaoxin Li, Zhang Xi-Jia, Batuhan Altundas, Letian Chen, Rohan Paleja, Matthew Gombolay
First: 2025-03-20T21:53:19+00:00 · Latest: 2025-10-31T01:04:38+00:00
Abstract
Semantic interpretability in Reinforcement Learning (RL) enables transparency and verifiability of decision-making. Achieving semantic interpretability in reinforcement learning requires (1) a feature space composed of human-understandable concepts and (2) a policy that is interpretable and verifiable. However, constructing such a feature space has traditionally relied on manual human specification, which often fails to generalize to unseen environments. Moreover, even when interpretable features are available, most reinforcement learning algorithms employ black-box models as policies, thereby hindering transparency. We introduce interpretable Tree-based Reinforcement learning via Automated Concept Extraction (iTRACE), an automated framework that leverages pre-trained vision-language models (VLM) for semantic feature extraction and train a interpretable tree-based model via RL. To address the impracticality of running VLMs in RL loops, we distill their outputs into a lightweight model. By leveraging Vision-Language Models (VLMs) to automate tree-based reinforcement learning, iTRACE loosens the reliance the need for human annotation that is traditionally required by interpretable models. In addition, it addresses key limitations of VLMs alone, such as their lack of grounding in action spaces and their inability to directly optimize policies. We evaluate iTRACE across three domains: Atari games, grid-world navigation, and driving. The results show that iTRACE outperforms other interpretable policy baselines and matches the performance of black-box policies on the same interpretable feature space.
中文标题/摘要
标题:通过视觉语言模型实现强化学习的自动语义可解释性
强化学习(RL)中的语义可解释性可以提高决策的透明度和验证性。实现语义可解释性需要(1)由人类可理解的概念组成的特征空间,以及(2)可解释和可验证的策略。然而,传统上构建这样的特征空间依赖于手动的人工指定,这往往无法泛化到未见过的环境中。此外,即使可解释的特征可用,大多数强化学习算法仍使用黑盒模型作为策略,从而阻碍了透明度。我们提出了通过自动概念提取实现可解释树形强化学习(iTRACE)的自动化框架,该框架利用预训练的视觉语言模型(VLM)进行语义特征提取,并通过RL训练一个可解释的树形模型。为了解决在RL循环中运行VLM的不切实际性,我们将它们的输出精简为一个轻量级模型。通过利用视觉语言模型(VLMs)自动化树形强化学习,iTRACE减轻了传统上由可解释模型所需的大量人工注释的依赖。此外,它还解决了VLMs自身的一些关键限制,如它们缺乏对动作空间的约束以及无法直接优化策略。我们在Atari游戏、网格世界导航和驾驶三个领域评估了iTRACE。结果表明,iTRACE在可解释策略基线中表现出色,并且在相同的可解释特征空间中与黑盒策略的性能相当。
Summary / 总结
The research aims to enhance semantic interpretability in reinforcement learning by introducing iTRACE, which uses pre-trained vision-language models for automated feature extraction and trains an interpretable tree-based model via reinforcement learning. The method involves distilling VLM outputs into a lightweight model to address computational impracticalities. Key experimental findings show that iTRACE outperforms other interpretable policy baselines and matches the performance of black-box policies on the same interpretable feature space across Atari games, grid-world navigation, and driving domains.
该论文提出了iTRACE,一种用于强化学习中语义可解释性的自动化框架。它利用预训练的视觉-语言模型提取可解释特征,并通过强化学习训练树形模型。iTRACE在Atari游戏、网格导航和驾驶等三个领域中,优于其他可解释策略基线,并且在相同的可解释特征空间中与黑盒策略的性能相当。
MoME: Mixture of Visual Language Medical Experts for Medical Imaging Segmentation
Authors: Arghavan Rezvani, Xiangyi Yan, Anthony T. Wu, Kun Han, Pooya Khosravi, Xiaohui Xie
First: 2025-10-30T20:50:15+00:00 · Latest: 2025-10-30T20:50:15+00:00
Abstract
In this study, we propose MoME, a Mixture of Visual Language Medical Experts, for Medical Image Segmentation. MoME adapts the successful Mixture of Experts (MoE) paradigm, widely used in Large Language Models (LLMs), for medical vision-language tasks. The architecture enables dynamic expert selection by effectively utilizing multi-scale visual features tailored to the intricacies of medical imagery, enriched with textual embeddings. This work explores a novel integration of vision-language models for this domain. Utilizing an assembly of 10 datasets, encompassing 3,410 CT scans, MoME demonstrates strong performance on a comprehensive medical imaging segmentation benchmark. Our approach explores the integration of foundation models for medical imaging, benefiting from the established efficacy of MoE in boosting model performance by incorporating textual information. Demonstrating competitive precision across multiple datasets, MoME explores a novel architecture for achieving robust results in medical image analysis.
中文标题/摘要
标题:MoME:视觉语言医学专家混合体在医学影像分割中的应用
在本研究中,我们提出MoME(Mixture of Visual Language Medical Experts),用于医学影像分割。MoME将广泛应用于大型语言模型(LLMs)中的混合专家(MoE)范式适应医学视觉语言任务。该架构通过有效利用多尺度视觉特征来动态选择专家,这些特征针对医学影像的复杂性进行了定制,并结合了文本嵌入。本研究探索了视觉语言模型在该领域的新型集成。利用包含3,410份CT扫描的10个数据集,MoME在全面的医学影像分割基准测试中表现出色。我们的方法探索了基础模型在医学影像中的集成,得益于MoE在通过引入文本信息提升模型性能方面的已验证效果。MoME在多个数据集上展示了竞争力的精度,探索了一种新的架构以在医学图像分析中实现稳健的结果。
Summary / 总结
MoME is a Mixture of Visual Language Medical Experts designed for medical image segmentation. It integrates the MoE paradigm from Large Language Models with multi-scale visual features and textual embeddings to dynamically select experts. MoME shows strong performance on a comprehensive medical imaging segmentation benchmark using 3,410 CT scans from 10 datasets, highlighting its effectiveness in leveraging textual information to improve model precision across various medical imaging tasks.
MoME 是一种混合视觉语言医学专家系统,用于医学图像分割。它结合了大型语言模型中的 MoE 架构,并利用多尺度视觉特征和文本嵌入来动态选择专家。MoME 使用来自 10 个数据集的 3,410 个 CT 扫描,在全面的医学影像分割基准测试中表现出色,展示了其通过利用文本信息提高模型精度的有效性,适用于多种医学影像任务。
Human Uncertainty-Aware Data Selection and Automatic Labeling in Visual Question Answering
Authors: Jian Lan, Zhicheng Liu, Udo Schlegel, Raoyuan Zhao, Yihong Liu, Hinrich Schütze, Michael A. Hedderich, Thomas Seidl
First: 2025-10-13T11:35:30+00:00 · Latest: 2025-10-30T20:31:38+00:00
Abstract
Large vision-language models (VLMs) achieve strong performance in Visual Question Answering but still rely heavily on supervised fine-tuning (SFT) with massive labeled datasets, which is costly due to human annotations. Crucially, real-world datasets often exhibit human uncertainty (HU) -- variation in human confidence across annotations -- but standard SFT simply optimizes toward the most frequent label, disregarding HU distributions. This leaves two open questions: How does HU affect SFT, and how can HU be effectively leveraged in training? In this work, we first conduct a systematic evaluation of VLMs across varying HU levels. We have two key findings: (i) surprisingly, high-HU samples contribute little or even degrade model performance, and (ii) naively training on the full dataset yields under-calibrated models that fail to capture HU distributions. Motivated by these findings, we introduce HaDola, a human uncertainty-aware data selection and automatic labeling framework. HaDola operates in four stages -- discriminate, self-annotate, error trigger, and training -- to iteratively identify harmful samples, prioritize informative ones, and bootstrap from a small seed set (5\% of data). Our approach substantially reduces reliance on costly HU annotations and makes VLMs more accurate and better calibrated. Extensive experiments on VQAv2 and VizWiz datasets demonstrate that HaDola consistently matches or outperforms state-of-the-art baselines with less training data. Our work highlights the importance of explicitly modeling HU in SFT, suggesting that better utilization of HU is more effective than merely scaling up dataset size.
中文标题/摘要
标题:人类不确定性意识的数据选择与视觉问答中的自动标注
大型视觉-语言模型(VLMs)在视觉问答中表现出色,但仍高度依赖大规模标注数据集的监督微调(SFT),这由于人工注释的成本高昂。关键的是,现实世界的数据集经常表现出人类不确定性(HU)——不同注释之间的人类置信度差异,但标准SFT简单地优化最频繁的标签,忽视了HU分布。这留下了两个开放问题:HU如何影响SFT,以及如何有效利用HU进行训练?在本文中,我们首先系统评估了VLMs在不同HU水平下的表现。我们有两个关键发现:(i) 惊人的是,高HU样本对模型性能贡献甚微甚至降低性能,(ii) 直接在完整数据集上训练会导致欠校准模型,无法捕捉HU分布。受这些发现的启发,我们引入了HaDola,一种人类不确定性意识的数据选择和自动标注框架。HaDola在四个阶段——区分、自我标注、错误触发和训练——中迭代识别有害样本、优先处理信息性样本,并从一个小种子集(数据的5%)开始进行自我提升。我们的方法显著减少了对昂贵的HU注释的依赖,使VLMs更准确且更校准。在VQAv2和VizWiz数据集上的广泛实验表明,HaDola在使用更少训练数据的情况下,能够一致地匹配或超越最先进的基线。我们的工作强调了在SFT中明确建模HU的重要性,表明更好地利用HU比仅仅扩大数据集规模更有效。
Summary / 总结
This paper addresses the issue of human uncertainty (HU) in supervised fine-tuning (SFT) of vision-language models (VLMs) for Visual Question Answering (VQA). The authors find that high-HU samples contribute little to model performance and that naively training on the full dataset leads to under-calibrated models. To address these issues, they propose HaDola, a framework that selects and labels data based on HU, reducing the need for costly human annotations and improving model accuracy and calibration. Experiments show that HaDola outperforms or matches state-of-the-art baselines with less training data.
该研究解决了视觉语言模型(VLM)在视觉问答(VQA)任务中的人类不确定性(HU)问题。作者首先在不同HU水平下评估了VLM的表现,并发现高HU样本会降低模型性能。他们随后提出了HaDola框架,该框架基于HU选择数据并自动标注,以提高模型的准确性和校准度。实验表明,HaDola在使用更少训练数据的情况下优于现有方法。该研究强调了在SFT中明确建模HU的重要性,表明更好地利用HU比仅仅增加数据集规模更有效。
MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models
Authors: Zimeng Huang, Jinxin Ke, Xiaoxuan Fan, Yufeng Yang, Yang Liu, Liu Zhonghan, Zedi Wang, Junteng Dai, Haoyi Jiang, Yuyu Zhou, Keze Wang, Ziliang Chen
Venue: NeurIPS 2025 poster
First: 2025-10-30T18:49:06+00:00 · Latest: 2025-10-30T18:49:06+00:00
Comments: NeurIPS 2025 Datasets and Benchmarks Track poster
Abstract
Large Vision-Language Models (LVLMs) have exhibited remarkable progress. However, deficiencies remain compared to human intelligence, such as hallucination and shallow pattern matching. In this work, we aim to evaluate a fundamental yet underexplored intelligence: association, a cornerstone of human cognition for creative thinking and knowledge integration. Current benchmarks, often limited to closed-ended tasks, fail to capture the complexity of open-ended association reasoning vital for real-world applications. To address this, we present MM-OPERA, a systematic benchmark with 11,497 instances across two open-ended tasks: Remote-Item Association (RIA) and In-Context Association (ICA), aligning association intelligence evaluation with human psychometric principles. It challenges LVLMs to resemble the spirit of divergent thinking and convergent associative reasoning through free-form responses and explicit reasoning paths. We deploy tailored LLM-as-a-Judge strategies to evaluate open-ended outputs, applying process-reward-informed judgment to dissect reasoning with precision. Extensive empirical studies on state-of-the-art LVLMs, including sensitivity analysis of task instances, validity analysis of LLM-as-a-Judge strategies, and diversity analysis across abilities, domains, languages, cultures, etc., provide a comprehensive and nuanced understanding of the limitations of current LVLMs in associative reasoning, paving the way for more human-like and general-purpose AI. The dataset and code are available at https://github.com/MM-OPERA-Bench/MM-OPERA.
中文标题/摘要
标题:MM-OPERA:评估大型视觉语言模型的开放性关联推理
大型视觉语言模型(LVLMs)已经取得了显著的进步,但在与人类智能相比时,仍存在幻觉和浅层模式匹配等缺陷。本文旨在评估一种基础但尚未充分探索的智能:关联,这是人类认知中创造性思考和知识整合的核心。当前的基准测试通常局限于封闭式任务,未能捕捉到开放性关联推理的复杂性,这对于实际应用至关重要。为解决这一问题,我们提出了MM-OPERA,这是一个系统性的基准测试,包含11,497个实例,涵盖了两个开放性任务:远程项关联(RIA)和上下文关联(ICA),并使关联智能评估与人类心理测量原则相一致。它通过自由形式的响应和明确的推理路径,挑战LVLMs表现出发散思维和收敛关联推理的精神。我们采用定制的LLM作为裁判策略来评估开放性输出,并应用过程奖励导向的判断来精确剖析推理。对最先进的LVLMs的广泛实证研究,包括任务实例的敏感性分析、LLM作为裁判策略的有效性分析以及在能力、领域、语言、文化等方面的多样性分析,提供了对当前LVLMs在关联推理方面局限性的全面而细致的理解,为开发更接近人类和通用的人工智能铺平了道路。数据集和代码可在https://github.com/MM-OPERA-Bench/MM-OPERA/获取。
Summary / 总结
The research aims to evaluate the open-ended association reasoning capability of large vision-language models (LVLMs) by introducing MM-OPERA, a new benchmark with 11,497 instances across two tasks: Remote-Item Association and In-Context Association. The study uses LLM-as-a-Judge strategies to evaluate the models' responses, providing insights into their limitations in associative reasoning and guiding future improvements. Key findings highlight the models' struggles with divergent and convergent thinking, indicating the need for better reasoning and creativity in LVLMs.
研究旨在通过引入MM-OPERA基准,评估大型视觉-语言模型(LVLM)的开放性关联推理能力,该基准包含11,497个实例,涵盖远程项关联和上下文关联任务。研究采用LLM-as-a-Judge策略评估模型的自由形式响应和推理路径,揭示了其在关联推理方面的局限性,并为未来的发展提供了见解。全面的实证研究强调了在这一领域需要更具人类特性和通用性的AI。
Cognition Envelopes for Bounded AI Reasoning in Autonomous UAS Operations
Authors: Pedro Antonio Alarcón Granadeno, Arturo Miguel Bernal Russell, Sofia Nelson, Demetrius Hernandez, Maureen Petterson, Michael Murphy, Walter J. Scheirer, Jane Cleland-Huang
First: 2025-10-30T18:11:32+00:00 · Latest: 2025-10-30T18:11:32+00:00
Comments: 10.5 pages, 9 figures
Abstract
Cyber-physical systems increasingly rely on Foundational Models such as Large Language Models (LLMs) and Vision-Language Models (VLMs) to increase autonomy through enhanced perception, inference, and planning. However, these models also introduce new types of errors, such as hallucinations, overgeneralizations, and context misalignments, resulting in incorrect and flawed decisions. To address this, we introduce the concept of Cognition Envelopes, designed to establish reasoning boundaries that constrain AI-generated decisions while complementing the use of meta-cognition and traditional safety envelopes. As with safety envelopes, Cognition Envelopes require practical guidelines and systematic processes for their definition, validation, and assurance.
中文标题/摘要
标题:认知边界在自主无人机系统操作中受限人工智能推理中的应用
网络物理系统越来越多地依赖于基础模型,如大型语言模型(LLMs)和视觉-语言模型(VLMs),以通过增强感知、推理和规划来提高自主性。然而,这些模型也会引入新的错误类型,如幻觉、过度泛化和上下文错位,导致错误和有缺陷的决策。为了解决这一问题,我们提出了认知边界的概念,旨在建立推理边界,限制AI生成的决策,同时补充元认知和传统安全边界的使用。与安全边界类似,认知边界需要实用的指导方针和系统的过程来定义、验证和保证。
Summary / 总结
The paper addresses the challenge of errors in autonomous systems using Foundational Models like LLMs and VLMs, which can lead to incorrect decisions. It introduces Cognition Envelopes to set boundaries for AI reasoning, ensuring decisions are constrained and safe. The study finds that Cognition Envelopes, similar to safety envelopes, need practical guidelines for definition, validation, and assurance to effectively mitigate risks.
论文针对使用大型语言模型(LLM)和视觉-语言模型(VLM)在自主操作中可能出现的幻觉和上下文错位等错误问题,提出了认知包络的概念,以界定和约束AI生成的决策,确保其在可接受的推理边界内。主要发现是,认知包络类似于安全包络,需要实用的指导方针和系统的过程来定义、验证和保证,以提高自主系统的可靠性。
ChartAB: A Benchmark for Chart Grounding & Dense Alignment
Authors: Aniruddh Bansal, Davit Soselia, Dang Nguyen, Tianyi Zhou
First: 2025-10-30T17:56:31+00:00 · Latest: 2025-10-30T17:56:31+00:00
Abstract
Charts play an important role in visualization, reasoning, data analysis, and the exchange of ideas among humans. However, existing vision-language models (VLMs) still lack accurate perception of details and struggle to extract fine-grained structures from charts. Such limitations in chart grounding also hinder their ability to compare multiple charts and reason over them. In this paper, we introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of VLMs in chart grounding tasks, i.e., extracting tabular data, localizing visualization elements, and recognizing various attributes from charts of diverse types and complexities. We design a JSON template to facilitate the calculation of evaluation metrics specifically tailored for each grounding task. By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs' capability to align and compare elements/attributes across two charts. Our analysis of evaluations on several recent VLMs reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding. These findings highlight the fine-grained discrepancies among VLMs in chart understanding tasks and point to specific skills that need to be strengthened in current models.
中文标题/摘要
标题:ChartAB:图表定位与密集对齐基准
图表在可视化、推理、数据分析以及人类思想交流中发挥着重要作用。然而,现有的视觉-语言模型(VLMs)在细节感知方面仍存在不足,难以从图表中提取精细结构。这种图表定位的限制也阻碍了它们比较多个图表和推理的能力。在本文中,我们引入了一个新的“图表对齐基准(ChartAB)”,以全面评估VLMs在图表定位任务中的表现,即提取表格数据、定位可视化元素以及从不同类型和复杂度的图表中识别各种属性。我们设计了一个JSON模板,以方便计算每个定位任务的评估指标。通过引入一种新颖的两阶段推理工作流,基准还可以进一步评估VLMs在两个图表之间对齐和比较元素/属性的能力。我们对几种近期VLMs的评估分析揭示了它们在图表理解中的感知偏差、弱点、鲁棒性和幻觉。这些发现突显了VLMs在图表理解任务中的细微差异,并指出了当前模型需要加强的具体技能。
Summary / 总结
The paper introduces ChartAB, a benchmark for evaluating vision-language models in chart grounding tasks, including extracting tabular data, localizing visualization elements, and recognizing attributes. It uses a JSON template to calculate specific evaluation metrics and a two-stage inference workflow to assess models' ability to align and compare elements across charts. The benchmark reveals perception biases, weaknesses, and hallucinations in recent models, highlighting the need for improved fine-grained skills in chart understanding tasks.
论文提出了ChartAB,一个用于评估视觉-语言模型在图表定位任务中的基准,包括提取表格数据、定位可视化元素和识别属性。它使用JSON模板来计算特定的评估指标,并采用两阶段推理工作流来评估模型在跨图表对齐和比较元素方面的能力。分析揭示了模型中的偏见、弱点和幻觉,强调了需要改进其对图表的细粒度理解。
SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models
Authors: Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas
First: 2025-10-30T17:52:39+00:00 · Latest: 2025-10-30T17:52:39+00:00
Abstract
This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM's size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.
中文标题/摘要
标题:SteerVLM:通过轻量级激活转向实现视觉语言模型稳健的模型控制
本工作介绍了SteerVLM,这是一种轻量级的转向模块,旨在引导视觉语言模型(VLMs)生成更符合所需指令的输出。我们的方法通过学习配对提示的潜在嵌入,编码目标和相反行为,动态调整语言模态与图像上下文之间的激活连接。这允许在不修改模型权重的情况下,在推理时对复杂的输出语义进行精细控制,同时保持对离目标任务的性能。我们的转向模块的学习参数量仅为原始VLM大小的0.14%。我们的转向模块通过维度上的激活调制和跨层自适应转向获得模型控制,无需预先提取的静态向量或手动调整干预点。此外,我们还引入了VNIA(视觉叙事意图对齐)多模态数据集,专门用于促进VLM转向技术的发展和评估。我们的方法在VLM的转向和幻觉缓解基准测试中优于现有干预技术,并通过激活工程提出了多模态模型控制的稳健解决方案。
Summary / 总结
SteerVLM is a lightweight module that guides VLMs to produce outputs more aligned with desired instructions by dynamically adjusting activations. It learns from paired prompts and requires only 0.14% of the original VLM's parameters. SteerVLM achieves fine-grained control over complex output semantics without modifying model weights and outperforms existing techniques in steering and hallucination mitigation benchmarks for VLMs.
SteerVLM 是一个轻量级模块,通过动态调整激活来引导视觉语言模型 (VLM) 生成更符合指令输出的结果。它通过学习配对提示来引导模型,而不修改权重,仅需原始 VLM 参数的 0.14%。SteerVLM 实现了对复杂输出语义的精细控制,并在 VLM 的引导和幻觉缓解基准测试中优于现有技术。
Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench
Authors: Fenfen Lin, Yesheng Liu, Haiyu Xu, Chen Yue, Zheqi He, Mingxuan Zhao, Miguel Hu Chen, Jiakang Liu, JG Yao, Xi Yang
First: 2025-10-30T17:20:51+00:00 · Latest: 2025-10-30T17:20:51+00:00
Comments: Project page: https://flageval-baai.github.io/MeasureBenchPage/
Abstract
Reading measurement instruments is effortless for humans and requires relatively little domain expertise, yet it remains surprisingly challenging for current vision-language models (VLMs) as we find in preliminary evaluation. In this work, we introduce MeasureBench, a benchmark on visual measurement reading covering both real-world and synthesized images of various types of measurements, along with an extensible pipeline for data synthesis. Our pipeline procedurally generates a specified type of gauge with controllable visual appearance, enabling scalable variation in key details such as pointers, scales, fonts, lighting, and clutter. Evaluation on popular proprietary and open-weight VLMs shows that even the strongest frontier VLMs struggle measurement reading in general. A consistent failure mode is indicator localization: models can read digits or labels but misidentify the key positions of pointers or alignments, leading to big numeric errors despite plausible textual reasoning. We have also conducted preliminary experiments with reinforcement learning over synthetic data, and find encouraging results on in-domain synthetic subset but less promising for real-world images. Our analysis highlights a fundamental limitation of current VLMs in fine-grained spatial grounding. We hope this resource can help future advances on visually grounded numeracy and precise spatial perception of VLMs, bridging the gap between recognizing numbers and measuring the world.
中文标题/摘要
标题:视觉语言模型能胜任视觉测量阅读吗?MeasureBench基准测试
人类阅读测量仪器轻而易举,所需的专业知识相对较少,但当前的视觉语言模型(VLMs)在初步评估中仍面临巨大挑战。本文介绍了MeasureBench,这是一个涵盖各种类型测量的真实世界和合成图像的视觉测量阅读基准,以及一个可扩展的数据合成管道。我们的管道程序化生成具有可控视觉外观的特定类型的量具,使关键细节如指针、刻度、字体、照明和杂乱的可扩展变化成为可能。对流行的专有和开源VLMs的评估显示,即使是最先进的VLMs在一般测量阅读上也表现不佳。一致的失败模式是指示器定位:模型可以读取数字或标签,但错误识别关键指针位置或对齐,导致大数值错误,尽管有合理的文本推理。我们还对合成数据进行了初步的强化学习实验,发现对领域内合成子集有令人鼓舞的结果,但对真实世界图像的效果较差。我们的分析突显了当前VLMs在精细空间定位方面的基本局限性。我们希望这一资源能帮助未来在视觉接地数理和精确空间感知方面取得进展,弥合识别数字与测量世界之间的差距。
CronusVLA: Towards Efficient and Robust Manipulation via Multi-Frame Vision-Language-Action Modeling
Authors: Hao Li, Shuai Yang, Yilun Chen, Xinyi Chen, Xiaoda Yang, Yang Tian, Hanqing Wang, Tai Wang, Dahua Lin, Feng Zhao, Jiangmiao Pang
First: 2025-06-24T17:30:27+00:00 · Latest: 2025-10-30T16:38:19+00:00
Comments: 39 pages, 24 figures
Abstract
Recent vision-language-action (VLA) models built on pretrained vision-language models (VLMs) have demonstrated strong performance in robotic manipulation. However, these models remain constrained by the single-frame image paradigm and fail to fully leverage the temporal information offered by multi-frame histories, as directly feeding multiple frames into VLM backbones incurs substantial computational overhead and inference latency. We propose CronusVLA, a unified framework that extends single-frame VLA models to the multi-frame paradigm. CronusVLA follows a two-stage process: (1) Single-frame pretraining on large-scale embodied datasets with autoregressive prediction of action tokens, establishing an effective embodied vision-language foundation; (2) Multi-frame post-training, which adapts the prediction of the vision-language backbone from discrete tokens to learnable features, and aggregates historical information via feature chunking. CronusVLA effectively addresses the existing challenges of multi-frame modeling while enhancing performance and observational robustness. To evaluate the robustness under temporal and spatial disturbances, we introduce SimplerEnv-OR, a novel benchmark featuring 24 types of observational disturbances and 120 severity levels. Experiments across three embodiments in simulated and real-world environments demonstrate that CronusVLA achieves leading performance and superior robustness, with a 70.9% success rate on SimplerEnv, a 26.8% improvement over OpenVLA on LIBERO, and the highest robustness score on SimplerEnv-OR. These results highlight the potential of efficient multi-frame adaptation in VLA models for more powerful and robust real-world deployment.
中文标题/摘要
标题:CronusVLA:通过多帧视觉-语言-动作建模实现高效稳健操作
基于预训练视觉-语言模型(VLMs)的近期视觉-语言-动作(VLA)模型在机器人操作方面表现出强大的性能。然而,这些模型仍然受限于单帧图像范式,未能充分利用多帧历史提供的时间信息,因为直接将多帧输入到VLM主干中会带来巨大的计算开销和推理延迟。我们提出了一种名为CronusVLA的统一框架,将单帧VLA模型扩展到多帧范式。CronusVLA遵循两阶段过程:(1)在大规模具身数据集上进行单帧预训练,通过自回归预测动作标记,建立有效的具身视觉-语言基础;(2)多帧后训练,将视觉-语言主干的预测从离散标记调整为可学习特征,并通过特征分块聚合历史信息。CronusVLA有效解决了多帧建模的现有挑战,同时提高了性能和观测鲁棒性。为了评估在时间和空间扰动下的鲁棒性,我们引入了SimplerEnv-OR基准,该基准包含24种观测扰动类型和120种严重程度级别。在模拟和真实环境中的三种具身模型实验表明,CronusVLA实现了领先性能和优越的鲁棒性,在SimplerEnv中的成功率为70.9%,在LIBERO中的性能提高了26.8%,在SimplerEnv-OR中的鲁棒性得分最高。这些结果突显了VLA模型中高效多帧适应的潜力,使其在更强大和鲁棒的实际部署中具有更大的可能性。
Summary / 总结
CronusVLA is a unified framework that extends single-frame vision-language-action models to a multi-frame paradigm to enhance performance and robustness in robotic manipulation. It involves a two-stage process: single-frame pretraining with autoregressive action token prediction and multi-frame post-training that adapts the vision-language backbone and aggregates historical information. Experiments show CronusVLA outperforms existing models with a 70.9% success rate on SimplerEnv and a 26.8% improvement over OpenVLA on LIBERO, demonstrating superior robustness under various disturbances.
CronusVLA 是一个框架,将单帧视觉-语言-动作模型扩展到多帧范式,以提高机器人操作能力。它包括单帧预训练以建立有效的感知-语言基础,以及多帧后训练来适应并整合历史信息。实验表明,CronusVLA 在 SimplerEnv 上的成功率为 70.9%,在 LIBERO 上比 OpenVLA 提高了 26.8%,并在 SimplerEnv-OR 上展示了更强的鲁棒性。
All You Need for Object Detection: From Pixels, Points, and Prompts to Next-Gen Fusion and Multimodal LLMs/VLMs in Autonomous Vehicles
Authors: Sayed Pedram Haeri Boroujeni, Niloufar Mehrabi, Hazim Alzorgan, Ahmad Sarlak, Mahlagha Fazeli, Abolfazl Razi
First: 2025-10-30T16:08:25+00:00 · Latest: 2025-10-30T16:08:25+00:00
Abstract
Autonomous Vehicles (AVs) are transforming the future of transportation through advances in intelligent perception, decision-making, and control systems. However, their success is tied to one core capability, reliable object detection in complex and multimodal environments. While recent breakthroughs in Computer Vision (CV) and Artificial Intelligence (AI) have driven remarkable progress, the field still faces a critical challenge as knowledge remains fragmented across multimodal perception, contextual reasoning, and cooperative intelligence. This survey bridges that gap by delivering a forward-looking analysis of object detection in AVs, emphasizing emerging paradigms such as Vision-Language Models (VLMs), Large Language Models (LLMs), and Generative AI rather than re-examining outdated techniques. We begin by systematically reviewing the fundamental spectrum of AV sensors (camera, ultrasonic, LiDAR, and Radar) and their fusion strategies, highlighting not only their capabilities and limitations in dynamic driving environments but also their potential to integrate with recent advances in LLM/VLM-driven perception frameworks. Next, we introduce a structured categorization of AV datasets that moves beyond simple collections, positioning ego-vehicle, infrastructure-based, and cooperative datasets (e.g., V2V, V2I, V2X, I2I), followed by a cross-analysis of data structures and characteristics. Ultimately, we analyze cutting-edge detection methodologies, ranging from 2D and 3D pipelines to hybrid sensor fusion, with particular attention to emerging transformer-driven approaches powered by Vision Transformers (ViTs), Large and Small Language Models (SLMs), and VLMs. By synthesizing these perspectives, our survey delivers a clear roadmap of current capabilities, open challenges, and future opportunities.
中文标题/摘要
标题:自动驾驶所需的一切:从像素、点和提示到下一代融合与多模态大/小语言模型/视觉模型在自动驾驶车辆中的应用
自动驾驶车辆(AVs)通过智能感知、决策和控制系统的发展正在重塑未来的交通。然而,它们的成功取决于一个核心能力——在复杂和多模态环境中可靠地进行目标检测。尽管计算机视觉(CV)和人工智能(AI)领域的最新突破推动了显著的进步,但该领域仍面临一个关键挑战,即知识在多模态感知、上下文推理和协同智能方面仍碎片化。本文综述填补了这一空白,通过提供面向未来的AV目标检测分析,强调了新兴范式,如视觉语言模型(VLMs)、大型语言模型(LLMs)和生成AI,而不是重新审视过时的技术。我们首先系统地回顾了AV传感器(摄像头、超声波、激光雷达和雷达)及其融合策略,不仅突出了它们在动态驾驶环境中的能力和局限性,还强调了它们与基于大/小语言模型/视觉模型的感知框架的潜在整合。接着,我们介绍了AV数据集的结构化分类,超越了简单的集合,将自我车辆、基础设施和协同数据集(例如V2V、V2I、V2X、I2I)置于其中,随后进行了数据结构和特征的交叉分析。最后,我们分析了最新的检测方法,从2D和3D管道到混合传感器融合,特别关注由视觉变换器(ViTs)、大型和小型语言模型(SLMs)和VLMs驱动的新兴变换器方法。通过综合这些视角,我们的综述提供了一条清晰的当前能力、开放挑战和未来机遇的路线图。
Counteracting Matthew Effect in Self-Improvement of LVLMs through Head-Tail Re-balancing
Authors: Xin Guo, Zhiheng Xi, Yiwen Ding, Yitao Zhai, Xiaowei Shi, Xunliang Cai, Tao Gui, Qi Zhang, Xuanjing Huang
First: 2025-10-30T13:26:58+00:00 · Latest: 2025-10-30T13:26:58+00:00
Comments: Preprint
Abstract
Self-improvement has emerged as a mainstream paradigm for advancing the reasoning capabilities of large vision-language models (LVLMs), where models explore and learn from successful trajectories iteratively. However, we identify a critical issue during this process: the model excels at generating high-quality trajectories for simple queries (i.e., head data) but struggles with more complex ones (i.e., tail data). This leads to an imbalanced optimization that drives the model to prioritize simple reasoning skills, while hindering its ability to tackle more complex reasoning tasks. Over iterations, this imbalance becomes increasingly pronounced--a dynamic we term the "Matthew effect"--which ultimately hinders further model improvement and leads to performance bottlenecks. To counteract this challenge, we introduce four efficient strategies from two perspectives: distribution-reshaping and trajectory-resampling, to achieve head-tail re-balancing during the exploration-and-learning self-improvement process. Extensive experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models across visual reasoning tasks demonstrate that our methods consistently improve visual reasoning capabilities, outperforming vanilla self-improvement by 3.86 points on average.
中文标题/摘要
标题:通过头部-尾部再平衡对抗LVLM自我提升中的马太效应
自我提升已成为提升大型视觉-语言模型(LVLM)推理能力的主要范式,其中模型通过迭代探索和学习成功的轨迹。然而,在这一过程中,我们发现一个关键问题:模型在生成简单查询(即头部数据)的高质量轨迹方面表现出色,但在处理更复杂的查询(即尾部数据)方面却遇到困难。这导致了一种不平衡的优化,使模型优先关注简单的推理技能,而阻碍了其解决更复杂推理任务的能力。随着迭代次数的增加,这种不平衡变得越来越明显——我们将其称为“马太效应”——最终阻碍了模型的进一步改进并导致性能瓶颈。为了应对这一挑战,我们从两个角度引入了四种有效的策略:分布重塑和轨迹重采样,以在探索和学习自我提升过程中实现头部-尾部再平衡。在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型的视觉推理任务上的广泛实验表明,我们的方法在视觉推理能力上始终优于传统的自我提升,平均高出3.86分。
Summary / 总结
The paper addresses the issue of the Matthew effect in self-improvement of large vision-language models (LVLMs), where models tend to excel at simple tasks (head data) while struggling with complex ones (tail data). To counteract this, the authors propose four strategies for distribution reshaping and trajectory resampling to achieve head-tail re-balancing. Experiments on Qwen2-VL-7B-Instruct and InternVL2.5-4B models show that these methods improve visual reasoning capabilities by an average of 3.86 points compared to vanilla self-improvement.
研究解决了大型视觉语言模型(LVLM)在自我改进过程中出现的不平衡问题,即模型在简单任务(头数据)上表现出色,但在复杂任务(尾数据)上却表现不佳。为了应对这一挑战,作者提出了两种视角下的四种策略:分布重塑和轨迹重采样,以实现头尾平衡。实验结果显示,这些方法在Qwen2-VL-7B-Instruct和InternVL2.5-4B模型上的视觉推理能力平均提高了3.86分,优于传统的自我改进方法。
Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Authors: Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang
First: 2025-10-30T13:11:23+00:00 · Latest: 2025-10-30T13:11:23+00:00
Abstract
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves both worst-group and average accuracy on context-sensitive benchmarks, establishing a new zero-shot state of the art. Beyond performance, our framework provides a lightweight representation-level counterfactual approach, offering a practical causal avenue for debiased and reliable multimodal reasoning.
中文标题/摘要
标题:代表级反事实校准以实现无偏的零样本识别
物体-上下文捷径仍然是视觉-语言模型中的一个持续性挑战,当测试场景与熟悉的训练共现情况不同时,会削弱零样本识别的可靠性。我们将此问题重新定义为因果推理问题,并提出:如果物体出现在不同的环境中,预测结果会如何?为了在推理时回答这一问题,我们估计CLIP表示空间中的物体和背景期望,并通过重新组合来自外部数据集、批邻居或文本描述的多样化替代上下文中的物体特征,合成反事实嵌入。通过估计总直接效应和模拟干预,我们进一步减去背景激活,保留有益的物体-上下文交互,同时减轻幻觉得分。无需重新训练或设计提示,我们的方法在上下文敏感基准测试中显著提高了最坏群体和平均准确率,建立了新的零样本状态的最新水平。除了性能,我们的框架提供了一种轻量级的代表级反事实方法,为无偏和可靠的多模态推理提供了实用的因果途径。
Summary / 总结
The paper addresses the challenge of object-context shortcuts in vision-language models, which can lead to unreliable zero-shot predictions. It proposes a method to estimate object and background expectations within CLIP's representation space and synthesizes counterfactual embeddings by recombining object features with diverse alternative contexts. This approach improves both worst-group and average accuracy on context-sensitive benchmarks, setting a new zero-shot state of the art without requiring retraining or prompt design. Beyond performance, the method provides a lightweight causal framework for debiased and reliable multimodal reasoning.
论文针对视觉-语言模型中存在的对象-上下文捷径问题,该问题可能导致零样本预测的可靠性下降。提出了一种方法,在CLIP的表示空间中估计对象和背景的期望,并通过重新组合对象特征与多样化的替代上下文来合成反事实嵌入。该方法在上下文敏感基准测试中显著提高了最差群体和平均准确率,建立了新的零样本状态的前沿。除了性能提升,该方法还提供了一种轻量级的反事实框架,为去偏见和可靠的多模态推理提供了实际的因果途径。
History
20251103_0324 20251102_0326 20251101_0324 20251031_0328 20251030_0330 20251029_0329 20251028_0329 20251027_0322 20251026_0327 20251025_0331 20251024_0329 20251023_0329 20251022_0330 20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553