arXiv 论文速递

3D Aware Region Prompted Vision Language Model

Authors: An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu

Venue: www

First: 2025-09-16T17:59:06+00:00 · Latest: 2025-09-16T17:59:06+00:00

Comments: Project Website: https://www.anjiecheng.me/sr3d

Abs · PDF · Code1 · Code2

Abstract

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

中文标题/摘要

标题：三维感知区域提示视觉语言模型

我们提出了一种三维空间区域感知（SR-3D）的视觉-语言模型，该模型通过共享的视觉标记空间连接单视图二维图像和多视图三维数据。SR-3D 支持灵活的区域提示，允许用户在任何帧上使用边界框、分割掩码或直接在三维空间中注释区域，而无需进行详尽的多帧标注。我们通过增强二维视觉特征的三维位置嵌入来实现这一点，这使得三维模型能够利用强大的二维先验知识进行更准确的空间推理，即使感兴趣的对象不在同一视图中出现。在通用二维视觉语言和专门的三维空间基准上的广泛实验表明，SR-3D 达到了最先进的性能，突显了其在场景理解中统一二维和三维表示空间的有效性。此外，我们观察到其在无需感官三维输入或真实三维注释的野外视频中的适用性，其中 SR-3D 准确地推断出空间关系和度量测量。

Summary / 总结

The research aims to develop a vision-language model that integrates 2D and 3D data through a shared visual token space, enabling flexible region prompting. The model enriches 2D visual features with 3D positional embeddings, allowing for accurate spatial reasoning across frames. Experiments show that SR-3D outperforms existing models on both general 2D vision-language tasks and specialized 3D spatial benchmarks, and it can infer spatial relationships and metric measurements in real-world videos without 3D inputs or annotations.

研究旨在开发一种通过共享视觉标记空间将单视图2D图像和多视图3D数据结合在一起的视觉-语言模型，支持使用边界框、分割掩码或3D注释进行灵活的区域提示。该模型通过3D位置嵌入丰富2D视觉特征，增强跨帧的空间推理能力。实验表明，SR-3D在通用视觉-语言任务和专门的3D空间基准测试中均优于现有模型，并且可以在没有3D输入或真实标注的情况下准确推断空间关系和度量尺寸。

StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance

Authors: Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau

Venue: SIGGRAPH Asia 2025

First: 2025-09-16T17:55:20+00:00 · Latest: 2025-09-16T17:55:20+00:00

Comments: SIGGRAPH Asia 2025 Conference Paper

Abs · PDF · Code1 · Code2

Abstract

Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.

中文标题/摘要

标题：StyleSculptor：基于纹理-几何双重引导的零样本风格可控3D资产生成

在视频游戏和虚拟现实等实际应用中，创建遵循现有纹理和几何风格的3D资产往往是必要的甚至是不可避免的。尽管在从文本或图像生成3D对象方面取得了令人印象深刻的进展，但创建风格可控的3D资产仍然是一个复杂且具有挑战性的问题。在本文中，我们提出了一种名为StyleSculptor的新型无训练方法，用于从内容图像和一个或多个风格图像生成风格引导的3D资产。与以往工作不同，StyleSculptor以零样本方式实现风格引导的3D生成，能够实现精细的3D风格控制，捕捉用户提供的风格图像的纹理、几何或两者风格。StyleSculptor的核心是一个新颖的风格解耦注意力（SD-Attn）模块，该模块通过跨3D注意力机制建立输入内容图像和风格图像之间的动态交互，实现稳定的特征融合和有效的风格引导生成。为了缓解语义内容泄露，我们还在SD-Attn模块中引入了一种风格解耦特征选择策略，利用3D特征补丁的方差来解耦风格和内容显著的通道，允许在注意力框架内选择性地注入特征。借助SD-Attn，网络可以动态计算纹理、几何或两者引导的特征，引导3D生成过程。在此基础上，我们进一步提出了风格引导控制（SGC）机制，该机制能够实现独占的几何或仅纹理风格化，以及可调节的风格强度控制。大量实验表明，StyleSculptor在生成高保真3D资产方面优于现有基线方法。

Summary / 总结

The research aims to generate 3D assets that match the texture and geometry styles of existing ones, which is crucial for applications like video gaming and virtual reality. The proposed StyleSculptor uses a novel training-free approach with a Style Disentangled Attention (SD-Attn) module to achieve zero-shot style-guided 3D generation. Key findings show that StyleSculptor outperforms existing methods in producing high-fidelity 3D assets with fine-grained control over texture, geometry, or both styles.

研究旨在生成与现有资产的纹理和几何风格相符的3D资产，这对于视频游戏和虚拟现实等应用至关重要。提出的StyleSculptor采用了一种新型的无训练方法，并使用了Style Disentangled Attention (SD-Attn)模块以实现零样本的风格引导3D生成。实验结果表明，StyleSculptor在生成高保真3D资产方面优于现有方法，并且能够对纹理、几何或两者风格进行精细控制。

Image Realness Assessment and Localization with Multimodal Features

Authors: Lovish Kaushik, Agnij Biswas, Somdyuti Paul

First: 2025-09-16T17:42:51+00:00 · Latest: 2025-09-16T17:42:51+00:00

Abs · PDF · Code1 · Code2

Abstract

A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.

中文标题/摘要

标题：基于多模态特征的图像真实度评估与定位

一种可靠地量化AI生成图像感知真实度并识别视觉不一致区域的方法对于AI生成图像的实际应用以及通过训练期间的真实度反馈提高生成AI的逼真度至关重要。本文介绍了一种框架，该框架利用在大规模数据集上训练的视觉-语言模型生成的描述视觉不一致性的文本描述，同时实现AI生成图像的整体客观真实度评估和局部不一致性识别。我们的结果表明，提出的多模态方法提高了客观真实度预测性能，并生成了密集的真实度图，能够有效地区分现实和不现实的空间区域

ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement

Authors: Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini

Venue: EMNLP 2025

First: 2025-09-16T17:35:39+00:00 · Latest: 2025-09-16T17:35:39+00:00

Comments: EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.

中文标题/摘要

标题：ChartGaze：通过眼动追踪引导注意力精炼提升LVLM中的图表理解

图表是传达和表示信息的重要视觉媒介。虽然大型视觉-语言模型（LVLM）在图表问答（CQA）方面取得了进展，但在模型关注图表无关区域时，任务仍然具有挑战性。在本工作中，我们介绍了ChartGaze，这是一个新的眼动追踪数据集，捕捉了人类在图表推理任务中的注视模式。通过系统比较人类和模型的注意力，我们发现LVLMs往往偏离人类的注视，导致可解释性和准确性降低。为了解决这一问题，我们提出了一种基于眼动追踪的注意力精炼方法，将图像-文本注意力与人类注视点对齐。我们的方法提高了答案准确性和注意力对齐，多个模型的改进幅度高达2.56个百分点。这些结果表明，将人类注视纳入LVLMs可以提升推理质量和可解释性。

Summary / 总结

ChartGaze is a new eye-tracking dataset that helps LVLMs better understand charts by aligning model attention with human gaze patterns. The study finds that LVLMs often focus on irrelevant parts of charts, leading to lower accuracy and interpretability. ChartGaze proposes a gaze-guided attention refinement method to improve this, achieving up to 2.56 percentage point gains in answer accuracy across multiple models. This shows the potential of using human gaze data to enhance LVLMs' performance on chart reasoning tasks.

ChartGaze 是一个新的人眼追踪数据集，用于捕捉人类在图表推理任务中的注视模式。研究发现，大型视觉-语言模型往往会关注无关区域，导致解释性和准确性降低。为了解决这个问题，作者提出了一种基于人眼注视的注意力精炼方法，将模型的注意力与人类的注视点对齐，从而在多个模型上提高了答案准确性和注意力对齐，最高提升了2.56个百分点。这表明将人类注视纳入模型可以提高图表集中LVLM的推理质量和解释性。

RadGame: An AI-Powered Platform for Radiology Education

Authors: Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar

First: 2025-09-16T17:27:33+00:00 · Latest: 2025-09-16T17:27:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

中文标题/摘要

标题：RadGame：一种基于人工智能的放射学教育平台

我们介绍了RadGame，一种基于人工智能的游戏化放射学教育平台，旨在培养两个核心技能：发现定位和报告生成。传统的放射学培训基于被动接触病例或实时从监督放射科医生那里获得反馈的主动实践，限制了即时和大规模反馈的机会。RadGame通过结合游戏化、大规模公共数据集和自动化的、基于人工智能的反馈来弥补这一差距，为人类学习者提供清晰、结构化的指导。在RadGame Localize中，玩家绘制异常的边界框，这些边界框会自动与公共数据集中放射科医生绘制的注释进行比较，并由视觉语言模型生成视觉解释，以解释用户遗漏的发现。在RadGame Report中，玩家根据胸部X光片、患者年龄和指征撰写发现，并根据放射学报告生成指标接收结构化的AI反馈，突出与公共数据集中放射科医生撰写的地面真实报告相比的错误和遗漏，生成最终的表现和风格评分。在前瞻性评估中，使用RadGame的参与者在定位准确性上比传统被动方法提高了68%，在报告写作准确性上比传统方法提高了31%，在看到相同病例后。RadGame突显了基于人工智能的游戏化在提供可扩展、反馈丰富的放射学培训方面的潜力，并重新构想了医疗人工智能资源在教育中的应用。

Summary / 总结

RadGame is an AI-powered gamified platform designed to enhance radiology education by focusing on localizing findings and generating reports. It uses large-scale public datasets and AI-driven feedback to provide structured guidance. Participants showed a 68% improvement in localization accuracy and a 31% improvement in report-writing accuracy after using RadGame, compared to traditional passive methods which only achieved 17% and 4% improvements respectively.

RadGame 是一个基于 AI 的游戏化平台，旨在通过聚焦于异常定位和报告生成来提升放射学教育。它利用大规模公共数据集和 AI 驱动的反馈来提供清晰的指导。参与者在使用 RadGame 后，定位准确性提高了 68%，报告写作准确性提高了 31%，而传统被动方法仅分别提高了 17% 和 4%。

Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

First: 2025-06-24T12:45:09+00:00 · Latest: 2025-09-16T15:12:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.

中文标题/摘要

标题：评估开源视觉-语言模型在对象描述中的鲁棒性在领域转移中的表现

视觉-语言模型（VLMs）已成为从视觉数据生成文本描述的强大工具。尽管这些模型在大规模网络数据集上表现出色，但它们对许多现实世界应用中固有的领域转移的鲁棒性仍然未被充分探索。本文系统地评估了VLM在面对可控的物理领域转移时，在单一视角对象描述任务中的性能。我们比较了两种不同对象集的描述准确性：一组多材料的真实世界工具和一组单一材料的3D打印物品。3D打印集引入了显著的领域转移，挑战了模型的泛化能力。我们的定量结果表明，所有测试的VLM在描述3D打印对象时与真实世界工具相比，表现明显下降。这突显了当前模型在超越表面特征进行泛化方面的关键局限性，并强调了需要更稳健的架构以适应现实世界的信号处理应用。

Summary / 总结

This paper evaluates the robustness of open-source vision-language models to domain shifts in object captioning. The study compares the models' performance on a single-view object captioning task using two distinct object sets: real-world multi-material tools and 3D-printed single-material items. The results show a significant drop in captioning accuracy for the 3D-printed objects, indicating that current models struggle with generalizing beyond surface-level features and emphasizing the need for more robust architectures.

该研究评估了开源视觉-语言模型在面对领域变化时生成物体描述的稳健性。研究使用两种不同的物体集进行单视角物体描述任务：现实中的多材料工具和3D打印的一材料物品。结果表明，对于3D打印的物体，模型的描述准确性显著下降，这表明当前模型在超越表面特征进行泛化方面存在局限性，强调了需要开发更稳健的架构以适应实际应用的需求。

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

Authors: Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

First: 2025-09-16T13:22:08+00:00 · Latest: 2025-09-16T13:22:08+00:00

Abs · PDF · Code1 · Code2

Abstract

By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.

中文标题/摘要

标题：HERO：重新思考高分辨率大型视觉语言模型中的视觉标记早期丢弃

通过将高分辨率图像裁剪成局部小块并独立编码，高分辨率大型视觉语言模型（HR-LVLMs）展示了出色的细粒度视觉理解能力。然而，这种分而治之的范式显著增加了视觉标记的数量，导致了巨大的计算和内存开销。为了更好地理解和解决这一挑战，我们实证研究了HR-LVLMs中的视觉标记利用情况，并发现了三个关键发现：（1）局部小块的重要性各不相同，由视觉显著性和任务相关性共同决定；（2）基于CLIP的视觉编码器中的CLS标记在各层中表现出两阶段的注意力模式，每个阶段关注不同类型的视觉标记；（3）在不同阶段被强调的视觉标记编码了不同粒度的信息，在视觉语言模型中发挥互补作用。基于这些见解，我们提出了HERO，一种高分辨率视觉标记早期丢弃框架，结合内容自适应标记预算分配与功能感知的标记选择。通过准确估计小块级别的重要性并选择性保留具有互补作用的视觉标记，HERO 在各种基准和模型规模上实现了优越的效率-准确度权衡，且无需训练。本研究提供了关于HR-LVLMs高效推理的实证见解和实用解决方案。

Summary / 总结

This study addresses the computational and memory overhead in High-Resolution Large Vision-Language Models (HR-LVLMs) by investigating visual token utilization. It finds that local tiles have varying importance, the CLS token exhibits a two-stage attention pattern, and visual tokens at different stages encode information at varying levels of granularity. Based on these insights, HERO, a High-resolution visual token early dropping framework, is proposed to achieve better efficiency-accuracy trade-offs without training.

研究旨在通过调查视觉标记的利用情况来解决高分辨率大型视觉-语言模型（HR-LVLMs）中的计算和内存开销问题。研究发现局部图块具有不同的重要性，CLS标记表现出两阶段的注意力模式，不同阶段的视觉标记编码不同粒度的信息。基于这些见解，提出了HERO，一种高分辨率视觉标记早期丢弃框架，以实现更好的效率-准确度权衡，且无需训练。

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Authors: Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang

First: 2025-09-16T12:51:11+00:00 · Latest: 2025-09-16T12:51:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

中文标题/摘要

标题：感知先于推理：视觉语言模型中的两阶段强化学习

强化学习（RL）已被证明在激发大型语言模型（LLMs）的推理能力方面非常有效。受此成功启发，最近的研究探索了将类似技术应用于视觉语言模型（VLMs），以提高其推理性能。然而，直接将RL方法从LLMs移植到VLMs是不理想的，因为VLMs面临的任务本质上更为复杂。具体来说，VLMs必须首先准确地感知和理解视觉输入，然后才能有效地进行推理。为了解决这一挑战，我们提出了一种两阶段的强化学习框架，旨在同时增强VLMs的感知和推理能力。为了缓解RL训练中常见的消失优势问题，我们首先在数据集级别进行采样，以选择性地使用不同的数据源加强特定能力。在训练过程中，第一阶段专注于通过粗粒度和细粒度的视觉理解来提高模型的视觉感知能力，而第二阶段则针对增强推理能力。经过提出的两阶段强化学习过程后，我们获得了PeBR-R1，这是一种感知和推理能力显著增强的视觉语言模型。在七个基准数据集上的实验结果表明，我们的方法有效，并且验证了PeBR-R1在各种视觉推理任务中的优越性能。

Summary / 总结

This paper proposes a two-stage reinforcement learning framework to improve the perceptual and reasoning capabilities of vision-language models (VLMs). The first stage focuses on enhancing visual perception, while the second stage targets reasoning abilities. By selectively strengthening specific capabilities using distinct data sources, the approach addresses the vanishing advantage issue in RL training. The proposed method, PeBR-R1, shows superior performance across seven benchmark datasets for visual reasoning tasks.

该论文提出了一种两阶段强化学习框架，旨在提升视觉语言模型（VLM）的感知和推理能力。第一阶段通过粗粒度和细粒度的理解来增强视觉感知，第二阶段则致力于提升推理能力。该方法通过首先确保准确的感知来应对VLM任务的固有复杂性。实验结果表明，提出的PeBR-R1方法在七个基准数据集上的多种视觉推理任务中表现优于现有模型。

ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way

Authors: Rajarshi Roy, Devleena Das, Ankesh Banerjee, Arjya Bhattacharjee, Kousik Dasgupta, Subarna Tripathi

First: 2025-07-11T15:21:49+00:00 · Latest: 2025-09-16T12:31:06+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.

中文标题/摘要

标题：ByDeWay: 以深度提示增强多模态大语言模型性能的无训练方法

我们介绍了ByDeWay，一种无训练框架，旨在提升多模态大语言模型（MLLMs）的性能。ByDeWay采用了一种新颖的提示策略，称为分层深度基于提示（LDP），该策略在不修改任何模型参数的情况下提高了空间推理和语义关联。它使用单目深度估计将场景分割为最近、中距离和最远层，然后使用接地的视觉语言模型生成区域特定的描述。这些结构化、深度感知的描述被附加到图像-问题提示中，丰富了其空间上下文。这引导MLLMs生成更具体且更少的幻觉响应。该方法轻量级、模块化且兼容黑盒MLLMs。在幻觉敏感（POPE）和推理密集（GQA）基准测试中，我们的方法在多个MLLMs上表现出一致的改进，验证了在无训练设置中深度感知提示的有效性。

Summary / 总结

ByDeWay is a training-free framework that enhances the performance of Multimodal Large Language Models (MLLMs) through a novel prompting strategy called Layered-Depth-Based Prompting (LDP). This method segments the scene into different layers using monocular depth estimation and generates region-specific captions with a grounded vision-language model. These depth-aware captions are appended to the image-question prompt, improving spatial reasoning and grounding. Experiments on hallucination-sensitive and reasoning-intensive benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.

ByDeWay 是一个无需训练的框架，通过一种名为 Layered-Depth-Based Prompting (LDP) 的新颖提示策略来提升多模态大型语言模型 (MLLMs) 的性能。该方法使用单目深度估计将场景分割成不同的层次，并使用基于视觉-语言模型生成区域特定的描述。这些深度感知的描述被附加到图像-问题提示中，从而改善了空间推理和语义关联。在幻觉敏感和推理密集型基准测试中，该方法在多个 MLLMs 上显示出一致的改进，验证了在零训练设置下深度感知提示的有效性。

Zero-shot Hierarchical Plant Segmentation via Foundation Segmentation Models and Text-to-image Attention

Authors: Junhao Xing, Ryohei Miyakawa, Yang Yang, Xinpeng Liu, Risa Shinoda, Hiroaki Santo, Yosuke Toda, Fumio Okura

Venue: WACV 2026

First: 2025-09-11T02:53:58+00:00 · Latest: 2025-09-16T10:47:41+00:00

Comments: WACV 2026 Accepted

Abs · PDF · Code1 · Code2 · Code3

Abstract

Foundation segmentation models achieve reasonable leaf instance extraction from top-view crop images without training (i.e., zero-shot). However, segmenting entire plant individuals with each consisting of multiple overlapping leaves remains challenging. This problem is referred to as a hierarchical segmentation task, typically requiring annotated training datasets, which are often species-specific and require notable human labor. To address this, we introduce ZeroPlantSeg, a zero-shot segmentation for rosette-shaped plant individuals from top-view images. We integrate a foundation segmentation model, extracting leaf instances, and a vision-language model, reasoning about plants' structures to extract plant individuals without additional training. Evaluations on datasets with multiple plant species, growth stages, and shooting environments demonstrate that our method surpasses existing zero-shot methods and achieves better cross-domain performance than supervised methods. Implementations are available at https://github.com/JunhaoXing/ZeroPlantSeg.

中文标题/摘要

标题：基于基础分割模型和文本到图像注意力的零样本分层植物分割

基础分割模型在无需训练的情况下（即零样本）可以从顶部视角的作物图像中合理地提取叶片实例。然而，分割由多个重叠叶片组成的整个植物个体仍然具有挑战性。这个问题被称为分层分割任务，通常需要标注训练数据集，这些数据集往往是特定于物种的，并且需要大量的人工劳动。为了解决这个问题，我们引入了ZeroPlantSeg，这是一种从顶部视角图像中对呈放射状的植物个体进行零样本分割的方法。我们结合了基础分割模型，提取叶片实例，以及一个视觉语言模型，通过推理植物结构来提取植物个体，无需额外训练。在包含多种植物物种、生长阶段和拍摄环境的数据集上的评估表明，我们的方法超越了现有的零样本方法，并且在跨域性能上优于监督方法。相关实现可在https://github.com/JunhaoXing/ZeroPlantSeg获得。

Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings

Authors: Abdalla Arafa, Didier Stricker

First: 2025-09-16T10:39:37+00:00 · Latest: 2025-09-16T10:39:37+00:00

Abs · PDF · Code1 · Code2

Abstract

Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive "bags of embeddings" that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.

中文标题/摘要

标题：超越平均值：基于高斯点积和词袋嵌入的开放词汇3D场景理解

新型视图合成在3D高斯点积（3DGS）的帮助下取得了显著进展，实现了实时的逼真渲染。然而，高斯点积固有的模糊性为3D场景理解带来了挑战，限制了其在AR/VR和机器人领域的广泛应用。尽管最近的研究试图通过2D基础模型蒸馏学习语义，但它们继承了根本性的局限性：alpha混合平均了物体的语义，使得3D级别的理解成为不可能。我们提出了一种彻底改变的替代方案，完全绕过了基于高斯的语义可微渲染。我们的关键洞察是利用预先分解的对象级高斯分布，并通过多视角CLIP特征聚合表示每个物体，创建全面的“嵌入词袋”，整体描述物体。这使得：(1) 通过将文本查询与对象级（而非高斯级）嵌入进行比较，实现准确的开放词汇物体检索；(2) 平滑的任务适应：将物体ID传播到像素进行2D分割，或将物体ID传播到高斯分布进行3D提取。实验表明，我们的方法有效地克服了3D开放词汇物体提取的挑战，同时在2D开放词汇分割方面保持与最新技术水平相当的性能，确保了最小的妥协。

Summary / 总结

The paper addresses the limitations of 3D Gaussian Splatting in 3D scene understanding by proposing a new method that uses predecomposed object-level Gaussians and multiview CLIP feature aggregation to create 'bags of embeddings'. This approach enables accurate open-vocabulary object retrieval and seamless task adaptation, overcoming the challenges of 3D object extraction while maintaining competitive performance in 2D segmentation tasks.

论文提出了一种新的方法，使用预分解的对象级高斯和多视角CLIP特征聚合来创建“嵌入包”，以解决3D高斯散射在3D场景理解中的局限性。该方法能够实现准确的开放词汇对象检索和无缝的任务适应，克服了3D对象提取的挑战，同时在2D分割任务中保持了竞争力。

All Roads Lead to Rome: Graph-Based Confidence Estimation for Large Language Model Reasoning

Authors: Caiqi Zhang, Chang Shu, Ehsan Shareghi, Nigel Collier

Venue: EMNLP 2025

First: 2025-09-16T10:02:52+00:00 · Latest: 2025-09-16T10:02:52+00:00

Comments: EMNLP 2025 Main

Abs · PDF · Code1 · Code2

Abstract

Confidence estimation is essential for the reliable deployment of large language models (LLMs). Existing methods are primarily designed for factual QA tasks and often fail to generalize to reasoning tasks. To address this gap, we propose a set of training-free, graph-based confidence estimation methods tailored to reasoning tasks. Our approach models reasoning paths as directed graphs and estimates confidence by exploiting graph properties such as centrality, path convergence, and path weighting. Experiments with two LLMs on three reasoning datasets demonstrate improved confidence estimation and enhanced performance on two downstream tasks.

中文标题/摘要

标题：条条大路通罗马：基于图的大型语言模型推理置信度估计

置信度估计对于大型语言模型（LLMs）的可靠部署至关重要。现有方法主要针对事实性问答任务，往往无法泛化到推理任务。为解决这一问题，我们提出了一种无需训练的、基于图的置信度估计方法，专门针对推理任务。我们的方法将推理路径建模为有向图，并通过利用图的中心性、路径收敛性和路径加权等特性来估计置信度。在两个LLM上对三个推理数据集进行的实验表明，该方法能够提高置信度估计并增强两个下游任务的性能。

Summary / 总结

The paper addresses the challenge of confidence estimation for large language models in reasoning tasks, where existing methods often fail. It introduces graph-based confidence estimation methods that do not require training and model reasoning paths as directed graphs. The approach uses graph properties like centrality and path convergence to estimate confidence. Experiments show improved confidence estimation and better performance on downstream tasks compared to existing methods.

论文针对大型语言模型在推理任务中的置信度估计难题，现有方法往往表现不佳。提出了一种无需训练的基于图的置信度估计方法，将推理路径建模为有向图，并利用图的中心性、路径收敛等属性进行置信度估计。实验结果显示，该方法在置信度估计和下游任务性能上优于现有方法。

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors: Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara

Venue: ICCV 2025

First: 2024-11-28T19:00:03+00:00 · Latest: 2025-09-16T10:01:27+00:00

Comments: ICCV 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.

中文标题/摘要

标题：与DINO对话：通过语言连接自我监督视觉骨干以实现开放词汇分割

开放词汇分割（OVS）旨在根据自由形式的文本概念分割图像，而无需预定义的训练类别。虽然现有的视觉-语言模型如CLIP可以通过利用视觉变换器的粗略空间信息生成分割掩码，但由于其图像和文本特征的全局对齐，它们在空间定位方面面临挑战。相反，自我监督视觉模型如DINO在细粒度视觉编码方面表现出色，但缺乏与语言的整合。为了解决这一差距，我们提出了一种名为Talk2DINO的新型混合方法，该方法结合了DINOv2的空间准确性与CLIP的语言理解能力。我们的方法通过一个学习映射函数将CLIP的文本嵌入与DINOv2的块级特征对齐，而无需微调底层骨干。在训练时，我们利用DINOv2的注意力图来选择性地将局部视觉块与文本嵌入对齐。我们展示了Talk2DINO的强大语义和定位能力可以增强分割过程，从而产生更自然且更少噪声的分割结果，并且我们的方法还可以有效地区分前景对象与背景。实验结果表明，Talk2DINO在多个无监督OVS基准测试中达到了最先进的性能。源代码和模型可在以下网址获取：https://lorebianchi98.github.io/Talk2DINO/

Summary / 总结

The research aims to improve open-vocabulary segmentation by integrating the spatial accuracy of DINOv2 with the language understanding of CLIP. The method, Talk2DINO, aligns CLIP’s textual embeddings with DINOv2’s patch-level features using a learned mapping function. Experimental results show that Talk2DINO outperforms existing methods, producing more natural and less noisy segmentations and effectively distinguishing foreground objects from the background, achieving state-of-the-art performance on several unsupervised OVS benchmarks.

论文通过将DINOv2的空间准确性与CLIP的语言理解相结合，解决了开放词汇分割（OVS）的挑战。提出的Talk2DINO方法使用学习的映射函数将CLIP的文本嵌入与DINOv2的块级特征对齐。在训练过程中，DINOv2的注意力图引导局部视觉块与文本嵌入的对齐。实验结果表明，Talk2DINO在多个无监督OVS基准上优于现有方法，生成了更自然且更少噪声的分割结果。该方法还能够有效地区分前景对象和背景。源代码和模型已公开可用。

FastDriveVLA: Efficient End-to-End Driving via Plug-and-Play Reconstruction-based Token Pruning

Authors: Jiajun Cao, Qizhe Zhang, Peidong Jia, Xuhui Zhao, Bo Lan, Xiaoan Zhang, Zhuo Li, Xiaobao Wei, Sixiang Chen, Liyun Li, Xianming Liu, Ming Lu, Yang Wang, Shanghang Zhang

First: 2025-07-31T07:55:56+00:00 · Latest: 2025-09-16T09:59:46+00:00

Comments: 9 pages, 5 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have demonstrated significant potential in complex scene understanding and action reasoning, leading to their increasing adoption in end-to-end autonomous driving systems. However, the long visual tokens of VLA models greatly increase computational costs. Current visual token pruning methods in Vision-Language Models (VLM) rely on either visual token similarity or visual-text attention, but both have shown poor performance in autonomous driving scenarios. Given that human drivers concentrate on relevant foreground areas while driving, we assert that retaining visual tokens containing this foreground information is essential for effective decision-making. Inspired by this, we propose FastDriveVLA, a novel reconstruction-based vision token pruning framework designed specifically for autonomous driving. FastDriveVLA includes a plug-and-play visual token pruner called ReconPruner, which prioritizes foreground information through MAE-style pixel reconstruction. A novel adversarial foreground-background reconstruction strategy is designed to train ReconPruner for the visual encoder of VLA models. Once trained, ReconPruner can be seamlessly applied to different VLA models with the same visual encoder without retraining. To train ReconPruner, we also introduce a large-scale dataset called nuScenes-FG, consisting of 241K image-mask pairs with annotated foreground regions. Our approach achieves state-of-the-art results on the nuScenes open-loop planning benchmark across different pruning ratios.

中文标题/摘要

标题：FastDriveVLA：通过插件式重建基元进行高效端到端驾驶

视觉-语言-动作（VLA）模型在复杂场景理解和动作推理方面展示了显著的潜力，推动了其在端到端自动驾驶系统中的广泛应用。然而，VLA模型的长视觉令牌大大增加了计算成本。当前视觉令牌剪枝方法依赖于视觉令牌相似性或视觉-文本注意力，但在自动驾驶场景中表现不佳。鉴于人类驾驶员在驾驶时集中于相关前景区域，我们断言保留包含这些前景信息的视觉令牌对于有效决策至关重要。受此启发，我们提出FastDriveVLA，这是一种专为自动驾驶设计的新型基于重建的视觉令牌剪枝框架。FastDriveVLA包括一种名为ReconPruner的插件式视觉令牌剪枝器，它通过MAE风格的像素重建优先处理前景信息。设计了一种新颖的对抗式前景-背景重建策略来训练ReconPruner以适应VLA模型的视觉编码器。训练完成后，ReconPruner可以无缝应用于具有相同视觉编码器的不同VLA模型而无需重新训练。为了训练ReconPruner，我们还引入了一个名为nuScenes-FG的大规模数据集，包含241K带有标注前景区域的图像-掩码对。我们的方法在nuScenes开环规划基准测试中实现了不同剪枝比例下的最佳结果。

Summary / 总结

FastDriveVLA is a novel reconstruction-based vision token pruning framework for autonomous driving, which retains foreground information through a plug-and-play visual token pruner called ReconPruner. The method uses a novel adversarial foreground-background reconstruction strategy to train ReconPruner on a large-scale dataset, nuScenes-FG, and can be applied to different VLA models without retraining. It achieves state-of-the-art results on the nuScenes open-loop planning benchmark with various pruning ratios.

FastDriveVLA 是一种针对自主驾驶的新型重建基视觉标记剪枝框架，旨在解决 VLA 模型的高计算成本问题。它使用了一个名为 ReconPruner 的插件式剪枝器，通过像素重建优先处理前景信息。该方法在 nuScenes 开环规划基准测试中实现了最先进的结果，适用于不同的剪枝比例。

Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, Chong Feng

First: 2025-09-16T09:54:01+00:00 · Latest: 2025-09-16T09:54:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs' visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model's visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.

中文标题/摘要

标题：跨层视觉平滑：通过持续关注关键对象提升大型视觉语言模型的视觉理解能力

大型视觉语言模型（LVLMs）能够准确地定位图像中的关键对象，但它们对这些对象的关注时间非常短暂。受持续关注关键对象可以提高LVLMs视觉能力这一假设的启发，我们提出了跨层视觉平滑（CLVS）。CLVS的核心思想是引入一个视觉记忆，以平滑各层之间的注意力分布。具体而言，我们用第一层的位置无偏视觉注意力初始化这个视觉记忆。在后续层中，模型的视觉注意力会同时考虑来自前一层的记忆，并且记忆会迭代更新，从而保持对关键对象的平滑关注。由于视觉理解主要发生在模型的早期和中期层，我们使用不确定性作为视觉理解完成的指标，并据此终止平滑过程。在三个LVLMs的四个基准测试上进行的实验验证了我们方法的有效性和普适性。CLVS在多种视觉理解任务上取得了最先进的性能，特别是在关系和属性理解方面取得了显著的改进。

Summary / 总结

The research aims to enhance the visual understanding of large Vision-Language Models (LVLMs) by proposing Cross-Layer Vision Smoothing (CLVS), which involves maintaining sustained focus on key objects through a vision memory that smooths attention across layers. The method initializes the vision memory with position-unbiased visual attention in the first layer and updates it iteratively in subsequent layers to maintain smooth attention on key objects. Experiments on four benchmarks across three LVLMs show that CLVS improves visual understanding, especially in relation and attribute understanding, achieving state-of-the-art performance.

研究旨在通过提出跨层视觉平滑（CLVS）方法来增强大型视觉-语言模型（LVLM）的视觉理解能力，该方法通过在各层间平滑注意力分布来保持对关键对象的持续关注。实验表明，CLVS提高了LVLM在各种视觉理解任务中的性能，特别是在关系和属性理解方面取得了显著改进。

Cross-Image Contrastive Decoding: Precise, Lossless Suppression of Language Priors in Large Vision-Language Models

Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, Chong Feng

First: 2025-05-15T18:16:56+00:00 · Latest: 2025-09-16T09:49:38+00:00

Comments: Under Review

Abs · PDF · Code1 · Code2

Abstract

Over-reliance on language priors is a major cause of hallucinations in Large Vision-Language Models (LVLMs), often leading to outputs that are linguistically plausible but visually inconsistent. Recent studies have explored contrastive decoding as a training-free solution. However, these methods typically construct contrastive visual inputs by perturbing the original image, resulting in distorted contrastive distributions, incomplete contrastive signals, and excessive suppression of language priors. Motivated by the observation that language priors tend to remain consistent across different images, we propose Cross-Image Contrastive Decoding (CICD), a simple yet effective training-free method that uses unrelated images as contrastive visual inputs. To address the issue of over-suppressing language priors, which can negatively affect the quality of generated responses, we further introduce a dynamic selection mechanism based on the cross-image differences in model behavior. By selectively suppressing language priors, our method reduces hallucinations without compromising the model's performance. Extensive experiments across multiple benchmarks and LVLMs confirm the effectiveness and generalizability of CICD, particularly in image captioning, where language priors are especially dominant.

中文标题/摘要

标题：跨图像对比解码：在大型视觉语言模型中精确、无损抑制语言先验

对语言先验的过度依赖是大型视觉语言模型（LVLMs）产生幻觉的主要原因，常导致语义上合理但视觉上不一致的输出。近期研究探索了对比解码作为无训练解决方案的可能性。然而，这些方法通常通过扰动原始图像来构建对比视觉输入，导致对比分布失真、对比信号不完整以及语言先验过度抑制。鉴于观察到语言先验在不同图像中保持一致，我们提出了一种简单而有效的无训练方法——跨图像对比解码（CICD），该方法使用不相关图像作为对比视觉输入。为了解决过度抑制语言先验的问题，该方法进一步引入了基于模型行为跨图像差异的动态选择机制。通过选择性抑制语言先验，我们的方法减少了幻觉现象而不影响模型性能。在多个基准测试和LVLMs上的广泛实验验证了CICD的有效性和普适性，特别是在图像描述任务中，语言先验尤为突出。

Summary / 总结

The research aims to address hallucinations in Large Vision-Language Models (LVLMs) caused by over-reliance on language priors. The proposed Cross-Image Contrastive Decoding (CICD) method uses unrelated images as contrastive visual inputs, avoiding the distortion of contrastive distributions. It introduces a dynamic selection mechanism based on cross-image differences in model behavior to selectively suppress language priors, reducing hallucinations without negatively impacting model performance. Experiments across various benchmarks show CICD's effectiveness and generalizability, especially in image captioning tasks.

论文针对大型视觉-语言模型（LVLM）因过度依赖语言先验而导致的幻觉问题，提出了一种名为跨图像对比解码（CICD）的无训练方法，该方法使用不相关的图像作为对比视觉输入来抑制语言先验。该方法引入了一种基于模型行为在不同图像之间差异的动态选择机制，以选择性地抑制语言先验，从而减少幻觉同时保持模型性能。实验表明，CICD在各种LVLM中有效减少了幻觉，尤其是在图像描述任务中表现尤为明显。

Adversarial Prompt Distillation for Vision-Language Models

Authors: Lin Luo, Xin Wang, Bojia Zi, Shihao Zhao, Xingjun Ma, Yu-Gang Jiang

First: 2024-11-22T03:02:13+00:00 · Latest: 2025-09-16T09:06:19+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Large pre-trained Vision-Language Models (VLMs) such as Contrastive Language-Image Pre-training (CLIP) have been shown to be susceptible to adversarial attacks, raising concerns about their deployment in safety-critical applications like autonomous driving and medical diagnosis. One promising approach for robustifying pre-trained VLMs is Adversarial Prompt Tuning (APT), which applies adversarial training during the process of prompt tuning. However, existing APT methods are mostly single-modal methods that design prompt(s) for only the visual or textual modality, limiting their effectiveness in either robustness or clean accuracy. In this work, we propose Adversarial Prompt Distillation (APD), a bimodal knowledge distillation framework that enhances APT by integrating it with multi-modal knowledge transfer. APD optimizes prompts for both visual and textual modalities while distilling knowledge from a clean pre-trained teacher CLIP model. Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD method over the current state-of-the-art APT methods in terms of both adversarial robustness and clean accuracy. The effectiveness of APD also validates the possibility of using a non-robust teacher to improve the generalization and robustness of fine-tuned VLMs.

中文标题/摘要

标题：视觉语言模型的对抗提示蒸馏

大型预训练视觉语言模型（VLMs）如对比语言-图像预训练（CLIP）已被证明对对抗攻击敏感，这在自动驾驶和医疗诊断等关键安全应用中引发了担忧。一种有希望的方法是对抗提示调优（APT），它在提示调优过程中应用对抗训练。然而，现有的APT方法主要是单模态方法，仅针对视觉或文本模态设计提示，这限制了它们在鲁棒性或干净准确度方面的效果。在本文中，我们提出了一种双模态知识蒸馏框架——对抗提示蒸馏（APD），该框架通过将APT与多模态知识转移相结合来增强APT。APD同时优化视觉和文本模态的提示，并从干净的预训练教师CLIP模型中蒸馏知识。在多个基准数据集上的广泛实验表明，与现有的最先进的APT方法相比，我们的APD方法在对抗鲁棒性和干净准确度方面都具有优势。APD的有效性也验证了使用非鲁棒教师来提高微调后的VLMs的泛化能力和鲁棒性的可能性。

Summary / 总结

This paper addresses the vulnerability of large pre-trained Vision-Language Models (VLMs) like CLIP to adversarial attacks, which is a concern for safety-critical applications. It proposes Adversarial Prompt Distillation (APD), a bimodal knowledge distillation framework that enhances Adversarial Prompt Tuning (APT) by integrating multi-modal knowledge transfer. APD optimizes prompts for both visual and textual modalities while distilling knowledge from a clean pre-trained teacher model. Experiments show that APD outperforms existing APT methods in both adversarial robustness and clean accuracy.

本文针对大型预训练视觉-语言模型（VLMs）如CLIP在对抗攻击下的脆弱性，这在安全关键应用中是一个担忧。文中提出了一种双模态知识蒸馏框架Adversarial Prompt Distillation (APD)，它通过多模态知识转移来增强Adversarial Prompt Tuning (APT)。APD同时优化视觉和文本模态的提示，并从干净的预训练教师模型中蒸馏知识。实验表明，APD在对抗鲁棒性和干净准确性方面均优于现有APT方法。

Can Generalist Vision Language Models (VLMs) Rival Specialist Medical VLMs? Benchmarking and Strategic Insights

Authors: Yuan Zhong, Ruinan Jin, Qi Dou, Xiaoxiao Li

First: 2025-06-19T07:59:00+00:00 · Latest: 2025-09-16T07:44:30+00:00

Comments: version 2

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) have shown promise in automating image diagnosis and interpretation in clinical settings. However, developing specialist medical VLMs requires substantial computational resources and carefully curated datasets, and it remains unclear under which conditions generalist and specialist medical VLMs each perform best. This study highlights the complementary strengths of specialist medical and generalist VLMs. Specialists remain valuable in modality-aligned use cases, but we find that efficiently fine-tuned generalist VLMs can achieve comparable or even superior performance in most tasks, particularly when transferring to unseen or rare OOD medical modalities. These results suggest that generalist VLMs, rather than being constrained by their lack of specialist medical pretraining, may offer a scalable and cost-effective pathway for advancing clinical AI development.

中文标题/摘要

标题：通用视觉语言模型（VLMs）能否与专科医疗VLMs匹敌？基准测试与战略洞察

视觉语言模型（VLMs）在临床环境中自动化图像诊断和解释方面显示出潜力。然而，开发专科医疗VLMs需要大量的计算资源和精心策划的数据集，尚不清楚在何种条件下通用和专科医疗VLMs表现最佳。本研究强调了专科医疗和通用VLMs的互补优势。专科人员在模态对齐的应用场景中仍然有价值，但研究发现，高效微调的通用VLMs在大多数任务中可以达到相当甚至更优的性能，尤其是在向未见过或罕见的OOD医疗模态转移时。这些结果表明，通用VLMs，而不是受限于其缺乏专科医疗预训练，可能为推进临床AI开发提供一种可扩展和成本效益高的途径。

Summary / 总结

This study investigates the performance of generalist and specialist medical Vision Language Models (VLMs) in clinical settings. The research highlights that generalist VLMs, when efficiently fine-tuned, can match or surpass the performance of specialist VLMs, especially in tasks involving unseen or rare medical modalities. This suggests that generalist VLMs could provide a scalable and cost-effective solution for advancing clinical AI development.

研究比较了通用和专科医疗视觉语言模型（VLMs）在临床环境中的表现。研究发现，尽管专科VLMs在特定模态任务上表现出色，但经过高效微调的通用VLMs在大多数任务中可以达到相当甚至更好的效果，尤其是在未见过或罕见的医疗模态中。这表明通用VLMs可能为临床AI开发提供一种可扩展且成本效益高的解决方案。

IAG: Input-aware Backdoor Attack on VLMs for Visual Grounding

Authors: Junxian Li, Beining Xu, Di Zhang

First: 2025-08-13T03:22:19+00:00 · Latest: 2025-09-16T07:37:39+00:00

Comments: 13 pages, 13 Figures

Abs · PDF · Code1 · Code2

Abstract

Vision-language models (VLMs) have shown significant advancements in tasks such as visual grounding, where they localize specific objects in images based on natural language queries and images. However, security issues in visual grounding tasks for VLMs remain underexplored, especially in the context of backdoor attacks. In this paper, we introduce a novel input-aware backdoor attack method, IAG, designed to manipulate the grounding behavior of VLMs. This attack forces the model to ground a specific target object in the input image, regardless of the user's query. We propose an adaptive trigger generator that embeds the semantic information of the attack target's description into the original image using a text-conditional U-Net, thereby overcoming the open-vocabulary attack challenge. To ensure the attack's stealthiness, we utilize a reconstruction loss to minimize visual discrepancies between poisoned and clean images. Additionally, we introduce a unified method for generating attack data. IAG is evaluated theoretically and empirically, demonstrating its feasibility and effectiveness. Notably, our ASR@0.5 on InternVL-2.5-8B reaches over 65\% on various testing sets. IAG also shows promising potential on manipulating Ferret-7B and LlaVA-1.5-7B with very little accuracy decrease on clean samples. Extensive specific experiments, such as ablation study and potential defense, also indicate the robustness and transferability of our attack.

中文标题/摘要

标题：IAG：输入感知视觉接地中的后门攻击

视觉语言模型（VLMs）在视觉定位等任务中取得了显著进展，这些任务要求它们根据自然语言查询和图像在图像中定位特定对象。然而，视觉定位任务中的安全问题在VLMs中仍然被忽视，特别是在后门攻击的背景下。本文介绍了一种新颖的输入感知后门攻击方法IAG，旨在操纵VLMs的定位行为。该攻击迫使模型在输入图像中定位特定目标对象，而不管用户的查询。我们提出了一种自适应触发器生成器，使用文本条件U-Net将攻击目标描述的语义信息嵌入到原始图像中，从而克服了开放式词汇攻击的挑战。为了确保攻击的隐蔽性，我们使用重构损失来最小化受污染图像和干净图像之间的视觉差异。此外，我们还提出了一种统一的攻击数据生成方法。IAG在理论上和实验上进行了评估，证明了其可行性和有效性。值得注意的是，我们的InternVL-2.5-8B上的ASR@0.5在各种测试集上超过65%。IAG在操纵Ferret-7B和LlaVA-1.5-7B方面也显示出很大的潜力，对干净样本的准确率下降很小。广泛的特定实验，如消融研究和潜在防御，也表明了我们攻击的鲁棒性和可转移性。

Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions

Authors: Pu Jian, Donglei Yu, Wen Yang, Shuo Ren, Jiajun Zhang

First: 2025-07-18T09:31:43+00:00 · Latest: 2025-09-16T07:08:58+00:00

Comments: ACL2025 Main (SAC Highlight Award)

Abs · PDF · Code1 · Code2

Abstract

In visual question answering (VQA) context, users often pose ambiguous questions to visual language models (VLMs) due to varying expression habits. Existing research addresses such ambiguities primarily by rephrasing questions. These approaches neglect the inherently interactive nature of user interactions with VLMs, where ambiguities can be clarified through user feedback. However, research on interactive clarification faces two major challenges: (1) Benchmarks are absent to assess VLMs' capacity for resolving ambiguities through interaction; (2) VLMs are trained to prefer answering rather than asking, preventing them from seeking clarification. To overcome these challenges, we introduce \textbf{ClearVQA} benchmark, which targets three common categories of ambiguity in VQA context, and encompasses various VQA scenarios.

中文标题/摘要

标题：教学视觉语言模型提问：解决视觉问答中的歧义

在视觉问答（VQA）背景下，用户由于表达习惯不同往往会提出含糊的问题。现有研究主要通过重新表述问题来解决这种歧义，但忽视了用户与视觉语言模型（VLMs）交互的本质，即通过用户反馈可以澄清歧义。然而，关于交互澄清的研究面临两大挑战：（1）缺乏评估VLMs通过交互解决歧义能力的基准；（2）VLMs被训练成更倾向于回答而不是提问，这阻碍了它们寻求澄清。为克服这些挑战，我们引入了**ClearVQA**基准，该基准针对VQA背景下常见的三种歧义类别，并涵盖了各种VQA场景。

Zero-shot Graph Reasoning via Retrieval Augmented Framework with LLMs

Authors: Hanqing Li, Kiran Sheena Jyothi, Henry Liang, Sharika Mahadevan, Diego Klabjan

First: 2025-09-16T06:58:58+00:00 · Latest: 2025-09-16T06:58:58+00:00

Abs · PDF · Code1 · Code2

Abstract

We propose a new, training-free method, Graph Reasoning via Retrieval Augmented Framework (GRRAF), that harnesses retrieval-augmented generation (RAG) alongside the code-generation capabilities of large language models (LLMs) to address a wide range of graph reasoning tasks. In GRRAF, the target graph is stored in a graph database, and the LLM is prompted to generate executable code queries that retrieve the necessary information. This approach circumvents the limitations of existing methods that require extensive finetuning or depend on predefined algorithms, and it incorporates an error feedback loop with a time-out mechanism to ensure both correctness and efficiency. Experimental evaluations on the GraphInstruct dataset reveal that GRRAF achieves 100% accuracy on most graph reasoning tasks, including cycle detection, bipartite graph checks, shortest path computation, and maximum flow, while maintaining consistent token costs regardless of graph sizes. Imperfect but still very high performance is observed on subgraph matching. Notably, GRRAF scales effectively to large graphs with up to 10,000 nodes.

Summary / 总结

The research proposes GRRAF, a zero-shot graph reasoning method using retrieval-augmented generation and large language models. It stores the target graph in a database and prompts the LLM to generate executable code queries to retrieve necessary information, avoiding extensive fine-tuning. GRRAF achieves 100% accuracy on most graph reasoning tasks and maintains consistent token costs across different graph sizes, showing high performance even on subgraph matching tasks. It scales effectively to large graphs with up to 10,000 nodes.

研究提出了GRRAF，一种利用检索增强生成和大型语言模型的零样本图推理方法。它将目标图存储在数据库中，并提示LLM生成可执行代码查询以检索必要信息，避免了大量微调。GRRAF在大多数图推理任务上实现了100%的准确率，并且在不同大小的图上保持了一致的令牌成本，即使在子图匹配任务上也表现出很高的性能。它能够有效地扩展到包含多达10,000个节点的大图。

Training-free Adjustable Polynomial Graph Filtering for Ultra-fast Multimodal Recommendation

Authors: Yu-Seung Roh, Joo-Young Kim, Jin-Duk Park, Won-Yong Shin

First: 2025-03-06T13:00:53+00:00 · Latest: 2025-09-16T06:35:48+00:00

Comments: 17 pages, 7 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

Multimodal recommender systems improve the performance of canonical recommender systems with no item features by utilizing diverse content types such as text, images, and videos, while alleviating inherent sparsity of user-item interactions and accelerating user engagement. However, current neural network-based models often incur significant computational overhead due to the complex training process required to learn and integrate information from multiple modalities. To address this challenge,we propose MultiModal-Graph Filtering (MM-GF), a training-free method grounded in graph filtering (GF) for efficient and accurate multimodal recommendations. Specifically, MM-GF first constructs multiple similarity graphs for two distinct modalities as well as user-item interaction data. Then, MM-GF optimally fuses these multimodal signals using a polynomial graph filter that allows for precise control of the frequency response by adjusting frequency bounds. Furthermore, the filter coefficients are treated as hyperparameters, enabling flexible and data-driven adaptation. Extensive experiments on real-world benchmark datasets demonstrate that MM-GF not only improves recommendation accuracy by up to 22.25% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.

中文标题/摘要

标题：无训练可调多项式图滤波用于超快速多模态推荐

多模态推荐系统通过利用文本、图像和视频等多种内容类型，提高了没有项目特征的典型推荐系统的性能，缓解了用户-项目交互的固有稀疏性并加速了用户参与。然而，当前基于神经网络的模型由于需要复杂训练过程来学习和整合多种模态的信息，往往会产生显著的计算开销。为了解决这一挑战，我们提出了基于图滤波（GF）的多模态图滤波（MM-GF）方法，这是一种无训练方法，用于高效且准确的多模态推荐。具体来说，MM-GF首先为两种不同的模态以及用户-项目交互数据构建多个相似性图。然后，MM-GF使用多项式图滤波器以调整频率边界的方式精确控制频率响应来融合这些多模态信号。此外，滤波器系数被视为超参数，使滤波器能够灵活且数据驱动地适应。在真实基准数据集上的广泛实验表明，与最佳竞争对手相比，MM-GF不仅将推荐准确性提高了22.25%，而且通过实现不到10秒的运行时间，显著降低了计算成本。

Summary / 总结

The paper proposes MultiModal-Graph Filtering (MM-GF), a training-free method for efficient and accurate multimodal recommendations. It constructs similarity graphs for different modalities and user-item interactions, then uses a polynomial graph filter to fuse these signals, allowing for precise control through adjustable frequency bounds. Experiments show MM-GF improves recommendation accuracy by up to 22.25% and reduces computational costs, achieving runtime under 10 seconds.

论文提出了一种训练-free 方法 MultiModal-Graph Filtering (MM-GF)，以解决基于神经网络的多模态推荐系统中的计算开销问题。MM-GF 构建多个相似性图，并使用多项式图滤波器融合多模态信号，允许对频率响应进行精确控制。实验表明，MM-GF 可以将推荐准确率提高多达 22.25%，并大幅降低计算成本，实现不到 10 秒的运行时间。

Defense-to-Attack: Bypassing Weak Defenses Enables Stronger Jailbreaks in Vision-Language Models

Authors: Yunhan Zhao, Xiang Zheng, Xingjun Ma

First: 2025-09-16T06:25:58+00:00 · Latest: 2025-09-16T06:25:58+00:00

Comments: This work has been submitted to the IEEE for possible publication

Abs · PDF · Code1 · Code2

Abstract

Despite their superb capabilities, Vision-Language Models (VLMs) have been shown to be vulnerable to jailbreak attacks. While recent jailbreaks have achieved notable progress, their effectiveness and efficiency can still be improved. In this work, we reveal an interesting phenomenon: incorporating weak defense into the attack pipeline can significantly enhance both the effectiveness and the efficiency of jailbreaks on VLMs. Building on this insight, we propose Defense2Attack, a novel jailbreak method that bypasses the safety guardrails of VLMs by leveraging defensive patterns to guide jailbreak prompt design. Specifically, Defense2Attack consists of three key components: (1) a visual optimizer that embeds universal adversarial perturbations with affirmative and encouraging semantics; (2) a textual optimizer that refines the input using a defense-styled prompt; and (3) a red-team suffix generator that enhances the jailbreak through reinforcement fine-tuning. We empirically evaluate our method on four VLMs and four safety benchmarks. The results demonstrate that Defense2Attack achieves superior jailbreak performance in a single attempt, outperforming state-of-the-art attack methods that often require multiple tries. Our work offers a new perspective on jailbreaking VLMs.

中文标题/摘要

标题：防御转攻击：利用弱防御绕过强越狱攻击在视觉语言模型中的应用

尽管视觉语言模型（VLMs）具有出色的能力，但它们已被证明对越狱攻击易受攻击。虽然最近的越狱已经取得了显著的进步，但其效果和效率仍有待提高。在本工作中，我们揭示了一个有趣的现象：将弱防御融入攻击管道可以显著提高VLMs越狱的有效性和效率。基于这一洞察，我们提出了Defense2Attack，这是一种新颖的越狱方法，通过利用防御模式来引导越狱提示设计，绕过VLMs的安全防护。具体而言，Defense2Attack 包含三个关键组件：（1）视觉优化器，嵌入具有肯定和鼓励语义的通用对抗扰动；（2）文本优化器，使用防御风格的提示精炼输入；（3）红队后缀生成器，通过强化微调增强越狱。我们在四个VLMs和四个安全基准上进行了实证评估。结果表明，Defense2Attack 在一次尝试中实现了优越的越狱性能，优于通常需要多次尝试的最先进的攻击方法。我们的工作为越狱VLMs提供了新的视角。

Summary / 总结

This paper explores the phenomenon that integrating weak defenses into the attack pipeline can significantly enhance the effectiveness and efficiency of jailbreak attacks on Vision-Language Models (VLMs). The proposed Defense2Attack method consists of a visual optimizer, a textual optimizer, and a red-team suffix generator. Empirical evaluations on four VLMs and four safety benchmarks show that Defense2Attack outperforms state-of-the-art attack methods in a single attempt, achieving superior jailbreak performance.

该研究通过提出Defense2Attack方法，利用弱防御来增强视觉语言模型（VLMs）的牢笼破解效果和效率。Defense2Attack包括视觉优化器、文本优化器和红队后缀生成器。在四个VLMs和四个安全基准上的实证评估表明，Defense2Attack在单次尝试中表现出色，优于最先进的破解方法。

Leveraging Geometric Priors for Unaligned Scene Change Detection

Authors: Ziling Liu, Ziwei Chen, Mingqi Gao, Jinyu Yang, Feng Zheng

First: 2025-09-14T14:31:08+00:00 · Latest: 2025-09-16T06:25:53+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unaligned Scene Change Detection aims to detect scene changes between image pairs captured at different times without assuming viewpoint alignment. To handle viewpoint variations, current methods rely solely on 2D visual cues to establish cross-image correspondence to assist change detection. However, large viewpoint changes can alter visual observations, causing appearance-based matching to drift or fail. Additionally, supervision limited to 2D change masks from small-scale SCD datasets restricts the learning of generalizable multi-view knowledge, making it difficult to reliably identify visual overlaps and handle occlusions. This lack of explicit geometric reasoning represents a critical yet overlooked limitation. In this work, we introduce geometric priors for the first time to address the core challenges of unaligned SCD, for reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. Building on these priors, we propose a training-free framework that integrates them with the powerful representations of a visual foundation model to enable reliable change detection under viewpoint misalignment. Through extensive evaluation on the PSCD, ChangeSim, and PASLCD datasets, we demonstrate that our approach achieves superior and robust performance. Our code will be released at https://github.com/ZilingLiu/GeoSCD.

中文标题/摘要

标题：利用几何先验进行未对齐场景变化检测

未对齐场景变化检测旨在检测不同时间拍摄的图像对之间的场景变化，而不假设视点对齐。为了处理视点变化，当前方法仅依赖于2D视觉线索来建立跨图像对应关系以辅助变化检测。然而，大的视点变化会改变视觉观察，导致基于外观的匹配漂移或失败。此外，仅限于2D变化掩模的小规模SCD数据集的监督限制了多视图知识的泛化学习，使得可靠地识别视觉重叠和处理遮挡变得困难。这种缺乏明确的几何推理代表了一个关键但被忽视的局限性。在本文中，我们首次引入几何先验以解决未对齐SCD的核心挑战，实现可靠的视觉重叠识别、稳健的对应关系建立和明确的遮挡检测。基于这些先验，我们提出了一种无需训练的框架，将它们与视觉基础模型的强大表示相结合，以在视点错位的情况下实现可靠的变更检测。通过在PSCD、ChangeSim和PASLCD数据集上的广泛评估，我们证明了我们的方法实现了优越且稳健的性能。我们的代码将在https://github.com/ZilingLiu/GeoSCD发布。

Summary / 总结

This paper addresses the challenge of detecting scene changes between images captured at different times without assuming viewpoint alignment. To handle viewpoint variations, the authors introduce geometric priors for the first time, which help in reliable identification of visual overlaps, robust correspondence establishment, and explicit occlusion detection. The proposed training-free framework integrates these priors with a visual foundation model, leading to superior and robust performance on the PSCD, ChangeSim, and PASLCD datasets.

研究旨在通过引入几何先验来检测不同时间拍摄的未对齐图像之间的场景变化，解决仅依赖2D视觉线索和小规模数据集的局限性。提出的方案使用了一个无需训练的框架，将几何先验与视觉基础模型结合，以建立稳健的对应关系并处理遮挡。在PSCD、ChangeSim和PASLCD数据集上的实验表明，该方法在视角偏差情况下表现出更高的可靠性和鲁棒性。

AsyMoE: Leveraging Modal Asymmetry for Enhanced Expert Specialization in Large Vision-Language Models

Authors: Heng Zhang, Haichuan Hu, Yaomin Shen, Weihao Yu, Yilei Yuan, Haochen You, Guo Cheng, Zijian Zhang, Lubin Gan, Huihui Wei, Hao Zhang, Jin Huang

First: 2025-09-16T06:16:05+00:00 · Latest: 2025-09-16T06:16:05+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. However, existing Mixture of Experts (MoE) approaches face challenges due to the asymmetry between visual and linguistic processing. Visual information is spatially complete, while language requires maintaining sequential context. As a result, MoE models struggle to balance modality-specific features and cross-modal interactions. Through systematic analysis, we observe that language experts in deeper layers progressively lose contextual grounding and rely more on parametric knowledge rather than utilizing the provided visual and linguistic information. To address this, we propose AsyMoE, a novel architecture that models this asymmetry using three specialized expert groups. We design intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to suppress parametric biases and maintain contextual grounding. Extensive experiments demonstrate that AsyMoE achieves 26.58% and 15.45% accuracy improvements over vanilla MoE and modality-specific MoE respectively, with 25.45% fewer activated parameters than dense models.

中文标题/摘要

标题：AsyMoE：利用模态不对称性增强大型视觉-语言模型专家专业化

大型视觉-语言模型（LVLMs）通过扩展架构和大量训练，在多模态任务中表现出色。然而，现有的混合专家（MoE）方法由于视觉和语言处理之间的不对称性而面临挑战。视觉信息是空间上完整的，而语言则需要保持顺序上下文。因此，MoE模型难以平衡模态特定特征和跨模态交互。通过系统分析，我们发现深层的语言专家逐渐失去上下文基础，更多依赖参数知识，而不是利用提供的视觉和语言信息。为了解决这一问题，我们提出了一种新的AsyMoE架构，该架构使用三个专门的专家组来建模这种不对称性。我们设计了跨模态专家进行模态特定处理，超曲面跨模态专家进行分层跨模态交互，以及证据优先语言专家来抑制参数偏差并保持上下文基础。广泛的实验表明，与标准MoE和模态特定MoE相比，AsyMoE分别实现了26.58%和15.45%的准确率提升，且激活参数比密集模型少25.45%。

Summary / 总结

The paper addresses the challenge of modality asymmetry in large vision-language models (LVLMs) by proposing AsyMoE, a novel architecture that models the asymmetry using three specialized expert groups: intra-modality experts for modality-specific processing, hyperbolic inter-modality experts for hierarchical cross-modal interactions, and evidence-priority language experts to maintain contextual grounding. AsyMoE shows significant improvements, achieving 26.58% and 15.45% accuracy gains over vanilla MoE and modality-specific MoE, respectively, while using fewer parameters than dense models.

研究旨在通过解决现有Mixture of Experts (MoE)方法中视觉和语言处理之间的不对称性，提高大型视觉-语言模型的性能。提出了AsyMoE，该方法使用三个专门的专家组：内模态专家用于模态特定处理，超球面跨模态专家用于分层跨模态交互，以及证据优先语言专家以保持上下文关联性。AsyMoE 显示出显著的改进，分别在 vanilla MoE 和模态特定 MoE 上实现了 26.58% 和 15.45% 的准确率提升，同时使用的参数比密集模型更少。

Palmprint De-Identification Using Diffusion Model for High-Quality and Diverse Synthesis

Authors: Licheng Yan, Bob Zhang, Andrew Beng Jin Teoh, Lu Leng, Shuyi Li, Yuqi Wang, Ziyuan Yang

First: 2025-04-11T06:00:06+00:00 · Latest: 2025-09-16T04:49:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments. However, this strength also introduces new risks, as publicly available palmprint images can be misused by adversaries for malicious activities. Despite this growing concern, research on methods to obscure or anonymize palmprints remains largely unexplored. Thus, it is essential to develop a palmprint de-identification technique capable of removing identity-revealing features while retaining the image's utility and preserving non-sensitive information. In this paper, we propose a training-free framework that utilizes pre-trained diffusion models to generate diverse, high-quality palmprint images that conceal identity features for de-identification purposes. To ensure greater stability and controllability in the synthesis process, we incorporate a semantic-guided embedding fusion alongside a prior interpolation mechanism. We further propose the de-identification ratio, a novel metric for intuitive de-identification assessment. Extensive experiments across multiple palmprint datasets and recognition methods demonstrate that our method effectively conceals identity-related traits with significant diversity across de-identified samples. The de-identified samples preserve high visual fidelity and maintain excellent usability, achieving a balance between de-identification and retaining non-identity information.

中文标题/摘要

标题：使用扩散模型进行高质量和多样化去标识的手掌纹识别

近年来，手掌纹识别技术取得了显著进步，即使在不受控或具有挑战性的环境中也能实现可靠的识别。然而，这一优势也带来了新的风险，因为公开的手掌纹图像可能被对手用于恶意活动。尽管存在这种日益增长的担忧，但关于如何模糊或匿名化手掌纹的研究仍然很少。因此，开发一种能够去除身份揭示特征同时保留图像实用性和保护非敏感信息的手掌纹去标识技术变得至关重要。在本文中，我们提出了一种无需训练的框架，利用预训练的扩散模型生成多样且高质量的手掌纹图像，以实现去标识化目的。为了确保合成过程的更大稳定性和可控性，我们结合了语义引导嵌入融合和先验插值机制。我们还提出了去标识化比率，这是一种新颖的评估去标识化程度的指标。在多个手掌纹数据集和识别方法上的广泛实验表明，我们的方法能够有效隐藏与身份相关的特点，并在去标识样本中具有显著的多样性。去标识样本保持了高视觉保真度，并保持了良好的可用性，实现了去标识化和保留非身份信息之间的平衡。

Summary / 总结

Palmprint recognition techniques have advanced significantly in recent years, enabling reliable recognition even when palmprints are captured in uncontrolled or challenging environments.

本文旨在通过去标识化手掌图像来缓解安全风险，同时保持其实用性。提出了一种无需训练的框架，利用预训练的扩散模型生成多样且高质量的手掌图像，以遮蔽身份特征。该方法结合了语义引导的嵌入融合和先验插值机制，以增强合成过程的稳定性和可控性。实验结果表明，所提出的方法能够有效遮蔽与身份相关的特点，保持高视觉保真度，并在多个数据集和识别方法中保持良好的可用性。

No Need for "Learning" to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

Authors: Tim Bary, Benoît Macq, Louis Petit

First: 2025-09-16T02:01:21+00:00 · Latest: 2025-09-16T02:01:21+00:00

Comments: 9 pages, 4 figures, 1 table

Abs · PDF · Code1 · Code2

Abstract

AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining $99.57\pm0.10\%$ and $99.40\pm0.52\%$, while reducing expert workload by up to a factor of $11$. The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.

中文标题/摘要

标题：无需“学习”即可推迟？基于校准预测的多专家无训练推迟框架

AI系统经常无法在所有输入上提供可靠的预测，促使需要人机混合决策。现有的学习推迟（L2D）方法通过训练推迟模型来解决这一问题，但这些模型对专家组成的变化敏感，如果专家发生变化，需要进行大量重新训练。我们提出了一种基于校准预测的无训练、模型和专家无关的专家推迟框架。该方法使用校准预测器生成的预测集来识别标签特定的不确定性，并使用分离性标准选择最具区分性的专家，该标准衡量专家区分剩余可能标签的能力。在CIFAR10-H和ImageNet16-H上的实验表明，我们的方法在准确性和减少专家工作量方面均优于独立模型和最强专家，准确率分别达到99.57±0.10%和99.40±0.52%，并使专家工作量减少多达11倍。该方法在专家表现不佳时仍保持稳健，并在信息量低的情况下表现出逐步性能下降。这些结果表明，校准预测提供了一种可扩展的、无需重新训练的L2D替代方案，适用于实际的人机协作场景。

Summary / 总结

The paper addresses the need for reliable predictions in AI systems by proposing a training-free framework for deferring to multiple experts using conformal prediction. Unlike existing Learning to Defer (L2D) approaches, this method does not require retraining when experts change and is model- and expert-agnostic. Experiments on CIFAR10-H and ImageNet16-H demonstrate that the proposed method outperforms both the standalone model and the strongest expert, achieving accuracies of 99.57±0.10% and 99.40±0.52% respectively, while significantly reducing expert workload. The method also maintains robustness under degraded expert performance and shows a gradual performance drop in low-information settings, suggesting a scalable alternative to L2D for human-AI collaboration.

论文提出了一种基于容限预测的无训练专家推诿框架，以实现可靠的人工智能与人类决策的结合。该方法通过识别标签特定的不确定性并选择最具有区分性的专家，将专家的工作量减少多达11倍，同时在CIFAR10-H和ImageNet16-H数据集上分别达到99.57±0.10%和99.40±0.52%的高准确率。该方法在专家性能下降的情况下仍保持稳健，并在信息量低的情况下表现出逐步的性能下降，提供了一种无重新训练的Learning to Defer方法的可扩展替代方案。

VARCO-VISION-2.0 Technical Report

Authors: Young-rok Cha, Jeongho Ju, SunYoung Park, Jong-Hyeon Lee, Younghyun Yu, Youngjune Kim

First: 2025-09-12T09:55:56+00:00 · Latest: 2025-09-16T01:21:28+00:00

Comments: 19 pages, 1 figure, 14 tables. Technical report for VARCO-VISION-2.0, a Korean-English bilingual VLM in 14B and 1.7B variants. Key features: multi-image understanding, OCR with text localization, improved Korean capabilities

Abs · PDF · Code1 · Code2

Abstract

We introduce VARCO-VISION-2.0, an open-weight bilingual vision-language model (VLM) for Korean and English with improved capabilities compared to the previous model VARCO-VISION-14B. The model supports multi-image understanding for complex inputs such as documents, charts, and tables, and delivers layoutaware OCR by predicting both textual content and its spatial location. Trained with a four-stage curriculum with memory-efficient techniques, the model achieves enhanced multimodal alignment, while preserving core language abilities and improving safety via preference optimization. Extensive benchmark evaluations demonstrate strong spatial grounding and competitive results for both languages, with the 14B model achieving 8th place on the OpenCompass VLM leaderboard among models of comparable scale. Alongside the 14B-scale model, we release a 1.7B version optimized for on-device deployment. We believe these models advance the development of bilingual VLMs and their practical applications. Two variants of VARCO-VISION-2.0 are available at Hugging Face: a full-scale 14B model and a lightweight 1.7B model.

中文标题/摘要

标题：VARCO-VISION-2.0 技术报告

我们介绍了VARCO-VISION-2.0，这是一种用于韩语和英语的开放重量双语视觉语言模型（VLM），与之前的模型VARCO-VISION-14B相比，其功能得到了改进。该模型支持多图像理解，适用于复杂的输入，如文档、图表和表格，并通过预测文本内容及其空间位置提供布局感知的OCR。通过使用四阶段课程学习和高效内存技术进行训练，该模型实现了增强的多模态对齐，同时保留了核心语言能力并提高了安全性。广泛的基准评估表明，该模型在空间定位方面表现出色，并且在两种语言上都取得了竞争力的结果，14B模型在OpenCompass VLM排行榜上获得了第8名。除了14B规模的模型，我们还发布了优化用于设备上部署的1.7B版本。我们相信这些模型推动了双语VLM及其实际应用的发展。VARCO-VISION-2.0在Hugging Face上提供了两种变体：全规模14B模型和轻量级1.7B模型。

Summary / 总结

VARCO-VISION-2.0 is an open-weight bilingual vision-language model for Korean and English, enhancing previous capabilities with improved multimodal alignment and safety. Trained through a four-stage curriculum with memory-efficient techniques, it supports multi-image understanding and layout-aware OCR. Extensive benchmark evaluations show strong spatial grounding and competitive results, with the 14B model ranking 8th on the OpenCompass VLM leaderboard. Two variants are available: a full-scale 14B model and a lightweight 1.7B model for on-device deployment.

VARCO-VISION-2.0 是一个双语视觉语言模型，支持韩语和英语，通过四阶段课程训练和高效内存技术提升了前代模型的能力。该模型支持多图像理解及带有文本定位的OCR，取得了OpenCompass VLM 领导板上的强劲空间定位和竞争力结果。提供了两种变体：一个全规模的14B模型和一个轻量级的1.7B模型，适用于设备端部署。

Instance-Level Data-Use Auditing of Visual ML Models

Authors: Zonghao Huang, Neil Zhenqiang Gong, Michael K. Reiter

First: 2025-03-28T13:28:57+00:00 · Latest: 2025-09-16T00:34:40+00:00

Abs · PDF · Code1 · Code2

Abstract

The growing trend of legal disputes over the unauthorized use of data in machine learning (ML) systems highlights the urgent need for reliable data-use auditing mechanisms to ensure accountability and transparency in ML. We present the first proactive, instance-level, data-use auditing method designed to enable data owners to audit the use of their individual data instances in ML models, providing more fine-grained auditing results than previous work. To do so, our research generalizes previous work integrating black-box membership inference and sequential hypothesis testing, expanding its scope of application while preserving the quantifiable and tunable false-detection rate that is its hallmark. We evaluate our method on three types of visual ML models: image classifiers, visual encoders, and vision-language models (Contrastive Language-Image Pretraining (CLIP) and Bootstrapping Language-Image Pretraining (BLIP) models). In addition, we apply our method to evaluate the performance of two state-of-the-art approximate unlearning methods. As a noteworthy second contribution, our work reveals that neither method successfully removes the influence of the unlearned data instances from image classifiers and CLIP models, even if sacrificing model utility by $10\%$.

中文标题/摘要

标题：视觉ML模型实例级数据使用审计

机器学习(ML)系统中未经授权使用数据的法律纠纷日益增多，突显了需要可靠的數據使用审计机制以确保ML中的问责制和透明度的紧迫性。我们提出了首个主动的、实例级的数据使用审计方法，旨在使数据所有者能够审计其个体数据实例在ML模型中的使用情况，提供比以往工作更精细的审计结果。为此，我们的研究将先前工作中的黑盒成员推断和序列假设检验进行了泛化，扩大了其应用范围，同时保持了其标志性的可量化和可调节的误检率。我们在三种类型的视觉ML模型上评估了该方法：图像分类器、视觉编码器以及视觉语言模型（对比语言-图像预训练（CLIP）和自助语言-图像预训练（BLIP）模型）。此外，我们还应用该方法评估了两种最先进的近似遗忘方法的性能。作为第二个值得注意的贡献，我们的工作揭示了这两种方法均未能成功从图像分类器和CLIP模型中移除未学习数据实例的影响，即使牺牲了模型的实用性高达10%。

Evaluating Robustness of Vision-Language Models Under Noisy Conditions

Authors: Purushoth, Alireza

First: 2025-09-15T22:31:21+00:00 · Latest: 2025-09-15T22:31:21+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have attained exceptional success across multimodal tasks such as image captioning and visual question answering. However, their robustness under noisy conditions remains unfamiliar. In this study, we present a comprehensive evaluation framework to evaluate the performance of several state-of-the-art VLMs under controlled perturbations, including lighting variation, motion blur, and compression artifacts. We used both lexical-based metrics (BLEU, METEOR, ROUGE, CIDEr) and neural-based similarity measures using sentence embeddings to quantify semantic alignment. Our experiments span diverse datasets, revealing key insights: (1) descriptiveness of ground-truth captions significantly influences model performance; (2) larger models like LLaVA excel in semantic understanding but do not universally outperform smaller models; and (3) certain noise types, such as JPEG compression and motion blur, dramatically degrade performance across models. Our findings highlight the nuanced trade-offs between model size, dataset characteristics, and noise resilience, offering a standardized benchmark for future robust multimodal learning.

中文标题/摘要

标题：视觉语言模型在噪声条件下的鲁棒性评估

视觉语言模型（VLMs）在图像字幕和视觉问答等多模态任务中取得了卓越的成功。然而，它们在噪声条件下的鲁棒性仍然不明确。在本研究中，我们提出了一种全面的评估框架，以评估几种最先进的VLMs在受控扰动下的性能，包括光照变化、运动模糊和压缩伪影。我们使用了基于词汇的度量标准（BLEU、METEOR、ROUGE、CIDEr）和基于神经的相似性度量（使用句子嵌入）来量化语义对齐。我们的实验涵盖了多种数据集，揭示了几个关键见解：（1）地面真实字幕的描述性显著影响模型性能；（2）较大的模型如LLaVA在语义理解方面表现出色，但并不普遍优于较小的模型；（3）某些噪声类型，如JPEG压缩和运动模糊，会显著降低模型性能。我们的研究结果突显了模型大小、数据集特性和噪声抗性的微妙权衡，为未来的鲁棒多模态学习提供了一个标准化基准。

Summary / 总结

This study evaluates the robustness of Vision-Language Models (VLMs) under noisy conditions by applying controlled perturbations such as lighting variation, motion blur, and compression artifacts. The research uses both lexical-based metrics and neural-based similarity measures to assess performance. Key findings include the significant influence of ground-truth descriptiveness, the varying performance of larger models like LLaVA compared to smaller models, and the detrimental effect of certain noise types on model performance across the board.

该研究通过应用光照变化、运动模糊和压缩伪影等可控干扰，评估了视觉语言模型（VLMs）在噪声条件下的鲁棒性。研究使用了词汇基于的度量标准和神经基于的相似性度量来评估性能。主要发现包括地面真值描述性的显著影响、大型模型如LLaVA与小型模型之间的不同表现以及某些噪声类型对所有模型性能的负面影响。