arXiv 论文速递

Cinéaste: A Fine-grained Contextual Movie Question Answering Benchmark

Authors: Nisarg A. Shah, Amir Ziai, Chaitanya Ekanadham, Vishal M. Patel

First: 2025-09-17T17:58:06+00:00 · Latest: 2025-09-17T17:58:06+00:00

Comments: 11 pages, 5 figures, 5 tables

Abstract

While recent advancements in vision-language models have improved video understanding, diagnosing their capacity for deep, narrative comprehension remains a challenge. Existing benchmarks often test short-clip recognition or use template-based questions, leaving a critical gap in evaluating fine-grained reasoning over long-form narrative content. To address these gaps, we introduce $\mathsf{Cin\acute{e}aste}$, a comprehensive benchmark for long-form movie understanding. Our dataset comprises 3,119 multiple-choice question-answer pairs derived from 1,805 scenes across 200 diverse movies, spanning five novel fine-grained contextual reasoning categories. We use GPT-4o to generate diverse, context-rich questions by integrating visual descriptions, captions, scene titles, and summaries, which require deep narrative understanding. To ensure high-quality evaluation, our pipeline incorporates a two-stage filtering process: Context-Independence filtering ensures questions require video context, while Contextual Veracity filtering validates factual consistency against the movie content, mitigating hallucinations. Experiments show that existing MLLMs struggle on $\mathsf{Cin\acute{e}aste}$; our analysis reveals that long-range temporal reasoning is a primary bottleneck, with the top open-source model achieving only 63.15\% accuracy. This underscores significant challenges in fine-grained contextual understanding and the need for advancements in long-form movie comprehension.

中文标题/摘要

标题：电影导演：精细语境电影问答基准

尽管近期视觉语言模型在视频理解方面取得了进步，但诊断其在深刻叙事理解方面的能力仍是一个挑战。现有基准通常测试短片段识别或使用模板化问题，这在评估长篇叙事内容的精细推理方面留下了关键缺口。为解决这些缺口，我们引入了$\mathsf{Cin\acute{e}aste}$，一个全面的长篇电影理解基准。我们的数据集包含来自200部不同电影1,805个场景的3,119个多项选择题-答案对，涵盖了五个新颖的精细语境推理类别。我们使用GPT-4o生成多样化、富含语境的问题，通过整合视觉描述、字幕、场景标题和摘要，这些问题需要深入的叙事理解。为了确保高质量评估，我们的流水线包含两阶段过滤过程：语境独立性过滤确保问题需要视频语境，而语境一致性过滤验证事实一致性，防止幻觉。实验表明，现有MLLM在$\mathsf{Cin\acute{e}aste}$上表现不佳；我们的分析显示，长时序推理是主要瓶颈，顶级开源模型的准确率仅为63.15%。这突显了精细语境理解的重大挑战，并强调了长篇电影理解方面的进步需求。

Summary / 总结

The research aims to evaluate the deep narrative comprehension capabilities of vision-language models by introducing Cinéaste, a benchmark for long-form movie understanding. The dataset includes 3,119 multiple-choice questions derived from 1,805 scenes across 200 movies, covering five reasoning categories. GPT-4o generates context-rich questions by integrating visual and textual information, and a two-stage filtering process ensures high-quality evaluation. Experiments show that existing models struggle, particularly with long-range temporal reasoning, achieving only 63.15% accuracy.

研究旨在通过引入Cinéaste这一长片理解基准，评估视觉语言模型的深层叙事理解能力。数据集包含来自200部电影1,805个场景的3,119个多项选择题，涵盖五个推理类别。GPT-4o通过整合视觉和文本信息生成上下文丰富的问题，并采用两阶段过滤过程确保高质量评估。实验表明，现有模型在长时序推理方面表现不佳，准确率仅为63.15%。

TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning

Authors: Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao

First: 2025-09-17T16:58:44+00:00 · Latest: 2025-09-17T16:58:44+00:00

Abs · PDF · Code1 · Code2

Abstract

With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.

中文标题/摘要

标题：TGPO：基于树引导的偏好优化以实现鲁棒的网络代理强化学习

随着大型语言模型和视觉-语言模型的迅速发展，使用大型模型作为网络代理对于自动化网页交互变得至关重要。然而，使用强化学习训练网络代理面临着关键挑战，包括奖励分配不当、标注成本高昂以及奖励稀疏性。为了解决这些问题，我们提出了树引导偏好优化（TGPO），这是一种离线强化学习框架，通过树结构轨迹表示将轨迹中语义相同的状态合并，消除标签冲突。该框架结合了过程奖励模型，该模型通过子目标进展、冗余检测和动作验证自动生成细粒度奖励。此外，动态加权机制在训练过程中优先处理高影响决策点。在Online-Mind2Web和我们自构建的C-WebShop数据集上的实验表明，TGPO显著优于现有方法，以较少的冗余步骤实现了更高的成功率。

Summary / 总结

The paper addresses the challenges of training Web Agents using reinforcement learning, such as credit assignment misallocation, high annotation costs, and sparse rewards. It introduces Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that uses a tree-structured trajectory representation to merge semantically identical states and eliminate label conflicts. TGPO also includes a Process Reward Model that generates fine-grained rewards based on subgoal progress, redundancy detection, and action verification. The framework further prioritizes high-impact decision points during training. Experimental results on Online-Mind2Web and C-WebShop datasets show that TGPO outperforms existing methods, achieving higher success rates with fewer redundant steps.

研究旨在解决使用强化学习训练Web代理所面临的挑战，如信用分配错误、高标注成本和稀疏奖励。提出的Tree-Guided Preference Optimization (TGPO)框架使用树结构轨迹表示来合并语义上相同的状态并消除标签冲突。TGPO还包括一个过程奖励模型，该模型基于子目标进度、冗余检测和动作验证生成细粒度的奖励。动态加权机制优先处理高影响决策点。实验表明，TGPO在Online-Mind2Web和C-WebShop数据集上优于现有方法，具有更高的成功率和更少的冗余步骤。

StyleSculptor: Zero-Shot Style-Controllable 3D Asset Generation with Texture-Geometry Dual Guidance

Authors: Zefan Qu, Zhenwei Wang, Haoyuan Wang, Ke Xu, Gerhard Hancke, Rynson W. H. Lau

Venue: SIGGRAPH Asia 2025

First: 2025-09-16T17:55:20+00:00 · Latest: 2025-09-17T15:58:50+00:00

Comments: SIGGRAPH Asia 2025, Project page:https://stylesculptor.github.io

Abs · PDF · Code1 · Code2 · Project1

Abstract

Creating 3D assets that follow the texture and geometry style of existing ones is often desirable or even inevitable in practical applications like video gaming and virtual reality. While impressive progress has been made in generating 3D objects from text or images, creating style-controllable 3D assets remains a complex and challenging problem. In this work, we propose StyleSculptor, a novel training-free approach for generating style-guided 3D assets from a content image and one or more style images. Unlike previous works, StyleSculptor achieves style-guided 3D generation in a zero-shot manner, enabling fine-grained 3D style control that captures the texture, geometry, or both styles of user-provided style images. At the core of StyleSculptor is a novel Style Disentangled Attention (SD-Attn) module, which establishes a dynamic interaction between the input content image and style image for style-guided 3D asset generation via a cross-3D attention mechanism, enabling stable feature fusion and effective style-guided generation. To alleviate semantic content leakage, we also introduce a style-disentangled feature selection strategy within the SD-Attn module, which leverages the variance of 3D feature patches to disentangle style- and content-significant channels, allowing selective feature injection within the attention framework. With SD-Attn, the network can dynamically compute texture-, geometry-, or both-guided features to steer the 3D generation process. Built upon this, we further propose the Style Guided Control (SGC) mechanism, which enables exclusive geometry- or texture-only stylization, as well as adjustable style intensity control. Extensive experiments demonstrate that StyleSculptor outperforms existing baseline methods in producing high-fidelity 3D assets.

Summary / 总结

The research aims to generate 3D assets that match the texture and geometry styles of existing assets, which is crucial for video gaming and virtual reality. StyleSculptor, a zero-shot approach, uses a novel Style Disentangled Attention (SD-Attn) module to achieve fine-grained style control, combining texture and geometry styles from user-provided images. The method outperforms existing methods in generating high-fidelity 3D assets with stable feature fusion and effective style guidance.

研究旨在生成与现有资产纹理和几何风格相符的3D资产，这对于视频游戏和虚拟现实至关重要。StyleSculptor采用了一种零样本方法，利用新颖的Style Disentangled Attention (SD-Attn)模块实现精细的风格控制，结合用户提供的图像中的纹理和几何风格。该方法在稳定特征融合和有效的风格引导方面优于现有方法，生成高质量的3D资产。

VSE-MOT: Multi-Object Tracking in Low-Quality Video Scenes Guided by Visual Semantic Enhancement

Authors: Jun Du, Weiwei Xing, Ming Li, Fei Richard Yu

First: 2025-09-17T15:04:45+00:00 · Latest: 2025-09-17T15:04:45+00:00

Abs · PDF · Code1 · Code2

Abstract

Current multi-object tracking (MOT) algorithms typically overlook issues inherent in low-quality videos, leading to significant degradation in tracking performance when confronted with real-world image deterioration. Therefore, advancing the application of MOT algorithms in real-world low-quality video scenarios represents a critical and meaningful endeavor. To address the challenges posed by low-quality scenarios, inspired by vision-language models, this paper proposes a Visual Semantic Enhancement-guided Multi-Object Tracking framework (VSE-MOT). Specifically, we first design a tri-branch architecture that leverages a vision-language model to extract global visual semantic information from images and fuse it with query vectors. Subsequently, to further enhance the utilization of visual semantic information, we introduce the Multi-Object Tracking Adapter (MOT-Adapter) and the Visual Semantic Fusion Module (VSFM). The MOT-Adapter adapts the extracted global visual semantic information to suit multi-object tracking tasks, while the VSFM improves the efficacy of feature fusion. Through extensive experiments, we validate the effectiveness and superiority of the proposed method in real-world low-quality video scenarios. Its tracking performance metrics outperform those of existing methods by approximately 8% to 20%, while maintaining robust performance in conventional scenarios.

中文标题/摘要

标题：VSE-MOT：低质量视频场景中基于视觉语义增强的多目标跟踪

当前的多目标跟踪（MOT）算法通常忽视低质量视频中存在的问题，导致在面对真实世界的图像退化时跟踪性能显著下降。因此，在真实世界的低质量视频场景中应用MOT算法的改进具有重要的意义。为应对低质量场景的挑战，受视觉语言模型的启发，本文提出了一种基于视觉语义增强的多目标跟踪框架（VSE-MOT）。具体而言，我们首先设计了一个三支路架构，利用视觉语言模型从图像中提取全局视觉语义信息并将其与查询向量融合。为进一步增强视觉语义信息的利用，我们引入了多目标跟踪适配器（MOT-Adapter）和视觉语义融合模块（VSFM）。MOT-Adapter将提取的全局视觉语义信息适配于多目标跟踪任务，而VSFM提高了特征融合的效果。通过广泛的实验，我们验证了所提出方法在真实世界的低质量视频场景中的有效性和优越性。其跟踪性能指标比现有方法高出约8%到20%，同时在常规场景中保持了稳健的性能。

Summary / 总结

The research addresses the challenge of multi-object tracking (MOT) in low-quality video scenes, where existing methods often fail. It proposes VSE-MOT, which uses a visual semantic enhancement approach. The method includes a tri-branch architecture and introduces the MOT-Adapter and VSFM to better utilize visual semantic information. Experiments show that VSE-MOT outperforms existing methods by 8% to 20% in low-quality scenarios while maintaining good performance in conventional scenarios.

本文提出了一种名为VSE-MOT的框架，通过整合视觉语义增强来解决低质量视频场景下的多目标跟踪（MOT）挑战。该框架采用三分支架构，利用视觉语言模型提取并融合全局视觉语义信息与查询向量。同时引入了MOT-Adapter和VSFM来适应和增强特征融合，从而提高跟踪性能。实验结果显示，VSE-MOT在低质量场景下的性能比现有方法高出8%到20%，同时在常规场景下保持了稳健的性能。

Visible Yet Unreadable: A Systematic Blind Spot of Vision Language Models Across Writing Systems

Authors: Jie Zhang, Ting Xu, Gelei Deng, Runyi Hu, Han Qiu, Tianwei Zhang, Qing Guo, Ivor Tsang

First: 2025-09-04T05:35:32+00:00 · Latest: 2025-09-17T13:47:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Writing is a universal cultural technology that reuses vision for symbolic communication. Humans display striking resilience: we readily recognize words even when characters are fragmented, fused, or partially occluded. This paper investigates whether advanced vision language models (VLMs) share this resilience. We construct two psychophysics inspired benchmarks across distinct writing systems, Chinese logographs and English alphabetic words, by splicing, recombining, and overlaying glyphs to yield ''visible but unreadable'' stimuli for models while remaining legible to humans. Despite strong performance on clean text, contemporary VLMs show a severe drop under these perturbations, frequently producing unrelated or incoherent outputs. The pattern suggests a structural limitation: models heavily leverage generic visual invariances but under rely on compositional priors needed for robust literacy. We release stimuli generation code, prompts, and evaluation protocols to facilitate transparent replication and follow up work. Our findings motivate architectures and training strategies that encode symbol segmentation, composition, and binding across scripts, and they delineate concrete challenges for deploying multimodal systems in education, accessibility, cultural heritage, and security.

中文标题/摘要

标题：可见却不可读：视觉语言模型在不同书写系统中的一个系统性盲点

书写是一种普遍的文化技术，利用视觉进行符号交流。人类表现出惊人的韧性：即使字符被分割、融合或部分遮挡，我们也能轻易识别出单词。本文探讨先进视觉语言模型（VLMs）是否也具备这种韧性。我们构建了跨不同书写系统的两个心理物理学启发式基准，通过拼接、重组和叠加字符，生成对模型可见但对人类可读的“可见却不可读”的刺激，尽管这些刺激对人类来说仍然清晰可辨。尽管在干净文本上表现出色，但当代VLMs在这些扰动下表现出严重的性能下降，经常产生不相关或不连贯的输出。这一模式表明，模型过度依赖通用的视觉不变性，而对构成先验的依赖不足，这些先验对于稳健的识字能力是必要的。我们发布了刺激生成代码、提示和评估协议，以促进透明的复制和后续工作。我们的发现促使人们开发能够跨书写系统编码符号分割、组合和绑定的架构和训练策略，并指出了在教育、无障碍、文化遗产和安全领域部署多模态系统时的具体挑战。

Summary / 总结

This paper investigates the resilience of advanced vision language models (VLMs) in recognizing fragmented or occluded text across different writing systems. By creating 'visible but unreadable' stimuli, the study finds that contemporary VLMs perform poorly under these conditions, indicating a structural limitation in leveraging compositional priors for robust literacy. The research suggests the need for architectures that encode symbol segmentation and composition across scripts, and highlights challenges for deploying multimodal systems in various fields.

该研究探讨了先进视觉语言模型（VLMs）在不同书写系统中识别碎片化或被遮挡文本的韧性。通过创建‘可见但不可读’的刺激，研究发现，当代VLMs在这些条件下表现不佳，表明模型在利用组合先验以实现稳健的读写能力方面存在结构性限制。研究建议需要构建能够编码符号分割、组合和跨书写系统绑定的架构，并指出了在教育、无障碍、文化遗产和安全等领域部署多模态系统所面临的挑战。

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation

Authors: Gia Khanh Nguyen, Yifeng Huang, Minh Hoai

First: 2025-09-17T13:06:58+00:00 · Latest: 2025-09-17T13:06:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual counting is a fundamental yet challenging task, especially when users need to count objects of a specific type in complex scenes. While recent models, including class-agnostic counting models and large vision-language models (VLMs), show promise in counting tasks, their ability to perform fine-grained, intent-driven counting remains unclear. In this paper, we introduce PairTally, a benchmark dataset specifically designed to evaluate fine-grained visual counting. Each of the 681 high-resolution images in PairTally contains two object categories, requiring models to distinguish and count based on subtle differences in shape, size, color, or semantics. The dataset includes both inter-category (distinct categories) and intra-category (closely related subcategories) settings, making it suitable for rigorous evaluation of selective counting capabilities. We benchmark a variety of state-of-the-art models, including exemplar-based methods, language-prompted models, and large VLMs. Our results show that despite recent advances, current models struggle to reliably count what users intend, especially in fine-grained and visually ambiguous cases. PairTally provides a new foundation for diagnosing and improving fine-grained visual counting systems.

中文标题/摘要

标题：当前AI模型能否数出我们所意指的，而非所见的？一个基准和系统评估

视觉计数是一个基本但具有挑战性的任务，尤其是在用户需要在复杂场景中计数特定类型对象时。虽然包括类无感知计数模型和大型视觉-语言模型（VLMs）在内的近期模型在计数任务中显示出潜力，但它们进行细粒度、意图驱动的计数的能力仍然不清楚。在本文中，我们引入了PairTally，这是一个专门用于评估细粒度视觉计数的基准数据集。PairTally中的每张高分辨率图像包含两个对象类别，要求模型根据形状、大小、颜色或语义上的细微差异进行区分和计数。该数据集包括跨类别（不同类别）和同类别（密切相关子类别）设置，使其适用于对选择性计数能力进行严格的评估。我们对多种最先进的模型进行了基准测试，包括基于示例的方法、语言提示模型和大型VLMs。我们的结果显示，尽管近期取得了进展，但当前模型在细粒度和视觉模糊情况下难以可靠地计数用户所意指的内容。PairTally为诊断和改进细粒度视觉计数系统提供了新的基础。

Summary / 总结

This paper introduces PairTally, a benchmark dataset for evaluating fine-grained visual counting, where models must distinguish and count objects based on subtle differences. The dataset includes both inter-category and intra-category settings, challenging models to perform selective counting. Experiments with various state-of-the-art models reveal that current AI models struggle to reliably count objects as intended, particularly in fine-grained and visually ambiguous scenarios.

本文介绍了PairTally，一个用于评估细粒度视觉计数的基准数据集，要求模型根据细微差异区分和计数物体。该数据集包括跨类别和同类别设置，挑战模型在复杂场景中精确计数特定对象的能力。实验显示，当前的AI模型在细粒度和视觉上模糊的计数任务中表现不佳，表明需要提高意图驱动的计数能力。

Evolution Meets Diffusion: Efficient Neural Architecture Generation

Authors: Bingye Zhou, Caiyang Yu

First: 2025-04-24T03:09:04+00:00 · Latest: 2025-09-17T11:36:56+00:00

Abs · PDF · Code1 · Code2

Abstract

Neural Architecture Search (NAS) has gained widespread attention for its transformative potential in deep learning model design. However, the vast and complex search space of NAS leads to significant computational and time costs. Neural Architecture Generation (NAG) addresses this by reframing NAS as a generation problem, enabling the precise generation of optimal architectures for specific tasks. Despite its promise, mainstream methods like diffusion models face limitations in global search capabilities and are still hindered by high computational and time demands. To overcome these challenges, we propose Evolutionary Diffusion-based Neural Architecture Generation (EDNAG), a novel approach that achieves efficient and training-free architecture generation. EDNAG leverages evolutionary algorithms to simulate the denoising process in diffusion models, using fitness to guide the transition from random Gaussian distributions to optimal architecture distributions. This approach combines the strengths of evolutionary strategies and diffusion models, enabling rapid and effective architecture generation. Extensive experiments demonstrate that EDNAG achieves state-of-the-art (SOTA) performance in architecture optimization, with an improvement in accuracy of up to 10.45%. Furthermore, it eliminates the need for time-consuming training and boosts inference speed by an average of 50 times, showcasing its exceptional efficiency and effectiveness.

中文标题/摘要

标题：进化与扩散相遇：高效的神经架构生成

神经架构搜索（NAS）因其在深度学习模型设计中的革命性潜力而受到广泛关注。然而，NAS的庞大而复杂的搜索空间导致了显著的计算和时间成本。神经架构生成（NAG）通过将NAS重新定义为生成问题，解决了这一问题，使得能够精确生成特定任务的最佳架构。尽管具有潜力，主流方法如扩散模型在全局搜索能力方面存在局限性，仍然受到高计算和时间成本的阻碍。为克服这些挑战，我们提出了一种新颖的方法——基于进化扩散的神经架构生成（EDNAG），该方法实现了高效的无训练架构生成。EDNAG利用进化算法模拟扩散模型中的去噪过程，使用适应度引导从随机高斯分布过渡到最优架构分布。这种方法结合了进化策略和扩散模型的优点，实现了快速有效的架构生成。大量实验表明，EDNAG在架构优化方面达到了最先进的（SOTA）性能，准确率提高了高达10.45%。此外，它消除了耗时的训练需求，并将推理速度平均提高了50倍，展示了其卓越的效率和效果。

Summary / 总结

The paper proposes EDNAG, a method that combines evolutionary algorithms and diffusion models to efficiently generate neural architectures. It addresses the computational and time costs of Neural Architecture Search by leveraging the strengths of both approaches. EDNAG achieves state-of-the-art performance in architecture optimization with up to 10.45% improvement in accuracy and boosts inference speed by an average of 50 times.

论文提出了一种名为EDNAG的新方法，通过结合进化算法和扩散模型来实现高效的神经架构生成。该方法通过进化策略引导最优架构的生成，以解决神经架构搜索中的计算和时间成本问题，实现了高达10.45%的准确率提升，并将推理速度提升了50倍，展示了其高效性和有效性。

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Authors: Yuanchen Wu, Ke Yan, Shouhong Ding, Ziyin Zhou, Xiaoqiang Li

Venue: ICML 2025

First: 2025-09-17T11:27:33+00:00 · Latest: 2025-09-17T11:27:33+00:00

Comments: Accepted by ICML 2025

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces the Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight "rationale fine-tuning" approach, which modifies the model's response format to require a rationale before deriving an answer without explicit prompts. Next, SRC searches for a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

中文标题/摘要

标题：通过自我推理校准实现LVLMs推理与答案的一致性

大型视觉-语言模型（LVLMs）在视觉问答方面表现出强大的能力。然而，它们仍然难以使推理与生成的答案保持一致，导致推理不一致和错误的回答。为此，本文引入了自我推理校准（SRC）框架，以迭代校准推理与答案之间的对齐。SRC首先采用一种轻量级的“推理微调”方法，修改模型的响应格式，要求在不使用显式提示的情况下，基于推理得出答案。接下来，SRC从微调的LVLMs中为每个样本寻找一组多样化的候选响应，然后使用定制的评分模型R-Scorer提出了一种配对评分策略，以评估候选响应的推理质量和事实一致性。基于置信加权偏好整理过程，SRC将对齐校准分解为偏好微调的方式，从而在多个基准测试中显著提高了LVLMs在感知、推理和泛化方面的表现。我们的结果强调了推理导向的对齐在探索LVLMs潜力方面的潜力。

Summary / 总结

This paper addresses the issue of LVLMs generating answers that do not align with their rationales, leading to inconsistent reasoning. It introduces the Self-Rationale Calibration (SRC) framework, which fine-tunes the model to require a rationale before generating an answer. The framework then evaluates diverse candidate responses using a scoring model, R-Scorer, to improve alignment. Experiments show significant improvements in perception, reasoning, and generalization across multiple benchmarks.

本文通过引入Self-Rationale Calibration (SRC)框架来解决大型视觉语言模型（LVLM）中推理和答案不一致的问题。SRC首先采用一种轻量级的推理微调方法来修改模型的响应格式，要求在生成答案之前提供推理。然后使用一个定制的评分模型R-Scorer来评估候选响应的质量和事实一致性。这一过程在多个基准测试中提高了模型的感知、推理和泛化能力，强调了推理导向的对齐对于LVLM潜力的重要性。

EDITS: Enhancing Dataset Distillation with Implicit Textual Semantics

Authors: Qianxin Xia, Jiawei Du, Guoming Lu, Zhiyong Shu, Jielei Wang

First: 2025-09-17T09:48:39+00:00 · Latest: 2025-09-17T09:48:39+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Dataset distillation aims to synthesize a compact dataset from the original large-scale one, enabling highly efficient learning while preserving competitive model performance. However, traditional techniques primarily capture low-level visual features, neglecting the high-level semantic and structural information inherent in images. In this paper, we propose EDITS, a novel framework that exploits the implicit textual semantics within the image data to achieve enhanced distillation. First, external texts generated by a Vision Language Model (VLM) are fused with image features through a Global Semantic Query module, forming the prior clustered buffer. Local Semantic Awareness then selects representative samples from the buffer to construct image and text prototypes, with the latter produced by guiding a Large Language Model (LLM) with meticulously crafted prompt. Ultimately, Dual Prototype Guidance strategy generates the final synthetic dataset through a diffusion model. Extensive experiments confirm the effectiveness of our method.Source code is available in: https://github.com/einsteinxia/EDITS.

中文标题/摘要

标题：EDITS: 提升数据集蒸馏的隐式文本语义

数据集蒸馏旨在从原始大规模数据集中合成一个紧凑的数据集，以实现高效的模型训练，同时保持竞争力的模型性能。然而，传统的技术主要捕捉低级视觉特征，忽略了图像中固有的高级语义和结构信息。在本文中，我们提出了一种名为EDITS的新框架，该框架利用图像数据中的隐式文本语义来实现增强的蒸馏。首先，通过全局语义查询模块将由视觉语言模型（VLM）生成的外部文本与图像特征融合，形成先验聚类缓冲区。局部语义意识从缓冲区中选择代表性样本以构建图像和文本原型，后者通过引导大型语言模型（LLM）使用精心设计的提示生成。最终，双原型引导策略通过扩散模型生成最终的合成数据集。广泛的实验验证了我们方法的有效性。源代码可在：https://github.com/einsteinxia/EDITS 获取。

Summary / 总结

The paper proposes EDITS, a novel framework for dataset distillation that leverages implicit textual semantics within image data. By fusing external texts generated by a Vision Language Model with image features and selecting representative samples, EDITS constructs image and text prototypes. The Dual Prototype Guidance strategy then generates a synthetic dataset using a diffusion model. Experiments show that EDITS enhances distillation, preserving model performance while reducing dataset size. Source code is available on GitHub.

该论文提出了一种名为EDITS的新框架，利用图像数据中的隐含文本语义进行数据集蒸馏。通过将视觉语言模型生成的外部文本与图像特征融合，并选择代表性样本，EDITS构建了图像和文本原型。最终，通过扩散模型，Dual Prototype Guidance策略生成了合成数据集。实验表明，EDITS能够提高蒸馏效果，保持模型性能同时减少数据集大小。源代码可在GitHub上获得。

SpecDiff: Accelerating Diffusion Model Inference with Self-Speculation

Authors: Jiayi Pan, Jiaming Xu, Yongkang Zhou, Guohao Dai

First: 2025-09-17T09:24:40+00:00 · Latest: 2025-09-17T09:24:40+00:00

Abs · PDF · Code1 · Code2

Abstract

Feature caching has recently emerged as a promising method for diffusion model acceleration. It effectively alleviates the inefficiency problem caused by high computational requirements by caching similar features in the inference process of the diffusion model. In this paper, we analyze existing feature caching methods from the perspective of information utilization, and point out that relying solely on historical information will lead to constrained accuracy and speed performance. And we propose a novel paradigm that introduces future information via self-speculation based on the information similarity at the same time step across different iteration times. Based on this paradigm, we present \textit{SpecDiff}, a training-free multi-level feature caching strategy including a cached feature selection algorithm and a multi-level feature classification algorithm. (1) Feature selection algorithm based on self-speculative information. \textit{SpecDiff} determines a dynamic importance score for each token based on self-speculative information and historical information, and performs cached feature selection through the importance score. (2) Multi-level feature classification algorithm based on feature importance scores. \textit{SpecDiff} classifies tokens by leveraging the differences in feature importance scores and introduces a multi-level feature calculation strategy. Extensive experiments show that \textit{SpecDiff} achieves average 2.80 \times, 2.74 \times , and 3.17\times speedup with negligible quality loss in Stable Diffusion 3, 3.5, and FLUX compared to RFlow on NVIDIA A800-80GB GPU. By merging speculative and historical information, \textit{SpecDiff} overcomes the speedup-accuracy trade-off bottleneck, pushing the Pareto frontier of speedup and accuracy in the efficient diffusion model inference.

中文标题/摘要

标题：SpecDiff：利用自我推测加速扩散模型推理

特征缓存最近已成为加速扩散模型的一种有前途的方法。它通过在推理过程中缓存相似特征，有效缓解了由高计算需求引起的低效问题。在本文中，我们从信息利用的角度分析了现有的特征缓存方法，并指出仅依赖历史信息会导致性能受限。我们提出了一种新的范式，通过在不同迭代时间的同一时间步的信息相似性基础上引入自我推测的未来信息。基于此范式，我们提出了一个无需训练的多级特征缓存策略SpecDiff，包括缓存特征选择算法和多级特征分类算法。（1）基于自我推测信息的特征选择算法。SpecDiff 根据自我推测信息和历史信息为每个标记动态确定一个重要性评分，并通过重要性评分进行缓存特征选择。（2）基于特征重要性评分的多级特征分类算法。SpecDiff 利用特征重要性评分的差异引入多级特征计算策略。广泛实验表明，与RFlow在NVIDIA A800-80GB GPU上相比，SpecDiff 在Stable Diffusion 3、3.5 和 FLUX 中分别实现了平均2.80倍、2.74倍和3.17倍的加速，且质量损失可以忽略不计。通过合并推测和历史信息，SpecDiff 克服了加速与准确性的权衡瓶颈，推动了高效扩散模型推理的帕累托前沿。

Summary / 总结

The paper addresses the inefficiency of diffusion models by proposing SpecDiff, a training-free multi-level feature caching strategy. It introduces self-speculation to utilize future information and enhances feature selection and classification. SpecDiff achieves an average speedup of 2.80x, 2.74x, and 3.17x in Stable Diffusion 3, 3.5, and FLUX, respectively, with negligible quality loss compared to RFlow on NVIDIA A800-80GB GPU.

论文通过提出SpecDiff，一种无需训练的多级特征缓存策略，来解决扩散模型的效率问题。它引入自我推测利用未来信息，并增强特征选择和分类。SpecDiff 在 Stable Diffusion 3、3.5 和 FLUX 上分别实现了 2.80x、2.74x 和 3.17x 的加速，与 RFlow 在 NVIDIA A800-80GB GPU 上相比，质量损失可以忽略不计。

Diving into Mitigating Hallucinations from a Vision Perspective for Large Vision-Language Models

Authors: Weihang Wang, Xinhao Li, Ziyue Wang, Yan Pang, Jielei Zhang, Peiyi Li, Qiang Zhang, Longwen Gao

First: 2025-09-17T09:08:05+00:00 · Latest: 2025-09-17T09:08:05+00:00

Comments: Accepted by EMNLP2025 Finding

Abs · PDF · Code1 · Code2

Abstract

Object hallucination in Large Vision-Language Models (LVLMs) significantly impedes their real-world applicability. As the primary component for accurately interpreting visual information, the choice of visual encoder is pivotal. We hypothesize that the diverse training paradigms employed by different visual encoders instill them with distinct inductive biases, which leads to their diverse hallucination performances. Existing benchmarks typically focus on coarse-grained hallucination detection and fail to capture the diverse hallucinations elaborated in our hypothesis. To systematically analyze these effects, we introduce VHBench-10, a comprehensive benchmark with approximately 10,000 samples for evaluating LVLMs across ten fine-grained hallucination categories. Our evaluations confirm encoders exhibit unique hallucination characteristics. Building on these insights and the suboptimality of simple feature fusion, we propose VisionWeaver, a novel Context-Aware Routing Network. It employs global visual features to generate routing signals, dynamically aggregating visual features from multiple specialized experts. Comprehensive experiments confirm the effectiveness of VisionWeaver in significantly reducing hallucinations and improving overall model performance.

中文标题/摘要

标题：从视觉角度探究减轻大型视觉-语言模型幻觉的方法

大型视觉-语言模型（LVLMs）中的对象幻觉严重阻碍了它们在现实世界中的应用。作为准确解释视觉信息的主要组件，视觉编码器的选择至关重要。我们假设不同视觉编码器采用的多样化训练范式赋予了它们不同的归纳偏置，导致它们在幻觉表现上存在差异。现有的基准通常专注于粗粒度的幻觉检测，未能捕捉到我们假设中描述的多样化幻觉。为了系统地分析这些影响，我们引入了VHBench-10，这是一个包含约10,000个样本的综合基准，用于评估LVLMs在十个细粒度幻觉类别中的表现。我们的评估证实了编码器表现出独特的幻觉特征。基于这些见解和简单特征融合的不足，我们提出了VisionWeaver，这是一种新颖的上下文感知路由网络。它利用全局视觉特征生成路由信号，动态聚合来自多个专门专家的视觉特征。全面的实验验证了VisionWeaver在显著减少幻觉和提高整体模型性能方面的有效性。

Summary / 总结

The paper addresses the issue of object hallucination in Large Vision-Language Models (LVLMs), which hinders their practical use. It introduces VHBench-10, a benchmark with 10,000 samples to evaluate LVLMs across ten fine-grained hallucination categories. The study finds that different visual encoders have unique hallucination characteristics. To mitigate hallucinations, the authors propose VisionWeaver, a Context-Aware Routing Network that uses global visual features to dynamically aggregate specialized expert features, effectively reducing hallucinations and improving model performance.

论文针对大型视觉-语言模型（LVLMs）中的对象幻觉问题，限制了其实际应用。引入了VHBench-10基准，用于在十个细粒度幻觉类别中评估LVLMs，并发现不同的视觉编码器具有独特的幻觉特征。为了减轻幻觉，作者提出了VisionWeaver，这是一种上下文感知路由网络，利用全局视觉特征动态聚合专门的专家特征，从而显著减少幻觉并提高模型性能。

CROP: Contextual Region-Oriented Visual Token Pruning

Authors: Jiawei Guo, Feifei Zhai, Pu Jian, Qianrun Wei, Yu Zhou

First: 2025-05-27T14:16:52+00:00 · Latest: 2025-09-17T08:06:44+00:00

Comments: EMNLP2025 Main

Abs · PDF · Code1 · Code2

Abstract

Current VLM-based VQA methods often process entire images, leading to excessive visual tokens that include redundant information irrelevant to the posed question. This abundance of unnecessary image details creates numerous visual tokens, drastically increasing memory and computational requirements in VLMs. To address this, we propose Contextual Region-Oriented Visual Token Pruning (CROP), a novel framework to compress visual tokens through a two-step process: Localization and Pruning. Specifically, CROP first employs an efficient model to identify the contextual region relevant to the input query. Subsequently, two distinct strategies are introduced for pruning: (1) Pre-LLM Compression (PLC), which adaptively compresses different image regions with varying ratios, and (2) Inner-LLM Pruning (ILP), a training-free method that prunes tokens within early LLM layers guided by the identified contextual region. Extensive experiments on a wide range of VQA tasks demonstrate that CROP significantly outperforms existing visual token pruning methods and achieves state-of-the-art performance.

中文标题/摘要

标题：CROP：基于上下文区域的视觉标记剪枝

当前基于VLM的VQA方法通常处理整个图像，导致包含与提出的问题无关的冗余信息的过多视觉标记。这些不必要的图像细节产生了大量的视觉标记，大幅增加了VLM的内存和计算需求。为了解决这一问题，我们提出了基于上下文区域的视觉标记剪枝（CROP），这是一种通过两步过程压缩视觉标记的新框架：定位和剪枝。具体而言，CROP 首先使用高效的模型识别与输入查询相关的上下文区域。随后，引入了两种不同的剪枝策略：（1）预LLM压缩（PLC），它根据不同图像区域的比率自适应地压缩这些区域，（2）内LLM剪枝（ILP），这是一种无需训练的方法，通过识别的上下文区域指导早期LLM层中的标记剪枝。广泛的实验表明，CROP 显著优于现有的视觉标记剪枝方法，并实现了最先进的性能。

Summary / 总结

CROP is a novel framework for compressing visual tokens in VQA tasks by identifying and pruning unnecessary image regions. It uses a two-step process: Localization and Pruning. Specifically, CROP first locates the relevant contextual region and then applies Pre-LLM Compression and Inner-LLM Pruning to reduce the number of visual tokens. Experiments show that CROP outperforms existing methods and achieves state-of-the-art performance on various VQA tasks.

CROP 是一种通过识别相关上下文区域并修剪不必要的视觉标记来压缩 VQA 任务中视觉标记的新框架。它采用两步过程：定位和修剪。该框架在各种 VQA 任务上显著优于现有方法，并达到了最先进的性能。

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Authors: Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia

First: 2025-09-17T07:58:36+00:00 · Latest: 2025-09-17T07:58:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$\times$ speedup with comparable visual quality.

中文标题/摘要

标题：BWCache：通过块级缓存加速视频扩散变换器

近期扩散变换器（DiTs）的发展已将其确立为视频生成的最新方法。然而，其固有的顺序去噪过程不可避免地导致了延迟，限制了其实用性。现有的加速方法要么因架构修改而牺牲视觉质量，要么无法在适当粒度上重用中间特征。我们的分析表明，DiT块是推理延迟的主要来源。在扩散时间步中，DiT块的特征变化呈现出U形模式，在中间时间步具有高度相似性，这表明存在大量的计算冗余。在本文中，我们提出了一种无需训练的块级缓存（BWCache）方法，以加速基于DiT的视频生成。BWCache动态地跨扩散时间步缓存和重用DiT块的特征。此外，我们引入了一个相似性指标，仅在相邻时间步块特征之间的差异低于阈值时触发特征重用，从而在保持视觉保真度的同时最小化冗余计算。在几种视频扩散模型上的广泛实验表明，BWCache实现了最高2.24倍的加速，视觉质量相当。

Summary / 总结

BWCache accelerates Diffusion Transformers (DiTs) for video generation by caching and reusing features from DiT blocks across diffusion timesteps. This method reduces computational redundancy without compromising visual quality, achieving up to 2.24 times speedup. The U-shaped pattern of feature variations suggests high similarity during intermediate timesteps, justifying the effectiveness of BWCache.

BWCache 通过在扩散时间步之间缓存和重用来自 DiT 块的特征来加速视频扩散变换器（DiTs），减少计算冗余。该方法基于相似性指标动态缓存特征，并仅在相邻时间步之间的差异低于阈值时触发重用，从而最小化冗余计算同时保持视觉保真度。实验表明，BWCache 可以实现最高 2.24 倍的加速，视觉质量相当。

Data-Efficient Fine-Tuning of Vision-Language Models for Diagnosis of Alzheimer's Disease

Authors: Fangqi Cheng, Surajit Ray, Xiaochen Yang

First: 2025-09-09T11:36:21+00:00 · Latest: 2025-09-17T07:27:33+00:00

Abs · PDF · Code1 · Code2

Abstract

Medical vision-language models (Med-VLMs) have shown impressive results in tasks such as report generation and visual question answering, but they still face several limitations. Most notably, they underutilize patient metadata and lack integration of clinical diagnostic knowledge. Moreover, most existing models are typically trained from scratch or fine-tuned on large-scale 2D image-text pairs, requiring extensive computational resources, and their effectiveness on 3D medical imaging is often limited due to the absence of structural information. To address these gaps, we propose a data-efficient fine-tuning pipeline to adapt 3D CT-based Med-VLMs for 3D MRI and demonstrate its application in Alzheimer's disease (AD) diagnosis. Our system introduces two key innovations. First, we convert structured metadata into synthetic reports, enriching textual input for improved image-text alignment. Second, we add an auxiliary token trained to predict the mini-mental state examination (MMSE) score, a widely used clinical measure of cognitive function that correlates with AD severity. This provides additional supervision for fine-tuning. Applying lightweight prompt tuning to both image and text modalities, our approach achieves state-of-the-art performance on two AD datasets using 1,500 training images, outperforming existing methods fine-tuned on 10,000 images. Code will be released upon publication.

中文标题/摘要

标题：数据高效调优视觉-语言模型以诊断阿尔茨海默病

医学视觉-语言模型（Med-VLMs）在报告生成和视觉问答等任务中取得了令人印象深刻的成果，但仍然面临一些限制。最显著的是，它们未能充分利用患者元数据，并缺乏临床诊断知识的整合。此外，大多数现有模型通常从头开始训练或在大规模2D图像-文本对上进行微调，需要大量的计算资源，而且由于缺乏结构信息，它们在3D医学成像上的效果往往有限。为了解决这些差距，我们提出了一种数据高效的微调流水线，以适应基于3D CT的Med-VLMs，并展示了其在阿尔茨海默病（AD）诊断中的应用。我们的系统引入了两个关键创新。首先，我们将结构化元数据转换为合成报告，丰富了文本输入，以改善图像-文本对齐。其次，我们添加了一个辅助标记，用于预测迷你精神状态检查（MMSE）分数，这是一种广泛使用的临床认知功能测量指标，与AD严重程度相关。这为微调提供了额外的监督。通过轻量级提示调优图像和文本模态，我们的方法在两个AD数据集上使用1,500张训练图像达到了最先进的性能，优于在10,000张图像上微调的现有方法。代码将在发表后发布。

Summary / 总结

This study addresses the limitations of medical vision-language models by proposing a data-efficient fine-tuning pipeline for 3D CT-based models to be applied to 3D MRI, specifically for Alzheimer's disease diagnosis. The method involves converting structured metadata into synthetic reports and adding an auxiliary token to predict the MMSE score, providing additional supervision. The approach achieves state-of-the-art performance on two AD datasets using only 1,500 training images, outperforming existing methods fine-tuned on 10,000 images.

该研究通过提出一种数据高效微调管道，将基于3D CT的模型应用于3D MRI，特别是用于阿尔茨海默病诊断。方法包括将结构化元数据转换为合成报告，并添加一个辅助标记以预测MMSE分数，提供额外的监督。该方法仅使用1,500张训练图像在两个AD数据集上实现了最先进的性能，优于现有方法在10,000张图像上进行微调的结果。

Iterative Prompt Refinement for Safer Text-to-Image Generation

Authors: Jinwoo Jeon, JunHyeok Oh, Hayeong Lee, Byung-Jun Lee

First: 2025-09-17T07:16:06+00:00 · Latest: 2025-09-17T07:16:06+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Text-to-Image (T2I) models have made remarkable progress in generating images from text prompts, but their output quality and safety still depend heavily on how prompts are phrased. Existing safety methods typically refine prompts using large language models (LLMs), but they overlook the images produced, which can result in unsafe outputs or unnecessary changes to already safe prompts. To address this, we propose an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both the input prompts and the generated images. By leveraging visual feedback, our method refines prompts more effectively, improving safety while maintaining user intent and reliability comparable to existing LLM-based approaches. Additionally, we introduce a new dataset labeled with both textual and visual safety signals using off-the-shelf multi-modal LLM, enabling supervised fine-tuning. Experimental results demonstrate that our approach produces safer outputs without compromising alignment with user intent, offering a practical solution for generating safer T2I content. Our code is available at https://github.com/ku-dmlab/IPR. \textbf{\textcolor{red}WARNING: This paper contains examples of harmful or inappropriate images generated by models.

中文标题/摘要

标题：迭代提示精炼以提高文本到图像生成的安全性

文本到图像（T2I）模型在从文本提示生成图像方面取得了显著进展，但其输出质量和安全性仍然高度依赖于提示的表述方式。现有的安全方法通常使用大型语言模型（LLMs）来精炼提示，但它们忽略了生成的图像，可能导致不安全的输出或对已经安全的提示进行不必要的更改。为了解决这个问题，我们提出了一种迭代提示精炼算法，该算法使用视觉语言模型（VLMs）来分析输入提示和生成的图像。通过利用视觉反馈，我们的方法能够更有效地精炼提示，提高安全性同时保持与用户意图和现有基于LLM方法相当的可靠性。此外，我们还引入了一个新的数据集，该数据集使用现成的多模态LLM进行标注，包含文本和视觉安全信号，以实现监督微调。实验结果表明，我们的方法能够在不牺牲与用户意图的对齐性的情况下生成更安全的输出，提供了一种生成更安全T2I内容的实用解决方案。我们的代码可在https://github.com/ku-dmlab/IPR获取。**警告：本论文包含由模型生成的有害或不适当图像的示例。

Summary / 总结

The paper addresses the issue of safety in text-to-image generation by proposing an iterative prompt refinement algorithm that uses Vision Language Models (VLMs) to analyze both input prompts and generated images. This method improves safety while maintaining user intent and reliability comparable to existing approaches. The authors introduce a new dataset with both textual and visual safety signals for supervised fine-tuning, and experimental results show that their approach produces safer outputs without compromising alignment with user intent.

论文提出了一种迭代提示精炼算法，用于更安全的文本到图像生成。该算法使用视觉语言模型分析提示和生成的图像，从而提高安全性同时保留用户意图。实验结果表明，该方法能够生成更安全的输出，同时不失去与用户意图的一致性，提供了一种生成更安全T2I内容的实用解决方案。

DREAM: Domain-aware Reasoning for Efficient Autonomous Underwater Monitoring

Authors: Zhenqi Wu, Abhinav Modi, Angelos Mavrogiannis, Kaustubh Joshi, Nikhil Chopra, Yiannis Aloimonos, Nare Karapetyan, Ioannis Rekleitis, Xiaomin Lin

Venue: ICRA 2026

First: 2025-09-17T03:35:52+00:00 · Latest: 2025-09-17T03:35:52+00:00

Comments: submitted to ICRA 2026

Abs · PDF · Code1 · Code2

Abstract

The ocean is warming and acidifying, increasing the risk of mass mortality events for temperature-sensitive shellfish such as oysters. This motivates the development of long-term monitoring systems. However, human labor is costly and long-duration underwater work is highly hazardous, thus favoring robotic solutions as a safer and more efficient option. To enable underwater robots to make real-time, environment-aware decisions without human intervention, we must equip them with an intelligent "brain." This highlights the need for persistent,wide-area, and low-cost benthic monitoring. To this end, we present DREAM, a Vision Language Model (VLM)-guided autonomy framework for long-term underwater exploration and habitat monitoring. The results show that our framework is highly efficient in finding and exploring target objects (e.g., oysters, shipwrecks) without prior location information. In the oyster-monitoring task, our framework takes 31.5% less time than the previous baseline with the same amount of oysters. Compared to the vanilla VLM, it uses 23% fewer steps while covering 8.88% more oysters. In shipwreck scenes, our framework successfully explores and maps the wreck without collisions, requiring 27.5% fewer steps than the vanilla model and achieving 100% coverage, while the vanilla model achieves 60.23% average coverage in our shipwreck environments.

中文标题/摘要

标题：DREAM：域感知推理以实现高效的自主水下监测

海洋正在变暖和酸化，增加了对温度敏感的贝类如牡蛎的大规模死亡事件的风险。这促使了长期监测系统的开发。然而，人力成本高昂且长时间的水下工作极具危险性，因此更倾向于采用机器人解决方案作为更安全和更高效的选项。为了使水下机器人能够在没有人类干预的情况下做出实时、环境感知的决策，我们必须为它们装备一个智能的“大脑”。这突显了持续、广域和低成本底栖监测的必要性。为此，我们提出了DREAM，一种由视觉语言模型(VLM)引导的长期水下探索和栖息地监测自主框架。结果显示，我们的框架在没有先验位置信息的情况下，能够高效地发现和探索目标物体（如牡蛎、沉船）。在牡蛎监测任务中，与之前的基线相比，我们的框架在相同数量的牡蛎下节省了31.5%的时间。与纯VLM相比，它使用23%更少的步骤，但覆盖了8.88%更多的牡蛎。在沉船场景中，我们的框架成功地探索和绘制了沉船，没有发生碰撞，所需的步骤比纯模型少了27.5%，并且实现了100%的覆盖，而纯模型在我们的沉船环境中平均覆盖率为60.23%。

Summary / 总结

The research aims to develop an autonomous underwater monitoring system to address the risks posed by ocean warming and acidification to temperature-sensitive shellfish like oysters. DREAM, a Vision Language Model-guided autonomy framework, is introduced for efficient long-term underwater exploration and habitat monitoring. The framework demonstrates significant efficiency gains, reducing exploration time by 31.5% in oyster-monitoring tasks and requiring 27.5% fewer steps for shipwreck exploration compared to the vanilla VLM, while achieving higher coverage rates.

DREAM 是一种使用视觉语言模型（VLM）的自主框架，用于长期的海底探索和监测。受持续高效监测温度敏感的海洋生物如牡蛎的需求驱动，DREAM 显著减少了探索时间和步骤，相比以前的方法。在牡蛎监测任务中，它比以前的方法节省了31.5%的时间，使用了23%更少的步骤，同时覆盖了更多的牡蛎。在沉船场景中，它实现了100%的覆盖，而 vanilla VLM 模型在我们的沉船环境中平均覆盖率为60.23%，所需的步骤减少了27.5%。

Gaussian Alignment for Relative Camera Pose Estimation via Single-View Reconstruction

Authors: Yumin Li, Dylan Campbell

First: 2025-09-17T02:57:34+00:00 · Latest: 2025-09-17T02:57:34+00:00

Comments: 12 pages, 4 figures, accepted by AJCAI 2025

Abs · PDF · Code1 · Code2

Abstract

Estimating metric relative camera pose from a pair of images is of great importance for 3D reconstruction and localisation. However, conventional two-view pose estimation methods are not metric, with camera translation known only up to a scale, and struggle with wide baselines and textureless or reflective surfaces. This paper introduces GARPS, a training-free framework that casts this problem as the direct alignment of two independently reconstructed 3D scenes. GARPS leverages a metric monocular depth estimator and a Gaussian scene reconstructor to obtain a metric 3D Gaussian Mixture Model (GMM) for each image. It then refines an initial pose from a feed-forward two-view pose estimator by optimising a differentiable GMM alignment objective. This objective jointly considers geometric structure, view-independent colour, anisotropic covariance, and semantic feature consistency, and is robust to occlusions and texture-poor regions without requiring explicit 2D correspondences. Extensive experiments on the Real\-Estate10K dataset demonstrate that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R. These results highlight the potential of bridging single-view perception with multi-view geometry to achieve robust and metric relative pose estimation.

中文标题/摘要

标题：基于单视图重建的高斯对齐方法用于相对相机姿态估计

从一对图像中估计度量相对相机姿态对于三维重建和定位非常重要。然而，传统的双视图姿态估计方法不是度量的，相机平移仅知比例关系，并且在大基线、无纹理或反射表面的情况下表现不佳。本文介绍了一种无需训练的框架GARPS，将该问题视为两个独立重建的三维场景的直接对齐。GARPS 利用度量单目深度估计器和高斯场景重建器为每张图像获得一个度量的3D高斯混合模型（GMM）。然后通过优化可微的GMM对齐目标对初始姿态进行细化，该目标同时考虑几何结构、视点无关颜色、各向异性协方差和语义特征一致性，并且在不需要显式2D对应关系的情况下对遮挡和纹理贫乏区域具有鲁棒性。在Real-Estate10K数据集上的大量实验表明，GARPS 在经典和最先进的基于学习的方法（包括MASt3R）中表现更优。这些结果突显了将单视图感知与多视图几何相结合以实现稳健且度量的相对姿态估计的潜力。

Summary / 总结

The paper addresses the challenge of estimating metric relative camera pose from a pair of images, which is crucial for 3D reconstruction and localization. It proposes GARPS, a training-free framework that aligns two independently reconstructed 3D scenes using a metric monocular depth estimator and a Gaussian scene reconstructor. The method optimizes a differentiable GMM alignment objective that considers geometric structure, view-independent color, anisotropic covariance, and semantic feature consistency, making it robust to occlusions and texture-poor regions. Experiments on the Real-Estate10K dataset show that GARPS outperforms both classical and state-of-the-art learning-based methods, including MASt3R.

GARPS 是一个无需训练的框架，用于从一对图像中估计具有度量的相对相机姿态。该框架使用单目深度估计器和高斯场景重建器为每张图像生成一个具有度量的 3D 高斯混合模型 (GMM)。该框架通过优化一个考虑几何结构、颜色、协方差和语义特征一致性的可微 GMM 对齐目标来细化初始姿态。实验结果表明，GARPS 在 Real-Estate10K 数据集上优于经典和最先进的基于学习的方法。

Enhancing Generalization in Vision-Language-Action Models by Preserving Pretrained Representations

Authors: Shresth Grover, Akshay Gopalkrishnan, Bo Ai, Henrik I. Christensen, Hao Su, Xuanlin Li

First: 2025-09-14T20:08:56+00:00 · Latest: 2025-09-17T02:41:34+00:00

Comments: Project Page: https://gen-vla.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language-action (VLA) models finetuned from vision-language models (VLMs) hold the promise of leveraging rich pretrained representations to build generalist robots across diverse tasks and environments. However, direct fine-tuning on robot data often disrupts these representations and limits generalization. We present a framework that better preserves pretrained features while adapting them for robot manipulation. Our approach introduces three components: (i) a dual-encoder design with one frozen vision encoder to retain pretrained features and another trainable for task adaptation, (ii) a string-based action tokenizer that casts continuous actions into character sequences aligned with the model's pretraining domain, and (iii) a co-training strategy that combines robot demonstrations with vision-language datasets emphasizing spatial reasoning and affordances. Evaluations in simulation and on real robots show that our method improves robustness to visual perturbations, generalization to novel instructions and environments, and overall task success compared to baselines.

Singular Value Few-shot Adaptation of Vision-Language Models

Authors: Taha Koleilat, Hassan Rivaz, Yiming Xiao

First: 2025-09-03T22:00:23+00:00 · Latest: 2025-09-16T23:58:52+00:00

Comments: 10 pages, 2 figures, 8 tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Vision-language models (VLMs) like CLIP have shown impressive zero-shot and few-shot learning capabilities across diverse applications. However, adapting these models to new fine-grained domains remains difficult due to reliance on prompt engineering and the high cost of full model fine-tuning. Existing adaptation approaches rely on augmented components, such as prompt tokens and adapter modules, which could limit adaptation quality, destabilize the model, and compromise the rich knowledge learned during pretraining. In this work, we present CLIP-SVD, a novel multi-modal and parameter-efficient adaptation technique that leverages Singular Value Decomposition (SVD) to modify the internal parameter space of CLIP without injecting additional modules. Specifically, we fine-tune only the singular values of the CLIP parameter matrices to rescale the basis vectors for domain adaptation while retaining the pretrained model. This design enables enhanced adaptation performance using only 0.04% of the model's total parameters and better preservation of its generalization ability. CLIP-SVD achieves state-of-the-art classification results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, we leverage a natural language-based approach to analyze the effectiveness and dynamics of the CLIP adaptation to allow interpretability of CLIP-SVD. The code is publicly available at https://github.com/HealthX-Lab/CLIP-SVD.

中文标题/摘要

标题：视觉-语言模型的单值分解少量样本适应

视觉-语言模型（VLMs）如CLIP在多种应用中展示了令人印象深刻的零样本和少量样本学习能力。然而，将这些模型适应到新的细粒度领域仍然困难，因为依赖于提示工程和全模型微调的高成本。现有的适应方法依赖于增强组件，如提示令牌和适配模块，这可能会限制适应质量，使模型不稳定，并损害预训练期间学到的丰富知识。在本文中，我们提出了CLIP-SVD，这是一种新颖的多模态和参数高效的适应技术，利用单值分解（SVD）修改CLIP的内部参数空间，而不注入额外模块。具体来说，我们仅微调CLIP参数矩阵的奇异值以重新缩放基向量进行领域适应，同时保留预训练模型。此设计仅使用模型总参数的0.04%即可实现增强的适应性能，并更好地保留其泛化能力。CLIP-SVD在11个自然和10个生物医学数据集上实现了最先进的分类结果，在少量样本设置中在准确性和泛化方面均优于先前的方法。此外，我们利用基于自然语言的方法分析CLIP适应的有效性和动态，以实现CLIP-SVD的可解释性。代码可在https://github.com/HealthX-Lab/CLIP-SVD上公开获取。

Summary / 总结

The research aims to improve the adaptation of vision-language models like CLIP to new fine-grained domains with minimal parameter changes. The method uses Singular Value Decomposition (SVD) to fine-tune only the singular values of CLIP's parameter matrices, enabling domain adaptation with only 0.04% of the model's parameters. This approach achieves state-of-the-art results on 11 natural and 10 biomedical datasets, outperforming previous methods in both accuracy and generalization under few-shot settings. Additionally, a natural language-based approach is used to analyze the adaptation dynamics for interpretability.

该研究旨在解决使用CLIP等视觉-语言模型在新领域进行有限微调时的适应性问题。它提出了CLIP-SVD，一种参数高效的适应技术，利用奇异值分解修改CLIP的内部参数空间，而不添加额外模块。CLIP-SVD在21个数据集上实现了最先进的分类结果，展示了仅微调模型0.04%参数的优越适应性和泛化能力。

Intelligent Healthcare Imaging Platform An VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Authors: Samer Al-Hamadani

First: 2025-09-16T23:15:44+00:00 · Latest: 2025-09-16T23:15:44+00:00

Comments: 32 pages, 14 figures, 6 tables

Abs · PDF · Code1 · Code2

Abstract

The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

中文标题/摘要

标题：智能医疗成像平台基于VLM的自动化医学图像分析和临床报告生成框架

医疗成像中人工智能（AI）的迅速发展已经革新了诊断医学和临床决策过程。本研究提出了一种基于视觉语言模型（VLMs）的智能多模态医学图像分析框架。该框架利用Google Gemini 2.5 Flash实现跨CT、MRI、X光和超声等多种成像模态的自动化肿瘤检测和临床报告生成。系统结合视觉特征提取与自然语言处理，实现上下文图像解释，采用坐标验证机制和概率高斯建模来描述异常分布。多层可视化技术生成详细的医学插图、叠加比较和统计表示，以增强临床信心，位置测量平均偏差为80像素。结果处理利用精确的提示工程和文本分析提取结构化的临床信息，同时保持可解释性。实验评估表明，该系统在多种模态下具有高异常检测性能。该系统具有用户友好的Gradio界面，用于临床工作流程集成，并展示了零样本学习能力，减少对大数据集的依赖。该框架代表了自动化诊断支持和放射学工作流程效率的重要进步，但临床验证和多中心评估是广泛采用前的必要步骤。

Summary / 总结

This work introduces an intelligent multimodal framework for medical image analysis using Vision-Language Models (VLMs) to automate tumor detection and clinical report generation across various imaging modalities. The system combines visual feature extraction with natural language processing, incorporating coordinate verification and probabilistic modeling. Experimental evaluations showed high performance in anomaly detection across multiple modalities, with location measurement achieving an 80-pixel average deviation. The framework includes a user-friendly Gradio interface and demonstrates zero-shot learning capabilities, though further clinical validation is needed.

该研究提出了一种使用视觉-语言模型（VLM）的智能多模态框架，用于自动化多种成像模态下的肿瘤检测和临床报告生成。系统结合了视觉特征提取和自然语言处理，采用了坐标验证和概率高斯建模。实验评估显示，该系统在多种模态下的异常检测性能优异，具有用户友好的Gradio界面和零样本学习能力。但需进行临床验证和多中心评估才能广泛采用。

Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data

Authors: Renat Sergazinov, Shao-An Yin

First: 2025-08-30T02:57:01+00:00 · Latest: 2025-09-16T22:27:26+00:00

Comments: 14 pages, 6 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

TabPFN v2 achieves better results than tree-based models on several tabular benchmarks, which is notable since tree-based models are usually the strongest choice for tabular data. However, it cannot handle more than 10K context tokens because transformers have quadratic computation and memory costs. Unlike existing approaches that rely on context compression, such as selecting representative samples via K-nearest neighbors (KNN), we introduce a tiled-block strategy to compute attention within the TabPFN framework. This design is compatible with standard GPU setups and, to the best of our knowledge, is the first to enable TabPFN to process long contexts without any pre-processing. We demonstrate the effectiveness of our approach on the standard TabArena benchmark, with code available at https://github.com/mrsergazinov/chunk_tabpfn.

中文标题/摘要

标题：分块TabPFN：长上下文表格数据的无需训练即插即用学习

TabPFN v2在多个表格基准测试中优于基于树的模型，这值得注意，因为基于树的模型通常是表格数据的最佳选择。然而，它无法处理超过10K上下文标记，因为变换器具有二次计算和内存成本。与现有的依赖上下文压缩的方法不同，例如通过K近邻（KNN）选择代表性样本，我们引入了一种分块策略，在TabPFN框架内计算注意力。此设计兼容标准GPU设置，并且据我们所知，这是首次使TabPFN能够处理长上下文而无需任何预处理。我们在标准的TabArena基准测试中展示了我们方法的有效性，代码可在https://github.com/mrsergazinov/chunk_tabpfn获取。

Summary / 总结

The research aims to address the limitation of TabPFN v2 in handling long contexts, which is a common issue in tabular data processing. The authors propose a chunked tabPFN (Chunked TabPFN) approach that uses a tiled-block strategy to compute attention within the TabPFN framework, enabling it to process long contexts without pre-processing. The method is demonstrated to be effective on the TabArena benchmark, outperforming existing approaches that rely on context compression techniques like K-nearest neighbors (KNN).

研究旨在解决TabPFN v2在处理长上下文时的局限性，这是表格数据处理中的常见问题。作者提出了一种分块TabPFN（Chunked TabPFN）方法，该方法在TabPFN框架中使用分块策略计算注意力，使其能够在无需预处理的情况下处理长上下文。该方法在标准的TabArena基准测试中被证明是有效的，优于依赖于K近邻（KNN）等上下文压缩技术的现有方法。

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Authors: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan

Venue: EMNLP 2025

First: 2025-05-29T17:59:47+00:00 · Latest: 2025-09-16T18:13:16+00:00

Comments: EMNLP 2025 Main Conference

Abs · PDF · Code1 · Code2

Abstract

Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

中文标题/摘要

标题：困惑于谜题：当视觉语言模型无法得到提示

谜语谜题，通过图像、空间排列和象征性替代编码语言的视觉谜题，对当前的视觉语言模型（VLMs）提出了独特的挑战。与传统的图像描述或问答任务不同，解谜需要多模态抽象、象征性推理和对文化、音韵和语言双关的理解。在本文中，我们通过构建一个手工生成和标注的多元英语谜语谜题基准，研究了当代VLMs解释和解决谜语谜题的能力，该基准包括从简单的象形替换到空间依赖性线索（“头”在“脚”上）的各种谜题。我们分析了不同VLMs的表现，研究结果表明，虽然VLMs在解码简单视觉线索方面表现出一些令人惊讶的能力，但在需要抽象推理、横向思维和理解视觉隐喻的任务上却面临重大挑战。

3D Aware Region Prompted Vision Language Model

Authors: An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu

Venue: www

First: 2025-09-16T17:59:06+00:00 · Latest: 2025-09-16T17:59:06+00:00

Comments: Project Website: https://www.anjiecheng.me/sr3d

Abs · PDF · Code1 · Code2

Abstract

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

中文标题/摘要

标题：三维感知区域提示视觉语言模型

我们提出了一种三维空间区域感知（SR-3D）视觉-语言模型，通过共享的视觉标记空间连接单视图二维图像和多视图三维数据。SR-3D 支持灵活的区域提示，允许用户在任何帧上使用边界框、分割掩码或直接在三维空间中注释区域，而无需进行详尽的多帧标注。我们通过增强二维视觉特征的三维位置嵌入来实现这一点，这使得三维模型能够利用强大的二维先验知识进行更准确的空间推理，即使感兴趣的对象在同一个视图中不同时出现。在通用二维视觉语言和专门的三维空间基准上的广泛实验表明，SR-3D 达到了最先进的性能，突显了其在场景理解中统一二维和三维表示空间的有效性。此外，我们观察到其在无需感官三维输入或真实三维注释的野外视频中的适用性，其中 SR-3D 准确地推断了空间关系和度量测量。

Summary / 总结

The research introduces SR-3D, a vision-language model that integrates single-view 2D images and multi-view 3D data through a shared visual token space. It supports flexible region prompting, enabling users to annotate regions with bounding boxes, segmentation masks, or directly in 3D. The model enriches 2D visual features with 3D positional embeddings, enhancing spatial reasoning across frames. SR-3D achieves state-of-the-art performance on both general 2D vision-language tasks and specialized 3D spatial benchmarks, and it can accurately infer spatial relationships and metric measurements in real-world videos without 3D inputs or annotations.

研究介绍了SR-3D，这是一种将单视图2D图像和多视图3D数据通过共享视觉标记空间连接起来的视觉-语言模型。它支持灵活的区域提示，允许用户使用边界框、分割掩码或直接在3D中进行标注。SR-3D通过将3D位置嵌入添加到2D视觉特征中，增强了跨帧的空间推理能力。实验表明，SR-3D在通用2D视觉-语言任务和专门的3D空间基准测试中均优于现有模型，并且可以在没有3D输入或真实标注的情况下准确推断空间关系和度量尺寸。

EdiVal-Agent: An Object-Centric Framework for Automated, Scalable, Fine-Grained Evaluation of Multi-Turn Editing

Authors: Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, Chung-Ching Lin, Jianwen Xie, Oscar Leong, Lijuan Wang, Ying Nian Wu, Mingyuan Zhou

First: 2025-09-16T17:45:39+00:00 · Latest: 2025-09-16T17:45:39+00:00

Comments: Tianyu Chen and Yasi Zhang contributed equally; Oscar Leong, Lijuan Wang, Ying Nian Wu, and Mingyuan Zhou advised equally

Abs · PDF · Code1 · Code2 · Project1

Abstract

Instruction-based image editing has advanced rapidly, yet reliable and interpretable evaluation remains a bottleneck. Current protocols either (i) depend on paired reference images -- resulting in limited coverage and inheriting biases from prior generative models -- or (ii) rely solely on zero-shot vision-language models (VLMs), whose prompt-based assessments of instruction following, content consistency, and visual quality are often imprecise. To address this, we introduce EdiVal-Agent, an automated, scalable, and fine-grained evaluation framework for multi-turn instruction-based editing from an object-centric perspective, supported by a suite of expert tools. Given an image, EdiVal-Agent first decomposes it into semantically meaningful objects, then synthesizes diverse, context-aware editing instructions. For evaluation, it integrates VLMs with open-vocabulary object detectors to assess instruction following, uses semantic-level feature extractors to evaluate content consistency, and leverages human preference models to judge visual quality. We show that combining VLMs with object detectors yields stronger agreement with human judgments in instruction-following evaluation compared to using VLMs alone and CLIP-based metrics. Furthermore, the pipeline's modular design allows future tools to be seamlessly integrated, enhancing evaluation accuracy over time. Instantiating this pipeline, we build EdiVal-Bench, a multi-turn editing benchmark covering 9 instruction types and 11 state-of-the-art editing models spanning autoregressive (AR) (including Nano Banana, GPT-Image-1), flow-matching, and diffusion paradigms. We demonstrate that EdiVal-Agent can be used to identify existing failure modes, thereby informing the development of the next generation of editing models. Project page: https://tianyucodings.github.io/EdiVAL-page/.

中文标题/摘要

标题：EdiVal-Agent：基于对象中心视角的多轮编辑自动化、可扩展且细粒度的评估框架

基于指令的图像编辑技术取得了快速进展，但可靠的且可解释的评估仍然是瓶颈。当前的评估协议要么依赖配对的参考图像——这导致了有限的覆盖范围并继承了先前生成模型的偏见——要么仅依赖于零样本视觉语言模型（VLMs），这些模型基于提示的评估指令遵循、内容一致性和视觉质量往往是不精确的。为了解决这个问题，我们引入了EdiVal-Agent，这是一种基于对象中心视角的自动化、可扩展且细粒度的多轮指令编辑评估框架，支持一系列专家工具。给定一张图像，EdiVal-Agent 首先将其分解为语义上有意义的对象，然后合成多样且上下文相关的编辑指令。在评估过程中，它将VLMs与开放词汇量的对象检测器结合以评估指令遵循情况，使用语义级特征提取器评估内容一致性，并利用人类偏好模型判断视觉质量。我们展示了将VLMs与对象检测器结合使用在指令遵循评估中比单独使用VLMs和基于CLIP的度量标准更能与人类判断达成一致。此外，该流水线的模块化设计允许未来工具无缝集成，随着时间的推移提高评估准确性。实例化此流水线，我们构建了EdiVal-Bench，这是一个涵盖9种指令类型和11种最先进的编辑模型（包括自回归（AR）（包括Nano Banana、GPT-Image-1）、流匹配和扩散范式的多轮编辑基准）。我们证明了EdiVal-Agent可以用于识别现有失败模式，从而指导下一代编辑模型的发展。项目页面：https://tianyucodings.github.io/EdiVAL-page/

Summary / 总结

EdiVal-Agent is an automated evaluation framework for multi-turn image editing, addressing the limitations of existing protocols by integrating VLMs with object detectors and semantic-level feature extractors. It decomposes images into objects, generates context-aware editing instructions, and evaluates instruction following, content consistency, and visual quality. The framework shows improved agreement with human judgments in instruction-following evaluation and can be extended to incorporate new tools, enhancing evaluation accuracy over time. EdiVal-Agent is instantiated in EdiVal-Bench, a benchmark covering various editing models, which helps identify existing failure modes in image editing systems.

EdiVal-Agent 是一个自动化评估框架，用于多轮指令驱动的图像编辑，解决了现有评估方法的局限性。该框架将图像分解为对象，生成上下文相关的编辑指令，并结合视觉语言模型和对象检测器评估指令遵循、内容一致性和视觉质量。该框架在指令遵循评估中与人类判断的契合度更高，并且可以通过集成未来工具来增强评估准确性。

Image Realness Assessment and Localization with Multimodal Features

Authors: Lovish Kaushik, Agnij Biswas, Somdyuti Paul

First: 2025-09-16T17:42:51+00:00 · Latest: 2025-09-16T17:42:51+00:00

Abs · PDF · Code1 · Code2

Abstract

A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.

中文标题/摘要

标题：基于多模态特征的图像真实感评估与定位

一种可靠地量化AI生成图像的感知真实感并识别视觉不一致区域的方法对于AI生成图像的实际应用以及通过训练期间的真实感反馈提高生成AI的逼真度至关重要。本文介绍了一种框架，该框架利用在大规模数据集上训练的视觉-语言模型生成的描述视觉不一致性的文本描述，同时实现AI生成图像的整体客观真实感评估和局部不一致性识别。我们的结果表明，提出的多模态方法提高了客观真实感预测性能，并生成了密集的真实感图，能够有效地区分现实和不现实的空间区域

Summary / 总结

This paper presents a framework for assessing the perceptual realness of AI-generated images and identifying visually inconsistent regions. It uses textual descriptions of visual inconsistencies generated by vision-language models to achieve both overall realness assessment and local inconsistency detection. The results show that the multimodal approach enhances realness prediction and generates detailed realness maps that effectively differentiate between realistic and unrealistic areas of the images.

本文提出了一种框架，用于评估AI生成图像的感知真实性和识别视觉不一致区域。该方法利用由大规模数据集训练的视觉-语言模型生成的视觉不一致性文本描述，实现整体真实度评估和局部不一致性检测。结果表明，多模态方法提高了真实度预测性能，并生成了详细的真实度图，有效地区分了真实和不真实的空间区域。

ChartGaze: Enhancing Chart Understanding in LVLMs with Eye-Tracking Guided Attention Refinement

Authors: Ali Salamatian, Amirhossein Abaskohi, Wan-Cyuan Fan, Mir Rayat Imtiaz Hossain, Leonid Sigal, Giuseppe Carenini

Venue: EMNLP 2025

First: 2025-09-16T17:35:39+00:00 · Latest: 2025-09-16T17:35:39+00:00

Comments: EMNLP 2025

Abs · PDF · Code1 · Code2

Abstract

Charts are a crucial visual medium for communicating and representing information. While Large Vision-Language Models (LVLMs) have made progress on chart question answering (CQA), the task remains challenging, particularly when models attend to irrelevant regions of the chart. In this work, we present ChartGaze, a new eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. Through a systematic comparison of human and model attention, we find that LVLMs often diverge from human gaze, leading to reduced interpretability and accuracy. To address this, we propose a gaze-guided attention refinement that aligns image-text attention with human fixations. Our approach improves both answer accuracy and attention alignment, yielding gains of up to 2.56 percentage points across multiple models. These results demonstrate the promise of incorporating human gaze to enhance both the reasoning quality and interpretability of chart-focused LVLMs.

中文标题/摘要

标题：ChartGaze：通过眼动追踪引导注意力精炼以增强LVLM中的图表理解

图表是传达和表示信息的重要视觉媒介。虽然大型视觉-语言模型（LVLM）在图表问答（CQA）方面取得了进展，但在模型关注图表无关区域时，该任务仍然具有挑战性。在这项工作中，我们介绍了ChartGaze，一个新的眼动追踪数据集，捕捉人类在图表推理任务中的注视模式。通过系统比较人类和模型的注意力，我们发现LVLM经常偏离人类的注视，导致可解释性和准确性降低。为了解决这一问题，我们提出了一种基于眼动追踪的注意力精炼方法，使图像-文本注意力与人类注视点对齐。我们的方法提高了答案准确性和注意力对齐，多个模型的准确率提高了2.56个百分点。这些结果表明，将人类注视纳入可以增强图表集中LVLM的推理质量和可解释性。

Summary / 总结

The research aims to improve the understanding of charts by LVLMs through eye-tracking data. The study introduces ChartGaze, an eye-tracking dataset that captures human gaze patterns during chart reasoning tasks. By comparing human and model attention, it was found that LVLMs often focus on irrelevant parts of the chart, leading to lower accuracy and interpretability. The proposed gaze-guided attention refinement aligns model attention with human fixations, resulting in improved answer accuracy and attention alignment, with gains up to 2.56 percentage points across multiple models.

ChartGaze 是一个新的眼动追踪数据集，通过将模型的注意力与人类注视点对齐来帮助 LVLMs 更好地理解图表。研究发现，LVLMs 经常关注图表的无关部分，导致准确性和可解释性降低。ChartGaze 提出了一种基于眼动的注意力精炼方法，实现了在多个模型中高达 2.56 个百分点的准确率提升。这表明使用人类注视点来增强 LVLMs 在图表任务中的推理能力和可解释性具有潜力。

RadGame: An AI-Powered Platform for Radiology Education

Authors: Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar

First: 2025-09-16T17:27:33+00:00 · Latest: 2025-09-16T17:27:33+00:00

Abs · PDF · Code1 · Code2

Abstract

We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

中文标题/摘要

标题：RadGame：一种基于人工智能的放射学教育平台

我们介绍了RadGame，一种基于人工智能的游戏化放射学教育平台，旨在培养两个核心技能：发现定位和报告生成。传统放射学培训基于被动接触病例或实时从监督放射科医生那里获得反馈的主动实践，限制了即时和大规模反馈的机会。RadGame通过结合游戏化、大规模公共数据集和自动化的、基于人工智能的反馈来弥补这一差距，为人类学习者提供清晰、结构化的指导。在RadGame Localize中，玩家绘制异常的边界框，这些边界框会自动与公共数据集中放射科医生绘制的注释进行比较，并由视觉语言模型生成视觉解释，以解释用户遗漏的发现。在RadGame Report中，玩家根据胸部X光片、患者年龄和指征撰写发现，并根据放射学报告生成指标接收结构化的AI反馈，突出与公共数据集中放射科医生撰写的地面真实报告相比的错误和遗漏，生成最终的表现和风格评分。在前瞻性评估中，使用RadGame的参与者在定位准确性上比传统被动方法提高了68%，在报告写作准确性上比传统方法提高了31%，即使他们看到的是相同的病例。RadGame突显了基于人工智能的游戏化在提供可扩展、反馈丰富的放射学培训方面的潜力，并重新构想了医疗人工智能资源在教育中的应用。

Summary / 总结

RadGame is an AI-powered gamified platform designed to enhance radiology education by focusing on localizing findings and generating reports. It uses large-scale public datasets and AI-driven feedback to provide structured guidance. Participants showed a 68% improvement in localization accuracy and a 31% improvement in report-writing accuracy after using RadGame, compared to traditional passive methods which only achieved 17% and 4% improvements respectively.

RadGame 是一个结合了 AI 的游戏化平台，旨在通过识别异常和撰写报告来提升放射学教育。它利用大规模公共数据集和 AI 反馈来提供结构化的指导。参与者在使用 RadGame 后，识别准确性提高了 68%，报告撰写准确性提高了 31%，而传统被动方法仅分别提高了 17% 和 4%。

Evaluating the Robustness of Open-Source Vision-Language Models to Domain Shift in Object Captioning

Authors: Federico Tavella, Amber Drinkwater, Angelo Cangelosi

First: 2025-06-24T12:45:09+00:00 · Latest: 2025-09-16T15:12:16+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) have emerged as powerful tools for generating textual descriptions from visual data. While these models excel on web-scale datasets, their robustness to the domain shifts inherent in many real-world applications remains under-explored. This paper presents a systematic evaluation of VLM performance on a single-view object captioning task when faced with a controlled, physical domain shift. We compare captioning accuracy across two distinct object sets: a collection of multi-material, real-world tools and a set of single-material, 3D-printed items. The 3D-printed set introduces a significant domain shift in texture and material properties, challenging the models' generalization capabilities. Our quantitative results demonstrate that all tested VLMs show a marked performance degradation when describing the 3D-printed objects compared to the real-world tools. This underscores a critical limitation in the ability of current models to generalize beyond surface-level features and highlights the need for more robust architectures for real-world signal processing applications.

中文标题/摘要

标题：评估开源视觉-语言模型在对象描述任务中面对物理领域转移的鲁棒性

视觉-语言模型（VLMs）已成为从视觉数据生成文本描述的强大工具。尽管这些模型在大规模网络数据集上表现出色，但它们对许多现实世界应用中固有的领域转移的鲁棒性仍然未被充分探索。本文系统地评估了VLM在面对可控的物理领域转移时，在单视角对象描述任务中的性能。我们比较了两个不同对象集的描述准确性：一组多材料的真实世界工具和一组单一材料的3D打印物品。3D打印集引入了显著的领域转移，挑战了模型的泛化能力。我们的定量结果显示，所有测试的VLM在描述3D打印对象时的性能明显下降，与真实世界工具相比。这突显了当前模型在超越表面特征进行泛化方面的关键局限性，并强调了需要更鲁棒的架构以适应现实世界的信号处理应用。

Summary / 总结

This paper evaluates the robustness of open-source vision-language models to domain shifts in object captioning. The study compares the models' performance on a single-view object captioning task using two distinct object sets: real-world multi-material tools and 3D-printed single-material items. The results show a significant drop in captioning accuracy when describing 3D-printed objects, indicating that current models struggle with generalizing beyond surface-level features and emphasizing the need for more robust architectures.

该研究评估了开源视觉-语言模型在对象描述任务中面对领域转移的鲁棒性。实验使用两种不同对象集进行比较：现实世界中的多材料工具和3D打印的一材料物品。结果显示，当描述3D打印对象时，模型的描述准确性显著下降，这表明当前模型难以超越表面特征进行泛化，并强调了需要更鲁棒的架构以适应实际应用的信号处理需求。

HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models

Authors: Xu Li, Yuxuan Liang, Xiaolei Chen, Yi Zheng, Haotian Chen, Bin Li, Xiangyang Xue

First: 2025-09-16T13:22:08+00:00 · Latest: 2025-09-16T13:22:08+00:00

Abs · PDF · Code1 · Code2

Abstract

By cropping high-resolution images into local tiles and encoding them independently, High-Resolution Large Vision-Language Models (HR-LVLMs) have demonstrated remarkable fine-grained visual understanding capabilities. However, this divide-and-conquer paradigm significantly increases the number of visual tokens, resulting in substantial computational and memory overhead. To better understand and address this challenge, we empirically investigate visual token utilization in HR-LVLMs and uncover three key findings: (1) the local tiles have varying importance, jointly determined by visual saliency and task relevance; (2) the CLS token in CLIP-based vision encoders exhibits a two-stage attention pattern across layers, with each stage attending to different types of visual tokens; (3) the visual tokens emphasized at different stages encode information at varying levels of granularity, playing complementary roles within LVLMs. Building on these insights, we propose HERO, a High-resolution visual token early dropping framework that integrates content-adaptive token budget allocation with function-aware token selection. By accurately estimating tile-level importance and selectively retaining visual tokens with complementary roles, HERO achieves superior efficiency-accuracy trade-offs across diverse benchmarks and model scales, all in a training-free manner. This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.

中文标题/摘要

标题：HERO：重新思考高分辨率大型视觉语言模型中的视觉标记早期丢弃

通过将高分辨率图像裁剪成局部小块并独立编码，高分辨率大型视觉语言模型（HR-LVLMs）展示了出色的细粒度视觉理解能力。然而，这种分而治之的方法显著增加了视觉标记的数量，导致了巨大的计算和内存开销。为了更好地理解和解决这一挑战，我们实证研究了HR-LVLMs中的视觉标记利用情况，并发现了三个关键发现：（1）局部小块的重要性各不相同，由视觉显著性和任务相关性共同决定；（2）基于CLIP的视觉编码器中的CLS标记在各层中表现出两阶段的注意力模式，每个阶段关注不同类型的视觉标记；（3）在不同阶段被强调的视觉标记编码了不同粒度的信息，在视觉语言模型中发挥互补作用。基于这些见解，我们提出了HERO，一种高分辨率视觉标记早期丢弃框架，结合内容自适应标记预算分配与功能感知标记选择。通过准确估计小块级别的重要性并选择性保留具有互补作用的视觉标记，HERO在各种基准和模型规模上实现了优越的效率-准确度权衡，且无需训练。本研究提供了关于HR-LVLMs高效推理的实证见解和实用解决方案。

Summary / 总结

This paper addresses the computational and memory overhead in High-Resolution Large Vision-Language Models (HR-LVLMs) by proposing HERO, a framework that early drops less important visual tokens. The authors found that local tiles have varying importance based on visual saliency and task relevance, and the CLS token in CLIP-based vision encoders has a two-stage attention pattern. HERO integrates content-adaptive token budget allocation with function-aware token selection, achieving better efficiency-accuracy trade-offs without training.

该研究探讨了高分辨率大型视觉语言模型（HR-LVLMs）在计算和内存开销方面的挑战，由于采用分而治之的方法。研究提出了一种名为HERO的框架，通过内容自适应的token预算分配和功能感知的token选择，早期丢弃不重要的视觉token。研究发现局部tile的重要性各不相同，视觉token在不同层中表现出两阶段的注意力模式，HERO在不同基准和模型规模上实现了更好的效率-准确度权衡，且无需训练。

Perception Before Reasoning: Two-Stage Reinforcement Learning for Visual Reasoning in Vision-Language Models

Authors: Yan Chen, Long Li, Teng Xi, Long Zeng, Jingdong Wang

First: 2025-09-16T12:51:11+00:00 · Latest: 2025-09-16T12:51:11+00:00

Abs · PDF · Code1 · Code2

Abstract

Reinforcement learning (RL) has proven highly effective in eliciting the reasoning capabilities of large language models (LLMs). Inspired by this success, recent studies have explored applying similar techniques to vision-language models (VLMs), aiming to enhance their reasoning performance. However, directly transplanting RL methods from LLMs to VLMs is suboptimal, as the tasks faced by VLMs are inherently more complex. Specifically, VLMs must first accurately perceive and understand visual inputs before reasoning can be effectively performed. To address this challenge, we propose a two-stage reinforcement learning framework designed to jointly enhance both the perceptual and reasoning capabilities of VLMs. To mitigate the vanishing advantage issue commonly observed in RL training, we first perform dataset-level sampling to selectively strengthen specific capabilities using distinct data sources. During training, the first stage focuses on improving the model's visual perception through coarse- and fine-grained visual understanding, while the second stage targets the enhancement of reasoning abilities. After the proposed two-stage reinforcement learning process, we obtain PeBR-R1, a vision-language model with significantly enhanced perceptual and reasoning capabilities. Experimental results on seven benchmark datasets demonstrate the effectiveness of our approach and validate the superior performance of PeBR-R1 across diverse visual reasoning tasks.

中文标题/摘要

标题：感知先于推理：视觉语言模型中的两阶段强化学习

强化学习（RL）已被证明在激发大型语言模型（LLMs）的推理能力方面非常有效。受此成功启发，近期研究探索了将类似技术应用于视觉语言模型（VLMs），以提高其推理性能。然而，直接将RL方法从LLMs移植到VLMs是不理想的，因为VLMs面临的任务本质上更为复杂。具体而言，VLMs必须首先准确地感知和理解视觉输入，然后才能有效地进行推理。为应对这一挑战，我们提出了一种两阶段的强化学习框架，旨在同时增强VLMs的感知和推理能力。为缓解RL训练中常见的消失优势问题，我们首先在数据集层面进行采样，以选择性地利用不同的数据源强化特定能力。在训练过程中，第一阶段专注于通过粗粒度和细粒度的视觉理解来提高模型的视觉感知能力，而第二阶段则针对推理能力的提升。经过提出的两阶段强化学习过程后，我们获得了PeBR-R1，这是一种感知和推理能力显著增强的视觉语言模型。在七个基准数据集上的实验结果表明，我们的方法有效，并且验证了PeBR-R1在各种视觉推理任务中的优越性能。

Summary / 总结

This paper proposes a two-stage reinforcement learning framework to improve the perceptual and reasoning capabilities of vision-language models (VLMs). The first stage focuses on enhancing visual perception through coarse- and fine-grained understanding, while the second stage aims to improve reasoning abilities. The approach addresses the vanishing advantage issue by using dataset-level sampling. Experimental results on seven benchmark datasets show that the proposed method, PeBR-R1, outperforms existing models in various visual reasoning tasks.

本文提出了一种两阶段强化学习框架，旨在提升视觉语言模型（VLM）的感知和推理能力。第一阶段通过粗细粒度的视觉理解来增强感知能力，第二阶段则致力于提高推理能力。该方法通过首先确保准确的感知来应对VLM任务的内在复杂性。实验结果表明，所提出的PeBR-R1方法在七个基准数据集上的各种视觉推理任务中表现优于现有模型。