arXiv 论文速递

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Authors: Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim

Venue: NeurIPS 2025

First: 2025-09-22T17:59:54+00:00 · Latest: 2025-09-22T17:59:54+00:00

Comments: NeurIPS 2025. Project page: https://cvlab-kaist.github.io/Seg4Diff/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

中文标题/摘要

标题：Seg4Diff：揭示文本到图像扩散变换器中的开放词汇分割

文本到图像的扩散模型通过其跨模态注意力机制隐式地将语言提示转化为逼真的图像。最近的多模态扩散变换器通过在图像和文本标记的拼接上引入联合自注意力，扩展了这一能力，使其能够实现更丰富和更可扩展的跨模态对齐。然而，这些注意力图如何以及在何处对图像生成做出贡献的详细理解仍然有限。在本文中，我们引入了Seg4Diff（分割用于扩散），这是一种系统框架，用于分析MM-DiT的注意力结构，重点关注特定层如何将文本中的语义信息传播到图像中。通过全面分析，我们确定了一个语义接地专家层，这是一种MM-DiT块，始终将文本标记与空间上一致的图像区域对齐，自然生成高质量的语义分割掩码。我们进一步证明，使用带有掩码标注图像数据的轻量级微调方案可以增强这些层的语义分组能力，从而提高分割性能和生成图像的保真度。我们的研究结果表明，语义分组是扩散变换器的一个涌现特性，并且可以通过选择性放大来促进分割和生成性能的提升，为视觉感知和生成的统一模型铺平了道路。

Summary / 总结

Seg4Diff is a framework for analyzing the attention structures in multi-modal diffusion transformers (MM-DiT), focusing on how specific layers propagate semantic information from text to image. The study identifies a semantic grounding expert layer that aligns text tokens with coherent image regions, producing high-quality semantic segmentation masks. Fine-tuning with mask-annotated image data enhances semantic grouping, improving segmentation performance and image fidelity.

研究旨在理解文本到图像扩散模型中注意力图如何贡献于图像生成。引入了Seg4Diff来分析MM-DiT的注意力结构，发现了一个语义接地专家层，能够将文本标记与图像区域对齐，生成高质量的语义分割掩码。通过使用带有掩码标注的图像数据进行微调，可以增强语义分组能力，从而提高分割性能和图像保真度。

Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation

Authors: Amirhossein Dadashzadeh, Parsa Esmati, Majid Mirmehdi

First: 2025-04-15T23:47:35+00:00 · Latest: 2025-09-22T17:27:30+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent advances in Source-Free Unsupervised Video Domain Adaptation (SFUVDA) leverage vision-language models to enhance pseudo-label generation. However, challenges such as noisy pseudo-labels and over-confident predictions limit their effectiveness in adapting well across domains. We propose Co-STAR, a novel framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and a contrastive vision-language model (CLIP). Our curriculum learning approach employs a reliability-based weight function that measures bidirectional prediction alignment between the teacher and CLIP, balancing between confident and uncertain predictions. This function preserves uncertainty for difficult samples, while prioritizing reliable pseudo-labels when the predictions from both models closely align. To further improve adaptation, we propose Adaptive Curriculum Regularization, which modifies the learning priority of samples in a probabilistic, adaptive manner based on their confidence scores and prediction stability, mitigating overfitting to noisy and over-confident samples. Extensive experiments across multiple video domain adaptation benchmarks demonstrate that Co-STAR consistently outperforms state-of-the-art SFUVDA methods. Code is available at: https://github.com/Plrbear/Co-Star

中文标题/摘要

标题：Co-STAR:协作式课程自训练与自适应正则化在源无标签视频域适应中的应用

近期在源无标签无监督视频域适应（SFUVDA）方面的进展利用视觉语言模型来增强伪标签生成。然而，伪标签噪声和过度自信的预测限制了其在跨域适应中的有效性。我们提出了一种名为Co-STAR的新框架，该框架将课程学习与源训练教师和对比视觉语言模型（CLIP）之间的协作自训练相结合。我们的课程学习方法采用基于可靠性的权重函数，该函数衡量教师和CLIP之间的双向预测对齐，平衡自信和不确定的预测。该函数保留了困难样本的不确定性，而在两种模型预测高度一致时优先考虑可靠的伪标签。为了进一步提高适应性，我们提出了自适应课程正则化，该方法根据样本的信心分数和预测稳定性以概率和自适应的方式修改学习优先级，从而减轻对噪声和过度自信样本的过拟合。在多个视频域适应基准上的广泛实验表明，Co-STAR始终优于最先进的SFUVDA方法。代码可在：https://github.com/Plrbear/Co-Star 获取

Summary / 总结

Co-STAR is a framework that integrates curriculum learning with collaborative self-training between a source-trained teacher and CLIP for source-free video domain adaptation. It uses a reliability-based weight function to balance confident and uncertain predictions and proposes Adaptive Curriculum Regularization to modify learning priority adaptively. Experiments show that Co-STAR outperforms existing methods across multiple benchmarks.

Co-STAR 是一种用于源无监督视频域适应的框架，结合了基于课程学习的协作自训练，使用可靠性加权函数平衡自信和不确定的预测，并根据置信度和稳定性调整学习优先级。实验表明，Co-STAR 在多个基准测试中优于现有方法。

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Authors: Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

First: 2025-09-22T17:15:13+00:00 · Latest: 2025-09-22T17:15:13+00:00

Abs · PDF · Code1 · Code2

Abstract

Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which sample frames uniformly and feed them to a VLM with the question, incur significant token overhead, forcing severe downsampling. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues, ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships. As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments satisfying the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on LongVideoBench and CinePile show NeuS-QA improves performance by over 10%, especially on questions involving event ordering, causality, and multi-step compositional reasoning.

中文标题/摘要

标题：NeuS-QA：将长视频理解接地于时间逻辑和神经符号推理

长视频问答（LVQA）超越了传统视觉问答（VQA），后者通常局限于静态图像或短视频片段。尽管当前的视觉语言模型（VLMs）在这些设置中表现良好，但在长视频中处理涉及多步时间推理和因果关系的复杂查询时却力不从心。传统的抽帧方法均匀地抽帧并将其输入VLM与问题一起处理，会带来显著的标记量开销，迫使严重的下采样。因此，模型经常错过细微的视觉结构、微妙的事件过渡或关键的时间线索，最终导致错误的答案。为了解决这些限制，最近的研究探索了查询自适应帧抽样、分层关键帧选择和基于代理的迭代查询。然而，这些方法仍然本质上是启发式的：它们缺乏明确的时间表示，无法强制执行或验证逻辑事件关系。因此，没有正式保证所抽取的上下文实际上编码了问题所需的组合或因果逻辑。为解决这些基础差距，我们引入了NeuS-QA，这是一种无需训练、即插即用的神经符号流水线，用于LVQA。NeuS-QA将自然语言问题翻译成形式化的时间逻辑表达式，从帧级语义命题构建视频自动机，并应用模型检查以严格识别满足问题逻辑要求的视频片段。只有这些逻辑验证过的片段才提交给VLM，从而提高可解释性，减少幻觉，并在不修改或微调模型的情况下实现组合推理。在LongVideoBench和CinePile上的实验表明，NeuS-QA在涉及事件顺序、因果关系和多步组合推理的问题上性能提高了超过10%。

Summary / 总结

NeuS-QA addresses the challenges of Long-Form Video Question Answering (LVQA) by translating natural language questions into temporal logic expressions and using model checking to identify video segments that satisfy these logical requirements. Only these segments are fed to the vision-language model, improving interpretability and reducing hallucinations. Experiments show NeuS-QA outperforms existing methods by over 10%, particularly in handling event ordering, causality, and multi-step reasoning tasks.

NeuS-QA通过将自然语言问题转换为时间逻辑表达式，并使用模型检查来识别满足这些要求的视频片段，解决了长视频问答（LVQA）的挑战。这种方法避免了均匀采样帧带来的标记量问题，并确保模型考虑正确的时序上下文，从而在LongVideoBench和CinePile上提高了超过10%的性能，特别是在涉及事件顺序、因果关系和多步组合推理的问题上表现尤为突出。

Equilibrium flow: From Snapshots to Dynamics

Authors: Yanbo Zhang, Michael Levin

First: 2025-09-22T16:33:20+00:00 · Latest: 2025-09-22T16:33:20+00:00

Comments: 17 pages, 8 figures

Abs · PDF · Code1 · Code2

Abstract

Scientific data, from cellular snapshots in biology to celestial distributions in cosmology, often consists of static patterns from underlying dynamical systems. These snapshots, while lacking temporal ordering, implicitly encode the processes that preserve them. This work investigates how strongly such a distribution constrains its underlying dynamics and how to recover them. We introduce the Equilibrium flow method, a framework that learns continuous dynamics that preserve a given pattern distribution. Our method successfully identifies plausible dynamics for 2-D systems and recovers the signature chaotic behavior of the Lorenz attractor. For high-dimensional Turing patterns from the Gray-Scott model, we develop an efficient, training-free variant that achieves high fidelity to the ground truth, validated both quantitatively and qualitatively. Our analysis reveals the solution space is constrained not only by the data but also by the learning model's inductive biases. This capability extends beyond recovering known systems, enabling a new paradigm of inverse design for Artificial Life. By specifying a target pattern distribution, we can discover the local interaction rules that preserve it, leading to the spontaneous emergence of complex behaviors, such as life-like flocking, attraction, and repulsion patterns, from simple, user-defined snapshots.

中文标题/摘要

标题：平衡流：从快照到动力学

科学数据，从生物学中的细胞快照到宇宙学中的天体分布，通常由潜在动力系统产生的静态模式组成。这些快照虽然缺乏时间顺序，但隐含地编码了保持它们的过程。本研究探讨了这种分布如何强烈地限制其潜在的动力学以及如何恢复这些动力学。我们引入了平衡流方法，这是一种框架，可以学习保持给定模式分布的连续动力学。我们的方法成功地识别了二维系统的可能动力学，并恢复了洛伦兹吸引子的典型混沌行为。对于来自格雷-斯科特模型的高维图灵模式，我们开发了一种高效的、无需训练的变体，其真实度非常高，通过定量和定性验证均得到了验证。我们的分析表明，解决方案空间不仅由数据限制，还由学习模型的归纳偏见限制。这种能力不仅限于恢复已知系统，还为人工生命提供了新的逆向设计范式。通过指定目标模式分布，我们可以发现保持它的局部交互规则，从而自发地产生复杂行为，如类似生命的群集、吸引和排斥模式，从简单的用户定义快照中产生。

Summary / 总结

This work addresses the challenge of inferring underlying dynamics from static patterns in scientific data. The Equilibrium flow method is introduced to learn continuous dynamics that preserve a given pattern distribution. The method successfully recovers plausible dynamics for 2-D systems and the chaotic behavior of the Lorenz attractor. For high-dimensional Turing patterns, an efficient, training-free variant is developed that achieves high fidelity to the ground truth, both quantitatively and qualitatively. This capability extends to inverse design for Artificial Life, enabling the discovery of local interaction rules that preserve specified pattern distributions, leading to complex behaviors such as flocking and attraction patterns.

该研究旨在从各种科学领域的静态快照数据中推断出潜在的动力学。引入了Equilibrium flow方法来学习能够保持给定模式分布的连续动力学。该方法成功地恢复了2-D系统的合理动力学以及Lorenz吸引子的混沌行为。对于来自Gray-Scott模型的高维Turing模式，开发了一种高效的无需训练的变体，其对真实情况的准确度既定量又定性都很高。该能力还扩展到了人工生命的逆向设计，通过指定目标模式分布，可以发现保持该分布的局部交互规则，从而自发产生复杂的模式，如类似生物的集群、吸引和排斥模式。

No Need for Learning to Defer? A Training Free Deferral Framework to Multiple Experts through Conformal Prediction

Authors: Tim Bary, Benoît Macq, Louis Petit

First: 2025-09-16T02:01:21+00:00 · Latest: 2025-09-22T14:32:27+00:00

Comments: 9 pages, 4 figures, 1 table

Abs · PDF · Code1 · Code2

Abstract

AI systems often fail to deliver reliable predictions across all inputs, prompting the need for hybrid human-AI decision-making. Existing Learning to Defer (L2D) approaches address this by training deferral models, but these are sensitive to changes in expert composition and require significant retraining if experts change. We propose a training-free, model- and expert-agnostic framework for expert deferral based on conformal prediction. Our method uses the prediction set generated by a conformal predictor to identify label-specific uncertainty and selects the most discriminative expert using a segregativity criterion, measuring how well an expert distinguishes between the remaining plausible labels. Experiments on CIFAR10-H and ImageNet16-H show that our method consistently outperforms both the standalone model and the strongest expert, with accuracies attaining $99.57\pm0.10\%$ and $99.40\pm0.52\%$, while reducing expert workload by up to a factor of $11$. The method remains robust under degraded expert performance and shows a gradual performance drop in low-information settings. These results suggest a scalable, retraining-free alternative to L2D for real-world human-AI collaboration.

中文标题/摘要

标题：无需学习推迟？基于校准预测的多专家无训练推迟框架

AI系统往往无法在所有输入上提供可靠的预测，促使需要人机混合决策。现有的学习推迟（L2D）方法通过训练推迟模型来解决这一问题，但这些模型对专家组成的变化敏感，如果专家发生变化，需要进行大量重新训练。我们提出了一种基于校准预测的无训练、模型和专家无关的专家推迟框架。该方法使用校准预测器生成的预测集来识别标签特定的不确定性，并使用区分性标准选择最能区分剩余可能标签的专家，该标准衡量专家区分这些标签的能力。在CIFAR10-H和ImageNet16-H上的实验表明，我们的方法在准确性和专家工作量减少方面均优于独立模型和最强专家，准确率分别达到99.57±0.10%和99.40±0.52%，同时将专家工作量减少多达11倍。该方法在专家表现不佳时仍保持稳健，并在信息量低的情况下表现出逐步性能下降。这些结果表明，校准预测提供了一种无训练、无需重新训练的L2D替代方案，适用于现实世界的人机协作。

Summary / 总结

The paper addresses the need for reliable hybrid human-AI decision-making by proposing a training-free framework for expert deferral based on conformal prediction. This method identifies label-specific uncertainty and selects the most discriminative expert, reducing expert workload by up to 11 times while achieving high accuracies of 99.57±0.10% and 99.40±0.52% on CIFAR10-H and ImageNet16-H datasets, respectively. The framework is robust under degraded expert performance and shows a gradual performance drop in low-information settings, offering a scalable alternative to Learning to Defer approaches.

本文提出了一种基于容限预测的无需训练的专家推诿框架，以解决混合人机系统中的可靠预测问题。该方法通过识别标签特定的不确定性并选择最具有区分性的专家，将专家工作量减少多达11倍，同时在CIFAR10-H和ImageNet16-H数据集上分别达到99.57±0.10%和99.40±0.52%的准确率。该框架在专家性能下降的情况下仍保持鲁棒性。

ContextFlow: Training-Free Video Object Editing via Adaptive Context Enrichment

Authors: Yiyang Chen, Xuanhua He, Xiujun Ma, Yue Ma

First: 2025-09-22T14:13:31+00:00 · Latest: 2025-09-22T14:13:31+00:00

Comments: The project page is at https://yychen233.github.io/ContextFlow-page

Abs · PDF · Code1 · Code2 · Project1

Abstract

Training-free video object editing aims to achieve precise object-level manipulation, including object insertion, swapping, and deletion. However, it faces significant challenges in maintaining fidelity and temporal consistency. Existing methods, often designed for U-Net architectures, suffer from two primary limitations: inaccurate inversion due to first-order solvers, and contextual conflicts caused by crude "hard" feature replacement. These issues are more challenging in Diffusion Transformers (DiTs), where the unsuitability of prior layer-selection heuristics makes effective guidance challenging. To address these limitations, we introduce ContextFlow, a novel training-free framework for DiT-based video object editing. In detail, we first employ a high-order Rectified Flow solver to establish a robust editing foundation. The core of our framework is Adaptive Context Enrichment (for specifying what to edit), a mechanism that addresses contextual conflicts. Instead of replacing features, it enriches the self-attention context by concatenating Key-Value pairs from parallel reconstruction and editing paths, empowering the model to dynamically fuse information. Additionally, to determine where to apply this enrichment (for specifying where to edit), we propose a systematic, data-driven analysis to identify task-specific vital layers. Based on a novel Guidance Responsiveness Metric, our method pinpoints the most influential DiT blocks for different tasks (e.g., insertion, swapping), enabling targeted and highly effective guidance. Extensive experiments show that ContextFlow significantly outperforms existing training-free methods and even surpasses several state-of-the-art training-based approaches, delivering temporally coherent, high-fidelity results.

中文标题/摘要

标题：ContextFlow：基于自适应上下文增强的无训练视频对象编辑

无训练视频对象编辑旨在实现精确的对象级操作，包括对象插入、替换和删除。然而，它在保持保真度和时间一致性方面面临重大挑战。现有方法，通常针对U-Net架构，存在两个主要局限性：由于一阶求解器导致的不准确反演，以及由粗糙的“硬”特征替换引起的上下文冲突。这些问题在扩散变换器（DiTs）中更为突出，因为先前层选择启发式的不适用性使得有效的指导变得困难。为了解决这些局限性，我们引入了ContextFlow，一种基于DiT的无训练视频对象编辑的新颖框架。具体而言，我们首先采用高阶修正流求解器来建立稳健的编辑基础。我们框架的核心是自适应上下文增强（用于指定编辑内容），这是一种解决上下文冲突的机制。它通过从并行重建和编辑路径中连接键值对来丰富自我注意上下文，使模型能够动态融合信息。此外，为了确定在哪里应用这种增强（用于指定编辑位置），我们提出了一种系统性的、基于数据的分析方法来识别任务特定的关键层。基于一种新颖的指导响应度度量，我们的方法能够为不同的任务（例如插入、替换）确定最具影响力的DiT块，从而实现有针对性和高效的指导。大量实验表明，ContextFlow 显著优于现有的无训练方法，并且甚至超越了一些最先进的基于训练的方法，提供了时间上一致、高保真的结果。

Summary / 总结

ContextFlow is a training-free framework for video object editing using Diffusion Transformers (DiTs). It addresses the challenges of maintaining fidelity and temporal consistency by employing a high-order Rectified Flow solver and Adaptive Context Enrichment. The method enriches the self-attention context and identifies task-specific vital layers, leading to temporally coherent and high-fidelity results that outperform existing training-free and some training-based approaches.

ContextFlow 是一种基于扩散变换器（DiTs）的无训练视频对象编辑框架。它通过使用高阶修正流求解器和自适应上下文增强机制来解决保真度和时间一致性的问题。该机制不替换特征，而是通过并行重建和编辑路径的键值对增强自注意力上下文，实现动态信息融合。ContextFlow 还采用数据驱动的方法来识别不同任务中最关键的 DiT 块，提供有针对性的指导。实验表明，ContextFlow 在性能上超越了现有的无训练方法，并且甚至超过了某些基于训练的先进方法，提供了高保真的结果。

A State-Update Prompting Strategy for Efficient and Robust Multi-turn Dialogue

Authors: Ziyi Liu

First: 2025-09-22T13:26:24+00:00 · Latest: 2025-09-22T13:26:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) struggle with information forgetting and inefficiency in long-horizon, multi-turn dialogues. To address this, we propose a training-free prompt engineering method, the State-Update Multi-turn Dialogue Strategy. It utilizes "State Reconstruction" and "History Remind" mechanisms to effectively manage dialogue history. Our strategy shows strong performance across multiple multi-hop QA datasets. For instance, on the HotpotQA dataset, it improves the core information filtering score by 32.6%, leading to a 14.1% increase in the downstream QA score, while also reducing inference time by 73.1% and token consumption by 59.4%. Ablation studies confirm the pivotal roles of both components. Our work offers an effective solution for optimizing LLMs in long-range interactions, providing new insights for developing more robust Agents.

中文标题/摘要

标题：一种高效的鲁棒多轮对话状态更新提示策略

大型语言模型（LLMs）在长时序、多轮对话中面临信息遗忘和低效的问题。为解决这一问题，我们提出了一种无需训练的提示工程方法——状态更新多轮对话策略。该方法利用“状态重建”和“历史提醒”机制有效管理对话历史。我们的策略在多个多跳问答数据集上表现出色。例如，在HotpotQA数据集上，它将核心信息过滤得分提高了32.6%，导致下游问答得分提高了14.1%，同时将推理时间减少了73.1%，减少了59.4%的令牌消耗。消融研究证实了两个组件的关键作用。我们的工作为优化LLMs在长距离交互中的表现提供了有效解决方案，为开发更鲁棒的代理提供了新的见解。

FROQ: Observing Face Recognition Models for Efficient Quality Assessment

Authors: Žiga Babnik, Deepak Kumar Jain, Peter Peer, Vitomir Štruc

First: 2025-09-22T12:29:44+00:00 · Latest: 2025-09-22T12:29:44+00:00

Comments: Presented at the International Joint Conference on Biometrics (IJCB 2025)

Abs · PDF · Code1 · Code2

Abstract

Face Recognition (FR) plays a crucial role in many critical (high-stakes) applications, where errors in the recognition process can lead to serious consequences. Face Image Quality Assessment (FIQA) techniques enhance FR systems by providing quality estimates of face samples, enabling the systems to discard samples that are unsuitable for reliable recognition or lead to low-confidence recognition decisions. Most state-of-the-art FIQA techniques rely on extensive supervised training to achieve accurate quality estimation. In contrast, unsupervised techniques eliminate the need for additional training but tend to be slower and typically exhibit lower performance. In this paper, we introduce FROQ (Face Recognition Observer of Quality), a semi-supervised, training-free approach that leverages specific intermediate representations within a given FR model to estimate face-image quality, and combines the efficiency of supervised FIQA models with the training-free approach of unsupervised methods. A simple calibration step based on pseudo-quality labels allows FROQ to uncover specific representations, useful for quality assessment, in any modern FR model. To generate these pseudo-labels, we propose a novel unsupervised FIQA technique based on sample perturbations. Comprehensive experiments with four state-of-the-art FR models and eight benchmark datasets show that FROQ leads to highly competitive results compared to the state-of-the-art, achieving both strong performance and efficient runtime, without requiring explicit training.

中文标题/摘要

标题：FROQ：观察面部识别模型以进行高效质量评估

面部识别（FR）在许多关键（高风险）应用中发挥着重要作用，其中识别过程中的错误可能导致严重后果。面部图像质量评估（FIQA）技术通过提供面部样本的质量估计来增强FR系统，使系统能够丢弃不适合可靠识别的样本或导致低置信度识别决策。大多数最先进的FIQA技术依赖于广泛的监督训练以实现准确的质量估计。相比之下，无监督技术消除了额外训练的需要，但通常较慢且性能较低。在本文中，我们介绍了FROQ（面部识别质量观察者），这是一种半监督、无需训练的方法，利用给定的FR模型中的特定中间表示来估计面部图像质量，并结合了监督FIQA模型的效率和无监督方法的无需训练方法。基于伪质量标签的简单校准步骤使FROQ能够发现任何现代FR模型中用于质量评估的具体表示。为了生成这些伪标签，我们提出了一种基于样本扰动的新型无监督FIQA技术。使用四个最先进的FR模型和八个基准数据集的全面实验表明，FROQ在与最先进的技术相比时取得了具有竞争力的结果，实现了强大的性能和高效的运行时间，而无需显式训练。

Summary / 总结

FROQ is a semi-supervised, training-free approach for face image quality assessment that leverages intermediate representations within face recognition models. It combines the efficiency of supervised FIQA models with the training-free approach of unsupervised methods. Comprehensive experiments show that FROQ achieves highly competitive results compared to state-of-the-art techniques, providing both strong performance and efficient runtime without requiring explicit training.

FROQ 是一种半监督的面部图像质量评估方法，利用给定的面部识别模型中的中间表示。它结合了监督FIQA模型的效率和无监督方法的无需训练方法。通过使用基于样本扰动生成的伪质量标签进行简单的校准步骤，FROQ 达到了与最先进的方法相比具有竞争力的结果，提供了强大的性能和高效的运行时间，无需显式训练。

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models

Authors: Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, Jieping Ye

Venue: NeurIPS 2025

First: 2025-09-22T12:08:12+00:00 · Latest: 2025-09-22T12:08:12+00:00

Comments: Accepted by NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

While vision language models (VLMs) excel in 2D semantic visual understanding, their ability to quantitatively reason about 3D spatial relationships remains under-explored, due to the deficiency of 2D images' spatial representation ability. In this paper, we analyze the problem hindering VLMs' spatial understanding abilities and propose SD-VLM, a novel framework that significantly enhances fundamental spatial perception abilities of VLMs through two key contributions: (1) propose Massive Spatial Measuring and Understanding (MSMU) dataset with precise spatial annotations, and (2) introduce a simple depth positional encoding method strengthening VLMs' spatial awareness. MSMU dataset covers massive quantitative spatial tasks with 700K QA pairs, 2.5M physical numerical annotations, and 10K chain-of-thought augmented samples. We have trained SD-VLM, a strong generalist VLM which shows superior quantitative spatial measuring and understanding capability. SD-VLM not only achieves state-of-the-art performance on our proposed MSMU-Bench, but also shows spatial generalization abilities on other spatial understanding benchmarks including Q-Spatial and SpatialRGPT-Bench. Extensive experiments demonstrate that SD-VLM outperforms GPT-4o and Intern-VL3-78B by 26.91% and 25.56% respectively on MSMU-Bench. Code and models are released at https://github.com/cpystan/SD-VLM.

中文标题/摘要

标题：SD-VLM：深度编码视觉-语言模型的空间测量与理解

尽管视觉语言模型（VLMs）在二维语义视觉理解方面表现出色，但它们在定量推理三维空间关系方面的能力仍被忽视，这主要是由于二维图像的空间表示能力不足。本文分析了阻碍VLMs空间理解能力的问题，并提出了一种名为SD-VLM的新框架，通过两个关键贡献显著增强了VLMs的基本空间感知能力：（1）提出了一个包含精确空间注释的大规模空间测量与理解（MSMU）数据集；（2）引入了一种简单的深度位置编码方法，增强了VLMs的空间意识。MSMU数据集涵盖了70万对QA、250万物理数值注释和1万个链式思考增强样本的大量定量空间任务。我们训练了SD-VLM，这是一种强大的通用VLM，展示了卓越的定量空间测量和理解能力。SD-VLM不仅在我们提出的MSMU-Bench上达到了最先进的性能，还在其他空间理解基准Q-Spatial和SpatialRGPT-Bench上展示了空间泛化能力。广泛的实验表明，SD-VLM在MSMU-Bench上的表现分别比GPT-4o和Intern-VL3-78B高出26.91%和25.56%。代码和模型已发布在https://github.com/cpystan/SD-VLM。

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Authors: Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

First: 2025-09-22T11:54:58+00:00 · Latest: 2025-09-22T11:54:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

中文标题/摘要

标题：Evict3R：无需训练的内存受限流式视觉几何变换器的标记移除策略

流式视觉变换器如StreamVGGT在实现强大的3D感知的同时，会遭受键值（KV）内存无界增长的问题，这限制了其可扩展性。我们提出了一种无需训练、在推理时移除冗余标记并保留最具有信息量标记的策略，以限制内存使用。我们的方法使用了显著较少的内存，且几乎不会降低准确性：在7-Scenes数据集上使用长序列时，内存峰值从18.63 GB减少到9.39 GB，准确性和完整性仅下降0.003。在严格的内存预算下，移除策略使得更密集的帧采样成为可能，从而提高了重建准确性，优于基线方法。在视频深度估计（Sintel, KITTI）、3D重建（7-Scenes, NRGBD）和相机姿态估计（Sintel, TUM-dynamics）实验中，我们的方法在极小的内存消耗下接近StreamVGGT的表现，并使长时间段流式推理更加实用。

Summary / 总结

The paper addresses the issue of unbounded memory growth in streaming visual transformers like StreamVGGT, which hinders their scalability. It introduces Evict3R, a training-free token eviction policy that retains informative tokens while discarding redundant ones, effectively reducing peak memory usage from 18.63 GB to 9.39 GB with minimal accuracy loss. The method enhances reconstruction accuracy under strict memory constraints and makes long-horizon streaming inference more feasible across various tasks including video depth estimation, 3D reconstruction, and camera pose estimation.

论文解决了流式视觉变压器如StreamVGGT的内存无界增长问题，这限制了其可扩展性。提出了Evict3R，一种无需训练的令牌移除策略，它保留了信息性令牌并丢弃冗余令牌，有效将峰值内存使用量从18.63 GB减少到9.39 GB，同时保持了最小的准确性损失。该方法在严格内存限制下提高了重建准确性，并使长时间段流式推理更加实用，适用于视频深度估计、3D重建和相机姿态估计等多种任务。

From Benchmarks to Reality: Advancing Visual Anomaly Detection by the VAND 3.0 Challenge

Authors: Lars Heckler-Kram, Ashwin Vaidya, Jan-Hendrik Neudeck, Ulla Scheler, Dick Ameln, Samet Akcay, Paula Ramos

First: 2025-09-22T11:27:49+00:00 · Latest: 2025-09-22T11:27:49+00:00

Abs · PDF · Code1 · Code2

Abstract

Visual anomaly detection is a strongly application-driven field of research. Consequently, the connection between academia and industry is of paramount importance. In this regard, we present the VAND 3.0 Challenge to showcase current progress in anomaly detection across different practical settings whilst addressing critical issues in the field. The challenge hosted two tracks, fostering the development of anomaly detection methods robust against real-world distribution shifts (Category 1) and exploring the capabilities of Vision Language Models within the few-shot regime (Category 2), respectively. The participants' solutions reached significant improvements over previous baselines by combining or adapting existing approaches and fusing them with novel pipelines. While for both tracks the progress in large pre-trained vision (language) backbones played a pivotal role for the performance increase, scaling up anomaly detection methods more efficiently needs to be addressed by future research to meet real-time and computational constraints on-site.

中文标题/摘要

标题：从基准到现实：VAND 3.0 挑战赛推动视觉异常检测进步

视觉异常检测是一个强应用驱动的研究领域。因此，学术界与工业界的联系至关重要。为此，我们提出了VAND 3.0挑战赛，以展示不同实际应用场景下异常检测的最新进展，同时解决该领域中的关键问题。挑战赛设置了两个赛道，分别促进异常检测方法在真实世界分布变化下的鲁棒性（类别1）和探索视觉语言模型在少样本情况下的能力（类别2）。参赛者通过结合或适应现有方法并融合新的流程，实现了显著的进步。尽管两个赛道中大型预训练视觉（语言）骨干网络的进步对性能提升起到了关键作用，但未来研究需要更有效地扩展异常检测方法，以满足现场的实时和计算约束。

COLA: Context-aware Language-driven Test-time Adaptation

Authors: Aiming Zhang, Tianyuan Yu, Liang Bai, Jun Tang, Yanming Guo, Yirun Ruan, Yun Zhou, Zhihe Lu

Venue: IEEE Trans. Image Process. (2025)

First: 2025-09-22T11:19:17+00:00 · Latest: 2025-09-22T11:19:17+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Test-time adaptation (TTA) has gained increasing popularity due to its efficacy in addressing ``distribution shift'' issue while simultaneously protecting data privacy. However, most prior methods assume that a paired source domain model and target domain sharing the same label space coexist, heavily limiting their applicability. In this paper, we investigate a more general source model capable of adaptation to multiple target domains without needing shared labels. This is achieved by using a pre-trained vision-language model (VLM), \egno, CLIP, that can recognize images through matching with class descriptions. While the zero-shot performance of VLMs is impressive, they struggle to effectively capture the distinctive attributes of a target domain. To that end, we propose a novel method -- Context-aware Language-driven TTA (COLA). The proposed method incorporates a lightweight context-aware module that consists of three key components: a task-aware adapter, a context-aware unit, and a residual connection unit for exploring task-specific knowledge, domain-specific knowledge from the VLM and prior knowledge of the VLM, respectively. It is worth noting that the context-aware module can be seamlessly integrated into a frozen VLM, ensuring both minimal effort and parameter efficiency. Additionally, we introduce a Class-Balanced Pseudo-labeling (CBPL) strategy to mitigate the adverse effects caused by class imbalance. We demonstrate the effectiveness of our method not only in TTA scenarios but also in class generalisation tasks. The source code is available at https://github.com/NUDT-Bai-Group/COLA-TTA.

中文标题/摘要

标题：COLA：基于上下文的语言驱动测试时自适应

测试时自适应（TTA）因其在解决“分布偏移”问题的同时保护数据隐私方面有效而越来越受欢迎。然而，大多数先前的方法假设存在一个源领域模型和目标领域共享相同标签空间的情况，这极大地限制了它们的应用范围。在本文中，我们研究了一种更通用的源模型，该模型能够在无需共享标签的情况下适应多个目标领域。这通过使用预训练的视觉-语言模型（VLM），例如CLIP，实现，该模型可以通过与类别描述匹配来识别图像。尽管VLM的零样本性能令人印象深刻，但它们在有效捕捉目标领域特有的属性方面存在困难。为此，我们提出了一种新颖的方法——基于上下文的语言驱动测试时自适应（COLA）。该方法结合了一个轻量级的基于上下文的模块，该模块由三个关键组件组成：任务感知适配器、上下文感知单元和残差连接单元，分别用于探索特定任务知识、VLM的领域特定知识和VLM的先验知识。值得注意的是，基于上下文的模块可以无缝集成到冻结的VLM中，确保了最小的工作量和参数效率。此外，我们引入了一种类平衡伪标签策略（CBPL）来减轻由类别不平衡引起的影响。我们不仅在TTA场景中，还在类别泛化任务中证明了该方法的有效性。源代码可在https://github.com/NUDT-Bai-Group/COLA-TTA获取。

Summary / 总结

The research aims to address the distribution shift issue and protect data privacy by developing a more general test-time adaptation (TTA) method that can adapt a pre-trained vision-language model (VLM) like CLIP to multiple target domains without needing shared labels. The proposed COLA method introduces a context-aware module with three components: a task-aware adapter, a context-aware unit, and a residual connection unit. This module allows the VLM to effectively capture task-specific and domain-specific knowledge while maintaining minimal effort and parameter efficiency. Experimental results show that COLA is effective in TTA scenarios and class generalization tasks.

研究旨在通过开发一种更通用的测试时适应（TTA）方法来解决分布偏移问题并保护数据隐私，该方法可以在无需共享标签的情况下将预训练的视觉-语言模型（VLM）如CLIP适应到多个目标域。提出的COLA方法引入了一个包含任务感知适配器、上下文感知单元和残差连接单元的上下文感知模块，该模块使VLM能够有效捕捉任务特定和领域特定的知识，同时保持最小的努力和参数效率。实验结果表明，COLA在TTA场景和类别泛化任务中都是有效的。

Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models

Authors: Jinyeong Kim, Seil Kang, Jiwoo Park, Junhyeok Kim, Seong Jae Hwang

First: 2025-09-22T11:12:12+00:00 · Latest: 2025-09-22T11:12:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Vision-Language Models (LVLMs) answer visual questions by transferring information from images to text through a series of attention heads. While this image-to-text information flow is central to visual question answering, its underlying mechanism remains difficult to interpret due to the simultaneous operation of numerous attention heads. To address this challenge, we propose head attribution, a technique inspired by component attribution methods, to identify consistent patterns among attention heads that play a key role in information transfer. Using head attribution, we investigate how LVLMs rely on specific attention heads to identify and answer questions about the main object in an image. Our analysis reveals that a distinct subset of attention heads facilitates the image-to-text information flow. Remarkably, we find that the selection of these heads is governed by the semantic content of the input image rather than its visual appearance. We further examine the flow of information at the token level and discover that (1) text information first propagates to role-related tokens and the final token before receiving image information, and (2) image information is embedded in both object-related and background tokens. Our work provides evidence that image-to-text information flow follows a structured process, and that analysis at the attention-head level offers a promising direction toward understanding the mechanisms of LVLMs.

中文标题/摘要

标题：图像到文本信息流在大型视觉语言模型中的注意力头解释

大型视觉语言模型（LVLMs）通过一系列注意力头将图像信息转移到文本中来回答视觉问题。尽管图像到文本的信息流是视觉问题回答的核心，但由于众多注意力头同时运作，其背后的机制仍然难以解释。为了解决这一挑战，我们提出了一种名为头归因的技术，该技术借鉴了组件归因方法，以识别在信息转移中起关键作用的注意力头中的一致模式。通过头归因，我们研究了LVLMs如何依赖特定的注意力头来识别和回答有关图像中主要对象的问题。我们的分析表明，有一组独特的注意力头促进了图像到文本的信息流。令人惊讶的是，我们发现这些头的选择是由输入图像的语义内容而不是其视觉外观所驱动的。我们进一步在标记层面研究了信息流，发现（1）文本信息首先传播到角色相关的标记和最终标记，然后再接收图像信息；（2）图像信息嵌入在对象相关的和背景标记中。我们的工作提供了证据表明，图像到文本的信息流遵循一个结构化的过程，而对注意力头层面的分析为理解LVLMs的机制提供了有希望的方向。

Visual Instruction Pretraining for Domain-Specific Foundation Models

Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang

First: 2025-09-22T10:57:42+00:00 · Latest: 2025-09-22T10:57:42+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at github.com/zcablii/ViTP.

中文标题/摘要

标题：领域特定基础模型的视觉指令预训练

现代计算机视觉正在形成一个感知、推理和生成相互增强的闭环。然而，这个闭环仍然不完整：高层推理对低层感知特征基础学习的自上而下的影响尚未得到充分探索。本文通过提出一种新的预训练范式来解决这一缺口，以在下游领域预训练基础模型。我们引入了视觉指令预训练（ViTP），这是一种新颖的方法，可以直接利用推理来增强感知。ViTP 将视觉变换器（ViT）骨干嵌入到视觉语言模型中，并使用从目标下游领域收集的丰富视觉指令数据集进行端到端预训练。ViTP 由我们提出的视觉鲁棒性学习（VRL）驱动，促使 ViT 从稀疏的视觉标记集中学习稳健且领域相关的特征。在 16 个具有挑战性的遥感和医学成像基准测试上的广泛实验表明，ViTP 在多种下游任务中建立了新的最佳性能。代码可在 github.com/zcablii/ViTP 获取。

MontePrep: Monte-Carlo-Driven Automatic Data Preparation without Target Data Instances

Authors: Congcong Ge, Yachuan Liu, Yixuan Tang, Yifan Zhu, Yaofeng Tu, Yunjun Gao

First: 2025-09-22T09:17:41+00:00 · Latest: 2025-09-22T09:17:41+00:00

Abs · PDF · Code1 · Code2

Abstract

In commercial systems, a pervasive requirement for automatic data preparation (ADP) is to transfer relational data from disparate sources to targets with standardized schema specifications. Previous methods rely on labor-intensive supervision signals or target table data access permissions, limiting their usage in real-world scenarios. To tackle these challenges, we propose an effective end-to-end ADP framework MontePrep, which enables training-free pipeline synthesis with zero target-instance requirements. MontePrep is formulated as an open-source large language model (LLM) powered tree-structured search problem. It consists of three pivot components, i.e., a data preparation action sandbox (DPAS), a fundamental pipeline generator (FPG), and an execution-aware pipeline optimizer (EPO). We first introduce DPAS, a lightweight action sandbox, to navigate the search-based pipeline generation. The design of DPAS circumvents exploration of infeasible pipelines. Then, we present FPG to build executable DP pipelines incrementally, which explores the predefined action sandbox by the LLM-powered Monte Carlo Tree Search. Furthermore, we propose EPO, which invokes pipeline execution results from sources to targets to evaluate the reliability of the generated pipelines in FPG. In this way, unreasonable pipelines are eliminated, thus facilitating the search process from both efficiency and effectiveness perspectives. Extensive experimental results demonstrate the superiority of MontePrep with significant improvement against five state-of-the-art competitors.

中文标题/摘要

标题：MontePrep：基于蒙特卡洛驱动的无需目标数据实例的自动数据准备

在商业系统中，自动数据准备（ADP）的一个普遍要求是将来自不同来源的关系数据转移到具有标准化模式规范的目标中。以往的方法依赖于劳动密集型的监督信号或目标表数据访问权限，限制了它们在实际场景中的应用。为了解决这些挑战，我们提出了一种有效的端到端ADP框架MontePrep，该框架能够在无需目标实例的情况下实现无训练的管道合成。MontePrep被表述为一个由大型语言模型（LLM）驱动的树结构搜索问题。它由三个关键组件组成，即数据准备动作沙箱（DPAS）、基本管道生成器（FPG）和执行感知管道优化器（EPO）。我们首先介绍了DPAS，这是一个轻量级的动作沙箱，用于导航基于搜索的管道生成。DPAS的设计绕过了不可行管道的探索。然后，我们介绍了FPG，它通过LLM驱动的蒙特卡洛树搜索逐步构建可执行的DP管道。此外，我们提出了EPO，它从源到目标调用管道执行结果，以评估FPG生成的管道的可靠性。这样，不合理管道被消除，从而从效率和有效性两个方面促进搜索过程。广泛的实验结果表明，MontePrep在与五种最先进的竞争对手相比时具有显著优势。

Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning

Authors: Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan

Venue: NIPS 2025

First: 2025-09-22T09:16:34+00:00 · Latest: 2025-09-22T09:16:34+00:00

Comments: NIPS 2025

Abs · PDF · Code1 · Code2

Abstract

The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.

中文标题/摘要

标题：大型语言模型能否在无需训练的情况下推理非文本模态？一种基于上下文表示学习的案例研究

大型语言模型（LLMs）的出色表现可以通过测试时计算得到增强，这依赖于外部工具甚至其他深度学习模型。然而，将非文本模态表示集成到LLMs中的现有方法通常需要额外的昂贵的监督训练，限制了对新领域和模态的即时适应。在本工作中，我们探索了以无需训练的方式将非文本基础模型（FMs）的表示集成到基于文本的LLMs中的可行性。我们提出了一种基于上下文表示学习（ICRL）的概念，以允许LLMs通过少样本学习适应性地利用非文本模态表示。与传统的基于上下文学习不同，ICRL用FM表示替换文本输入，使LLM能够在无需微调的情况下进行多模态推理。我们在分子领域的多个任务上评估了ICRL，探讨了三个核心研究问题：（i）如何以无需训练的方式将FM表示映射到LLMs中，（ii）哪些因素影响ICRL的性能，（iii）ICRL有效性的机制是什么。据我们所知，ICRL是第一个无需训练的框架，用于将非文本模态表示集成到基于文本的LLMs中，为适应性、多模态泛化的研究提供了有希望的方向。

Summary / 总结

This work investigates the feasibility of integrating non-text modality representations into text-based Large Language Models (LLMs) without additional training. It proposes In-Context Representation Learning (ICRL) to allow LLMs to adaptively use non-text modality representations through few-shot learning. The study evaluates ICRL on molecular domain tasks and finds that it can effectively perform multi-modal inference without fine-tuning, addressing the limitation of requiring costly supervised training for integrating non-text modalities into LLMs.

这项研究探讨了在无需额外训练的情况下将非文本模态表示集成到大型语言模型（LLMs）中的可行性。它提出了In-Context Representation Learning（ICRL），允许LLMs通过少样本学习适应性地利用非文本模态表示。研究在分子任务上评估了ICRL，并发现它可以在无需微调的情况下有效执行多模态推理，解决了将非文本模态集成到LLMs中需要昂贵的监督训练的限制。

Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization

Authors: Jiulong Wu, Zhengliang Shi, Shuaiqiang Wang, Jizhou Huang, Dawei Yin, Lingyong Yan, Min Cao, Min Zhang

First: 2025-06-04T15:03:50+00:00 · Latest: 2025-09-22T09:12:18+00:00

Comments: This paper is accepted by EMNLP2025

Abs · PDF · Code1 · Code2

Abstract

Large Visual Language Models (LVLMs) have demonstrated impressive capabilities across multiple tasks. However, their trustworthiness is often challenged by hallucinations, which can be attributed to the modality misalignment and the inherent hallucinations of their underlying Large Language Models (LLMs) backbone. Existing preference alignment methods focus on aligning model responses with human preferences while neglecting image-text modality alignment, resulting in over-reliance on LLMs and hallucinations. In this paper, we propose Entity-centric Multimodal Preference Optimization (EMPO), which achieves enhanced modality alignment compared to existing human preference alignment methods. Besides, to overcome the scarcity of high-quality multimodal preference data, we utilize open-source instruction datasets to automatically construct high-quality preference data across three aspects: image, instruction, and response. Experiments on two human preference datasets and five multimodal hallucination benchmarks demonstrate the effectiveness of EMPO, e.g., reducing hallucination rates by 85.9\% on Object-HalBench and 49.8\% on MM-HalBench.

中文标题/摘要

标题：通过实体中心的多模态偏好优化减轻大型视觉语言模型的幻觉

大型视觉语言模型（LVLMs）在多个任务中展现了令人印象深刻的性能。然而，它们的信任度常常受到幻觉的挑战，这可以归因于模态不匹配以及其底层大型语言模型（LLMs）的固有幻觉。现有的偏好对齐方法专注于使模型响应与人类偏好一致，而忽视了图像-文本模态对齐，导致过度依赖LLMs和幻觉。在本文中，我们提出了实体中心的多模态偏好优化（EMPO），与现有的人类偏好对齐方法相比，它实现了增强的模态对齐。此外，为了克服高质量多模态偏好数据的稀缺性，我们利用开源指令数据集自动构建了涵盖图像、指令和响应三个方面的高质量偏好数据。在两个人类偏好数据集和五个多模态幻觉基准上的实验表明，EMPO的有效性，例如在Object-HalBench上将幻觉率降低了85.9%，在MM-HalBench上降低了49.8%。

Summary / 总结

This paper addresses the issue of hallucinations in Large Visual Language Models (LVLMs) by proposing Entity-centric Multimodal Preference Optimization (EMPO). EMPO enhances modality alignment and reduces reliance on the underlying Large Language Model (LLM) by focusing on entity-centric multimodal preferences. The method utilizes open-source instruction datasets to automatically generate high-quality multimodal preference data, which are then used to optimize model responses. Experiments show that EMPO significantly reduces hallucination rates, achieving an 85.9% reduction on Object-HalBench and a 49.8% reduction on MM-HalBench.

本文提出了一种实体中心的多模态偏好优化方法（EMPO），以解决大型视觉语言模型（LVLM）中的幻觉问题。EMPO通过聚焦实体中心的多模态偏好来增强模态对齐，并减少对底层大型语言模型（LLM）的依赖。该方法利用开源指令数据集自动生成高质量的多模态偏好数据，然后用于优化模型响应。实验结果显示，EMPO显著降低了幻觉率，分别在Object-HalBench和MM-HalBench上实现了85.9%和49.8%的幻觉率降低。

ChartHal: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

Authors: Xingqi Wang, Yiming Cui, Xin Yao, Shijin Wang, Guoping Hu, Xiaoyu Qin

First: 2025-09-22T08:15:55+00:00 · Latest: 2025-09-22T08:15:55+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies. Code and data are available at https://github.com/ymcui/ChartHal .

中文标题/摘要

标题：ChartHal：一种细粒度框架，评估大型视觉语言模型在图表理解中的幻觉

大型视觉-语言模型（LVLMs）最近取得了显著进展，但幻觉仍然是一个关键障碍，特别是在图表理解方面，这需要复杂的感知和认知能力以及严格的事实准确性。尽管先前的工作已经独立地研究了幻觉和图表理解，但它们的交集仍然很少被探索。为了解决这一差距，我们提出了ChartHal，这是一个基准，其中包括图表理解中幻觉场景的细粒度分类，以及一个由1,062个样本组成的人工验证数据集。我们的评估表明，最先进的LVLMs在ChartHal上遭受严重的幻觉，包括如GPT-5和o4-mini等专有模型，它们的准确率分别仅为34.46%和22.79%。进一步的分析表明，涉及图表中不存在或与图表矛盾的信息的问题特别容易触发幻觉，突显了需要更 robust 的缓解策略的迫切需求。代码和数据可在https://github.com/ymcui/ChartHal 获取。

Summary / 总结

ChartHal is a benchmark that evaluates the hallucination issues in chart understanding by LVLMs, focusing on a fine-grained taxonomy of hallucination scenarios. It includes a human-validated dataset of 1,062 samples. State-of-the-art LVLMs, including GPT-5 and o4-mini, perform poorly on this benchmark, achieving only 34.46% and 22.79% accuracy, respectively. The study highlights that questions involving information not present or contradictory to charts are more prone to hallucinations, indicating the need for better mitigation strategies.

ChartHal 是一个基准，通过细粒度的幻觉场景分类来评估 LVLMs 在图表理解中的幻觉问题，包含 1,062 个人工验证样本。最先进的 LVLMs，包括 GPT-5 和 o4-mini，在此基准上的准确率分别仅为 34.46% 和 22.79%。研究显示，涉及图表中不存在或与其矛盾的信息的问题更容易引发幻觉，表明需要更好的缓解策略。

Time Is a Feature: Exploiting Temporal Dynamics in Diffusion Language Models

Authors: Wen Wang, Bozhen Fang, Chenchen Jing, Yongliang Shen, Yangyi Shen, Qiuyu Wang, Hao Ouyang, Hao Chen, Chunhua Shen

First: 2025-08-12T17:59:57+00:00 · Latest: 2025-09-22T07:56:08+00:00

Comments: Project webpage: https://aim-uofa.github.io/dLLM-MidTruth

Abs · PDF · Code1 · Code2 · Project1

Abstract

Diffusion large language models (dLLMs) generate text through iterative denoising, yet current decoding strategies discard rich intermediate predictions in favor of the final output. Our work here reveals a critical phenomenon, temporal oscillation, where correct answers often emerge in the middle process, but are overwritten in later denoising steps. To address this issue, we introduce two complementary methods that exploit temporal consistency: 1) Temporal Self-Consistency Voting, a training-free, test-time decoding strategy that aggregates predictions across denoising steps to select the most consistent output; and 2) a post-training method termed Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE), a measure of semantic stability across intermediate predictions, as a reward signal to encourage stable generations. Empirical results across multiple benchmarks demonstrate the effectiveness of our approach. Using the negative TSE reward alone, we observe a remarkable average improvement of 24.7% on the Countdown dataset over an existing dLLM. Combined with the accuracy reward, we achieve absolute gains of 2.0% on GSM8K, 4.3% on MATH500, 6.6% on SVAMP, and 25.3% on Countdown, respectively. Our findings underscore the untapped potential of temporal dynamics in dLLMs and offer two simple yet effective tools to harness them.

中文标题/摘要

标题：时间是特征：在扩散语言模型中利用时间动态

扩散大型语言模型（dLLMs）通过迭代去噪生成文本，但当前解码策略倾向于丢弃中间丰富的预测，而保留最终输出。我们的研究揭示了一个关键现象，即时间振荡，其中正确的答案往往在中间过程中出现，但在后续去噪步骤中被覆盖。为解决这一问题，我们引入了两种互补的方法来利用时间一致性：1) 时间自我一致性投票，这是一种无需训练的测试时解码策略，它通过聚合去噪步骤中的预测来选择最一致的输出；2) 一种后训练方法，称为时间一致性强化，它使用时间语义熵（TSE），这是一种衡量中间预测语义稳定性的度量，作为奖励信号以鼓励稳定生成。在多个基准上的实验证明了我们方法的有效性。仅使用负TSE奖励，我们在Countdown数据集上观察到与现有dLLM相比平均改进了24.7%。结合准确性奖励，我们在GSM8K上取得了2.0%的绝对收益，在MATH500上取得了4.3%的收益，在SVAMP上取得了6.6%的收益，在Countdown上取得了25.3%的收益。我们的研究结果强调了dLLMs中时间动态的未开发潜力，并提供了两种简单而有效的工具来利用它们。

Summary / 总结

This paper addresses the issue of temporal oscillation in diffusion large language models (dLLMs) where correct answers often emerge but are overwritten. To tackle this, the authors propose two methods: Temporal Self-Consistency Voting, a test-time decoding strategy that aggregates predictions across denoising steps, and Temporal Consistency Reinforcement, which uses Temporal Semantic Entropy (TSE) as a reward signal. Experiments show significant improvements on various benchmarks, with an average 24.7% improvement on the Countdown dataset and up to 25.3% on the Countdown dataset when combined with accuracy rewards.

本文研究了扩散大型语言模型(dLLMs)中的时间振荡问题，即正确答案在中间步骤中出现但后来被覆盖。为了解决这一问题，作者提出了两种方法：时间自一致性投票，这是一种在测试时的解码策略，它会聚合去噪步骤中的预测；以及时间一致性强化，它使用时间语义熵作为奖励信号来鼓励生成的稳定性。实验结果显示，在多个基准测试中取得了显著改进，单独使用负TSE奖励在Countdown数据集上的改进幅度为24.7%，结合其他奖励后分别在GSM8K、MATH500、SVAMP和Countdown数据集上取得了2.0%、4.3%、6.6%和25.3%的绝对收益。

Training-Free Label Space Alignment for Universal Domain Adaptation

Authors: Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito, Donghyun Kim

First: 2025-09-22T07:46:10+00:00 · Latest: 2025-09-22T07:46:10+00:00

Comments: 22 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels -- such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) -- complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9\%}in H-score and \textcolor{blue}{+6.1\%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6\%}) increment in both H- and H$^3$-scores.

Summary / 总结

The paper addresses the challenge of universal domain adaptation (UniDA) by proposing a training-free label space alignment method. It leverages the zero-shot capabilities of vision-language models like CLIP to align label spaces between source and target domains, improving adaptation stability. The method identifies unknown categories in the target domain and filters noisy labels to construct a universal classifier that integrates shared and private class information. Experimental results show that this approach significantly outperforms existing UniDA techniques, achieving an average improvement of +7.9% in H-score and +6.1% in H$^3$-score on key DomainBed benchmarks.

论文针对源域和目标域标签空间可能不同的通用领域适应（UniDA）挑战，提出了一种无需训练的标签空间对齐方法，利用如CLIP等视觉语言模型对标签空间进行对齐，以提高适应稳定性。该方法通过过滤和精炼噪声标签构建一个综合共享和私有类信息的通用分类器，相比现有UniDA技术，在H-score和H$^3$-score上取得了显著的改进。

Emergent 3D Correspondence from Neural Shape Representation

Authors: Keyu Du, Jingyu Hu, Haipeng Li, Hao Xu, Haibing Huang, Chi-Wing Fu, Shuaicheng Liu

Venue: Siggraph Asia 2025

First: 2025-09-22T07:23:07+00:00 · Latest: 2025-09-22T07:23:07+00:00

Comments: This paper is accepted by Siggraph Asia 2025 conference track

Abs · PDF · Code1 · Code2

Abstract

This paper presents a new approach to estimate accurate and robust 3D semantic correspondence with the hierarchical neural semantic representation. Our work has three key contributions. First, we design the hierarchical neural semantic representation (HNSR), which consists of a global semantic feature to capture high-level structure and multi-resolution local geometric features to preserve fine details, by carefully harnessing 3D priors from pre-trained 3D generative models. Second, we design a progressive global-to-local matching strategy, which establishes coarse semantic correspondence using the global semantic feature, then iteratively refines it with local geometric features, yielding accurate and semantically-consistent mappings. Third, our framework is training-free and broadly compatible with various pre-trained 3D generative backbones, demonstrating strong generalization across diverse shape categories. Our method also supports various applications, such as shape co-segmentation, keypoint matching, and texture transfer, and generalizes well to structurally diverse shapes, with promising results even in cross-category scenarios. Both qualitative and quantitative evaluations show that our method outperforms previous state-of-the-art techniques.

中文标题/摘要

标题：从神经形状表示中涌现的3D对应关系

本文提出了一种新的方法，利用分层神经语义表示估计准确且鲁棒的3D语义对应关系。我们的工作有三个关键贡献。首先，我们设计了分层神经语义表示（HNSR），它由全局语义特征组成，用于捕捉高层结构，并结合多分辨率局部几何特征来保留细节，通过精心利用预训练3D生成模型的3D先验知识。其次，我们设计了一种逐步全局到局部匹配策略，该策略使用全局语义特征建立粗略的语义对应关系，然后通过局部几何特征逐步细化，从而获得准确且语义一致的映射。第三，我们的框架无需训练，并且广泛兼容各种预训练3D生成主干网络，展示了在各种形状类别中强大的泛化能力。我们的方法还支持多种应用，如形状共分割、关键点匹配和纹理转移，并且在结构多样化的形状上表现出良好的泛化能力，即使在跨类别场景中也取得了令人鼓舞的结果。定性和定量评估表明，我们的方法优于之前最先进的技术。

Summary / 总结

This paper introduces a new method for estimating accurate and robust 3D semantic correspondence using a hierarchical neural semantic representation (HNSR). The method includes a global semantic feature for capturing high-level structure and multi-resolution local geometric features for fine details. It also features a progressive global-to-local matching strategy that iteratively refines coarse semantic correspondences with local geometric features, resulting in semantically consistent mappings. The framework is training-free and compatible with various pre-trained 3D generative models, showing strong generalization across diverse shape categories and supporting applications like shape co-segmentation and texture transfer.

本文提出了一种新的方法，使用层次神经语义表示（HNSR）来估计准确且鲁棒的3D语义对应。该方法包括用于高层次结构的全局语义特征和用于细节的多分辨率局部几何特征。它还采用逐步全局到局部匹配策略来逐步细化对应关系。该框架无需训练且兼容各种预训练的3D生成模型，展示了在多种形状类别中的强大泛化能力，并支持形状共分割、关键点匹配和纹理转移等应用。实验结果表明，该方法在与之前最先进的技术相比时表现出更优的性能。

Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Authors: Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

Venue: NeurIPS 2025

First: 2025-09-22T07:22:27+00:00 · Latest: 2025-09-22T07:22:27+00:00

Comments: 20 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.

中文标题/摘要

标题：多尺度时间预测通过增量生成和多智能体协作

准确的时间预测是全面场景理解与具身人工智能之间的桥梁。然而，对于视觉语言模型来说，在多个时间尺度上预测场景的多个细粒度状态是困难的。我们通过将多尺度分解为两个正交维度来形式化一般场景和手术场景中的多尺度时间预测任务：时间尺度，预测在不同预测间隔内的人类和手术状态；状态尺度，建模一般和手术场景中的状态层次结构。例如，在一般场景中，接触关系的状态比空间关系的状态更细粒度。在手术场景中，中等水平的步骤比高级阶段更细粒度，但仍受其包含阶段的约束。为了支持这一统一任务，我们引入了第一个多尺度时间预测基准，该基准在多个状态尺度和时间尺度上提供了同步注释。我们进一步提出了一种方法，增量生成和多智能体协作（IG-MC），该方法结合了两项关键创新。首先，我们提出了一种即插即用的增量生成模块，该模块在扩展的时间尺度上连续生成最新的视觉预览，以告知多个决策智能体，使决策和生成的视觉同步，并防止随着预测间隔的延长而降低性能。其次，我们提出了一种以决策为导向的多智能体协作框架，用于多状态预测，该框架包括生成、启动和多状态评估智能体，它们动态触发和评估预测周期，以平衡全局一致性和局部保真度。

Summary / 总结

The paper addresses the challenge of predicting multiple fine-grained states at various temporal scales in general and surgical scenes. It introduces the Multi-Scale Temporal Prediction (MSTP) task and benchmark, and proposes a method called Incremental Generation and Multi-agent Collaboration (IG-MC). IG-MC includes an incremental generation module that continuously updates visual previews and a multi-agent collaboration framework for predicting multiple states, ensuring synchronization and performance stability over longer look-ahead intervals.

论文针对在不同时间尺度上预测场景中多个细粒度状态的挑战，正式提出了多尺度时间预测（MSTP）任务。它引入了第一个MSTP基准，具有跨不同尺度的同步注释。所提出的方法，增量生成和多智能体协作（IG-MC），包括一个持续更新视觉预览的增量生成模块和一个用于预测多个状态的多智能体协作框架，确保决策同步并保持性能随前瞻间隔增加而不会下降。

Vision Language Models Are Not (Yet) Spelling Correctors

Authors: Junhong Liang, Bojun Zhang

First: 2025-09-22T07:10:42+00:00 · Latest: 2025-09-22T07:10:42+00:00

Abs · PDF · Code1 · Code2

Abstract

Spelling correction from visual input poses unique challenges for vision language models (VLMs), as it requires not only detecting but also correcting textual errors directly within images. We present ReViCo (Real Visual Correction), the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English. ReViCo contains naturally occurring errors collected from real-world image data and supports fine-grained evaluation at both image and token levels. Through comprehensive experiments on representative cascaded (Qwen) and native (InternVL) open-source models, as well as closed-source systems (GPT-4o, Claude), we show that current VLMs fall significantly short of human performance, particularly in correction. To address these limitations, we explore two solution paradigms: a Joint OCR-Correction pipeline and a Background Information enhanced approach, both of which yield consistent performance gains. Our analysis highlights fundamental limitations of existing architectures and provides actionable insights for advancing multimodal spelling correction.

中文标题/摘要

标题：视觉语言模型尚未（仍）是拼写校正器

从视觉输入进行拼写校正对视觉语言模型（VLMs）提出了独特的挑战，因为它不仅需要检测，还需要直接在图像中纠正文本错误。我们提出了ReViCo（真实视觉校正），这是首个系统评估VLMs在中文和英文真实世界视觉拼写校正方面的基准。ReViCo包含从真实图像数据中自然收集的错误，并支持在图像和标记级别进行细粒度评估。通过对代表性的级联（Qwen）和原生（InternVL）开源模型以及封闭源系统（GPT-4o，Claude）进行全面实验，我们表明当前的VLMs在拼写校正方面远远低于人类表现，尤其是在校正方面。为了应对这些局限性，我们探索了两种解决方案范式：联合OCR-校正流水线和背景信息增强方法，两者都带来了一致的性能提升。我们的分析揭示了现有架构的基本局限性，并为推进多模态拼写校正提供了可操作的见解。

Summary / 总结

The study addresses the challenge of spelling correction from visual input, which requires both detection and correction of textual errors within images. It introduces ReViCo, a benchmark for evaluating vision language models (VLMs) on real-world visual spelling correction tasks for both Chinese and English. Comprehensive experiments on various VLMs show that they significantly underperform human performance, especially in correction tasks. The research explores two potential solutions: a Joint OCR-Correction pipeline and a Background Information enhanced approach, which both improve performance. The findings highlight the limitations of current VLM architectures and suggest directions for future research in multimodal spelling correction.

研究针对从视觉输入进行拼写修正的挑战，这需要在图像中同时检测和修正文本错误。研究引入了ReViCo，这是一个用于评估视觉语言模型（VLMs）在中文和英文实际图像拼写修正任务上的基准。全面的实验表明，当前的VLMs在修正任务上显著低于人类的表现，尤其是在修正方面。研究探索了两种潜在解决方案：联合OCR-修正流水线和背景信息增强方法，这两种方法都提高了性能。研究分析指出了现有架构的基本局限性，并为未来多模态拼写修正的研究提供了行动建议。

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Authors: Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Venue: NeurIPS 2025

First: 2025-09-17T11:28:58+00:00 · Latest: 2025-09-22T06:42:20+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

中文标题/摘要

标题：ViSpec：通过视觉感知投机解码加速视觉语言模型

投机解码是广泛采用的大语言模型（LLMs）推理加速技术，但在视觉语言模型（VLMs）中的应用尚未充分探索，现有方法仅能实现轻微加速（<1.5x）。随着多模态能力在大规模模型中变得越来越重要，这一差距变得越来越显著。我们假设大型VLMs可以在逐层过滤冗余图像信息的同时不损害文本理解，而较小的草稿模型则难以做到这一点。为了解决这个问题，我们引入了视觉感知投机解码（ViSpec），这是一种针对VLMs的新型框架。ViSpec采用轻量级的视觉适配模块将图像标记压缩成紧凑表示，并无缝集成到草稿模型的注意力机制中，同时保留原始图像的位置信息。此外，我们为每个输入图像提取一个全局特征向量，并将该特征添加到所有后续文本标记中以增强多模态一致性。为了克服缺乏长辅助响应的多模态数据集，我们通过重新利用现有数据集并使用目标VLM生成扩展输出来构建一个专门的训练数据集。我们的训练策略减轻了草稿模型直接访问目标模型隐藏状态的风险，这在仅使用目标模型输出进行训练时可能会导致捷径学习。广泛的实验验证了ViSpec，据我们所知，这是首次在VLM投机解码中实现显著加速。代码可在https://github.com/KangJialiang/ViSpec/获取。

Summary / 总结

ViSpec is a novel framework for accelerating vision-language models (VLMs) by employing vision-aware speculative decoding. It uses a lightweight vision adaptor to compress image tokens and integrate them into the draft model's attention mechanism, while preserving image positional information. ViSpec also enhances multimodal coherence by augmenting text tokens with global image features. Experiments show that ViSpec achieves significant speedups in VLM speculative decoding, overcoming previous limitations and providing the first substantial acceleration in this area. Training is done on a specialized dataset to avoid shortcut learning and ensure robust performance.

ViSpec 是一种新颖的框架，用于通过视觉感知推测性解码加速视觉语言模型（VLMs）。它使用轻量级的视觉适配器来压缩图像标记并将其无缝集成到草稿模型的注意力机制中，同时保留图像的位置信息。ViSpec 还通过将全局图像特征添加到文本标记中来增强多模态的一致性。实验表明，ViSpec 实现了 VLM 推测性解码中的显著加速，克服了之前的限制，并提供了该领域的首个实质性加速。训练使用专门的数据集进行，以避免捷径学习并确保稳健的性能。

CLIP-IN: Enhancing Fine-Grained Visual Understanding in CLIP via Instruction Editing Data and Long Captions

Authors: Ziteng Wang, Siqi Yang, Limeng Qiao, Lin Ma

Venue: NeurIPS 2025

First: 2025-08-04T11:57:10+00:00 · Latest: 2025-09-22T06:33:36+00:00

Comments: NeurIPS 2025 Main

Abs · PDF · Code1 · Code2

Abstract

Despite the success of Vision-Language Models (VLMs) like CLIP in aligning vision and language, their proficiency in detailed, fine-grained visual comprehension remains a key challenge. We present CLIP-IN, a novel framework that bolsters CLIP's fine-grained perception through two core innovations. Firstly, we leverage instruction-editing datasets, originally designed for image manipulation, as a unique source of hard negative image-text pairs. Coupled with a symmetric hard negative contrastive loss, this enables the model to effectively distinguish subtle visual-semantic differences. Secondly, CLIP-IN incorporates long descriptive captions, utilizing rotary positional encodings to capture rich semantic context often missed by standard CLIP. Our experiments demonstrate that CLIP-IN achieves substantial gains on the MMVP benchmark and various fine-grained visual recognition tasks, without compromising robust zero-shot performance on broader classification and retrieval tasks. Critically, integrating CLIP-IN's visual representations into Multimodal Large Language Models significantly reduces visual hallucinations and enhances reasoning abilities. This work underscores the considerable potential of synergizing targeted, instruction-based contrastive learning with comprehensive descriptive information to elevate the fine-grained understanding of VLMs.

中文标题/摘要

标题：CLIP-IN：通过指令编辑数据和长描述增强CLIP的细粒度视觉理解

尽管视觉语言模型（VLMs）如CLIP在视觉和语言对齐方面取得了成功，但在细节和细粒度的视觉理解方面仍面临关键挑战。我们提出了CLIP-IN，这是一种新颖的框架，通过两项核心创新增强了CLIP的细粒度感知。首先，我们利用最初为图像操作设计的指令编辑数据集作为硬负样本图像-文本对的独特来源。结合对称的硬负样本对比损失，这使模型能够有效地区分细微的视觉语义差异。其次，CLIP-IN引入了长描述性字幕，利用旋转位置编码来捕捉标准CLIP经常忽略的丰富语义上下文。我们的实验表明，CLIP-IN在MMVP基准和各种细粒度视觉识别任务上取得了显著的提升，而不会牺牲更广泛分类和检索任务的鲁棒零样本性能。关键的是，将CLIP-IN的视觉表示集成到多模态大型语言模型中显著减少了视觉幻觉并增强了推理能力。这项工作强调了将目标导向的对比学习与全面的描述性信息相结合以提升VLMs的细粒度理解的巨大潜力。

Summary / 总结

CLIP-IN enhances CLIP's fine-grained visual understanding by using instruction-editing datasets to generate hard negative image-text pairs and a symmetric hard negative contrastive loss, which helps the model distinguish subtle visual-semantic differences. Additionally, CLIP-IN incorporates long captions with rotary positional encodings to capture rich semantic context. Experiments show that CLIP-IN improves performance on fine-grained visual recognition tasks while maintaining robust zero-shot performance on broader tasks. Integrating CLIP-IN's visual representations into multimodal large language models reduces visual hallucinations and enhances reasoning abilities.

CLIP-IN通过使用指令编辑数据集生成硬负样本图像-文本对，并结合对称的硬负样本对比损失，帮助模型区分细微的视觉语义差异。此外，CLIP-IN还采用了长描述性字幕，并使用旋转位置编码来捕捉丰富的语义上下文。实验表明，CLIP-IN在细粒度视觉识别任务上的表现得到提升，同时在更广泛的零样本分类和检索任务上保持了稳健的表现。将CLIP-IN的视觉表示整合到多模态大型语言模型中可以减少视觉幻觉并增强推理能力。

Mano Report

Authors: Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

First: 2025-09-22T03:13:58+00:00 · Latest: 2025-09-22T03:13:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

中文标题/摘要

标题：Mano报告

图形用户界面（GUI）是人机交互的主要媒介，但由于视觉元素的复杂性、动态环境以及多步推理的需求，自动化GUI交互仍然具有挑战性。现有的基于视觉-语言模型（VLMs）的方法往往受到分辨率有限、领域不匹配和序列决策能力不足的限制。为了解决这些问题，我们提出了一种名为Mano的稳健GUI代理，该代理基于在大量网络和计算机系统数据上预训练的多模态基础模型构建。我们的方法结合了一个新颖的模拟环境以生成高保真数据、三阶段训练流程（监督微调、离线强化学习和在线强化学习）以及一个验证模块以实现错误恢复。Mano在多个GUI基准测试中表现出最先进的性能，包括Mind2Web和OSWorld，显著提高了成功率和操作准确性。我们的工作为强化学习与VLMs的有效集成提供了新的见解，强调了领域特定数据、迭代训练和整体奖励设计的重要性。

Summary / 总结

The research aims to improve the automation of graphical user interface (GUI) interactions by addressing the limitations of existing vision-language models. Mano, a robust GUI agent, is proposed, leveraging a multi-modal foundation model pre-trained on extensive data. The approach includes a simulated environment for data generation, a three-stage training pipeline, and a verification module. Mano shows superior performance on GUI benchmarks, with notable improvements in success rate and operational accuracy.

研究旨在通过解决现有视觉-语言模型的局限性，提高GUI交互的自动化水平。提出了一个名为Mano的稳健GUI代理，利用广泛网页和计算机系统数据预训练的多模态基础模型。它采用三阶段训练管道和验证模块，展示了在Mind2Web和OSWorld等基准测试中的优越性能，显著提高了成功率和操作准确性。

UIPro: Unleashing Superior Interaction Capability For GUI Agents

Authors: Hongxin Li, Jingran Su, Jingfan Chen, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang

Venue: ICCV 2025

First: 2025-09-22T03:04:53+00:00 · Latest: 2025-09-22T03:04:53+00:00

Comments: Accepted to ICCV 2025

Abs · PDF · Code1 · Code2

Abstract

Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes \textbf{UIPro}, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro's superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.

中文标题/摘要

标题：UIPro：释放图形用户界面代理的卓越交互能力

构建能够像人类一样感知和操作图形用户界面（GUI）的自主代理一直是人工智能领域的愿景。这些代理的核心能力是GUI交互，这包括GUI理解能力和规划能力。现有方法尝试基于视觉语言模型（VLMs）的多模态理解能力来开发GUI代理。然而，有限的场景、不足的规模和异构的动作空间阻碍了构建通用GUI代理的进展。为了解决这些问题，本文提出了一种名为\textbf{UIPro}的新颖通用GUI代理，该代理通过广泛多平台和多任务GUI交互数据进行训练，并结合统一的动作空间。我们首先整理了一个包含2060万GUI理解任务的综合数据集，用于预训练UIPro，赋予其强大的GUI基础能力，这是下游GUI代理任务的关键。随后，我们建立了一个统一的动作空间，以协调异构的GUI代理任务数据集，并生成一个合并数据集，通过持续微调促进UIPro的动作预测能力。实验结果表明，UIPro在各种平台的多个GUI任务基准测试中表现出色，突显了我们方法的有效性。

Summary / 总结

This paper addresses the challenge of building generalist GUI agents by proposing UIPro, which is trained on a large multi-platform and multi-task GUI interaction dataset. UIPro includes a unified action space to harmonize heterogeneous datasets. The results show that UIPro outperforms existing methods across various GUI task benchmarks, demonstrating the effectiveness of the proposed approach.

本文提出了UIPro，这是一种通过跨多个平台的大规模GUI交互任务数据集进行训练的通用GUI代理。UIPro使用统一的动作空间来协调异构的数据集并提高动作预测能力。实验结果表明，UIPro在各种GUI任务基准测试中优于现有方法，证明了所提出方法的有效性。

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Authors: Seongheon Park, Yixuan Li

Venue: NeurIPS 2025

First: 2025-08-27T15:30:06+00:00 · Latest: 2025-09-22T02:58:41+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2

Abstract

Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

中文标题/摘要

标题：GLSim：通过全局-局部相似性检测LVLM中的对象幻觉

大型视觉-语言模型中的对象幻觉对其实用场景的安全部署构成了重大挑战。近期研究提出了对象级幻觉评分以估计对象幻觉的可能性；然而，这些方法通常孤立地采用全局或局部视角，这可能限制了检测的可靠性。本文介绍了一种名为GLSim的新颖无训练对象幻觉检测框架，该框架利用图像和文本模态之间互补的全局和局部嵌入相似性信号，能够在多种场景中实现更准确和可靠的幻觉检测。我们全面评估了现有的对象幻觉检测方法，并证明GLSim在检测性能上优于竞争基线，具有显著优势。

Summary / 总结

The research aims to address the challenge of object hallucination in large vision-language models for safe real-world deployment. GLSim is a training-free framework that combines global and local embedding similarity signals to detect object hallucinations more accurately and reliably across various scenarios. Experimental results show that GLSim outperforms existing methods by a significant margin in detection performance.

研究旨在解决大型视觉-语言模型中对象幻觉的问题，以确保其在实际应用中的安全部署。GLSim 是一个无需训练的框架，结合全局和局部嵌入相似性信号来更准确可靠地检测各种场景中的对象幻觉。实验结果表明，GLSim 在检测性能上显著优于现有方法。

MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Authors: Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

First: 2025-09-21T21:05:29+00:00 · Latest: 2025-09-21T21:05:29+00:00

Abs · PDF · Code1 · Code2

Abstract

The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction.To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

中文标题/摘要

标题：MoE比你想象的更强：RoE的超并行推理扩展

大型语言模型（LLMs）的生成质量通常通过利用推理时序列级别的扩展方法（例如，思维链）来提高。我们介绍了超并行扩展，这是一种互补的框架，可以在标记级别提高预测质量。超并行扩展计算并聚合模型为单个标记生成的多个输出提案。我们通过混合专家模型（MoE）实现了这一概念，并称之为专家阵容（RoE）。RoE 是一种无需训练的推理算法，可以将单个 MoE 转换为动态的 MoE 集合。RoE 向专家路由机制注入了受控的随机性，使其能够为每个标记采样多个多样化的专家，并聚合它们的输出以获得更准确的最终预测。为了克服计算成本，我们引入了一种高效的批量策略和一种专门的 KV 缓存机制，以最小化计算和内存开销。例如，RoE 使一个 7B 的 MoE 模型能够达到一个 10.5B 的 MoE 模型的性能，同时在推理时使用 30% 更少的计算量。这些收益是在不调整模型参数的情况下实现的。

Summary / 总结

The paper introduces hyper-parallel scaling, a method to improve the prediction quality of token-level outputs in large language models by computing and aggregating multiple output proposals from a single token. This is implemented in Mixture-of-Experts (MoE) models, called Roster of Experts (RoE), which dynamically samples multiple diverse experts for each token and aggregates their outputs. This approach reduces computational cost by 30% without fine-tuning, matching the performance of a larger model with less compute resources.

论文提出了一种超并行扩展方法，通过为每个令牌计算和聚合多个输出提案来提高大型语言模型的预测质量。这种方法在Mixture-of-Experts (MoE)模型中实现，称为Roster of Experts (RoE)，动态地为每个令牌采样多个不同的专家并聚合它们的输出。RoE 通过高效的批量处理和专门的KV缓存机制降低计算成本，使得一个7B MoE模型的性能与10.5B MoE模型相当，同时减少30%的推理计算量。

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Authors: Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang

First: 2025-09-21T17:53:30+00:00 · Latest: 2025-09-21T17:53:30+00:00

Comments: 23 pages in main text

Abs · PDF · Code1 · Code2 · Project1

Abstract

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

中文标题/摘要

标题：FlagEval 研究报告：对自动可验证文本和视觉问题上大型推理模型的初步评估

我们进行了一项中等规模的无污染（在一定程度上）的大型推理模型（LRMs）评估，并获得了初步发现。我们还发布了用于视觉语言模型的评估基准 ROME，旨在测试从视觉线索进行推理的能力。我们在此网站上附上了基准、评估数据和其他更新的链接：https://flageval-baai.github.io/LRM-Eval/