arXiv 论文速递

Stitch: Training-Free Position Control in Multimodal Diffusion Transformers

Authors: Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata

First: 2025-09-30T17:59:51+00:00 · Latest: 2025-09-30T17:59:51+00:00

Comments: Preprint

Abstract

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge. Earlier methods improved spatial relationship following with external position control. However, as architectures evolved to enhance image quality, these techniques became incompatible with modern models. We propose Stitch, a training-free method for incorporating external position control into Multi-Modal Diffusion Transformers (MMDiT) via automatically-generated bounding boxes. Stitch produces images that are both spatially accurate and visually appealing by generating individual objects within designated bounding boxes and seamlessly stitching them together. We find that targeted attention heads capture the information necessary to isolate and cut out individual objects mid-generation, without needing to fully complete the image. We evaluate Stitch on PosEval, our benchmark for position-based T2I generation. Featuring five new tasks that extend the concept of Position beyond the basic GenEval task, PosEval demonstrates that even top models still have significant room for improvement in position-based generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances base models, even improving FLUX by 218% on GenEval's Position task and by 206% on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on PosEval, improving over previous models by 54%, all accomplished while integrating position control into leading models training-free. Code is available at https://github.com/ExplainableML/Stitch.

Summary / 总结

Text-to-Image (T2I) generation models have advanced rapidly in recent years, but accurately capturing spatial relationships like "above" or "to the right of" poses a persistent challenge.

Stitch 是一种无需训练的方法，通过在多模态扩散变换器中使用自动生成的边界框引入外部位置控制，从而提高文本到图像生成的空间准确性。它通过在生成过程中切割出单独的对象来生成空间上准确且视觉上吸引人的图像。在扩展了基本 GenEval 任务的新基准 PosEval 中，Stitch 显著超越了现有模型，Position 任务上的改进幅度从 206% 到 218%，并在 Qwen-Image 上达到了最先进的效果。

TTT3R: 3D Reconstruction as Test-Time Training

Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

First: 2025-09-30T17:59:51+00:00 · Latest: 2025-09-30T17:59:51+00:00

Comments: Page: https://rover-xingyu.github.io/TTT3R Code: https://github.com/Inception3D/TTT3R

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing

Authors: Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, Jingdong Wang

First: 2025-09-30T17:59:46+00:00 · Latest: 2025-09-30T17:59:46+00:00

Comments: 23 pages, 10 figures

Abs · PDF · Code1 · Code2

Abstract

Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext'' composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model's role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM's generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.

中文标题/摘要

标题：查询-上下文：一种统一的多模态模型用于图像生成和编辑

统一多模态模型（UMMs）在文本到图像生成（T2I）和编辑（TI2I）方面表现出色，无论是作为结合强大视觉语言模型（VLM）与扩散生成器的组装统一框架，还是作为早期融合理解和生成模态的朴素统一多模态模型。我们认为，在当前的统一框架中，多模态生成推理的关键能力，包括指令理解、定位和图像引用以保持身份和忠实重建，与高保真合成是内在交织的。在本文中，我们引入了Query-Kontext，这是一种通过由多模态输入编码的语义线索和粗粒度图像条件组成的多模态“kontext”将VLM与扩散模型连接起来的新方法。此设计将复杂的多模态生成推理能力委托给强大的VLM，同时保留扩散模型的高质量视觉合成角色。为了实现这一点，我们提出了一种三阶段渐进式训练策略。首先，我们通过多模态kontext标记将VLM连接到一个轻量级的扩散头部，以释放VLM的生成推理能力。其次，我们将此头部扩展到一个大型预训练的扩散模型，以增强视觉细节和真实性。最后，我们引入了一个低级图像编码器以提高图像保真度，并在下游任务中进行指令调优。此外，我们构建了一个综合的数据管道，结合了真实、合成和开源数据集，涵盖了各种多模态参考到图像场景，包括图像生成、指令驱动编辑、定制生成和多主题组合。实验表明，我们的方法与强大的统一基线相当，并且在某些情况下甚至优于特定任务的最新方法。

Summary / 总结

The research aims to improve the performance of unified multimodal models in text-to-image generation and editing by addressing the intrinsic entanglement of multimodal generative reasoning and high-fidelity synthesis. The Query-Kontext approach introduces a multimodal 'kontext' composed of semantic cues and image conditions, bridging a vision-language model with a diffusion model. The method employs a three-stage training strategy, first connecting the VLM to a lightweight diffusion head, then scaling it to a large diffusion model, and finally incorporating a low-level image encoder for fidelity improvement. Experiments demonstrate that Query-Kontext matches strong unified baselines and outperforms task-specific state-of-the-art methods in several cases.

研究旨在通过解决多模态生成推理与高保真合成的内在纠缠问题，提高统一多模态模型在文本到图像生成和编辑中的表现。Query-Kontext 提出了一种新颖的方法，通过多模态 kontext（包含语义线索和图像条件）将视觉语言模型与扩散模型连接起来。该方法采用三阶段训练策略来增强生成推理和视觉合成，并在某些情况下表现出色，能够与强大的统一基线方法相媲美，甚至优于特定任务的先进方法。

Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models

Authors: Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi

First: 2025-05-21T13:14:32+00:00 · Latest: 2025-09-30T17:53:25+00:00

Abs · PDF · Code1 · Code2

Abstract

The impact of misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed. Interpreting such creator intent is therefore essential for multimodal misinformation detection (MMD) and effective information governance. To this end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles, created using an intent-guided simulation framework that models both the desired influence and the execution plan of news creators. The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities, and supports three intent-centric tasks: (1) misleading intent detection, (2) misleading source attribution, and (3) creator desire inference. We evaluate 14 state-of-the-art vision-language models (VLMs) and find that they struggle with intent reasoning, often relying on shallow cues such as surface-level alignment, stylistic polish, or heuristic authenticity signals. These results highlight the limitations of current VLMs and position DeceptionDecoded as a foundation for developing intent-aware models that go beyond shallow cues in MMD.

中文标题/摘要

标题：透视欺骗：利用视觉语言模型揭示多模态新闻中的误导性创作者意图

信息误导的影响不仅来自事实错误，还来自创作者故意嵌入的误导性叙述。因此，解释这种创作者意图对于多模态信息误导检测（MMD）和有效的信息治理至关重要。为此，我们引入了DeceptionDecoded，这是一个包含12,000个图像-描述对的大规模基准数据集，这些数据对基于可信赖的参考文章创建，使用意图引导的模拟框架来建模新闻创作者的期望影响和执行计划。该数据集涵盖了视觉和文本模态中的各种操控案例，并支持三种以意图为中心的任务：（1）误导意图检测，（2）误导性来源归因，（3）创作者意图推断。我们评估了14个最先进的视觉语言模型（VLMs），发现它们在意图推理方面存在困难，往往依赖于表面级别的对齐、风格上的精致或启发式的真伪信号。这些结果突显了当前VLMs的局限性，并将DeceptionDecoded定位为开发超越表面线索的意图感知模型的基础。

Summary / 总结

The impact of misinformation arises not only from factual inaccuracies but also from the misleading narratives that creators deliberately embed.

研究旨在通过解读创作者意图来揭露多模态新闻中的误导性叙述，这对于虚假信息检测至关重要。研究引入了DeceptionDecoded，这是一个包含12,000个图像-描述对的大规模数据集，并评估了14个最先进的视觉-语言模型，发现它们通常依赖于浅层线索而非深层次的意图推理，突显了在虚假信息检测中需要更先进的模型。

Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces

Authors: John Gkountouras, Ivan Titov

First: 2025-09-30T17:46:46+00:00 · Latest: 2025-09-30T17:46:46+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent text-only models demonstrate remarkable mathematical reasoning capabilities. Extending these to visual domains requires vision-language models to translate images into text descriptions. However, current models, trained to produce captions for human readers, often omit the precise details that reasoning systems require. This creates an interface mismatch: reasoners often fail not due to reasoning limitations but because they lack access to critical visual information. We propose Adaptive-Clarification Reinforcement Learning (AC-RL), which teaches vision models what information reasoners need through interaction. Our key insight is that clarification requests during training reveal information gaps; by penalizing success that requires clarification, we create pressure for comprehensive initial captions that enable the reasoner to solve the problem in a single pass. AC-RL improves average accuracy by 4.4 points over pretrained baselines across seven visual mathematical reasoning benchmarks, and analysis shows it would cut clarification requests by up to 39% if those were allowed. By treating clarification as a form of implicit supervision, AC-RL demonstrates that vision-language interfaces can be effectively learned through interaction alone, without requiring explicit annotations.

中文标题/摘要

标题：澄清作为监督：视觉语言接口中的强化学习

近期仅文本模型展示了卓越的数学推理能力。将这些能力扩展到视觉领域需要视觉语言模型将图像转换为文本描述。然而，当前模型虽然训练目的是为人类读者生成描述，但往往忽略了推理系统所需的精确细节。这造成了接口不匹配：推理系统往往并非由于推理能力不足，而是因为缺乏关键的视觉信息。我们提出了一种适应性澄清强化学习(AC-RL)，通过交互教会视觉模型推理系统所需的信息。我们的核心见解是，在训练过程中提出的澄清请求揭示了信息缺口；通过惩罚需要澄清的成功，我们为生成能够一次性解决问题的全面初始描述施加了压力。AC-RL在七个视觉数学推理基准测试中将平均准确性提高了4.4个百分点，分析表明，如果允许澄清请求，它将减少高达39%的澄清请求。通过将澄清视为一种隐式监督，AC-RL证明了视觉语言接口可以通过交互学习，而无需显式标注。

Summary / 总结

Recent text-only models demonstrate remarkable mathematical reasoning capabilities.

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation

Authors: Agneet Chatterjee, Rahim Entezari, Maksym Zhuravinskyi, Maksim Lapin, Reshinth Adithyan, Amit Raj, Chitta Baral, Yezhou Yang, Varun Jampani

Venue: NeurIPS 2025

First: 2025-09-30T17:22:18+00:00 · Latest: 2025-09-30T17:22:18+00:00

Comments: NeurIPS 2025. Project Page : https://stable-cinemetrics.github.io/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in video generation have enabled high-fidelity video synthesis from user provided prompts. However, existing models and benchmarks fail to capture the complexity and requirements of professional video generation. Towards that goal, we introduce Stable Cinemetrics, a structured evaluation framework that formalizes filmmaking controls into four disentangled, hierarchical taxonomies: Setup, Event, Lighting, and Camera. Together, these taxonomies define 76 fine-grained control nodes grounded in industry practices. Using these taxonomies, we construct a benchmark of prompts aligned with professional use cases and develop an automated pipeline for prompt categorization and question generation, enabling independent evaluation of each control dimension. We conduct a large-scale human study spanning 10+ models and 20K videos, annotated by a pool of 80+ film professionals. Our analysis, both coarse and fine-grained reveal that even the strongest current models exhibit significant gaps, particularly in Events and Camera-related controls. To enable scalable evaluation, we train an automatic evaluator, a vision-language model aligned with expert annotations that outperforms existing zero-shot baselines. SCINE is the first approach to situate professional video generation within the landscape of video generative models, introducing taxonomies centered around cinematic controls and supporting them with structured evaluation pipelines and detailed analyses to guide future research.

中文标题/摘要

标题：稳定电影度量：专业视频生成的结构化分类与评估

近期在视频生成方面的进展使得从用户提供的提示生成高保真视频成为可能。然而，现有的模型和基准未能捕捉到专业视频生成的复杂性和要求。为此，我们引入了稳定电影度量（Stable Cinemetrics），这是一种结构化的评估框架，将电影制作控制分为四个独立的、分层的分类：设置、事件、照明和摄像机。这些分类共同定义了76个细粒度的控制节点，这些节点基于行业实践。利用这些分类，我们构建了一个与专业用例对齐的基准提示集，并开发了一个自动管道进行提示分类和问题生成，从而独立评估每个控制维度。我们进行了一项大规模的人类研究，覆盖了10+个模型和20000个视频，由80+名电影专业人士进行标注。我们的分析，无论是粗略的还是详细的，都揭示出即使是最强大的当前模型也存在显著差距，特别是在事件和摄像机相关的控制方面。为了实现可扩展的评估，我们训练了一个自动评估器，这是一种与专家注释对齐的视觉-语言模型，其性能优于现有的零样本基线。SCINE是第一个将专业视频生成置于视频生成模型景观中的方法，它围绕电影控制引入了分类，并通过结构化的评估管道和详细的分析来支持未来的研究。

Summary / 总结

Stable Cinemetrics introduces a structured evaluation framework for professional video generation, formalizing filmmaking controls into four taxonomies: Setup, Event, Lighting, and Camera. This framework enables detailed evaluation of video generation models across 76 fine-grained control nodes. The study, involving 10+ models and 20K videos annotated by 80+ film professionals, reveals significant gaps in current models, especially in Events and Camera-related controls. An automatic evaluator was also trained to support scalable evaluation, outperforming existing zero-shot baselines.

Stable Cinemetrics 提出了一个结构化的评估框架，将电影制作控制细化为四个分类：Setup、Event、Lighting 和 Camera。该框架涵盖了 76 个细粒度的控制节点，用于评估视频生成模型。研究涉及 10 多个模型和 20000 多个视频，由 80 多名电影专业人士进行注释，揭示了当前模型在 Events 和 Camera 相关控制方面的显著差距。还训练了一个自动评估器，以支持可扩展的评估，其性能优于现有的零样本基线。

LoLA: Low-Rank Linear Attention With Sparse Caching

Authors: Luke McDermott, Robert W. Heath Jr., Rahul Parhi

First: 2025-05-29T17:12:42+00:00 · Latest: 2025-09-30T16:42:50+00:00

Abs · PDF · Code1 · Code2

Abstract

The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.

Summary / 总结

LoLA is a training-free method that enhances linear attention in transformers to improve associative recall for lifelong in-context learning. It divides past key-value pairs into a local sliding window cache, a sparse global cache, and a recurrent hidden state. Experiments show that LoLA significantly improves performance on pass-key retrieval tasks, achieving 97.4% accuracy with a smaller cache compared to Llama-3.1 8B. It also outperforms other subquadratic models on zero-shot commonsense reasoning tasks.

LoLA 是一种无需训练的方法，通过将过去的键值对分为局部滑动窗口缓存、稀疏全局缓存和线性注意力的递归隐藏状态，来增强变压器中的线性注意力以改善长时关联记忆。实验表明，LoLA 在 pass-key 检索任务上的性能显著提升，准确率达到 97.4%，并且使用比 Llama-3.1 8B 更小的缓存。它还在零样本常识推理任务上优于其他子二次模型。

Zero-Shot Decentralized Federated Learning

Authors: Alessio Masano, Matteo Pennisi, Federica Proietto Salanitri, Concetto Spampinato, Giovanni Bellitto

First: 2025-09-30T16:13:21+00:00 · Latest: 2025-09-30T16:13:21+00:00

Comments: Accepted at International Joint Conference on Neural Networks (IJCNN) 2025. Code available at https://github.com/perceivelab/ZeroDFL

Abs · PDF · Code1 · Code2 · Code3

Abstract

CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP's adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms--or remains on par with--state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation--paving the way for decentralized adaptation of large vision-language models in real-world applications.

Summary / 总结

ZeroDFL is a fully decentralized framework for zero-shot federated learning that enables clients to optimize and exchange textual prompts without a central coordinator. It reduces communication overhead by 118x compared to FedTPG and consistently outperforms or matches state-of-the-art methods across nine diverse image classification datasets, enhancing generalization, scalability, and privacy.

Zero-shot Decentralized Federated Learning (ZeroDFL) 通过在没有中央协调器的情况下使分布式客户端能够进行零样本适应，解决了现有联邦提示学习方法的局限性。它使用迭代的提示共享机制来优化和交换文本提示，与FedTPG相比，通信开销减少了118倍。ZeroDFL 在九个不同的图像分类数据集上的一致表现优于或与最先进的方法相当，增强了泛化能力、可扩展性和隐私保护，在联邦零样本学习中的应用前景广阔。

Adaptive Planning for Multi-Attribute Controllable Summarization with Monte Carlo Tree Search

Authors: Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Geunbae Lee, Jungseul Ok

First: 2025-09-30T15:55:24+00:00 · Latest: 2025-09-30T15:55:24+00:00

Abs · PDF · Code1 · Code2

Abstract

Controllable summarization moves beyond generic outputs toward human-aligned summaries guided by specified attributes. In practice, the interdependence among attributes makes it challenging for language models to satisfy correlated constraints consistently. Moreover, previous approaches often require per-attribute fine-tuning, limiting flexibility across diverse summary attributes. In this paper, we propose adaptive planning for multi-attribute controllable summarization (PACO), a training-free framework that reframes the task as planning the order of sequential attribute control with a customized Monte Carlo Tree Search (MCTS). In PACO, nodes represent summaries, and actions correspond to single-attribute adjustments, enabling progressive refinement of only the attributes requiring further control. This strategy adaptively discovers optimal control orders, ultimately producing summaries that effectively meet all constraints. Extensive experiments across diverse domains and models demonstrate that PACO achieves robust multi-attribute controllability, surpassing both LLM-based self-planning models and fine-tuned baselines. Remarkably, PACO with Llama-3.2-1B rivals the controllability of the much larger Llama-3.3-70B baselines. With larger models, PACO achieves superior control performance, outperforming all competitors.

中文标题/摘要

标题：基于蒙特卡洛树搜索的多属性可控总结自适应规划

可控总结超越了通用输出，转向由指定属性引导的人类对齐摘要。实践中，属性之间的相互依赖性使得语言模型难以一致地满足相关约束。此外，先前的方法通常需要针对每个属性进行微调，限制了在不同总结属性上的灵活性。在本文中，我们提出了多属性可控总结的自适应规划（PACO），这是一种无需训练的框架，将任务重新定义为使用定制的蒙特卡洛树搜索（MCTS）规划顺序属性控制的顺序。在PACO中，节点表示摘要，动作对应于单属性调整，使仅需进一步控制的属性能够逐步细化。该策略自适应地发现最优控制顺序，最终生成能够有效满足所有约束的摘要。在不同领域和模型的广泛实验中，PACO实现了稳健的多属性可控性，超越了基于LLM的自我规划模型和微调基线。令人惊讶的是，使用Llama-3.2-1B的PACO与更大的Llama-3.3-70B基线具有相同的可控性。随着模型规模的增大，PACO实现了更优的控制性能，超越了所有竞争对手。

Summary / 总结

This paper addresses the challenge of generating human-aligned summaries with multiple attributes by proposing PACO, a training-free framework that uses Monte Carlo Tree Search to adaptively plan the order of attribute adjustments. Experiments show that PACO effectively meets all constraints across various domains and models, outperforming both self-planning models and fine-tuned baselines, and even matching the controllability of much larger models.

本文提出了一种名为PACO的无训练框架，利用蒙特卡洛树搜索来适应性地规划多个属性控制的顺序，以生成符合人类偏好的摘要。实验表明，PACO在多属性可控性方面优于基于语言模型的自我规划模型和微调基线，即使使用较小的模型也能实现稳健的多属性可控性，并且在使用更大模型时，其控制性能也优于所有竞争对手。

AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size

Authors: Guanxi Lu, Hao, Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, Hongxiang Fan

First: 2025-09-30T15:53:56+00:00 · Latest: 2025-09-30T15:53:56+00:00

Comments: Preprint. Under review

Abs · PDF · Code1 · Code2

Abstract

Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs. Among various decoding strategies, blockwise semi-autoregressive (semi-AR) approaches are widely adopted due to their natural support for KV caching and their favorable accuracy-speed trade-off. However, this paper identifies two fundamental limitations in the conventional semi-AR decoding approach that applies a fixed block size: i) late decoding overhead, where the unmasking of high-confidence tokens outside the current block is unnecessarily delayed, and ii) premature decoding error, where low-confidence tokens inside the current block are committed too early, leading to incorrect tokens. This paper presents the first systematic investigation challenging the fixed block size assumption in semi-AR decoding. Through a statistical analysis of confidence dynamics during the denoising process, we identify a volatility band (VB) region during dLLM decoding, which encodes local semantic structure and can be used to guide adaptive block sizing. Leveraging these insights, we introduce AdaBlock-dLLM, a training-free, plug-and-play scheduler that adaptively aligns block boundaries with semantic steps by adjusting block size during runtime. Extensive experiments across diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy improvement under the same throughput budget. Beyond inference-time optimization, we hope our semantics-aware adaptive scheduling approach and confidence-based analysis will inspire future training strategies for dLLMs.

Summary / 总结

This paper addresses the limitations of fixed block size in semi-autoregressive decoding of diffusion-based large language models (dLLMs), focusing on late decoding overhead and premature decoding errors. It introduces AdaBlock-dLLM, a training-free scheduler that adaptively adjusts block sizes based on a volatility band region identified during the decoding process. Experiments show up to 5.3% accuracy improvement under the same throughput budget.

本文探讨了固定块大小在扩散型大型语言模型（dLLMs）块半自回归（semi-AR）解码中的局限性，重点关注延迟解码开销和过早解码错误。通过分析置信度动态，作者识别出一个波动带（VB）区域，用于指导自适应块大小调整。他们引入了AdaBlock-dLLM，这是一种无需训练的调度器，在运行时调整块大小以与语义步骤对齐。实验显示，在相同的吞吐量预算下，可获得高达5.3%的准确率提升。

SQUARE: Semantic Query-Augmented Fusion and Efficient Batch Reranking for Training-free Zero-Shot Composed Image Retrieval

Authors: Ren-Di Wu, Yu-Yen Lin, Huei-Fang Yang

First: 2025-09-30T14:41:24+00:00 · Latest: 2025-09-30T14:41:24+00:00

Comments: 20 pages, 9 figures

Abs · PDF · Code1 · Code2

Abstract

Composed Image Retrieval (CIR) aims to retrieve target images that preserve the visual content of a reference image while incorporating user-specified textual modifications. Training-free zero-shot CIR (ZS-CIR) approaches, which require no task-specific training or labeled data, are highly desirable, yet accurately capturing user intent remains challenging. In this paper, we present SQUARE, a novel two-stage training-free framework that leverages Multimodal Large Language Models (MLLMs) to enhance ZS-CIR. In the Semantic Query-Augmented Fusion (SQAF) stage, we enrich the query embedding derived from a vision-language model (VLM) such as CLIP with MLLM-generated captions of the target image. These captions provide high-level semantic guidance, enabling the query to better capture the user's intent and improve global retrieval quality. In the Efficient Batch Reranking (EBR) stage, top-ranked candidates are presented as an image grid with visual marks to the MLLM, which performs joint visual-semantic reasoning across all candidates. Our reranking strategy operates in a single pass and yields more accurate rankings. Experiments show that SQUARE, with its simplicity and effectiveness, delivers strong performance on four standard CIR benchmarks. Notably, it maintains high performance even with lightweight pre-trained, demonstrating its potential applicability.

中文标题/摘要

标题：SQUARE: 语义查询增强融合与高效批量重排序用于无需训练的零样本组合图像检索

组合图像检索（CIR）旨在检索保留参考图像视觉内容的同时融入用户指定文本修改的目标图像。无需特定任务训练或标注数据的零样本组合图像检索（ZS-CIR）方法非常理想，但准确捕捉用户意图仍然具有挑战性。本文提出了一种新颖的两阶段无需训练框架SQUARE，利用多模态大型语言模型（MLLM）增强ZS-CIR。在语义查询增强融合（SQAF）阶段，我们通过将MLLM生成的目标图像描述丰富来自视觉语言模型（如CLIP）的查询嵌入。这些描述提供了高层次的语义指导，使查询能够更好地捕捉用户意图并提高全局检索质量。在高效批量重排序（EBR）阶段，将顶级候选者以带有视觉标记的图像网格形式呈现给MLLM，MLLM在所有候选者之间进行联合视觉-语义推理。我们的重排序策略在单次通过中运行并产生更准确的排名。实验表明，SQUARE凭借其简洁性和有效性在四个标准CIR基准上表现出色。值得注意的是，即使使用轻量级预训练模型，它也保持了高性能，展示了其潜在的应用性。

Simple yet Effective Semi-supervised Knowledge Distillation from Vision-Language Models via Dual-Head Optimization

Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

First: 2025-05-12T15:39:51+00:00 · Latest: 2025-09-30T14:13:57+00:00

Comments: 38 pages, 17 figures, preprint

Abs · PDF · Code1 · Code2 · Code3

Abstract

Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data. Recently, vision-language models (VLMs), pre-trained on massive image-text pairs, have demonstrated remarkable zero-/few-shot performance that often surpasses SSL approaches due to their exceptional generalization capabilities. This gap motivates us to question: how can we effectively harness the powerful generalization capabilities of VLMs into task-specific models? Knowledge distillation (KD) offers a natural framework for transferring VLM capabilities, but we identify that it suffers from gradient conflicts between supervised and distillation losses. To address this challenge, we propose Dual-Head Optimization (DHO), which introduces dual prediction heads for each distinct signal. We observe that DHO resolves gradient conflicts, enabling improved feature learning compared to single-head KD baselines, with practical benefits of minimal computational overhead and test-time hyperparameter tuning without retraining. Extensive experiments across 15 datasets show that DHO consistently outperforms KD baselines, often outperforming teacher models with smaller student models. DHO also achieves new state-of-the-art performance on both in-distribution ImageNet semi-supervised learning and out-of-distribution generalization across ImageNet variants. We publicly release our code and model checkpoints to facilitate future research at https://github.com/erjui/DHO.

中文标题/摘要

标题：通过双头优化从视觉语言模型进行简单有效的半监督知识蒸馏

半监督学习（SSL）已成为解决数据稀缺问题的一种实用方案，通过利用未标记数据。最近，预训练在大量图像-文本对上的视觉语言模型（VLMs）展示了卓越的零/少样本性能，往往超越SSL方法，这得益于它们出色的泛化能力。这一差距促使我们质疑：我们如何有效利用VLM强大的泛化能力来提升任务特定模型？知识蒸馏（KD）提供了一种自然框架来转移VLM的能力，但我们发现它在监督损失和蒸馏损失之间存在梯度冲突。为解决这一挑战，我们提出了双头优化（DHO），引入了每个信号的双预测头。我们观察到，DHO解决了梯度冲突，与单头KD基线相比，能够实现更好的特征学习，具有最小的计算开销和测试时超参数调整无需重新训练的实用优势。在15个数据集上的广泛实验表明，DHO始终优于KD基线，经常使用较小的学生模型超越教师模型。DHO还在ImageNet半监督学习的分布内和ImageNet变体的分布外泛化方面取得了新的最佳性能。我们将在https://github.com/erjui/DHO公开发布我们的代码和模型检查点，以促进未来研究。

Summary / 总结

Semi-supervised learning (SSL) has emerged as a practical solution for addressing data scarcity challenges by leveraging unlabeled data.

Scaling RL to Long Videos

Authors: Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han

Venue: NeurIPS 2025 Long

First: 2025-07-10T17:47:40+00:00 · Latest: 2025-09-30T14:13:20+00:00

Comments: Accepted by NeurIPS 2025. Code at https://github.com/NVlabs/Long-RL and model at https://huggingface.co/Efficient-Large-Model/LongVILA-R1-7B

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 104K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In our experiments, LongVILA-R1-7B achieves strong performance on video benchmarks, reaching 65.1% and 71.1% accuracy on VideoMME without and with subtitles, respectively, and consistently outperforming LongVILA-7B across multiple benchmarks. Moreover, LongVILA-R1-7B supports processing up to 8,192 video frames per video, and configurable FPS settings. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames).

Summary / 总结

We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning.

ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Authors: Edoardo Bianchi, Jacopo Staiano, Antonio Liotta

First: 2025-09-30T14:00:41+00:00 · Latest: 2025-09-30T14:00:41+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

中文标题/摘要

标题：ProfVLM：一种轻量级的视频-语言模型用于多视角熟练度评估

现有的技能熟练度评估方法往往依赖于黑盒视频分类器，忽视了多视角上下文，缺乏可解释性。我们提出了ProfVLM，这是一种紧凑的视觉-语言模型，将此任务重新表述为生成推理：它联合预测技能水平并从第一人称和第三人称视频中生成专家级反馈。我们方法的核心是一个注意门控投影器，它可以动态融合多视角特征，将冻结的TimeSformer主干投影到一个为反馈生成调优的语言模型中。在EgoExo4D数据集和专家评论上训练，ProfVLM在参数量减少多达20倍和训练时间减少多达60%的情况下超越了现有最佳方法。我们的方法不仅在多种活动中实现了更高的准确性，还能输出与表现对齐的自然语言批评，提供透明的推理。这些结果突显了生成视觉-语言建模作为技能评估的新强大方向。

RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Authors: Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Venue: NeurIPS 2025 Spotlight

First: 2025-02-24T13:52:05+00:00 · Latest: 2025-09-30T13:50:33+00:00

Comments: NeurIPS 2025 (Spotlight)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

Summary / 总结

RobustMerge is a parameter-efficient merging method for multi-task language models (MLLMs) that addresses the challenge of maintaining direction robustness during efficient merging. It prunes and scales parameters based on inter-parameter relations to stabilize singular values and performs cross-task normalization to enhance generalization. Experiments on a diverse multimodal task benchmark demonstrate its superior performance and generalizability compared to existing methods.

RobustMerge 是一种参数高效的多任务语言模型合并方法，旨在解决合并过程中方向稳健性的问题。该方法通过剪枝和缩放参数来保持方向稳定性，并进行跨任务归一化以增强泛化能力。在多样化的多模态任务基准测试上进行的实验表明，该方法在性能和泛化能力方面优于现有方法。

Interpret, prune and distill Donut : towards lightweight VLMs for VQA on document

Authors: Adnan Ben Mansour, Ayoub Karine, David Naccache

First: 2025-09-30T13:31:03+00:00 · Latest: 2025-09-30T13:31:03+00:00

Comments: Accepted at Workshop on Machine Learning in Document Analysis and Recognition (ICDAR WML 2025), Wuhan, China

Abs · PDF · Code1 · Code2

Abstract

Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition. Despite their effectiveness, these models are too costly for real-time or resource-constrained applications. We investigate model compression through knowledge distillation, training compact student models from a larger teacher. We leverage mechanistic interpretability to drive student architecture design within this framework. By analyzing internal computations, we identify essential subcomponents to retain, while having a clear view of which subcomponents should be approximated, skipped, or reparametrized based on their function. This approach yields Donut-MINT (Mechanistic Interpretability-based Network Trimming), a pruned Donut variant that reduces inference time and memory usage while maintaining strong performance on DocVQA, a standard benchmark for document Visual Question Answering. Our method reframes compression as circuit discovery, bridging interpretability research and practical Vision-Language Model deployment.

中文标题/摘要

标题：解释、修剪和精简 Donut：朝向轻量级的VLMs用于文档上的VQA

近期在丰富视觉文档理解方面的进展依赖于大型的视觉语言模型，如Donut，这些模型可以在无需光学字符识别的情况下进行文档级别的视觉问答。尽管这些模型非常有效，但它们对于实时或资源受限的应用来说成本过高。我们通过知识蒸馏研究了模型压缩，从一个较大的教师模型训练紧凑的学生模型。我们利用机制可解释性在这一框架内驱动学生模型架构的设计。通过分析内部计算，我们确定了需要保留的关键子组件，并且对哪些子组件应该被近似、跳过或重新参数化有了清晰的认识，基于它们的功能。这种方法产生了Donut-MINT（基于机制可解释性的网络修剪）这一修剪后的Donut变体，它在减少推理时间和内存使用的同时，仍能保持在标准文档视觉问答基准DocVQA上的强大性能。我们的方法将压缩重新定义为电路发现，将可解释性研究与实际的视觉语言模型部署联系起来。

Summary / 总结

Recent advances in Visually-rich Document Understanding rely on large Vision-Language Models like Donut, which perform document-level Visual Question Answering without Optical Character Recognition.

研究旨在解决大型视觉语言模型（VLMs）如Donut在实时文档级视觉问答（VQA）中的计算效率问题。作者采用知识蒸馏方法从大型教师模型中创建一个紧凑的学生模型，并利用机制可解释性来指导剪枝过程。这导致了Donut-MINT，这是一个剪枝后的Donut版本，显著减少了推理时间和内存使用，同时在DocVQA基准测试中仍能保持良好的性能。该方法将模型压缩重新定义为电路发现，将可解释性研究与实际的VLM部署相结合。

TSalV360: A Method and Dataset for Text-driven Saliency Detection in 360-Degrees Videos

Authors: Ioannis Kontostathis, Evlampios Apostolidis, Vasileios Mezaris

First: 2025-09-30T13:11:16+00:00 · Latest: 2025-09-30T13:11:16+00:00

Comments: IEEE CBMI 2025. This is the authors' accepted version. The final publication is available at https://ieeexplore.ieee.org/

Abs · PDF · Code1 · Code2

Abstract

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos. For this, we introduce the TSV360 dataset which includes 16,000 triplets of ERP frames, textual descriptions of salient objects/events in these frames, and the associated ground-truth saliency maps. Following, we extend and adapt a SOTA visual-based approach for 360-degrees video saliency detection, and develop the TSalV360 method that takes into account a user-provided text description of the desired objects and/or events. This method leverages a SOTA vision-language model for data representation and integrates a similarity estimation module and a viewport spatio-temporal cross-attention mechanism, to discover dependencies between the different data modalities. Quantitative and qualitative evaluations using the TSV360 dataset, showed the competitiveness of TSalV360 compared to a SOTA visual-based approach and documented its competency to perform customized text-driven saliency detection in 360-degrees videos.

中文标题/摘要

标题：TSalV360：一种用于360度视频文本驱动显著性检测的方法和数据集

本文处理360度视频中的文本驱动显著性检测任务。为此，我们引入了TSV360数据集，其中包括16,000个ERP帧的三元组、这些帧中显著对象/事件的文本描述以及相应的地面真实显著性图。随后，我们扩展并适应了一种当前最佳视觉方法，用于360度视频显著性检测，并开发了TSalV360方法，该方法考虑了用户提供的目标和/或事件的文本描述。该方法利用当前最佳的视觉-语言模型进行数据表示，并结合相似性估计模块和视窗时空交叉注意力机制，以发现不同数据模态之间的依赖关系。使用TSV360数据集进行的定量和定性评估表明，TSalV360在与当前最佳视觉方法的竞争力方面表现出色，并证明了其在360度视频中执行定制文本驱动显著性检测的能力。

Summary / 总结

In this paper, we deal with the task of text-driven saliency detection in 360-degrees videos.

研究通过引入TSV360数据集和开发TSalV360方法，解决了360度视频中的文本驱动显著性检测问题。该方法利用最先进的视觉-语言模型，并结合相似性估计模块和视窗时空交叉注意力机制，将文本描述与视觉数据集成。实验结果表明，TSalV360在360度视频中的文本驱动显著性检测方面优于最先进的视觉方法，并能有效进行定制化的文本驱动显著性检测。

AVCD: Mitigating Hallucinations in Audio-Visual Large Language Models through Contrastive Decoding

Authors: Chaeyoung Jung, Youngjoon Jang, Joon Son Chung

First: 2025-05-27T08:13:57+00:00 · Latest: 2025-09-30T12:14:50+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Hallucination remains a major challenge in multimodal large language models (MLLMs). To address this, various contrastive decoding (CD) methods have been proposed that contrasts original logits with hallucinated logits generated from perturbed inputs. While CD has shown promise in vision-language models (VLMs), it is not well-suited for AV-LLMs, where hallucinations often emerge from both unimodal and cross-modal combinations involving audio, video, and language. These intricate interactions call for a more adaptive and modality-aware decoding strategy. In this paper, we propose Audio-Visual Contrastive Decoding (AVCD)-a novel, training-free decoding framework designed to model trimodal interactions and suppress modality-induced hallucinations in AV-LLMs. Unlike previous CD methods in VLMs that corrupt a fixed modality, AVCD leverages attention distributions to dynamically identify less dominant modalities and applies attentive masking to generate perturbed output logits. To support CD in a trimodal setting, we also reformulate the original CD framework to jointly handle audio, visual, and textual inputs. Finally, to improve efficiency, we introduce entropy-guided adaptive decoding, which selectively skips unnecessary decoding steps based on the model's confidence in its predictions. Extensive experiments demonstrate that AVCD consistently outperforms existing decoding methods. Especially, on the AVHBench dataset, it improves accuracy by 2% for VideoLLaMA2 and 7% for video-SALMONN, demonstrating strong robustness and generalizability. Our code is available at https://github.com/kaistmm/AVCD.

中文标题/摘要

标题：AVCD：通过对比解码减轻多模态大型语言模型中的幻觉

幻觉仍然是多模态大型语言模型(MLLMs)中的主要挑战。为了解决这一问题，已经提出了各种对比解码(CD)方法，这些方法通过对比原始logits与从扰动输入生成的幻觉logits来进行对比。虽然CD在视觉-语言模型(VLMs)中显示出潜力，但它并不适合AV-LLMs，因为幻觉往往源自音频、视频和语言的单模态和跨模态组合。这些复杂的交互需要一种更适应和模态意识更强的解码策略。在本文中，我们提出了一种新的、无需训练的解码框架——Audio-Visual对比解码(AVCD)，旨在建模三模态交互并抑制AV-LLMs中的模态诱导幻觉。与VLMs中之前的CD方法不同，AVCD利用注意力分布动态识别较不占主导地位的模态，并应用注意力掩码生成扰动输出logits。为了支持三模态设置中的CD，我们还重新制定了原始的CD框架，以同时处理音频、视觉和文本输入。最后，为了提高效率，我们引入了基于模型预测置信度的熵引导自适应解码，根据模型对其预测的信心选择性地跳过不必要的解码步骤。广泛的实验表明，AVCD在所有现有解码方法中表现最佳。特别是在AVHBench数据集上，它分别将VideoLLaMA2和video-SALMONN的准确性提高了2%和7%，显示出强大的鲁棒性和泛化能力。我们的代码可在https://github.com/kaistmm/AVCD获取。

Summary / 总结

Hallucination remains a major challenge in multimodal large language models (MLLMs).

本文提出了一种新颖的解码框架AVCD，以解决音频-视觉大型语言模型（AV-LLM）中的幻觉问题。不同于之前的对比解码（CD）方法，AVCD使用注意力分布动态识别较不占主导地位的模态，并应用注意力掩码生成扰动输出logits。AVCD在AVHBench数据集上分别将VideoLLaMA2和video-SALMONN的准确性提高了2%和7%，显示出强大的鲁棒性和泛化能力。该框架为无训练，并包含基于模型预测信心的熵导向自适应解码以提高效率。

OmniCount: Multi-label Object Counting with Semantic-Geometric Priors

Authors: Anindya Mondal, Sauradip Nag, Xiatian Zhu, Anjan Dutta

Venue: AAAI 2025

First: 2024-03-08T16:38:11+00:00 · Latest: 2025-09-30T12:05:14+00:00

Comments: Accepted to AAAI 2025

Abs · PDF · Code1 · Code2 · Project1

Abstract

Object counting is pivotal for understanding the composition of scenes. Previously, this task was dominated by class-specific methods, which have gradually evolved into more adaptable class-agnostic strategies. However, these strategies come with their own set of limitations, such as the need for manual exemplar input and multiple passes for multiple categories, resulting in significant inefficiencies. This paper introduces a more practical approach enabling simultaneous counting of multiple object categories using an open-vocabulary framework. Our solution, OmniCount, stands out by using semantic and geometric insights (priors) from pre-trained models to count multiple categories of objects as specified by users, all without additional training. OmniCount distinguishes itself by generating precise object masks and leveraging varied interactive prompts via the Segment Anything Model for efficient counting. To evaluate OmniCount, we created the OmniCount-191 benchmark, a first-of-its-kind dataset with multi-label object counts, including points, bounding boxes, and VQA annotations. Our comprehensive evaluation in OmniCount-191, alongside other leading benchmarks, demonstrates OmniCount's exceptional performance, significantly outpacing existing solutions. The project webpage is available at https://mondalanindya.github.io/OmniCount.

中文标题/摘要

标题：OmniCount：基于语义几何先验的多标签物体计数

物体计数对于理解场景组成至关重要。此前，该任务主要由类别特定的方法主导，这些方法逐渐演变为更适应的类别无关策略。然而，这些策略也存在一些局限性，如需要手动输入示例输入和多次遍历多个类别，导致显著的低效率。本文提出了一种更实用的方法，能够在开放词汇框架下同时计数多个物体类别。我们的解决方案OmniCount通过使用预训练模型提供的语义和几何洞察（先验）来计数用户指定的多个物体类别，无需额外训练。OmniCount通过生成精确的物体掩码并利用Segment Anything Model的多样化交互提示进行高效计数而脱颖而出。为了评估OmniCount，我们创建了OmniCount-191基准数据集，这是首个包含多标签物体计数、点、边界框和VQA注释的数据集。在OmniCount-191以及其他领先基准的数据集上的全面评估表明，OmniCount表现出色，显著超越现有解决方案。项目网页可访问：https://mondalanindya.github.io/OmniCount/

Summary / 总结

Object counting is pivotal for understanding the composition of scenes.

OmniCount 是一种用于同时计数多个对象类别的多标签对象计数方法，利用预训练模型的语义和几何先验信息进行计数，无需额外训练。该方法生成精确的对象掩码，并利用交互式提示进行高效计数。在 OmniCount-191 基准和其他数据集上的评估表明，OmniCount 在准确性和效率方面优于现有解决方案。

SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions

Authors: Xianzhe Fan, Xuhui Zhou, Chuanyang Jin, Kolby Nottingham, Hao Zhu, Maarten Sap

Venue: NeurIPS 2025

First: 2025-06-29T00:54:13+00:00 · Latest: 2025-09-30T11:52:11+00:00

Comments: 24 pages, 6 figures

Abs · PDF · Code1 · Code2

Abstract

Humans continuously infer the states, goals, and behaviors of others by perceiving their surroundings in dynamic, real-world social interactions. However, most Theory of Mind (ToM) benchmarks only evaluate static, text-based scenarios, which have a significant gap compared to real interactions. We propose the SoMi-ToM benchmark, designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions. This benchmark is based on rich multimodal interaction data generated by the interaction environment SoMi, covering diverse crafting goals and social relationships. Our framework supports multi-level evaluation: (1) first-person evaluation provides multimodal (visual, dialogue, action, etc.) input from a first-person perspective during a task for real-time state inference, (2) third-person evaluation provides complete third-person perspective video and text records after a task for goal and behavior inference. This evaluation method allows for a more comprehensive examination of a model's ToM capabilities from both the subjective immediate experience and the objective global observation. We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions (three options). On this dataset, we systematically evaluated the performance of human subjects and several state-of-the-art large vision-language models (LVLMs). The results show that LVLMs perform significantly worse than humans on SoMi-ToM: the average accuracy gap between humans and models is 40.1% in first-person evaluation and 26.4% in third-person evaluation. This indicates that future LVLMs need to further improve their ToM capabilities in embodied, complex social interactions.

Summary / 总结

The research aims to evaluate multi-perspective Theory of Mind (ToM) in embodied social interactions by proposing the SoMi-ToM benchmark. This benchmark uses rich multimodal interaction data from the SoMi interaction environment, covering various crafting goals and social relationships. The evaluation method includes both first-person and third-person perspectives, allowing for a comprehensive examination of ToM capabilities. The results show that state-of-the-art large vision-language models perform significantly worse than humans, with an average accuracy gap of 40.1% in first-person evaluation and 26.4% in third-person evaluation, highlighting the need for improved ToM capabilities in embodied social interactions.

研究旨在通过提出SoMi-ToM基准来评估多视角的理论思维（ToM）在实体社会互动中的表现。该基准使用SoMi互动环境中的丰富多模态交互数据，涵盖了各种制作目标和社会关系。评估方法包括第一人称和第三人称视角，允许对ToM能力进行全面的考察。结果显示，最先进的大型视觉-语言模型的表现明显不如人类，第一人称评估的平均准确率差距为40.1%，第三人称评估为26.4%，这表明未来需要进一步提高ToM能力以应对实体复杂的社会互动。

PathoHR: Hierarchical Reasoning for Vision-Language Models in Pathology

Authors: Yating Huang, Ziyan Huang, Lintao Xiang, Qijun Yang, Hujun Yin

First: 2025-09-07T15:42:38+00:00 · Latest: 2025-09-30T11:39:52+00:00

Comments: Accept by EMNLP2025

Abs · PDF · Code1 · Code2

Abstract

Accurate analysis of pathological images is essential for automated tumor diagnosis but remains challenging due to high structural similarity and subtle morphological variations in tissue images. Current vision-language (VL) models often struggle to capture the complex reasoning required for interpreting structured pathological reports. To address these limitations, we propose PathoHR-Bench, a novel benchmark designed to evaluate VL models' abilities in hierarchical semantic understanding and compositional reasoning within the pathology domain. Results of this benchmark reveal that existing VL models fail to effectively model intricate cross-modal relationships, hence limiting their applicability in clinical setting. To overcome this, we further introduce a pathology-specific VL training scheme that generates enhanced and perturbed samples for multimodal contrastive learning. Experimental evaluations demonstrate that our approach achieves state-of-the-art performance on PathoHR-Bench and six additional pathology datasets, highlighting its effectiveness in fine-grained pathology representation.

Summary / 总结

The research aims to improve the accuracy of automated tumor diagnosis by addressing the limitations of current vision-language models in understanding complex pathological images. The study introduces PathoHR-Bench, a new benchmark to evaluate models' hierarchical reasoning and compositional understanding in pathology. The results show that existing models struggle with cross-modal relationships, while the proposed pathology-specific training scheme significantly enhances performance, achieving state-of-the-art results on multiple pathology datasets.

研究旨在通过解决当前视觉-语言模型在理解复杂病理图像方面的局限性，提高自动肿瘤诊断的准确性。研究引入了PathoHR-Bench，一个新的基准来评估模型在病理学中的层次语义理解和组合推理能力。结果表明，现有模型在跨模态关系方面存在困难，而提出的病理特定训练方案显著提升了性能，在多个病理数据集上达到了最先进的效果。

SGS: Segmentation-Guided Scoring for Global Scene Inconsistencies

Authors: Gagandeep Singh, Samudi Amarsinghe, Urawee Thani, Ki Fung Wong, Priyanka Singh, Xue Li

First: 2025-09-30T10:15:11+00:00 · Latest: 2025-09-30T10:15:11+00:00

Comments: 6 pages, 3 figures

Abs · PDF · Code1 · Code2 · Code3

Abstract

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch. While HAMMER achieves strong performance on the DGM4 dataset, it consistently fails when the main subject is contextually misplaced into an implausible background. We diagnose this limitation as a combination of label-space bias, local attention focus, and spurious text-foreground alignment. To remedy this without retraining, we propose a lightweight segmentation-guided scoring (SGS) pipeline. SGS uses person/face segmentation masks to separate foreground and background regions, extracts embeddings with a joint vision-language model, and computes region-aware coherence scores. These scores are fused with HAMMER's original prediction to improve binary detection, grounding, and token-level explanations. SGS is inference-only, incurs negligible computational overhead, and significantly enhances robustness to global manipulations. This work demonstrates the importance of region-aware reasoning in multimodal disinformation detection. We release scripts for segmentation and scoring at https://github.com/Gaganx0/HAMMER-sgs

Summary / 总结

We extend HAMMER, a state-of-the-art model for multimodal manipulation detection, to handle global scene inconsistencies such as foreground-background (FG-BG) mismatch.

AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment

Authors: Hanwei Zhu, Yu Tian, Keyan Ding, Baoliang Chen, Bolin Chen, Shiqi Wang, Weisi Lin

First: 2025-09-30T09:37:01+00:00 · Latest: 2025-09-30T09:37:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system. Conventional approaches typically rely on fixed models to output scalar scores, limiting their adaptability to diverse distortions, user-specific queries, and interpretability needs. Furthermore, scoring and interpretation are often treated as independent processes, despite their interdependence: interpretation identifies perceptual degradations, while scoring abstracts them into a compact metric. To address these limitations, we propose AgenticIQA, a modular agentic framework that integrates vision-language models (VLMs) with traditional IQA tools in a dynamic, query-aware manner. AgenticIQA decomposes IQA into four subtasks -- distortion detection, distortion analysis, tool selection, and tool execution -- coordinated by a planner, executor, and summarizer. The planner formulates task-specific strategies, the executor collects perceptual evidence via tool invocation, and the summarizer integrates this evidence to produce accurate scores with human-aligned explanations. To support training and evaluation, we introduce AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and summarization capabilities of VLM-based IQA agents. Extensive experiments across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses strong baselines in both scoring accuracy and explanatory alignment.

中文标题/摘要

标题：AgenticIQA：一种适应性和可解释性的图像质量评估框架

图像质量评估（IQA）本质上是复杂的，因为它反映了根植于人类视觉系统的感知质量和量化。传统方法通常依赖于固定模型输出标量分数，限制了它们对多种失真、用户特定查询和解释需求的适应性。此外，评分和解释通常被视为独立的过程，尽管它们之间存在相互依赖：解释识别感知降级，而评分将它们抽象为紧凑的度量标准。为了解决这些限制，我们提出了AgenticIQA，这是一种模块化的代理框架，将视觉语言模型（VLMs）与传统IQA工具结合在一起，以动态、查询感知的方式进行整合。AgenticIQA将IQA分解为四个子任务——失真检测、失真分析、工具选择和工具执行，由计划者、执行者和总结者协调。计划者制定任务特定策略，执行者通过工具调用收集感知证据，总结者将这些证据整合以生成与人类对齐的准确评分和解释。为了支持训练和评估，我们引入了AgenticIQA-200K，这是一个针对IQA代理定制的大规模指令数据集，以及AgenticIQA-Eval，这是第一个评估基于VLM的IQA代理规划、执行和总结能力的基准。广泛的实验表明，AgenticIQA在评分准确性和解释对齐方面始终优于强大的基线。

Summary / 总结

Image quality assessment (IQA) is inherently complex, as it reflects both the quantification and interpretation of perceptual quality rooted in the human visual system.

AgenticIQA 提出了一个模块化的代理框架，将视觉语言模型与传统图像质量评估工具结合，以查询感知的方式进行动态评估。它将图像质量评估分解为四个子任务：失真检测、分析、工具选择和执行，由规划者、执行者和总结者管理。AgenticIQA 在各种图像质量评估数据集中的评分准确性和解释一致性方面均优于强基线方法。

Inducing Dyslexia in Vision Language Models

Authors: Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf

First: 2025-09-29T11:03:16+00:00 · Latest: 2025-09-30T09:36:07+00:00

Abs · PDF · Code1 · Code2

Abstract

Dyslexia, a neurodevelopmental disorder characterized by persistent reading difficulties, is often linked to reduced activity of the visual word form area in the ventral occipito-temporal cortex. Traditional approaches to studying dyslexia, such as behavioral and neuroimaging methods, have provided valuable insights but remain limited in their ability to test causal hypotheses about the underlying mechanisms of reading impairments. In this study, we use large-scale vision-language models (VLMs) to simulate dyslexia by functionally identifying and perturbing artificial analogues of word processing. Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that targeted ablation of these units, unlike ablation of random units, leads to selective impairments in reading tasks while general visual and language comprehension abilities remain intact. In particular, the resulting model matches dyslexic humans' phonological deficits without a significant change in orthographic processing. Taken together, our modeling results replicate key characteristics of dyslexia and establish a computational framework for investigating reading disorders.

中文标题/摘要

标题：在视觉语言模型中诱发阅读障碍

阅读障碍是一种神经发育障碍，表现为持续的阅读困难，通常与背外侧枕颞叶皮层中的视觉单词形式区的活动减少有关。传统上，通过行为和神经影像学方法研究阅读障碍虽然提供了宝贵见解，但在测试阅读障碍潜在机制的因果假设方面仍有限制。本研究使用大规模视觉-语言模型（VLMs）通过功能上识别和扰动单词处理的人工模拟来模拟阅读障碍。使用认知神经科学的刺激，我们确定了VLMs中的视觉单词形式选择性单元，并证明了对这些单元的靶向消除，而不是随机单元的消除，会导致阅读任务中的选择性障碍，而一般视觉和语言理解能力保持不变。特别是，该模型的语音学缺陷与阅读障碍患者的缺陷相符，而拼写处理没有显著变化。综上所述，我们的建模结果复制了阅读障碍的关键特征，并建立了一个研究阅读障碍的计算框架。

Summary / 总结

This study aims to simulate dyslexia in vision-language models by functionally identifying and perturbing units that process visual word forms. By ablatively removing these units, the model exhibits selective reading impairments while maintaining general visual and language comprehension abilities, mirroring phonological deficits in dyslexic humans without altering orthographic processing. This approach provides a computational framework to investigate reading disorders.

该研究旨在通过识别和干扰视觉-语言模型中的视觉单词形式选择单元来模拟阅读障碍，这些单元类似于大脑中的视觉单词形式区域。研究显示，针对这些单元的切除会导致选择性的阅读障碍，复制了阅读障碍的关键特征，而一般视觉和语言理解能力保持不变。该模型匹配了阅读障碍患者的声音缺陷，而不改变拼写处理，建立了研究阅读障碍的计算框架。

Learning Egocentric In-Hand Object Segmentation through Weak Supervision from Human Narrations

Authors: Nicola Messina, Rosario Leonardi, Luca Ciampi, Fabio Carrara, Giovanni Maria Farinella, Fabrizio Falchi, Antonino Furnari

First: 2025-09-30T09:34:55+00:00 · Latest: 2025-09-30T09:34:55+00:00

Abs · PDF · Code1 · Code2

Abstract

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring. However, progress in this area is currently hindered by the scarcity of annotated datasets, as existing approaches rely on costly manual labels. In this paper, we propose to learn human-object interaction detection leveraging narrations -- natural language descriptions of the actions performed by the camera wearer which contain clues about manipulated objects (e.g., "I am pouring vegetables from the chopping board to the pan"). Narrations provide a form of weak supervision that is cheap to acquire and readily available in state-of-the-art egocentric datasets. We introduce Narration-Supervised in-Hand Object Segmentation (NS-iHOS), a novel task where models have to learn to segment in-hand objects by learning from natural-language narrations. Narrations are then not employed at inference time. We showcase the potential of the task by proposing Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH), an end-to-end model distilling knowledge from narrations to learn plausible hand-object associations and enable in-hand object segmentation without using narrations at test time. We benchmark WISH against different baselines based on open-vocabulary object detectors and vision-language models, showing the superiority of its design. Experiments on EPIC-Kitchens and Ego4D show that WISH surpasses all baselines, recovering more than 50% of the performance of fully supervised methods, without employing fine-grained pixel-wise annotations.

Summary / 总结

Pixel-level recognition of objects manipulated by the user from egocentric images enables key applications spanning assistive technologies, industrial safety, and activity monitoring.

该论文解决了用户在第一人称视角图像中操作对象的像素级识别问题，这对于辅助技术和活动监测至关重要。作者提出使用自然语言叙述作为弱监督来学习人类对象交互检测，引入了一个名为Narration-Supervised in-Hand Object Segmentation (NS-iHOS)的任务。他们开发了一个端到端模型，Weakly-Supervised In-hand Object Segmentation from Human Narrations (WISH)，该模型从叙述中学习以在测试时不使用叙述的方式进行手部对象分割。在EPIC-Kitchens和Ego4D上的实验表明，WISH超越了各种基线方法，恢复了超过50%的完全监督方法的性能。

Towards Unified Multimodal Misinformation Detection in Social Media: A Benchmark Dataset and Baseline

Authors: Haiyang Li, Yaxiong Wang, Lianwei Wu, Lechao Cheng, Zhun Zhong

First: 2025-09-30T09:26:32+00:00 · Latest: 2025-09-30T09:26:32+00:00

Abs · PDF · Code1 · Code2

Abstract

In recent years, detecting fake multimodal content on social media has drawn increasing attention. Two major forms of deception dominate: human-crafted misinformation (e.g., rumors and misleading posts) and AI-generated content produced by image synthesis models or vision-language models (VLMs). Although both share deceptive intent, they are typically studied in isolation. NLP research focuses on human-written misinformation, while the CV community targets AI-generated artifacts. As a result, existing models are often specialized for only one type of fake content. In real-world scenarios, however, the type of a multimodal post is usually unknown, limiting the effectiveness of such specialized systems. To bridge this gap, we construct the Omnibus Dataset for Multimodal News Deception (OmniFake), a comprehensive benchmark of 127K samples that integrates human-curated misinformation from existing resources with newly synthesized AI-generated examples. Based on this dataset, we propose Unified Multimodal Fake Content Detection (UMFDet), a framework designed to handle both forms of deception. UMFDet leverages a VLM backbone augmented with a Category-aware Mixture-of-Experts (MoE) Adapter to capture category-specific cues, and an attribution chain-of-thought mechanism that provides implicit reasoning guidance for locating salient deceptive signals. Extensive experiments demonstrate that UMFDet achieves robust and consistent performance across both misinformation types, outperforming specialized baselines and offering a practical solution for real-world multimodal deception detection.

Summary / 总结

In recent years, detecting fake multimodal content on social media has drawn increasing attention.

论文通过构建OmniFake数据集和提出UMFDet统一框架，解决了社交媒体上人类编造和AI生成的虚假信息检测难题。UMFDet利用带有类别感知MoE适配器的VLM骨干和归因链式思考机制来处理这两种类型的欺骗。实验表明，UMFDet在两种虚假信息类型上表现一致且优于专门模型。

M$^{2}$SNet: Multi-scale in Multi-scale Subtraction Network for Medical Image Segmentation

Authors: Xiaoqi Zhao, Hongpeng Jia, Youwei Pang, Long Lv, Feng Tian, Lihe Zhang, Weibing Sun, Huchuan Lu

First: 2023-03-20T06:26:49+00:00 · Latest: 2025-09-30T08:55:10+00:00

Comments: Manuscript

Abs · PDF · Code1 · Code2 · Code3

Abstract

Accurate medical image segmentation is critical for early medical diagnosis. Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different level features progressively in decoder. However, both the two operations easily generate plenty of redundant information, which will weaken the complementarity between different level features, resulting in inaccurate localization and blurred edges of lesions. To address this challenge, we propose a general multi-scale in multi-scale subtraction network (M$^{2}$SNet) to finish diverse segmentation from medical image. Specifically, we first design a basic subtraction unit (SU) to produce the difference features between adjacent levels in encoder. Next, we expand the single-scale SU to the intra-layer multi-scale SU, which can provide the decoder with both pixel-level and structure-level difference information. Then, we pyramidally equip the multi-scale SUs at different levels with varying receptive fields, thereby achieving the inter-layer multi-scale feature aggregation and obtaining rich multi-scale difference information. In addition, we build a training-free network ``LossNet'' to comprehensively supervise the task-aware features from bottom layer to top layer, which drives our multi-scale subtraction network to capture the detailed and structural cues simultaneously. Without bells and whistles, our method performs favorably against most state-of-the-art methods under different evaluation metrics on eleven datasets of four different medical image segmentation tasks of diverse image modalities, including color colonoscopy imaging, ultrasound imaging, computed tomography (CT), and optical coherence tomography (OCT). The source code can be available at https://github.com/Xiaoqi-Zhao-DLUT/MSNet.

Summary / 总结

Accurate medical image segmentation is critical for early medical diagnosis.

NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving

Authors: Yuan Gao, Mattia Piccinini, Roberto Brusnicki, Yuchen Zhang, Johannes Betz

First: 2025-09-30T08:37:31+00:00 · Latest: 2025-09-30T08:37:31+00:00

Comments: 8 pages

Abs · PDF · Code1 · Code2

Abstract

Understanding risk in autonomous driving requires not only perception and prediction, but also high-level reasoning about agent behavior and context. Current Vision Language Models (VLMs)-based methods primarily ground agents in static images and provide qualitative judgments, lacking the spatio-temporal reasoning needed to capture how risks evolve over time. To address this gap, we propose NuRisk, a comprehensive Visual Question Answering (VQA) dataset comprising 2,900 scenarios and 1.1 million agent-level samples, built on real-world data from nuScenes and Waymo, supplemented with safety-critical scenarios from the CommonRoad simulator. The dataset provides Bird-Eye-View (BEV) based sequential images with quantitative, agent-level risk annotations, enabling spatio-temporal reasoning. We benchmark well-known VLMs across different prompting techniques and find that they fail to perform explicit spatio-temporal reasoning, resulting in a peak accuracy of 33% at high latency. To address these shortcomings, our fine-tuned 7B VLM agent improves accuracy to 41% and reduces latency by 75%, demonstrating explicit spatio-temporal reasoning capabilities that proprietary models lacked. While this represents a significant step forward, the modest accuracy underscores the profound challenge of the task, establishing NuRisk as a critical benchmark for advancing spatio-temporal reasoning in autonomous driving.

Summary / 总结

NuRisk is a Visual Question Answering dataset designed to enhance risk assessment in autonomous driving by providing quantitative, agent-level risk annotations in Bird-Eye-View images. It addresses the limitations of current Vision Language Models by offering spatio-temporal reasoning capabilities. Experiments show that well-known VLMs perform poorly with only 33% accuracy at high latency, while a fine-tuned 7B VLM achieves 41% accuracy and reduces latency by 75%, indicating improved spatio-temporal reasoning but highlighting the task's complexity.

NuRisk 是一个视觉问答数据集，旨在通过提供 Bird-Eye-View 图像中的定量、代理级风险注释来增强自动驾驶中的风险评估。它解决了当前视觉语言模型的局限性，提供了时空推理能力。实验表明，现有的 VLMs 的表现不佳，仅能达到 33% 的准确率且延迟较高，而经过微调的 7B VLM 达到了 41% 的准确率并减少了 75% 的延迟，显示出改进的时空推理能力，但仍表明该任务的复杂性。

How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads

Authors: Ingeol Baek, Hwan Chang, Sunghyun Ryu, Hwanhee Lee

Venue: EMNLP 2025 Oral

First: 2025-05-21T10:53:41+00:00 · Latest: 2025-09-30T08:28:38+00:00

Comments: EMNLP 2025 Oral

Abs · PDF · Code1 · Code2

Abstract

Despite significant advancements in Large Vision Language Models (LVLMs), a gap remains, particularly regarding their interpretability and how they locate and interpret textual information within images. In this paper, we explore various LVLMs to identify the specific heads responsible for recognizing text from images, which we term the Optical Character Recognition Head (OCR Head). Our findings regarding these heads are as follows: (1) Less Sparse: Unlike previous retrieval heads, a large number of heads are activated to extract textual information from images. (2) Qualitatively Distinct: OCR heads possess properties that differ significantly from general retrieval heads, exhibiting low similarity in their characteristics. (3) Statically Activated: The frequency of activation for these heads closely aligns with their OCR scores. We validate our findings in downstream tasks by applying Chain-of-Thought (CoT) to both OCR and conventional retrieval heads and by masking these heads. We also demonstrate that redistributing sink-token values within the OCR heads improves performance. These insights provide a deeper understanding of the internal mechanisms LVLMs employ in processing embedded textual information in images.

中文标题/摘要

标题：大型视觉语言模型如何“看见”图像中的文本？揭开OCR头部的独特作用

尽管在大型视觉语言模型（LVLMs）方面取得了显著进展，但在其可解释性和如何在图像中定位和解释文本信息方面仍存在差距。本文探讨了各种LVLMs，以识别负责从图像中识别文本的特定头部，我们将其称为光学字符识别头部（OCR头部）。关于这些头部，我们的发现如下：（1）不稀疏：与之前的检索头部不同，大量头部被激活以从图像中提取文本信息。（2）定性不同：OCR头部具有与其他一般检索头部显著不同的特性，表现出低相似性。（3）静态激活：这些头部的激活频率与其OCR得分高度一致。我们通过在下游任务中应用链式思考（CoT）和遮蔽这些头部来验证我们的发现，并且通过在OCR头部内重新分配汇合令牌值来提高性能。这些见解为理解LVLMs在处理嵌入在图像中的文本信息时所采用的内部机制提供了更深入的理解。

Summary / 总结

This paper investigates the Optical Character Recognition (OCR) heads in Large Vision-Language Models (LVLMs) to understand how they process textual information in images. The study reveals that OCR heads are less sparse, qualitatively distinct from general retrieval heads, and statically activated based on OCR scores. The authors validate these findings by applying Chain-of-Thought (CoT) and masking experiments, and show that redistributing sink-token values within OCR heads improves performance. These insights enhance our understanding of LVLMs' internal mechanisms for processing embedded text in images.

本文研究了大型视觉语言模型（LVLM）中的光学字符识别（OCR）头部，以了解它们如何处理图像中的文本信息。研究发现，OCR头部较少稀疏，与一般的检索头部在性质上存在显著差异，并且根据OCR得分静态激活。作者通过应用链式思考（CoT）和掩码实验验证了这些发现，并表明在OCR头部重新分配sink-token值可以提高性能。这些见解加深了我们对LVLM处理嵌入文本图像内部机制的理解。

VLM-FO1: Bridging the Gap Between High-Level Reasoning and Fine-Grained Perception in VLMs

Authors: Peng Liu, Haozhan Shen, Chunxin Fang, Zhicheng Sun, Jiajia Liao, Tiancheng Zhao

First: 2025-09-30T08:10:56+00:00 · Latest: 2025-09-30T08:10:56+00:00

Comments: 22 pages

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization. This failure stems from a fundamental mismatch, as generating exact numerical coordinates is a challenging task for language-centric architectures. In this paper, we introduce VLM-FO1, a novel framework that overcomes this limitation by reframing object-centric perception from a brittle coordinate generation problem into a robust feature retrieval task. Our method operates as a plug-and-play module that integrates with any pre-trained VLM. It leverages a Hybrid Fine-grained Region Encoder (HFRE), featuring a dual vision encoder, to generate powerful region tokens rich in both semantic and spatial detail. A token-based referencing system then enables the LLM to seamlessly reason about and ground language in these specific visual regions. Experiments show that VLM-FO1 achieves state-of-the-art performance across a diverse suite of benchmarks, demonstrating exceptional capabilities in object grounding, region generational understanding, and visual region reasoning. Crucially, our two-stage training strategy ensures that these perception gains are achieved without compromising the base model's general visual understanding capabilities. VLM-FO1 establishes an effective and flexible paradigm for building perception-aware VLMs, bridging the gap between high-level reasoning and fine-grained visual grounding.

Summary / 总结

Vision-Language Models (VLMs) excel at high-level scene understanding but falter on fine-grained perception tasks requiring precise localization.

论文通过引入VLM-FO1解决了视觉语言模型在细粒度感知方面的挑战，将对象中心的感知重新定义为特征检索任务。VLM-FO1使用混合细粒度区域编码器（HFRE）生成详细的区域标记，使语言模型能够推理特定的视觉区域。该方法在各种基准测试中实现了最先进的性能，增强了对象定位和视觉区域推理能力，同时不牺牲通用视觉理解能力。两阶段训练确保这些收益是可持续的。