arXiv 论文速递

2025-10-22 03:30
Snapshot: 20251022_0330
ConsistEdit: Highly Consistent and Precise Training-free Visual Editing
Authors: Zixin Yin, Ling-Hao Chen, Lionel Ni, Xili Dai
Venue: SIGGRAPH Asia 2025
First: 2025-10-20T17:59:52+00:00 · Latest: 2025-10-20T17:59:52+00:00
Comments: SIGGRAPH Asia 2025
Abstract
Recent advances in training-free attention control methods have enabled flexible and efficient text-guided editing capabilities for existing generation models. However, current approaches struggle to simultaneously deliver strong editing strength while preserving consistency with the source. This limitation becomes particularly critical in multi-round and video editing, where visual errors can accumulate over time. Moreover, most existing methods enforce global consistency, which limits their ability to modify individual attributes such as texture while preserving others, thereby hindering fine-grained editing. Recently, the architectural shift from U-Net to MM-DiT has brought significant improvements in generative performance and introduced a novel mechanism for integrating text and vision modalities. These advancements pave the way for overcoming challenges that previous methods failed to resolve. Through an in-depth analysis of MM-DiT, we identify three key insights into its attention mechanisms. Building on these, we propose ConsistEdit, a novel attention control method specifically tailored for MM-DiT. ConsistEdit incorporates vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of the query, key, and value tokens to produce consistent, prompt-aligned edits. Extensive experiments demonstrate that ConsistEdit achieves state-of-the-art performance across a wide range of image and video editing tasks, including both structure-consistent and structure-inconsistent scenarios. Unlike prior methods, it is the first approach to perform editing across all inference steps and attention layers without handcraft, significantly enhancing reliability and consistency, which enables robust multi-round and multi-region editing. Furthermore, it supports progressive adjustment of structural consistency, enabling finer control.
中文标题/摘要
标题:ConsistEdit:高度一致且精确的无训练编辑
近期无训练注意力控制方法的进步使现有生成模型具备了灵活高效的文本引导编辑能力。然而,当前方法难以同时提供强大的编辑强度并保持与源内容的一致性。这一限制在多轮编辑和视频编辑中尤为关键,因为视觉错误会随时间累积。此外,大多数现有方法仅保证全局一致性,这限制了它们修改个别属性(如纹理)并保留其他属性的能力,从而妨碍了精细编辑。最近,从U-Net到MM-DiT的架构转变带来了生成性能的重大提升,并引入了一种新的机制来整合文本和视觉模态。这些进步为克服先前方法未能解决的挑战铺平了道路。通过对MM-DiT进行深入分析,我们发现了其注意力机制的三个关键见解。基于这些见解,我们提出了ConsistEdit,一种专门针对MM-DiT的新型注意力控制方法。ConsistEdit结合了仅视觉注意力控制、掩码引导预注意力融合以及对查询、键和值令牌的差异化操作,以生成一致且与提示对齐的编辑。大量实验表明,ConsistEdit在各种图像和视频编辑任务中均实现了最先进的性能,包括结构一致和结构不一致的场景。与先前方法不同,它首次能够在所有推理步骤和注意力层中进行无手工艺编辑,显著提高了可靠性和一致性,从而实现稳健的多轮和多区域编辑。此外,它支持结构一致性渐进调整,从而实现更精细的控制。
Summary / 总结
ConsistEdit is a novel training-free attention control method designed for MM-DiT architecture, addressing the limitations of previous approaches in maintaining consistency and precision during visual editing. By incorporating vision-only attention control, mask-guided pre-attention fusion, and differentiated manipulation of query, key, and value tokens, ConsistEdit achieves highly consistent and precise edits across various image and video editing tasks. The method enables robust multi-round and multi-region editing, and supports progressive adjustment of structural consistency, demonstrating state-of-the-art performance and reliability.
ConsistEdit 是一种针对 MM-DiT 的新颖注意力控制方法,旨在提高文本引导的视觉编辑的一致性和精确性。它通过引入仅视觉注意力控制、掩码引导的预注意力融合以及对查询、键和值令牌的差异化操作来解决先前方法的局限性。广泛的实验表明,ConsistEdit 在各种图像和视频编辑任务中表现出色,能够在多个推理步骤和注意力层中提供可靠且一致的编辑效果。
Glyph: Scaling Context Windows via Visual-Text Compression
Authors: Jiale Cheng, Yusen Liu, Xinyu Zhang, Yulin Fei, Wenyi Hong, Ruiliang Lyu, Weihan Wang, Zhe Su, Xiaotao Gu, Xiao Liu, Yushi Bai, Jie Tang, Hongning Wang, Minlie Huang
First: 2025-10-20T17:58:56+00:00 · Latest: 2025-10-20T17:58:56+00:00
Abstract
Large language models (LLMs) increasingly rely on long-context modeling for tasks such as document understanding, code analysis, and multi-step reasoning. However, scaling context windows to the million-token level brings prohibitive computational and memory costs, limiting the practicality of long-context LLMs. In this work, we take a different perspective-visual context scaling-to tackle this challenge. Instead of extending token-based sequences, we propose Glyph, a framework that renders long texts into images and processes them with vision-language models (VLMs). This approach substantially compresses textual input while preserving semantic information, and we further design an LLM-driven genetic search to identify optimal visual rendering configurations for balancing accuracy and compression. Through extensive experiments, we demonstrate that our method achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs such as Qwen3-8B on various long-context benchmarks. This compression also leads to around 4x faster prefilling and decoding, and approximately 2x faster SFT training. Furthermore, under extreme compression, a 128K-context VLM could scale to handle 1M-token-level text tasks. In addition, the rendered text data benefits real-world multimodal tasks, such as document understanding. Our code and model are released at https://github.com/thu-coai/Glyph.
中文标题/摘要
标题:Glyph:通过视觉-文本压缩扩展上下文窗口
大型语言模型(LLMs)越来越多地依赖于长上下文建模,用于文档理解、代码分析和多步推理等任务。然而,将上下文窗口扩展到百万词级别带来了巨大的计算和内存成本,限制了长上下文LLMs的实际应用。在本工作中,我们从视觉上下文扩展的角度出发,应对这一挑战。我们不扩展基于词元的序列,而是提出了一种名为Glyph的框架,将长文本渲染为图像,并使用视觉-语言模型(VLMs)处理这些图像。这种方法在大幅压缩文本输入的同时保留了语义信息,并进一步设计了一种由LLM驱动的遗传搜索,以识别平衡准确性和压缩的最佳视觉渲染配置。通过广泛的实验,我们证明了我们的方法在各种长上下文基准测试中实现了3-4倍的词元压缩,同时保持与领先LLM(如Qwen3-8B)相当的准确性。这种压缩还导致填充和解码速度提高了约4倍,SFT训练速度提高了约2倍。此外,在极端压缩下,一个128K上下文的VLM可以扩展处理百万词级别的文本任务。此外,渲染的文本数据也有助于实际的多模态任务,如文档理解。我们的代码和模型已发布在https://github.com/thu-coai/Glyph。
Summary / 总结
The research aims to address the computational and memory challenges of scaling context windows in large language models (LLMs) to handle long documents. Glyph proposes a visual context scaling method, converting long texts into images and processing them with vision-language models (VLMs). This approach achieves 3-4x token compression while maintaining accuracy comparable to leading LLMs, and also improves prefilling, decoding, and SFT training speeds by around 4x and 2x, respectively. Under extreme compression, a 128K-context VLM can handle 1M-token-level text tasks, and the rendered text data benefits multimodal tasks like document understanding.
该论文提出了Glyph框架,将长文本转换为图像以减少大型语言模型(LLMs)的计算和内存成本。通过使用视觉语言模型(VLMs)和基于LLM的遗传搜索,Glyph实现了3-4倍的令牌压缩,同时保持与Qwen3-8B等领先LLM相当的准确性。这种方法还使预填充、解码和SFT训练分别提速4倍、4倍和2倍,并且在极端压缩下,128K上下文的VLM能够处理1M令牌级别的任务。渲染后的文本数据还能够用于多模态任务,如文档理解。
SparseVILA: Decoupling Visual Sparsity for Efficient VLM Inference
Authors: Samir Khaki, Junxian Guo, Jiaming Tang, Shang Yang, Yukang Chen, Konstantinos N. Plataniotis, Yao Lu, Song Han, Zhijian Liu
First: 2025-10-20T17:35:47+00:00 · Latest: 2025-10-20T17:35:47+00:00
Abstract
Vision Language Models (VLMs) have rapidly advanced in integrating visual and textual reasoning, powering applications across high-resolution image understanding, long-video analysis, and multi-turn conversation. However, their scalability remains limited by the growing number of visual tokens that dominate inference latency. We present SparseVILA, a new paradigm for efficient VLM inference that decouples visual sparsity across the prefilling and decoding stages. SparseVILA distributes sparsity across stages by pruning redundant visual tokens during prefill and retrieving only query-relevant tokens during decoding. This decoupled design matches leading prefill pruning methods while preserving multi-turn fidelity by retaining most of the visual cache so that query-aware tokens can be retrieved at each conversation round. Built on an AWQ-optimized inference pipeline, SparseVILA achieves up to 4.0 times faster prefilling, 2.5 times faster decoding, and an overall 2.6 times end-to-end speedup on long-context video tasks -- while improving accuracy on document-understanding and reasoning tasks. By decoupling query-agnostic pruning and query-aware retrieval, SparseVILA establishes a new direction for efficient multimodal inference, offering a training-free, architecture-agnostic framework for accelerating large VLMs without sacrificing capability.
中文标题/摘要
标题:SparseVILA:解耦视觉稀疏性以实现高效的VLM推理
视觉语言模型(VLMs)在视觉和文本推理的整合方面取得了快速进展,推动了高分辨率图像理解、长视频分析和多轮对话等应用的发展。然而,它们的可扩展性仍然受到主导推理延迟的视觉标记数量不断增长的限制。我们提出了SparseVILA,这是一种新的高效VLM推理范式,它在预填充和解码阶段解耦视觉稀疏性。SparseVILA通过在预填充阶段剪枝冗余的视觉标记,并在解码阶段仅检索与查询相关的标记,来在阶段之间分配稀疏性。这种解耦设计在保持多轮对话保真度的同时,通过保留大部分视觉缓存,使得查询感知的标记可以在每次对话回合中被检索。基于AWQ优化的推理管道,SparseVILA在长上下文视频任务中实现了4.0倍的预填充速度、2.5倍的解码速度和整体2.6倍的端到端加速,同时在文档理解和推理任务上提高了准确性。通过解耦查询无关的剪枝和查询感知的检索,SparseVILA为高效的多模态推理确立了一个新方向,提供了一个无需训练、架构无关的框架,用于加速大型VLMs而不牺牲其能力。
Summary / 总结
SparseVILA is designed to enhance the efficiency of Vision Language Models (VLMs) by decoupling visual sparsity during prefilling and decoding. It prunes redundant visual tokens during prefill and retrieves only relevant tokens during decoding, achieving up to 4.0 times faster prefilling, 2.5 times faster decoding, and a 2.6 times overall speedup on long-context video tasks. This method maintains accuracy on document-understanding and reasoning tasks while improving scalability.
SparseVILA 通过在预填充和解码阶段解耦视觉稀疏性来提高视觉语言模型(VLM)的效率。它在预填充阶段修剪冗余的视觉令牌,在解码阶段仅检索相关令牌,从而实现预填充高达4.0倍、解码高达2.5倍的速度提升,以及整体2.6倍的端到端加速。这种方法在长上下文视频任务中提高了文档理解和推理任务的准确性,而不牺牲性能能力。
Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs
Authors: Zhining Liu, Ziyi Chen, Hui Liu, Chen Luo, Xianfeng Tang, Suhang Wang, Joy Zeng, Zhenwei Dai, Zhan Shi, Tianxin Wei, Benoit Dumoulin, Hanghang Tong
First: 2025-10-20T17:31:09+00:00 · Latest: 2025-10-20T17:31:09+00:00
Comments: 21 pages, 10 figures, 6 tables
Abstract
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
中文标题/摘要
标题:见而不见:探索VLMs中视觉注意与答案正确性之间的差距
视觉-语言模型(VLMs)在多模态任务如视觉问答中表现出色,但即使存在正确的视觉证据,它们也可能失败。在本研究中,我们系统地探讨了这些失败是由于未能感知证据还是未能有效利用证据。通过分析逐层的注意力动态,我们发现浅层主要关注文本,而深层则稀疏但可靠地关注局部证据区域。令人惊讶的是,VLMs在输出错误答案时往往能感知到视觉证据,我们将其称为“见而不见”,这一现象在主要的VLM家族中普遍存在。基于此,我们引入了一种推理时的干预措施,通过选择性注意力掩蔽突出深层的证据区域。该方法无需训练,并且在多个家族中(包括LLaVA、Qwen、Gemma和InternVL)一致提高了准确性。这些结果表明,VLMs内部编码了可靠的证据,但利用率不足,使这些信号显性化可以弥合感知与推理之间的差距,从而推进对VLMs的诊断理解和可靠性。
Summary / 总结
This study investigates why Vision-Language Models (VLMs) can fail to provide correct answers despite having access to relevant visual evidence. By analyzing attention patterns, the research reveals that while deeper layers of VLMs reliably focus on visual evidence, they often fail to utilize this information effectively, leading to incorrect answers. The study introduces a method to highlight deep-layer evidence through selective attention-based masking, which improves accuracy across various VLMs without requiring additional training. This finding suggests that making internal evidence more explicit can enhance the performance and reliability of VLMs.
该研究探讨了为什么视觉语言模型(VLMs)即使拥有相关视觉证据仍会给出错误答案的原因。通过分析注意力模式,研究发现虽然VLMs的深层层能够可靠地聚焦于视觉证据,但往往未能有效利用这些信息,导致错误答案。研究引入了一种通过选择性注意力基底遮罩来突出显示深层层证据的方法,无需额外训练即可提高多种VLMs的准确性。这一发现表明,使内部证据更加显性可以提升VLMs的性能和可靠性。
VERA-V: Variational Inference Framework for Jailbreaking Vision-Language Models
Authors: Qilin Liao, Anamika Lochab, Ruqi Zhang
First: 2025-10-20T17:12:10+00:00 · Latest: 2025-10-20T17:12:10+00:00
Comments: 18 pages, 7 Figures,
Abstract
Vision-Language Models (VLMs) extend large language models with visual reasoning, but their multimodal design also introduces new, underexplored vulnerabilities. Existing multimodal red-teaming methods largely rely on brittle templates, focus on single-attack settings, and expose only a narrow subset of vulnerabilities. To address these limitations, we introduce VERA-V, a variational inference framework that recasts multimodal jailbreak discovery as learning a joint posterior distribution over paired text-image prompts. This probabilistic view enables the generation of stealthy, coupled adversarial inputs that bypass model guardrails. We train a lightweight attacker to approximate the posterior, allowing efficient sampling of diverse jailbreaks and providing distributional insights into vulnerabilities. VERA-V further integrates three complementary strategies: (i) typography-based text prompts that embed harmful cues, (ii) diffusion-based image synthesis that introduces adversarial signals, and (iii) structured distractors to fragment VLM attention. Experiments on HarmBench and HADES benchmarks show that VERA-V consistently outperforms state-of-the-art baselines on both open-source and frontier VLMs, achieving up to 53.75% higher attack success rate (ASR) over the best baseline on GPT-4o.
中文标题/摘要
标题:VERA-V:视觉语言模型脱狱的变分推理框架
视觉语言模型(VLMs)通过视觉推理扩展了大型语言模型,但其多模态设计也引入了新的、未被充分探索的漏洞。现有的多模态红队方法主要依赖于脆弱的模板,专注于单一攻击场景,并仅暴露了一小部分漏洞。为了解决这些局限性,我们引入了VERA-V,这是一种变分推理框架,将多模态脱狱发现重新表述为学习配对文本-图像提示的联合后验分布。这种概率视角使得生成隐蔽的、耦合的对抗输入成为可能,这些输入可以绕过模型的护栏。我们训练了一个轻量级的攻击者来近似后验,从而实现多样脱狱的高效采样,并提供关于漏洞的分布见解。VERA-V 进一步整合了三种互补策略:(i)基于字体的文本提示,嵌入有害线索;(ii)基于扩散的图像合成,引入对抗信号;(iii)结构化的干扰物,以分散 VLM 的注意力。在 HarmBench 和 HADES 基准测试中,VERA-V 在开源和前沿 VLM 上始终优于最先进的基线,相对于最佳基线在 GPT-4o 上的攻击成功率(ASR)提高了高达 53.75%。
Summary / 总结
VERA-V is a variational inference framework designed to discover vulnerabilities in Vision-Language Models (VLMs) by learning a joint posterior distribution over text-image pairs. This probabilistic approach enables the generation of stealthy adversarial inputs that bypass model guardrails. VERA-V outperforms existing methods on various benchmarks, achieving up to 53.75% higher attack success rate compared to the best baseline on GPT-4o. The framework integrates three strategies: typography-based text prompts, diffusion-based image synthesis, and structured distractors to fragment VLM attention.
VERA-V 是一种变分推理框架,旨在通过学习文本-图像对的联合后验分布来发现视觉-语言模型(VLM)的漏洞。这种概率方法能够生成隐蔽的对抗性输入,绕过模型的防护机制。VERA-V 在各种基准测试中表现优于现有方法,与最佳基线相比,在 GPT-4o 上的攻击成功率提高了高达 53.75%。该框架整合了三种策略:基于字型的文本提示、扩散驱动的图像合成以及结构化的干扰物来分散 VLM 的注意力。
UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens
Authors: Ruichuan An, Sihan Yang, Renrui Zhang, Zijun Shen, Ming Lu, Gaole Dai, Hao Liang, Ziyu Guo, Shilin Yan, Yulin Luo, Bocheng Zou, Chaoqun Yang, Wentao Zhang
Venue: NeurIPS 2025
First: 2025-05-20T17:56:01+00:00 · Latest: 2025-10-20T16:56:39+00:00
Abstract
Personalized models have demonstrated remarkable success in understanding and generating concepts provided by users. However, existing methods use separate concept tokens for understanding and generation, treating these tasks in isolation. This may result in limitations for generating images with complex prompts. For example, given the concept $\langle bo\rangle$, generating "$\langle bo\rangle$ wearing its hat" without additional textual descriptions of its hat. We call this kind of generation \textit{\textbf{personalized attribute-reasoning generation}}. To address the limitation, we present UniCTokens, a novel framework that effectively integrates personalized information into a unified vision language model (VLM) for understanding and generation. UniCTokens trains a set of unified concept tokens to leverage complementary semantics, boosting two personalized tasks. Moreover, we propose a progressive training strategy with three stages: understanding warm-up, bootstrapping generation from understanding, and deepening understanding from generation to enhance mutual benefits between both tasks. To quantitatively evaluate the unified VLM personalization, we present UnifyBench, the first benchmark for assessing concept understanding, concept generation, and attribute-reasoning generation. Experimental results on UnifyBench indicate that UniCTokens shows competitive performance compared to leading methods in concept understanding, concept generation, and achieving state-of-the-art results in personalized attribute-reasoning generation. Our research demonstrates that enhanced understanding improves generation, and the generation process can yield valuable insights into understanding. Our code and dataset will be released at: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}.
中文标题/摘要
标题:UniCTokens:通过统一概念令牌增强个性化理解和生成
个性化模型在理解和生成用户提供的概念方面取得了显著的成功。然而,现有方法使用单独的概念令牌分别进行理解和生成,将这两个任务孤立处理。这可能导致在生成具有复杂提示的图像时存在局限性。例如,给定概念$\langle bo\rangle$,生成"$\langle bo\rangle$戴着它的帽子"而无需额外的关于其帽子的文字描述。我们称这种生成为\textit{\textbf{个性化属性推理生成}}。为了解决这一局限性,我们提出了UniCTokens,这是一种新颖的框架,能够有效地将个性化信息整合到统一的视觉语言模型(VLM)中,用于理解和生成。UniCTokens训练一组统一的概念令牌,利用互补的语义,增强两个个性化任务。此外,我们提出了一种分阶段的训练策略,分为理解预热、从理解中启动生成和从生成深化理解三个阶段,以增强两个任务之间的相互收益。为了定量评估统一VLM的个性化,我们提出了UnifyBench,这是第一个用于评估概念理解、概念生成和属性推理生成的基准。UnifyBench上的实验结果表明,UniCTokens在概念理解、概念生成方面表现出竞争力,并在个性化属性推理生成方面达到了最先进的结果。我们的研究证明,增强的理解可以提高生成,而生成过程也可以为理解提供有价值的见解。我们的代码和数据集将在以下链接发布: \href{https://github.com/arctanxarc/UniCTokens}{https://github.com/arctanxarc/UniCTokens}。
Summary / 总结
The paper introduces UniCTokens, a framework that integrates personalized information into a unified vision language model for understanding and generation. It addresses the limitation of existing methods by using unified concept tokens and a progressive training strategy. Experimental results show that UniCTokens outperforms leading methods in concept understanding, concept generation, and personalized attribute-reasoning generation, achieving state-of-the-art results in the latter. The framework enhances mutual benefits between understanding and generation and will be released as open-source code and dataset.
论文提出了UniCTokens框架,将个性化信息整合到统一的视觉语言模型中,用于理解和生成。该框架通过使用统一的概念令牌和渐进式训练策略解决了现有方法的局限性。实验结果表明,UniCTokens在概念理解和概念生成方面优于领先方法,并在个性化属性推理生成方面取得了最先进的结果。该框架增强了理解和生成之间的相互作用,并将作为开源代码和数据集发布。
Joint Multi-Condition Representation Modelling via Matrix Factorisation for Visual Place Recognition
Authors: Timur Ismagilov, Shakaiba Majeed, Michael Milford, Tan Viet Tuyen Nguyen, Sarvapali D. Ramchurn, Shoaib Ehsan
First: 2025-10-20T16:50:03+00:00 · Latest: 2025-10-20T16:50:03+00:00
Comments: 13 pages
Abstract
We address multi-reference visual place recognition (VPR), where reference sets captured under varying conditions are used to improve localisation performance. While deep learning with large-scale training improves robustness, increasing data diversity and model complexity incur extensive computational cost during training and deployment. Descriptor-level fusion via voting or aggregation avoids training, but often targets multi-sensor setups or relies on heuristics with limited gains under appearance and viewpoint change. We propose a training-free, descriptor-agnostic approach that jointly models places using multiple reference descriptors via matrix decomposition into basis representations, enabling projection-based residual matching. We also introduce SotonMV, a structured benchmark for multi-viewpoint VPR. On multi-appearance data, our method improves Recall@1 by up to ~18% over single-reference and outperforms multi-reference baselines across appearance and viewpoint changes, with gains of ~5% on unstructured data, demonstrating strong generalisation while remaining lightweight.
中文标题/摘要
标题:通过矩阵分解联合多条件表示建模用于视觉地点识别
我们解决了多参考视觉地点识别(VPR),其中使用在不同条件下捕获的参考集来提高定位性能。虽然大规模训练的深度学习提高了鲁棒性,但增加数据多样性和模型复杂性会导致训练和部署时的大量计算成本。通过投票或聚合进行描述符级融合可以避免训练,但通常针对多传感器设置或依赖于在外观和视角变化下效果有限的启发式方法。我们提出了一种无需训练、描述符无关的方法,通过矩阵分解将多个参考描述符联合建模为基础表示,从而实现投影残差匹配。我们还引入了SotonMV,一个用于多视角VPR的结构化基准。在多外观数据上,我们的方法在Recall@1上提高了高达约18%,在外观和视角变化下优于单参考基线,并在无结构数据上提高了约5%的性能,展示了强大的泛化能力同时保持轻量级。
Summary / 总结
The paper addresses the challenge of multi-reference visual place recognition by proposing a training-free, descriptor-agnostic approach that uses matrix factorization to jointly model places from multiple reference descriptors. This method enables projection-based residual matching and outperforms single-reference and multi-reference baselines, improving Recall@1 by up to 18% on multi-appearance data and showing strong generalization while remaining lightweight. The authors also introduce SotonMV, a structured benchmark for multi-viewpoint VPR.
研究旨在通过使用多种在不同条件下的参考集来提升视觉位置识别(VPR)。提出了一种无需训练、描述符无关的方法,通过矩阵分解来联合建模位置,实现高效的投影残差匹配。该方法在多外观数据上将Recall@1提高了最多18%,优于单参考和多参考基线,并在非结构化数据上表现出强大的泛化能力,同时保持轻量级。
Towards 3D Objectness Learning in an Open World
Authors: Taichi Liu, Zhenyu Wang, Ruofeng Liu, Guang Wang, Desheng Zhang
Venue: NeurIPS 2025
First: 2025-10-20T16:01:20+00:00 · Latest: 2025-10-20T16:01:20+00:00
Comments: Accepted by NeurIPS 2025
Abstract
Recent advancements in 3D object detection and novel category detection have made significant progress, yet research on learning generalized 3D objectness remains insufficient. In this paper, we delve into learning open-world 3D objectness, which focuses on detecting all objects in a 3D scene, including novel objects unseen during training. Traditional closed-set 3D detectors struggle to generalize to open-world scenarios, while directly incorporating 3D open-vocabulary models for open-world ability struggles with vocabulary expansion and semantic overlap. To achieve generalized 3D object discovery, We propose OP3Det, a class-agnostic Open-World Prompt-free 3D Detector to detect any objects within 3D scenes without relying on hand-crafted text prompts. We introduce the strong generalization and zero-shot capabilities of 2D foundation models, utilizing both 2D semantic priors and 3D geometric priors for class-agnostic proposals to broaden 3D object discovery. Then, by integrating complementary information from point cloud and RGB image in the cross-modal mixture of experts, OP3Det dynamically routes uni-modal and multi-modal features to learn generalized 3D objectness. Extensive experiments demonstrate the extraordinary performance of OP3Det, which significantly surpasses existing open-world 3D detectors by up to 16.0% in AR and achieves a 13.5% improvement compared to closed-world 3D detectors.
中文标题/摘要
标题:向开放世界中的3D对象性学习迈进
近年来,3D物体检测和新型类别检测取得了显著进展,但关于学习通用3D对象性的研究仍显不足。本文探讨了开放世界3D对象性的学习,旨在检测3D场景中的所有物体,包括训练期间未见过的新物体。传统的封闭集3D检测器难以在开放世界场景中泛化,而直接将3D开放词汇模型用于开放世界能力则面临词汇扩展和语义重叠的问题。为了实现通用3D物体发现,我们提出了OP3Det,这是一种无提示的开放世界通用3D检测器,能够在不依赖手工设计的文本提示的情况下检测3D场景中的任何物体。我们引入了2D基础模型的强大泛化能力和零样本能力,利用2D语义先验和3D几何先验生成通用类别提案,以拓宽3D物体发现。然后,通过在跨模态专家混合中整合点云和RGB图像的互补信息,OP3Det动态路由单模态和多模态特征以学习通用3D对象性。大量实验表明,OP3Det表现出色,其性能显著优于现有开放世界3D检测器,AR指标上高出16.0%,与封闭世界3D检测器相比,性能提高了13.5%。
Summary / 总结
This paper addresses the challenge of learning generalized 3D objectness in an open-world setting, where the detector must identify both known and novel objects. The authors propose OP3Det, a class-agnostic 3D detector that leverages 2D foundation models and integrates information from point clouds and RGB images to achieve strong generalization and zero-shot capabilities. Experimental results show that OP3Det outperforms existing open-world 3D detectors by up to 16.0% in Average Recall and improves upon closed-world detectors by 13.5%.
本文针对开放世界中的3D物体检测问题,即检测所有物体,包括训练时未见过的新物体。作者提出了一种类无偏的3D检测器OP3Det,该检测器利用2D基础模型并结合点云和RGB图像信息,实现了强大的泛化能力和零样本能力。实验结果表明,OP3Det在平均召回率(AR)上比现有开放世界3D检测器高出16.0%,比封闭世界3D检测器提高了13.5%。
On-the-Fly OVD Adaptation with FLAME: Few-shot Localization via Active Marginal-Samples Exploration
Authors: Yehonathan Refael, Amit Aides, Aviad Barzilai, George Leifman, Genady Beryozkin, Vered Silverman, Bolous Jaber, Tomer Shekel
First: 2025-10-20T15:41:55+00:00 · Latest: 2025-10-20T15:41:55+00:00
Abstract
Open-vocabulary object detection (OVD) models offer remarkable flexibility by detecting objects from arbitrary text queries. However, their zero-shot performance in specialized domains like Remote Sensing (RS) is often compromised by the inherent ambiguity of natural language, limiting critical downstream applications. For instance, an OVD model may struggle to distinguish between fine-grained classes such as "fishing boat" and "yacht" since their embeddings are similar and often inseparable. This can hamper specific user goals, such as monitoring illegal fishing, by producing irrelevant detections. To address this, we propose a cascaded approach that couples the broad generalization of a large pre-trained OVD model with a lightweight few-shot classifier. Our method first employs the zero-shot model to generate high-recall object proposals. These proposals are then refined for high precision by a compact classifier trained in real-time on only a handful of user-annotated examples - drastically reducing the high costs of RS imagery annotation.The core of our framework is FLAME, a one-step active learning strategy that selects the most informative samples for training. FLAME identifies, on the fly, uncertain marginal candidates near the decision boundary using density estimation, followed by clustering to ensure sample diversity. This efficient sampling technique achieves high accuracy without costly full-model fine-tuning and enables instant adaptation, within less then a minute, which is significantly faster than state-of-the-art alternatives.Our method consistently surpasses state-of-the-art performance on RS benchmarks, establishing a practical and resource-efficient framework for adapting foundation models to specific user needs.
中文标题/摘要
标题:FLAME驱动的即时OVD适应:基于活跃边际样本探索的少样本定位
开放词汇对象检测(OVD)模型通过从任意文本查询中检测对象提供了显著的灵活性。然而,在如遥感(RS)等专门领域中,它们的零样本性能往往因自然语言的固有歧义而受损,限制了关键的下游应用。例如,一个OVD模型可能难以区分“渔船”和“游艇”这类细粒度类别,因为它们的嵌入相似且经常不可分。这可能妨碍特定用户目标,如监测非法捕鱼,导致无关的检测结果。为了解决这一问题,我们提出了一种级联方法,将大型预训练OVD模型的广泛泛化与轻量级少样本分类器相结合。我们的方法首先使用零样本模型生成高召回的对象提案,然后通过仅在少量用户标注示例上实时训练的小型分类器进行高精度细化,大幅降低了RS图像标注的高昂成本。我们框架的核心是FLAME,这是一种一步式主动学习策略,能够选择最具信息量的样本进行训练。FLAME利用密度估计在决策边界附近即时识别不确定的边际候选样本,然后通过聚类确保样本多样性。这种高效的采样技术在无需昂贵的全模型微调的情况下实现了高精度,并能在不到一分钟内实现即时适应,显著快于最先进的替代方案。我们的方法在RS基准测试中始终超越了最先进的性能,建立了一个实用且资源高效的框架,用于将基础模型适应特定用户需求。
Summary / 总结
The paper addresses the challenge of zero-shot performance in open-vocabulary object detection (OVD) models in specialized domains like Remote Sensing (RS), where the models struggle with distinguishing fine-grained classes. It proposes a cascaded approach using a large pre-trained OVD model and a lightweight few-shot classifier. The method employs FLAME, an active learning strategy, to select informative samples for real-time training, achieving high accuracy and fast adaptation within minutes. The approach consistently outperforms state-of-the-art methods on RS benchmarks.
论文针对开放词汇对象检测(OVD)模型在遥感(RS)等专门领域中的零样本性能不足问题,提出了一个级联方法,使用一个大型预训练的OVD模型生成高召回的对象提案,然后通过仅基于少量用户标注样本训练的轻量级few-shot分类器进行精细化。核心方法是FLAME,这是一种实时选择最具信息量样本的主动学习策略,通过密度估计和聚类确保样本多样性,从而在几分钟内实现高效适应,显著优于现有最佳方法。该方法在RS基准测试中表现出色,提供了一种实用且资源高效的框架,以适应特定用户需求。
Frugal Federated Learning for Violence Detection: A Comparison of LoRA-Tuned VLMs and Personalized CNNs
Authors: Sébastien Thuau, Siba Haidar, Ayush Bajracharya, Rachid Chelouah
First: 2025-10-20T15:26:43+00:00 · Latest: 2025-10-20T15:26:43+00:00
Comments: 7 pages, 1 figure, FLTA 2025
Abstract
We examine frugal federated learning approaches to violence detection by comparing two complementary strategies: (i) zero-shot and federated fine-tuning of vision-language models (VLMs), and (ii) personalized training of a compact 3D convolutional neural network (CNN3D). Using LLaVA-7B and a 65.8M parameter CNN3D as representative cases, we evaluate accuracy, calibration, and energy usage under realistic non-IID settings. Both approaches exceed 90% accuracy. CNN3D slightly outperforms Low-Rank Adaptation(LoRA)-tuned VLMs in ROC AUC and log loss, while using less energy. VLMs remain favorable for contextual reasoning and multimodal inference. We quantify energy and CO$_2$ emissions across training and inference, and analyze sustainability trade-offs for deployment. To our knowledge, this is the first comparative study of LoRA-tuned vision-language models and personalized CNNs for federated violence detection, with an emphasis on energy efficiency and environmental metrics. These findings support a hybrid model: lightweight CNNs for routine classification, with selective VLM activation for complex or descriptive scenarios. The resulting framework offers a reproducible baseline for responsible, resource-aware AI in video surveillance, with extensions toward real-time, multimodal, and lifecycle-aware systems.
中文标题/摘要
标题:节俭联邦学习在暴力检测中的应用:LoRA调优VLM与个性化CNN的比较
我们通过比较两种互补策略来研究节俭联邦学习在暴力检测中的应用:(i) 零样本和联邦微调视觉语言模型(VLMs),(ii) 个性化训练紧凑的3D卷积神经网络(CNN3D)。使用LLaVA-7B和一个65.8M参数的CNN3D作为代表性案例,我们在非IID的现实环境中评估准确率、校准和能耗。两种方法的准确率均超过90%。CNN3D在ROC AUC和log loss上略优于LoRA调优的VLMs,同时能耗更低。VLMs在上下文推理和多模态推理方面仍占优势。我们量化了训练和推理过程中的能耗和CO2排放,并分析了部署过程中的可持续性权衡。据我们所知,这是首次对LoRA调优的视觉语言模型和个性化CNN在联邦暴力检测中的比较研究,重点在于能效和环境指标。这些发现支持一种混合模型:轻量级CNN用于常规分类,而选择性激活VLM用于复杂或描述性场景。由此形成的框架为视频监控中的负责任、资源感知AI提供了一个可重复的基线,并扩展到实时、多模态和生命周期感知系统。
Summary / 总结
This study compares frugal federated learning approaches for violence detection, evaluating zero-shot and federated fine-tuning of vision-language models (VLMs) against personalized training of a compact 3D CNN. Both methods achieve over 90% accuracy, with the CNN3D slightly outperforming LoRA-tuned VLMs in ROC AUC and log loss while using less energy. The research quantifies energy and CO2 emissions, highlighting the sustainability trade-offs and supporting a hybrid model for routine and complex scenarios.
研究比较了暴力检测中的节俭联邦学习方法,评估了零样本和联邦微调视觉语言模型(VLMs)与个性化训练紧凑型3D CNN的效果。两种方法均超过90%的准确率,3D CNN在ROC AUC和对数损失上略优于LoRA微调的VLMs,同时能耗更低。研究量化了训练和推理过程中的能耗和二氧化碳排放,强调了可持续性权衡,并支持一种混合模型,用于常规和复杂场景。
DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection
Authors: Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara
Venue: NeurIPS 2025
First: 2025-03-12T11:15:34+00:00 · Latest: 2025-10-20T15:09:57+00:00
Comments: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Abstract
Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/
中文标题/摘要
标题:DitHub:增量开放词汇对象检测的模块化框架
开放词汇对象检测器可以通过简单的文本提示泛化到不受限制的类别集合。然而,将这些模型适应稀有类别或在多个专业化领域加强其能力仍然是必不可少的。虽然最近的方法依赖于单一权重的庞大适应策略,我们则采用了模块化深度学习。我们引入了DitHub,一个旨在构建和维护高效适应模块库的框架。受到版本控制系统的影响,DitHub将专家模块视为分支,可以根据需要获取和合并。这种模块化方法使我们能够深入探索适应模块的组合特性,这是在对象检测领域中的首次此类研究。我们的方法在ODinW-13基准和新引入的ODinW-O基准上达到了最先进的性能,后者旨在评估类别的再现性。欲了解更多信息,请访问我们的项目页面:https://aimagelab.github.io/DitHub/
Summary / 总结
DitHub is a modular framework for incremental open-vocabulary object detection, enabling the adaptation of models to new or rare categories through textual prompting. It introduces a version control system-like approach to manage adaptation modules as branches, which can be easily fetched and merged. This modular design allows for the exploration of compositional properties of adaptation modules and achieves state-of-the-art performance on ODinW-13 and ODinW-O benchmarks, designed to evaluate class reappearance. The framework supports the development and maintenance of efficient adaptation modules for object detection tasks.
研究旨在通过开发一个模块化框架来提升开放词汇对象检测能力,该框架能够高效地适应新类或稀有类。DitHub借鉴了版本控制系统的方法,将适应模块作为分支管理,实现灵活高效的更新。该方法在ODinW-13基准测试和新引入的ODinW-O基准测试中表现出色,后者专门评估类的再现能力。
When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
Authors: Samer Al-Hamadani
First: 2025-10-13T11:48:48+00:00 · Latest: 2025-10-20T15:09:23+00:00
Comments: 30 pages, 12 figures, 4 tables
Abstract
Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($0.00050) and GPT-4 ($0.00067) over YOLO ($0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.
中文标题/摘要
标题:监督训练何时见效?视觉语言模型时代目标检测的隐性经济学分析
目标检测传统上依赖昂贵的手动标注。我们首次进行全面的成本效益分析,比较了监督YOLO和零样本视觉语言模型(Gemini Flash 2.5和GPT-4)。在5,000张分层COCO图像和500张多样产品图像上进行评估,并结合总拥有成本模型,我们推导出架构选择的临界点。结果显示,监督YOLO在标准类别上的准确率为91.2%,而Gemini为68.5%,GPT-4为71.3%;对于100个类别的系统,标注费用为10,800美元,只有在超过5.5亿次推理(相当于一年内每天处理151,000张图像)时,准确率优势才开始见效。在多样产品类别上,Gemini达到52.3%,GPT-4达到55.1%,而监督YOLO无法检测未训练的类别。每正确检测一次的成本表明,在10万次推理时,Gemini(0.00050美元)和GPT-4(0.00067美元)优于YOLO(0.143美元)。我们提供了决策框架,表明最佳架构选择取决于推理量、类别稳定性、预算和准确率要求。
Grounded Reinforcement Learning for Visual Reasoning
Authors: Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki
First: 2025-05-29T17:20:26+00:00 · Latest: 2025-10-20T14:54:22+00:00
Comments: Project website: https://visually-grounded-rl.github.io/
Abstract
While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
中文标题/摘要
标题:基于视觉锚定的强化学习视觉推理
虽然强化学习(RL)在语言模型中的链式思维任务中取得了显著进展,如数学和编程,但视觉推理增加了复杂性,要求模型指导视觉注意力、解释感知输入,并将抽象推理与空间证据联系起来。我们提出了ViGoRL(视觉锚定强化学习),这是一种通过RL训练的视觉-语言模型,明确将每一步推理与特定的视觉坐标锚定。受人类视觉决策的启发,ViGoRL学习生成空间上锚定的推理轨迹,在每一步引导视觉注意力到任务相关区域。当需要精细探索时,我们新颖的多轮RL框架使模型能够随着推理展开动态放大预测坐标。在包括SAT-2和BLINK的空间推理、V*bench的视觉搜索、ScreenSpot和VisualWebArena的基于网页的定位等一系列视觉推理基准测试中,ViGoRL始终优于监督微调和缺乏明确锚定机制的传统RL基线。结合多轮RL与放大的视觉反馈显著提高了ViGoRL在定位小GUI元素和视觉搜索方面的性能,达到V*Bench的86.4%。此外,我们发现锚定放大了其他视觉行为,如区域探索、锚定子目标设置和视觉验证。最后,人类评估表明,模型的视觉参考不仅在空间上准确,而且有助于理解模型的推理步骤。我们的结果表明,视觉锚定的RL是一种强大的范式,用于赋予模型通用视觉推理能力。
Summary / 总结
The research aims to enhance visual reasoning in language models by integrating reinforcement learning (RL) with explicit visual grounding. ViGoRL, a vision-language model, is trained using RL to anchor each reasoning step to specific visual coordinates, mimicking human visual decision-making. Across various visual reasoning benchmarks, ViGoRL outperforms both supervised fine-tuning and conventional RL baselines, particularly in tasks requiring fine-grained exploration, achieving 86.4% on V*Bench. Visual grounding also improves other visual behaviors such as region exploration and visual verification, and human evaluations confirm the model's spatial accuracy and reasoning clarity.
研究旨在通过将强化学习(RL)与视觉坐标相结合,提升模型的视觉推理能力。ViGoRL 是一种视觉语言模型,使用 RL 明确将推理步骤锚定到特定的视觉位置,模仿人类的视觉决策过程。在各种视觉推理基准测试中,ViGoRL 在需要精细探索的任务中表现优于监督微调和传统 RL 方法,特别是在 V*Bench 任务中达到 86.4% 的成绩。
MIRAGE: Agentic Framework for Multimodal Misinformation Detection with Web-Grounded Reasoning
Authors: Mir Nafis Sharear Shopnil, Sharad Duwal, Abhishek Tyagi, Adiba Mahbub Proma
First: 2025-10-20T14:40:26+00:00 · Latest: 2025-10-20T14:40:26+00:00
Comments: 16 pages, 3 tables, 1 figure
Abstract
Misinformation spreads across web platforms through billions of daily multimodal posts that combine text and images, overwhelming manual fact-checking capacity. Supervised detection models require domain-specific training data and fail to generalize across diverse manipulation tactics. We present MIRAGE, an inference-time, model-pluggable agentic framework that decomposes multimodal verification into four sequential modules: visual veracity assessment detects AI-generated images, cross-modal consistency analysis identifies out-of-context repurposing, retrieval-augmented factual checking grounds claims in web evidence through iterative question generation, and a calibrated judgment module integrates all signals. MIRAGE orchestrates vision-language model reasoning with targeted web retrieval, outputs structured and citation-linked rationales. On MMFakeBench validation set (1,000 samples), MIRAGE with GPT-4o-mini achieves 81.65% F1 and 75.1% accuracy, outperforming the strongest zero-shot baseline (GPT-4V with MMD-Agent at 74.0% F1) by 7.65 points while maintaining 34.3% false positive rate versus 97.3% for a judge-only baseline. Test set results (5,000 samples) confirm generalization with 81.44% F1 and 75.08% accuracy. Ablation studies show visual verification contributes 5.18 F1 points and retrieval-augmented reasoning contributes 2.97 points. Our results demonstrate that decomposed agentic reasoning with web retrieval can match supervised detector performance without domain-specific training, enabling misinformation detection across modalities where labeled data remains scarce.
中文标题/摘要
标题:MIRAGE:基于网页推理的多模态 misinformation 检测框架
通过数十亿日均的多模态帖子(结合文本和图像)在网页平台上传播的 misinformation,使人工事实核查能力不堪重负。监督检测模型需要特定领域的训练数据,并且无法泛化到多种操纵策略。我们提出了 MIRAGE,一种推理时、模型可插拔的代理框架,将多模态验证分解为四个顺序模块:视觉真实性评估检测 AI 生成的图像,跨模态一致性分析识别断章取义,检索增强的事实核查通过迭代问题生成将声明与网页证据联系起来,以及一个校准判断模块整合所有信号。MIRAGE 组织视觉语言模型推理与目标网页检索,输出结构化和引文链接的推理。在 MMFakeBench 验证集(1,000 个样本)上,MIRAGE 使用 GPT-4o-mini 达到了 81.65% 的 F1 和 75.1% 的准确率,优于最强的零样本基线(GPT-4V 与 MMD-Agent 的 74.0% F1)7.65 个百分点,同时将假阳性率保持在 34.3% 对比仅靠法官的基线为 97.3%。测试集结果(5,000 个样本)证实了泛化能力,F1 为 81.44%,准确率为 75.08%。消融研究显示视觉验证贡献了 5.18 个 F1 点,检索增强的推理贡献了 2.97 个点。我们的结果表明,分解的代理推理与网页检索可以匹配监督检测器的性能,无需特定领域的训练,从而在标注数据稀缺的情况下实现跨模态的 misinformation 检测。
SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches
Authors: Ehsan Latif, Zirak Khan, Xiaoming Zhai
First: 2025-06-29T11:35:10+00:00 · Latest: 2025-10-20T13:55:37+00:00
Comments: Submitted to NeurIPS2025
Abstract
Scientific sketches (e.g., models) offer a powerful lens into students' conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom's level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system's potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.
中文标题/摘要
标题:SketchMind:一种评估学生绘制科学草图的认知多代理框架
科学草图(例如,模型)为了解学生概念理解提供了一种强大的视角,但利用人工智能自动评估这种自由形式、视觉多样的艺术制品仍然是一个关键挑战。现有解决方案往往将草图评估视为图像分类任务或单一的视觉-语言模型,缺乏可解释性、教学一致性以及跨认知层次的适应性。为了解决这些限制,我们提出了SketchMind,这是一种基于认知的、多代理的评估和改进学生绘制科学草图的框架。SketchMind包括负责评分标准解析、草图感知、认知对齐以及迭代反馈与草图修改的模块化代理,从而实现个性化和透明的评估。我们使用一个包含3,575个学生生成草图的定制数据集,评估了六项不同最高阶别布鲁姆水平的科学评估项目,这些项目要求学生绘制模型来解释现象。与没有SRG的基线GPT-4o性能(平均准确率:55.6%)相比,集成SRG的性能达到了77.1%的平均准确率(绝对平均增益+21.4%)。我们还展示了多代理协调与SRG集成如何提升SketchMind的性能,例如,GPT-4.1在草图预测准确率上平均提高了8.9%,在所有项目中均优于单代理管道。人类评估者对由SketchMind与GPT-4.1生成的反馈和共创草图的评价为4.1分(满分5分),显著高于基线模型(例如,GPT-4o的评分为2.3)。专家指出,该系统有可能通过引导性修订有意义地支持概念发展。我们的代码和(待审批)数据集将被发布,以支持可重复性和未来的人工智能驱动教育研究。
Summary / 总结
The research aims to develop a more interpretable and adaptable framework for assessing student-drawn scientific sketches. SketchMind, a multi-agent cognitive framework, is introduced to evaluate and improve these sketches through modular agents handling rubric parsing, sketch perception, cognitive alignment, and iterative feedback. The framework significantly improves accuracy in sketch evaluation, achieving an average accuracy of 77.1% compared to 55.6% for baseline models. Human evaluators also rated the feedback and co-created sketches generated by SketchMind higher than those from baseline models, indicating better support for conceptual growth through guided revision.
研究旨在解决对学生绘制的科学草图进行自动评估的挑战,这些草图有助于理解学生的概念知识。提出了一个基于认知的多智能体框架SketchMind,通过处理评分标准解析、草图感知、认知对齐和迭代反馈的模块化智能体来评估和改进这些草图。该框架显著提高了准确性,与基线模型相比,平均准确率达到77.1%,绝对增益为21.4%。人类评估者还对SketchMind生成的反馈和共同创作的草图的评价高于基线模型,表明更好的教学对齐和评估过程的透明性。
Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization
Authors: Yuanli Wu, Long Zhang, Yue Du, Bin Li
First: 2025-10-20T12:54:32+00:00 · Latest: 2025-10-20T12:54:32+00:00
Abstract
With the rapid proliferation of video content across social media, surveillance, and education platforms, efficiently summarizing long videos into concise yet semantically faithful surrogates has become increasingly vital. Existing supervised methods achieve strong in-domain accuracy by learning from dense annotations but suffer from high labeling costs and limited cross-dataset generalization, while unsupervised approaches, though label-free, often fail to capture high-level human semantics and fine-grained narrative cues. More recently, zero-shot prompting pipelines have leveraged large language models (LLMs) for training-free video summarization, yet remain highly sensitive to handcrafted prompt templates and dataset-specific score normalization. To overcome these limitations, we introduce a rubric-guided, pseudo-labeled prompting framework that transforms a small subset of ground-truth annotations into high-confidence pseudo labels, which are aggregated into structured, dataset-adaptive scoring rubrics guiding interpretable scene evaluation. During inference, first and last segments are scored based solely on their descriptions, whereas intermediate ones incorporate brief contextual summaries of adjacent scenes to assess narrative progression and redundancy. This contextual prompting enables the LLM to balance local salience and global coherence without parameter tuning. On SumMe and TVSum, our method achieves F1 scores of \textbf{57.58} and \textbf{63.05}, surpassing unsupervised and prior zero-shot baselines while approaching supervised performance. The results demonstrate that rubric-guided pseudo labeling effectively stabilizes LLM-based scoring and establishes a general, interpretable zero-shot paradigm for video summarization.
中文标题/摘要
标题:基于上下文感知的伪标签评分方法在零样本视频摘要中的应用
随着社交媒体、监控和教育平台上的视频内容迅速增长,高效地将长视频浓缩为简洁且语义忠实的替代品变得越来越重要。现有监督方法通过学习密集注释实现领域内较强的准确性,但面临高标注成本和跨数据集泛化能力有限的问题,而无监督方法虽然无需标注,但往往难以捕捉高层次的人类语义和细腻的叙事线索。最近,零样本提示流水线利用大型语言模型(LLMs)进行无训练的视频摘要,但仍然高度依赖手工制作的提示模板和数据集特定的评分标准化。为克服这些限制,我们提出了一种基于评分标准的伪标签提示框架,将一小部分真实标注转换为高置信度的伪标签,并将其聚合为结构化的、数据集适应性的评分标准,指导可解释的场景评估。在推理过程中,首尾段落仅根据其描述进行评分,而中间段落则结合相邻场景的简要上下文摘要来评估叙事进展和冗余。这种上下文提示使LLM能够在不调整参数的情况下平衡局部显著性和全局连贯性。在SumMe和TVSum上,我们的方法分别实现了F1分数的57.58和63.05,超越了无监督和先前的零样本基线,接近监督性能。结果表明,基于评分标准的伪标签有效稳定了LLM评分,并建立了视频摘要的一般、可解释的零样本范式。
Summary / 总结
The paper addresses the challenge of efficiently summarizing long videos while maintaining semantic accuracy. It proposes a rubric-guided pseudo-labeled prompting framework that converts a small set of ground-truth annotations into high-confidence pseudo labels, which are then used to train a large language model (LLM) for video summarization. The method scores first and last segments based on descriptions and intermediate segments based on contextual summaries, achieving F1 scores of 57.58 and 63.05 on SumMe and TVSum, respectively, surpassing unsupervised and prior zero-shot baselines.
本文旨在高效地对长视频进行总结并保持语义准确性。提出了一种基于评分表的伪标签提示框架,将一小部分真实标注转换为高置信度的伪标签,用于指导场景评估。在推理过程中,框架根据描述对首尾段进行评分,并使用相邻场景的简要上下文总结来评估叙事进展。该方法在SumMe和TVSum上的F1分数分别为57.58和63.05,优于无监督和先前的零样本基线,并接近监督性能。
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Authors: Che Liu, Yingji Zhang, Dong Zhang, Weijie Zhang, Chenggong Gong, Yu Lu, Shilin Zhou, Ziliang Gan, Ziao Wang, Haipang Wu, Ji Liu, André Freitas, Qifan Wang, Zenglin Xu, Rongjuncheng Zhang, Yong Dai
First: 2025-02-26T17:26:36+00:00 · Latest: 2025-10-20T12:05:19+00:00
Comments: Project: https://github.com/HiThink-Research/NEXUS-O
Abstract
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.
中文标题/摘要
标题:Nexus:一种跨模态和交互式语言、音频和视觉模型
本研究提出了一种工业级的跨模态大型语言模型(LLM)流水线,整合了听觉、视觉和语言模态,以克服三模态数据集有限、高计算成本和复杂特征对齐等挑战。该流水线包含三个主要组件:首先,一个模块化框架,允许灵活配置各种编码器-LLM-解码器架构。其次,一种轻量级的训练策略,在最先进的视觉-语言模型Qwen2.5-VL上预训练音频-语言对齐,从而避免了视觉特定模态的昂贵预训练。第三,一个音频合成流水线,从各种真实场景生成高质量的音频-文本数据,支持自动语音识别和语音到语音聊天等应用。为此,我们引入了一种工业级的跨模态LLM,Nexus。广泛的实验验证了该流水线的有效性,得出以下关键发现:(1) 在视觉理解任务中,Nexus 的表现优于其骨干模型 Qwen2.5-VL-7B,验证了我们训练策略的有效性。(2) 在英语口语问答任务中,该模型在LLaMA Q. 基准测试中优于同期竞争对手(即MiniCPM-o2.6-7B)。(3) 在我们的实际ASR测试集中,Nexus 表现突出,表明其在实际场景中的鲁棒性。(4) 在语音到文本翻译任务中,我们的模型优于Qwen2-Audio-Instruct-7B。(5) 在文本到语音任务中,基于预训练的声码器(例如Fishspeech1.4或CosyVoice2.0),Nexus 在Seed-TTS基准测试中与其骨干声码器相当。(6) 对三模态对齐的深入分析表明,引入音频模态增强了视觉和语言之间的表示对齐。
Summary / 总结
This work introduces Nexus, an omni-modal large language model pipeline integrating auditory, visual, and linguistic modalities to address challenges like limited tri-modal datasets and high computational costs. The pipeline includes a modular framework, a lightweight training strategy, and an audio synthesis pipeline. Key experimental findings show that Nexus outperforms its backbone model in visual understanding, achieves better accuracy in English Spoken Question-Answering, demonstrates robustness in real-world ASR tests, and excels in Speech-to-Text Translation and Text-to-Speech tasks compared to other models.
该研究提出了Nexus,一种将听觉、视觉和语言模态整合的全模态大型语言模型管道。该管道包括模块化框架、轻量级训练策略和音频合成管道。关键实验发现包括在视觉理解任务中的优越表现、在英语口语问答中的更高准确率、出色的ASR性能以及在语音到文本翻译和文本到语音任务中的表现优于其骨干模型和竞争对手。
Learning to Detect Unknown Jailbreak Attacks in Large Vision-Language Models
Authors: Shuang Liang, Zhihao Xu, Jialing Tao, Hui Xue, Xiting Wang
First: 2025-10-17T08:37:45+00:00 · Latest: 2025-10-20T11:50:13+00:00
Comments: Withdrawn due to an accidental duplicate submission. This paper (arXiv:2510.15430) was unintentionally submitted as a new entry instead of a new version of our previous work (arXiv:2508.09201)
Abstract
Despite extensive alignment efforts, Large Vision-Language Models (LVLMs) remain vulnerable to jailbreak attacks, posing serious safety risks. To address this, existing detection methods either learn attack-specific parameters, which hinders generalization to unseen attacks, or rely on heuristically sound principles, which limit accuracy and efficiency. To overcome these limitations, we propose Learning to Detect (LoD), a general framework that accurately detects unknown jailbreak attacks by shifting the focus from attack-specific learning to task-specific learning. This framework includes a Multi-modal Safety Concept Activation Vector module for safety-oriented representation learning and a Safety Pattern Auto-Encoder module for unsupervised attack classification. Extensive experiments show that our method achieves consistently higher detection AUROC on diverse unknown attacks while improving efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
中文标题/摘要
标题:学习检测大型视觉语言模型中的未知越狱攻击
尽管进行了广泛的对齐努力,大型视觉语言模型(LVLMs)仍然容易受到越狱攻击的影响,这带来了严重的安全风险。为了解决这一问题,现有的检测方法要么学习特定攻击的参数,这妨碍了对未见过攻击的泛化,要么依赖于经验主义的原则,这限制了准确性和效率。为克服这些限制,我们提出了学习检测(LoD),这是一种通用框架,通过将重点从特定攻击的学习转移到特定任务的学习,准确地检测未知的越狱攻击。该框架包括一个多模态安全概念激活向量模块,用于安全导向的表示学习,以及一个安全模式自编码器模块,用于无监督攻击分类。广泛的实验表明,我们的方法在多种未知攻击上的检测AUROC始终更高,同时提高了效率。代码可在https://anonymous.4open.science/r/Learning-to-Detect-51CB获取。
Summary / 总结
The research aims to address the vulnerability of Large Vision-Language Models (LVLMs) to jailbreak attacks, which pose significant safety risks. To improve detection accuracy and efficiency, the study proposes Learning to Detect (LoD), a framework that focuses on task-specific learning rather than attack-specific parameters. Experiments demonstrate that LoD achieves higher detection AUROC on various unknown attacks while enhancing efficiency. The code is available at https://anonymous.4open.science/r/Learning-to-Detect-51CB.
研究旨在提高大型视觉-语言模型对未知越狱攻击的检测能力,尽管这些模型在对齐努力后仍然容易受到此类攻击。提出的Learning to Detect (LoD)框架将重点从特定攻击的学习转移到特定任务的学习,使用多模态安全概念激活向量模块进行表示学习,以及安全模式自动编码器模块进行无监督攻击分类。实验表明,LoD在各种未知攻击上的检测AUROC更高,同时提高了效率。代码可在https://anonymous.4open.science/r/Learning-to-Detect-51CB获取。
Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Authors: Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi
First: 2025-06-01T15:19:52+00:00 · Latest: 2025-10-20T11:09:33+00:00
Comments: 16 pages, 12 figures
Abstract
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
中文标题/摘要
标题:无限解析器:面向布局的强化学习扫描文档解析
自动将扫描文档解析为丰富结构化的、机器可读格式仍然是文档AI中的关键瓶颈,因为传统的多阶段管道会遭受错误传播和对多样化布局的有限适应性。我们引入了布局RL,这是一种端到端的强化学习框架,通过优化归一化编辑距离、段落计数准确性和阅读顺序保留的复合奖励来训练模型使其明确具有布局意识。利用我们新发布的数据集Infinity-Doc-55K,该数据集结合了55K高质量合成扫描文档解析数据和专家筛选的真实世界文档,我们在基于视觉-语言模型的解析器Infinity-Parser中实例化了布局RL。在OCR、表格和公式提取以及阅读顺序检测的英文和中文基准测试中,Infinity-Parser在准确性和结构保真度方面均达到了新的最佳性能,超越了专门的管道和通用的视觉-语言模型。我们将公开发布我们的代码和数据集,以加速稳健文档理解的进步。
Leveraging Vision-Language Models for Open-Vocabulary Instance Segmentation and Tracking
Authors: Bastian Pätzold, Jan Nogga, Sven Behnke
First: 2025-03-18T20:18:42+00:00 · Latest: 2025-10-20T11:05:46+00:00
Comments: IEEE Robotics and Automation Letters (RA-L), November 2025
Abstract
Vision-language models (VLMs) excel in visual understanding but often lack reliable grounding capabilities and actionable inference rates. Integrating them with open-vocabulary object detection (OVD), instance segmentation, and tracking leverages their strengths while mitigating these drawbacks. We utilize VLM-generated structured descriptions to identify visible object instances, collect application-relevant attributes, and inform an open-vocabulary detector to extract corresponding bounding boxes that are passed to a video segmentation model providing segmentation masks and tracking. Once initialized, this model directly extracts segmentation masks, processing image streams in real time with minimal computational overhead. Tracks can be updated online as needed by generating new structured descriptions and detections. This combines the descriptive power of VLMs with the grounding capability of OVD and the pixel-level understanding and speed of video segmentation. Our evaluation across datasets and robotics platforms demonstrates the broad applicability of this approach, showcasing its ability to extract task-specific attributes from non-standard objects in dynamic environments. Code, data, videos, and benchmarks are available at https://vlm-gist.github.io
中文标题/摘要
标题:利用视觉语言模型进行开放词汇实例分割和跟踪
视觉语言模型(VLMs)在视觉理解方面表现出色,但往往缺乏可靠的语义关联能力和可操作的推理速度。将它们与开放词汇对象检测(OVD)、实例分割和跟踪结合使用,可以发挥其优势并缓解这些缺点。我们利用VLM生成的结构化描述来识别可见的对象实例,收集与应用相关的属性,并指导开放词汇检测器提取相应的边界框,这些边界框传递给一个提供分割掩码和跟踪的视频分割模型。一旦初始化,该模型可以直接提取分割掩码,实时处理图像流,且计算开销较小。跟踪可以根据需要在线更新,通过生成新的结构化描述和检测。这种方法结合了VLM的描述能力和OVD的语义关联能力,以及视频分割的像素级理解和速度。我们在多个数据集和机器人平台上的评估表明,该方法具有广泛的适用性,能够从动态环境中提取非标准对象的特定属性。代码、数据、视频和基准数据可在https://vlm-gist.github.io 获取。
Summary / 总结
The research aims to enhance the grounding and inference capabilities of vision-language models (VLMs) by integrating them with open-vocabulary object detection and instance segmentation. The method involves using VLM-generated structured descriptions to identify and extract bounding boxes of objects, which are then processed by a video segmentation model to provide segmentation masks and tracking. The approach demonstrates broad applicability across datasets and robotics platforms, effectively extracting task-specific attributes from non-standard objects in dynamic environments. Key findings include improved real-time processing and the ability to update tracks online as needed. Evaluation shows the method's effectiveness in dynamic settings. Code, data, videos, and benchmarks are available at https://vlm-gist.github.io.
研究旨在通过将视觉语言模型(VLMs)与开放词汇对象检测和实例分割相结合,增强其定位和推理能力。方法包括使用VLM生成的结构化描述来识别和提取物体的边界框,然后由视频分割模型提供分割掩码和跟踪。该方法在多个数据集和机器人平台上展示了广泛的应用性,有效从动态环境中的非标准物体中提取任务特定属性。关键发现包括实时处理能力的提升以及能够在线更新跟踪。评估表明该方法在动态环境中具有有效性。代码、数据、视频和基准数据可在https://vlm-gist.github.io 获取。
Diffusion Models as Dataset Distillation Priors
Authors: Duo Su, Huyu Wu, Huanran Chen, Yiming Shi, Yuzhu Wang, Xi Ye, Jun Zhu
First: 2025-10-20T11:04:09+00:00 · Latest: 2025-10-20T11:04:09+00:00
Abstract
Dataset distillation aims to synthesize compact yet informative datasets from large ones. A significant challenge in this field is achieving a trifecta of diversity, generalization, and representativeness in a single distilled dataset. Although recent generative dataset distillation methods adopt powerful diffusion models as their foundation models, the inherent representativeness prior in diffusion models is overlooked. Consequently, these approaches often necessitate the integration of external constraints to enhance data quality. To address this, we propose Diffusion As Priors (DAP), which formalizes representativeness by quantifying the similarity between synthetic and real data in feature space using a Mercer kernel. We then introduce this prior as guidance to steer the reverse diffusion process, enhancing the representativeness of distilled samples without any retraining. Extensive experiments on large-scale datasets, such as ImageNet-1K and its subsets, demonstrate that DAP outperforms state-of-the-art methods in generating high-fidelity datasets while achieving superior cross-architecture generalization. Our work not only establishes a theoretical connection between diffusion priors and the objectives of dataset distillation but also provides a practical, training-free framework for improving the quality of the distilled dataset.
中文标题/摘要
标题:扩散模型作为数据集蒸馏先验
数据集蒸馏旨在从大型数据集中合成紧凑且信息丰富的数据集。该领域的一个重大挑战是在单一蒸馏数据集中实现多样性、泛化能力和代表性三者的平衡。尽管最近的生成数据集蒸馏方法采用强大的扩散模型作为基础模型,但扩散模型固有的代表性先验被忽视了。因此,这些方法往往需要结合外部约束来提高数据质量。为了解决这一问题,我们提出了扩散作为先验(DAP),通过使用Mercer核量化合成数据和真实数据在特征空间中的相似性来正式化代表性。然后,我们将这一先验作为指导,引导逆向扩散过程,从而在无需重新训练的情况下增强蒸馏样本的代表性。在ImageNet-1K及其子集等大规模数据集上的广泛实验表明,DAP在生成高保真数据集方面优于最先进的方法,并且在跨架构泛化方面表现更优。我们的工作不仅建立了扩散先验与数据集蒸馏目标之间的理论联系,还提供了一种无需训练的实用框架,以提高蒸馏数据集的质量。
Summary / 总结
The paper addresses the challenge of synthesizing high-quality distilled datasets with diversity, generalization, and representativeness using diffusion models. It proposes Diffusion As Priors (DAP), which uses a Mercer kernel to quantify the similarity between synthetic and real data, guiding the reverse diffusion process to enhance representativeness. Experiments on ImageNet-1K and its subsets show that DAP outperforms existing methods in generating high-fidelity datasets with better cross-architecture generalization.
论文旨在利用扩散模型合成高质量、多样性和代表性的数据集。提出了Diffusion As Priors (DAP) 方法,通过Mercer核量化合成数据和真实数据之间的相似性,引导反向扩散过程以增强代表性。在ImageNet-1K及其子集上的实验表明,DAP 在生成高保真数据集方面优于现有方法,并且具有更好的跨架构泛化能力。
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Authors: Ziming Wei, Bingqian Lin, Yunshuang Nie, Jiaqi Chen, Shikui Ma, Hang Xu, Xiaodan Liang
First: 2025-03-23T13:18:17+00:00 · Latest: 2025-10-20T10:19:21+00:00
Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems
Abstract
Data scarcity is a long-standing challenge in the Vision-Language Navigation (VLN) field, which extremely hinders the generalization of agents to unseen environments. Previous works primarily rely on additional simulator data or web-collected images/videos to improve the generalization. However, the simulator environments still face limited diversity, and the web-collected data often requires extensive labor to remove the noise. In this paper, we propose a Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates the unseen observation-instruction pairs via rewriting human-annotated training data. Benefiting from our rewriting mechanism, new observation-instruction pairs can be obtained in both simulator-free and labor-saving manners to promote generalization. Specifically, we first introduce Object-Enriched Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large Language Models (LLMs) to derive rewritten object-enriched scene descriptions, enabling observation synthesis with diverse objects and spatial layouts via Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast Instruction Rewriting, which generates observation-aligned rewritten instructions by requiring LLMs to reason the difference between original and new observations. We further develop a mixing-then-focusing training strategy with a random observation cropping scheme, effectively enhancing data distribution diversity while suppressing augmentation data noise during training. Experiments on both the discrete environments (R2R, REVERIE, and R4R datasets) and continuous environments (R2R-CE dataset) show the superior performance and impressive generalization ability of our method. Code is available at https://github.com/SaDil13/VLN-RAM.
中文标题/摘要
标题:从已见重写未见:使用基础模型增强视觉语言导航中的观察-指令重写
视觉语言导航(VLN)领域长期面临数据稀缺的问题,极大地限制了代理在未见环境中的泛化能力。以往的工作主要依赖额外的模拟器数据或网络收集的图像/视频来提高泛化能力。然而,模拟器环境仍然缺乏多样性,而网络收集的数据往往需要大量劳动来去除噪声。在本文中,我们提出了一种重写驱动的增强(RAM)范式,直接通过重写人类标注的训练数据来生成未见的观察-指令对。得益于我们的重写机制,新的观察-指令对可以在无需模拟器和节省劳动的情况下获得,从而促进泛化。具体而言,我们首先引入了对象增强的观察重写,其中结合了视觉语言模型(VLMs)和大型语言模型(LLMs),以推导出重写后对象丰富的场景描述,通过文本到图像生成模型(T2IMs)实现具有多样对象和空间布局的观察合成。然后,我们提出了观察对比指令重写,通过要求LLMs推理原始观察与新观察之间的差异来生成与观察对齐的重写指令。我们进一步开发了一种混合然后聚焦的训练策略,结合随机观察裁剪方案,有效增强了数据分布的多样性,同时在训练过程中抑制增强数据噪声。在离散环境(R2R、REVERIE和R4R数据集)和连续环境(R2R-CE数据集)上的实验表明,我们的方法具有优越的性能和令人印象深刻的泛化能力。代码可在https://github.com/SaDil13/VLN-RAM/获取。
Summary / 总结
This paper addresses the challenge of data scarcity in Vision-Language Navigation (VLN) by proposing a Rewriting-driven AugMentation (RAM) paradigm. The method uses Vision-Language Models (VLMs) and Large Language Models (LLMs) to rewrite human-annotated training data, generating unseen observation-instruction pairs in a simulator-free and labor-saving manner. The approach includes Object-Enriched Observation Rewriting and Observation-Contrast Instruction Rewriting, and employs a mixing-then-focusing training strategy to enhance data diversity and suppress noise. Experiments on various VLN datasets demonstrate the method's superior performance and strong generalization ability.
本文提出了一种重写驱动增强(RAM)范式,以解决视觉-语言导航(VLN)中的数据稀缺问题。该方法通过重写机制生成未见过的观察-指令对,无需依赖额外的模拟器数据或网络收集的图像/视频。方法包括对象增强的观察重写和观察对比指令重写,并采用混合-聚焦训练策略以提高数据多样性并减少噪声。实验结果表明,该方法在离散和连续环境中均表现出优越的性能和强大的泛化能力。
CReFT-CAD: Boosting Orthographic Projection Reasoning for CAD via Reinforcement Fine-Tuning
Authors: Ke Niu, Zhuofan Chen, Haiyang Yu, Yuwen Chen, Teng Fu, Mengyang Zhao, Bin Li, Xiangyang Xue
First: 2025-05-31T13:52:56+00:00 · Latest: 2025-10-20T10:16:50+00:00
Abstract
Computer-Aided Design (CAD) plays a pivotal role in industrial manufacturing. Orthographic projection reasoning underpins the entire CAD workflow, encompassing design, manufacturing, and simulation. However, prevailing deep-learning approaches employ standard 3D reconstruction pipelines as an alternative, which often introduce imprecise dimensions and limit the parametric editability required for CAD workflows. Recently, some researchers adopt vision-language models (VLMs), particularly supervised fine-tuning (SFT), to tackle CAD-related challenges. SFT shows promise but often devolves into pattern memorization, yielding poor out-of-distribution performance on complex reasoning tasks. To address these gaps, we introduce CReFT-CAD, a two-stage fine-tuning paradigm that first employs a curriculum-driven reinforcement learning stage with difficulty-aware rewards to build reasoning ability steadily, and then applies supervised post-tuning to hone instruction following and semantic extraction. Complementing this, we release TriView2CAD, the first large-scale, open-source benchmark for orthographic projection reasoning, comprising 200,000 synthetic and 3,000 real-world orthographic projections with precise dimension annotations and six interoperable data modalities. We benchmark leading VLMs on orthographic projection reasoning and demonstrate that CReFT-CAD substantially improves reasoning accuracy and out-of-distribution generalizability in real-world scenarios, offering valuable insights for advancing CAD reasoning research.
中文标题/摘要
标题:CReFT-CAD:通过强化微调增强正投影推理
计算机辅助设计(CAD)在工业制造中发挥着关键作用。 正投影推理贯穿整个CAD工作流程,包括设计、制造和仿真。然而,现有的深度学习方法通常采用标准的3D重建管道作为替代方案,这往往引入不精确的尺寸并限制了CAD工作流程所需的参数可编辑性。最近,一些研究人员采用视觉-语言模型(VLMs),特别是监督微调(SFT),来应对CAD相关的挑战。SFT显示出潜力,但往往退化为模式记忆,导致在复杂推理任务上的离群分布性能较差。为了解决这些差距,我们引入了CReFT-CAD,这是一种两阶段微调范式,首先采用基于课程的强化学习阶段,带有难度感知的奖励,逐步建立推理能力,然后应用监督后微调以提高指令遵循和语义提取。此外,我们发布了TriView2CAD,这是首个大规模、开源的正投影推理基准,包含200,000个合成和3,000个真实世界的正投影,具有精确的尺寸注释和六种互操作的数据模态。我们对领先VLMs进行了正投影推理基准测试,并证明CReFT-CAD在实际场景中的推理准确性和离群分布泛化能力显著提高,为推进CAD推理研究提供了宝贵的见解。
Summary / 总结
CReFT-CAD is a two-stage fine-tuning method that enhances orthographic projection reasoning in CAD by using a curriculum-driven reinforcement learning stage followed by supervised post-tuning. This approach improves reasoning accuracy and out-of-distribution generalizability compared to standard 3D reconstruction pipelines and supervised fine-tuning. The method is benchmarked against leading vision-language models, showing significant improvements in real-world scenarios.
CReFT-CAD 是一种两阶段微调方法,通过使用基于课程的强化学习阶段,随后进行监督后调优来增强 CAD 中的正投影推理。这种方法在推理准确性和分布外泛化能力方面优于标准的 3D 重建管道和监督微调。该方法被用于对领先的空间语言模型进行基准测试,显示出在实际场景中的显著改进。
Recurrent Attention-based Token Selection for Efficient Streaming Video-LLMs
Authors: Vaggelis Dorovatas, Soroush Seifi, Gunshi Gupta, Rahaf Aljundi
Venue: NeurIPS 2025
First: 2025-10-20T10:04:49+00:00 · Latest: 2025-10-20T10:04:49+00:00
Comments: NeurIPS 2025
Abstract
Video Large Language Models (Video-LLMs) excel at understanding videos in-context, provided they have full access to the video when answering queries. However, these models face challenges in streaming scenarios where hour-long videos must be processed online, and questions need timely responses. In this work, we propose a training-free approach compatible with standard Video-LLMs, leveraging three key concepts: 1) LLM-informed selection of visual tokens to identify those that the LLM has attended to and contributed to its understanding of each short clip. Our attention-based selection allows us to discard up to ~95% of unimportant visual tokens with minimal performance loss; 2) Recurrent processing of past selected tokens to generate temporally coherent understanding of each processed clip; 3) Caption-based question answering for lightweight and accurate responses. Our method achieves state-of-the-art performance on streaming video benchmarks, striking a balance between efficiency and effectiveness.
中文标题/摘要
标题:基于循环注意力的标记选择以提高流式视频-LLM的效率
视频大型语言模型(Video-LLMs)在理解视频方面表现出色,前提是它们在回答查询时可以访问整个视频。然而,在流式处理场景中,这些模型面临挑战,因为需要在线处理长达数小时的视频,并且需要及时响应问题。在本工作中,我们提出了一种无需训练的方法,适用于标准的Video-LLMs,利用三个关键概念:1)LLM指导的视觉标记选择,以识别LLM已关注并对其理解每个短片段做出贡献的标记。基于注意力的选择使我们能够丢弃多达约95%的不重要视觉标记,同时性能损失最小;2)对过去选择的标记进行循环处理,以生成每个处理片段的时空连贯理解;3)基于字幕的问题回答,实现轻量级和准确的响应。我们的方法在流式视频基准测试中达到了最先进的性能,实现了效率和效果之间的平衡。
Summary / 总结
This work addresses the challenge of processing long videos in real-time for Video-LLMs by proposing a training-free approach. It involves LLM-informed token selection to focus on important visual tokens, recurrent processing of selected tokens for coherent understanding, and caption-based question answering. The method achieves state-of-the-art performance on streaming video benchmarks with high efficiency and effectiveness, discarding up to 95% of unimportant tokens with minimal performance loss.
该研究提出了一种无需训练的方法,以解决实时处理长视频的挑战。方法包括LLM指导的标记选择以聚焦重要视觉标记、对选定标记的递归处理以获得连贯的理解,以及基于字幕的问题回答。该方法在流式视频基准测试中达到了最先进的性能,具有高效率和有效性,同时丢弃了高达95%的不重要标记且性能损失很小。
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Authors: Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou
First: 2025-10-20T09:56:43+00:00 · Latest: 2025-10-20T09:56:43+00:00
Comments: This work is in progress
Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.
中文标题/摘要
标题:迈向通用检索增强生成的多模态检索
检索增强生成(RAG)已成为通过从外部语料库检索相关文档来增强大型语言模型(LLMs)的强大范式。然而,现有的RAG系统主要关注单一模态的文本文档,并且在查询和文档可能包含多种模态(如文本和图像)的现实场景中往往表现不佳。本文旨在解决通用检索增强生成(URAG)的挑战,即检索和推理混合模态信息以提高视觉语言生成。为此,我们提出了Nyx,一种针对URAG场景的统一的多模态到多模态检索器。为了解决现实中的混合模态数据稀缺问题,我们引入了一个四阶段的自动化生成和过滤管道,利用网络文档构建了NyxQA数据集,该数据集包含多样化的混合模态问题-答案对,更好地反映了实际信息需求。基于此高质量数据集,我们采用两阶段训练框架对Nyx进行训练:首先在NyxQA和多种开源检索数据集上进行预训练,然后使用下游视觉语言模型(VLMs)的反馈进行监督微调,以使检索输出与生成偏好对齐。实验结果表明,Nyx不仅在标准的纯文本RAG基准测试中表现出色,而且在更通用和现实的URAG设置中表现出色,显著提高了视觉语言任务中的生成质量。
Summary / 总结
This paper addresses the challenge of Universal Retrieval-Augmented Generation (URAG) by proposing Nyx, a mixed-modal retriever designed for scenarios involving both text and images. Nyx is trained using a two-stage framework: pre-training on a dataset of diverse mixed-modal question-answer pairs (NyxQA) and fine-tuning with feedback from vision-language models. The results show that Nyx performs well on standard text-only RAG benchmarks and excels in the more general URAG setting, enhancing generation quality in vision-language tasks.
本文提出了一种统一的混合模态检索器Nyx,以解决通用检索增强生成(URAG)的挑战。Nyx基于一个新构建的包含多种混合模态问题-答案对的数据集NyxQA进行训练。训练框架包括预训练和监督微调两个阶段。实验结果表明,Nyx不仅在标准的文本-only RAG基准测试中表现良好,还在更通用和现实的URAG设置中表现出色,提高了视觉语言任务中的生成质量。
DynVFX: Augmenting Real Videos with Dynamic Content
Authors: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel
First: 2025-02-05T21:14:55+00:00 · Latest: 2025-10-20T09:26:34+00:00
Comments: Project page: https://dynvfx.github.io
Abstract
We present a method for augmenting real-world videos with newly generated dynamic content. Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects that naturally interact with the existing scene over time. The position, appearance, and motion of the new content are seamlessly integrated into the original footage while accounting for camera motion, occlusions, and interactions with other dynamic objects in the scene, resulting in a cohesive and realistic output video. We achieve this via a zero-shot, training-free framework that harnesses a pre-trained text-to-video diffusion transformer to synthesize the new content and a pre-trained vision-language model to envision the augmented scene in detail. Specifically, we introduce a novel inference-based method that manipulates features within the attention mechanism, enabling accurate localization and seamless integration of the new content while preserving the integrity of the original scene. Our method is fully automated, requiring only a simple user instruction. We demonstrate its effectiveness on a wide range of edits applied to real-world videos, encompassing diverse objects and scenarios involving both camera and object motion.
中文标题/摘要
标题:DynVFX:增强现实视频中的动态内容
我们提出了一种方法,用于在真实世界视频中添加新生成的动态内容。给定一个输入视频和一个简单的用户提供的文本说明,我们的方法会合成与现有场景自然交互的动态对象或复杂的场景效果。新内容的位置、外观和运动无缝地融入原始画面,同时考虑到摄像机运动、遮挡和其他动态对象的交互,从而产生连贯且逼真的输出视频。我们通过一个无需训练的零样本框架实现这一点,该框架利用预训练的文本到视频扩散变换器生成新内容,并利用预训练的视觉语言模型详细构想增强后的场景。具体而言,我们引入了一种新颖的基于推理的方法,通过在注意力机制中操作特征,实现准确的定位和无缝集成新内容,同时保持原始场景的完整性。我们的方法完全自动化,只需简单的用户指令即可。我们展示了其在各种应用于真实世界视频的编辑中的有效性,涵盖了各种各样的对象和场景,涉及摄像机和物体运动。
Summary / 总结
The research aims to augment real-world videos with dynamic content based on user instructions. It uses a zero-shot, training-free framework combining a text-to-video diffusion transformer and a vision-language model. The method synthesizes new dynamic objects or scene effects that interact naturally with the existing scene, accounting for camera motion and occlusions. Key findings include seamless integration of new content into the original footage, maintaining the integrity of the original scene and producing cohesive and realistic output videos.
研究旨在根据用户指令增强现实视频中的动态内容。它使用一种无需训练的框架,结合文本到视频的扩散变换器和视觉语言模型。该方法生成新的动态对象或场景效果,使其自然地与现有场景互动,同时考虑摄像机运动和遮挡。主要发现包括新内容能够无缝融入原始视频,保持原始场景的完整性,并在各种编辑和场景中实现连贯和逼真的输出视频。
Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling
Authors: Lipeng Xie, Sen Huang, Zhuo Zhang, Anni Zou, Yunpeng Zhai, Dingchao Ren, Kezun Zhang, Haoyuan Hu, Boyin Liu, Haoran Chen, Zhaoyang Liu, Bolin Ding
First: 2025-10-20T09:01:37+00:00 · Latest: 2025-10-20T09:01:37+00:00
Abstract
Reward models are essential for aligning Large Language Models (LLMs) with human values, yet their development is hampered by costly preference datasets and poor interpretability. While recent rubric-based approaches offer transparency, they often lack systematic quality control and optimization, creating a trade-off between scalability and reliability. We address these limitations with a novel, training-free framework built on a key assumption: \textit{evaluation rubrics underlying human preferences exhibit significant generalization ability across diverse queries}, a property that enables remarkable data efficiency. Our two-stage approach first infers high-quality, query-specific rubrics using a validation-guided \textbf{Propose-Evaluate-Revise} pipeline. Second, it generalizes these granular rubrics into a compact, non-redundant core set by maximizing an \textbf{information-theoretic coding rate}. The final output is an interpretable, hierarchical "Theme-Tips" rubric set. Extensive experiments demonstrate the framework's exceptional data efficiency and performance. Critically, using just 70 preference pairs (1.5\% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts. This work pioneers a scalable, interpretable, and data-efficient path for reward modeling.
中文标题/摘要
标题:自动评分标准:学习提取可泛化的奖励建模准则
奖励模型对于使大型语言模型(LLMs)与人类价值观对齐至关重要,但其开发受到昂贵的偏好数据集和较差的可解释性的阻碍。尽管最近的基于评分表的方法提供了透明度,但它们往往缺乏系统性的质量控制和优化,从而在可扩展性和可靠性之间造成了权衡。我们通过一个新颖的、无需训练的框架解决了这些限制,该框架基于一个关键假设: extit{人类偏好背后的评估评分表在多种查询中表现出显著的泛化能力},这一特性使得数据效率显著提高。我们的两阶段方法首先使用验证引导的 extbf{建议-评估-修订}管道推断高质量、查询特定的评分表。其次,通过最大化 extbf{信息论编码率}将这些细粒度的评分表泛化为一个紧凑且无冗余的核心集。最终输出是一个可解释的、分层的“主题-提示”评分表集。广泛的实验表明,该框架具有出色的低数据效率和性能。关键的是,仅使用70对偏好数据(源数据的1.5%),我们的方法也使较小的模型如Qwen3-8B能够超越专门的、完全训练的对应模型。这项工作开创了一条可扩展、可解释且数据高效的奖励建模路径。
Summary / 总结
The paper addresses the challenges of developing reward models for aligning Large Language Models with human values, focusing on the limitations of costly datasets and poor interpretability. It introduces a training-free framework that infers query-specific rubrics through a Propose-Evaluate-Revise pipeline and then generalizes them into a compact core set. Experiments show that this method is highly data-efficient, requiring only 70 preference pairs to outperform specialized, fully-trained models. The final output is an interpretable, hierarchical rubric set that enhances the scalability and interpretability of reward modeling.
论文旨在解决开发用于使大型语言模型与人类价值观对齐的奖励模型时面临的挑战,重点关注成本高昂的数据集和较差的可解释性。它提出了一种无需训练的框架,通过Propose-Evaluate-Revise管道推断出查询特定的评分标准,然后将其概括为一个紧凑的核心集。实验表明,该方法具有高度的数据效率,仅需70对偏好配对就能超越专门训练的模型。最终输出的是一个可解释的、分层的评分标准集,增强了奖励模型的可扩展性和可解释性。
Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations
Authors: Tal Barami, Nimrod Berman, Ilan Naiman, Amos H. Hason, Rotem Ezra, Omri Azencot
First: 2025-10-20T08:58:23+00:00 · Latest: 2025-10-20T08:58:23+00:00
Abstract
Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement.
中文标题/摘要
标题:超越静态与动态:多因素序列表示的去纠缠基准与评估框架
在序列数据中学习去纠缠表示是深度学习中的一个关键目标,具有广泛的应用前景,特别是在视觉、音频和时间序列领域。尽管现实世界的数据包含多个相互作用的语义因素,但先前的工作主要集中在更简单的两因素静态和动态设置上,主要是因为这样的设置使得数据收集更加容易,从而忽视了现实世界数据的多因素本质。我们提出了第一个标准化基准,用于在六个涵盖视频、音频和时间序列的多样数据集上评估多因素序列去纠缠。该基准包括用于数据集集成、模型开发和多因素分析定制评估指标的模块化工具。我们还提出了一种后验潜空间探索阶段,以自动对齐潜空间维度与语义因素,并引入了一种基于Koopman的模型,实现了最先进的结果。此外,我们展示了视觉语言模型可以自动进行数据集注释,并作为零样本去纠缠评估器,从而消除手动标签和人工干预的需要。这些贡献共同为多因素序列去纠缠的发展提供了稳健且可扩展的基础。
FineVision: Open Data Is All You Need
Authors: Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti
First: 2025-10-20T07:54:46+00:00 · Latest: 2025-10-20T07:54:46+00:00
Abstract
The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.
中文标题/摘要
标题:FineVision: 开放数据全靠它
视觉-语言模型(VLMs)的进步受到碎片化数据集不一致和污染的阻碍。我们引入了FineVision,这是一个精心收集、整理和统一的包含2400万样本的语料库——这是此类最大的开放资源。我们通过半自动、人工在环的管道将超过200个来源统一成185个子集:自动化执行批量导入和模式映射,而审查员审核映射并抽查输出以验证忠实消费注释、适当格式化和多样性以及安全性;问题触发针对性修复和重新运行。该工作流进一步在来源内部和跨来源进行严格的去重,并针对66个公共基准进行去污染。FineVision 还包括具有统一动作空间的代理/GUI 任务;审查员验证模式并检查轨迹样本以确认可执行性。在广泛评估套件中,基于FineVision训练的模型始终优于现有开放混合数据集训练的模型,突显了规模、数据卫生和平衡自动化与人工监督的好处。我们发布语料库和整理工具以加速数据驱动的VLM研究。
Summary / 总结
FineVision addresses the fragmented and contaminated landscape of public vision-language datasets by creating a unified corpus of 24 million samples through a semi-automated, human-in-the-loop pipeline. The dataset is rigorously de-duplicated and decontaminated, and it includes agentic/GUI tasks. Models trained on FineVision outperform those trained on existing open datasets across various evaluations, highlighting the importance of scale, data hygiene, and balanced automation with human oversight. The corpus and curation tools are released to promote data-centric VLM research.
FineVision 是一个包含 2400 万样本的大规模数据集,通过半自动化流程整合了超过 200 个来源,并经过严格的去重和去污染处理。该数据集包括代理/GUI 任务,并且模型在 FineVision 上训练后在多种评估中表现优于现有开放混合数据集,突显了数据规模、数据质量和平衡自动化与人工监督的重要性。
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs
Authors: Hao Fang, Changle Zhou, Jiawei Kong, Kuofeng Gao, Bin Chen, Tao Liang, Guojun Ma, Shu-Tao Xia
First: 2025-05-26T08:36:10+00:00 · Latest: 2025-10-20T06:44:03+00:00
Abstract
Large Vision-Language Models (LVLMs) are susceptible to hallucinations, where generated responses seem semantically plausible yet exhibit little or no relevance to the input image. Previous studies reveal that this issue primarily stems from LVLMs' over-reliance on language priors while disregarding the visual information during decoding. To alleviate this issue, we introduce a novel Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy, which adaptively strengthens the mutual dependency between generated texts and input images to mitigate hallucinations. Unlike existing methods solely focusing on text token sampling, we propose to jointly model the contributions of visual and textual tokens to C-PMI, formulating hallucination mitigation as a bi-level optimization problem aimed at maximizing mutual information. To solve it, we design a token purification mechanism that dynamically regulates the decoding process by sampling text tokens remaining maximally relevant to the given image, while simultaneously refining image tokens most pertinent to the generated response. Extensive experiments across various benchmarks reveal that the proposed method significantly reduces hallucinations in LVLMs while preserving decoding efficiency.
中文标题/摘要
标题:视觉引导语言:一种条件互信息校准解码策略以减少LVLM中的幻觉
大型视觉-语言模型(LVLMs)容易出现幻觉现象,即生成的响应在语义上看似合理,但实际上与输入图像几乎没有关联。先前的研究表明,这一问题主要源于LVLMs过度依赖语言先验,而在解码过程中忽略了视觉信息。为了解决这一问题,我们提出了一种新颖的条件点互信息(C-PMI)校准解码策略,该策略能够自适应地增强生成文本与输入图像之间的相互依赖性,从而减轻幻觉现象。与现有方法仅关注文本词元采样不同,我们提出了一种同时建模视觉和文本词元对C-PMI贡献的方法,将幻觉缓解问题表述为一个双层优化问题,旨在最大化互信息。为了解决这一问题,我们设计了一种词元净化机制,该机制通过动态调节解码过程来采样与给定图像最相关的文本词元,同时不断优化与生成响应最相关的图像词元。在各种基准上的广泛实验表明,所提出的方法在显著减少LVLM中的幻觉现象的同时,保持了解码效率。
Summary / 总结
The paper addresses the issue of hallucinations in Large Vision-Language Models (LVLMs), where generated responses are semantically plausible but irrelevant to the input image. It introduces a Conditional Pointwise Mutual Information (C-PMI) calibrated decoding strategy that enhances the mutual dependency between generated texts and input images. The method formulates hallucination mitigation as a bi-level optimization problem and includes a token purification mechanism that dynamically regulates the decoding process to reduce hallucinations while maintaining efficiency.
该论文通过提出一种条件点互信息(C-PMI)校准解码策略来解决大型视觉语言模型(LVLMs)中的幻觉问题。该方法增强了生成文本与输入图像之间的相互依赖性,将其形式化为一个双层优化问题,以最大化互信息。实验结果表明,这种方法有效减少了幻觉现象,同时保持了解码效率。
History
20251021_0331 20251020_0328 20251019_0321 20251018_0327 20251017_0320 20251016_0328 20251015_0328 20251014_0323 20251011_0328 20251010_0330 20251009_0321 20251008_0343 20251007_0353 20251006_0325 20251005_0350 20251004_0352 20251003_0352 20251002_0356 20251001_0321 20250925_0335 20250924_0350 20250923_0348 20250922_0346 20250921_0345 20250920_0342 20250919_0346 20250918_0342 20250917_0336 20250916_0333 20250915_0333 20250914_0328 20250913_0322 20250912_0335 20250911_0337 20250910_0338 20250909_0341 20250908_0342 20250907_0333 20250906_0350 20250905_0319 20250904_0323 20250903_0355 20250902_0325 20250901_0355 20250831_0355 20250830_0356 20250829_0355 20250828_0333 20250827_1654 20250827_1602 20250827_1557 20250827_0320 20250826_0320 20250825_1752 20250825_1709 20250825_1652 20250825_1647 20250825_1645 20250825_1631 20250825_1606 20250825_1559 20250825_1558 20250825_1556 20250825_1531 20250825_1525 20250825_1516 20250825_1450 20250825_1444 20250825_1438 20250825_1414 20250825_1413 20250825_1410 20250825_1408 20250825_1405 20250825_1401 20250825_1355 20250825_1347 20250825_1345 20250825_1344 20250825_1343 20250825_1340 20250825_1339 20250825_1333 20250825_1323 20250825_1317 20250825_1243 20250824_0342 20250823_0343 20250823_0142 20250822_2331 20250822_2308 20250822_2258 20250822_2241 20250822_2228 20250822_2206 20250822_2147 20250822_2111 20250822_1259 20250822_1233 20250822_1229 20250822_1223 20250822_1210 20250822_1201 20250822_1111 20250822_1058 20250822_1052 20250822_1045 20250822_0657 20250822_0553