Coupled Diffusion Sampling for Training-Free Multi-View Image Editing
Authors: Hadi Alzayer, Yunzhi Zhang, Chen Geng, Jia-Bin Huang, Jiajun Wu
First: 2025-10-16T17:59:59+00:00 · Latest: 2025-10-16T17:59:59+00:00
Comments: Project page: https://coupled-diffusion.github.io
Abstract
We present an inference-time diffusion sampling method to perform multi-view
consistent image editing using pre-trained 2D image editing models. These
models can independently produce high-quality edits for each image in a set of
multi-view images of a 3D scene or object, but they do not maintain consistency
across views. Existing approaches typically address this by optimizing over
explicit 3D representations, but they suffer from a lengthy optimization
process and instability under sparse view settings. We propose an implicit 3D
regularization approach by constraining the generated 2D image sequences to
adhere to a pre-trained multi-view image distribution. This is achieved through
coupled diffusion sampling, a simple diffusion sampling technique that
concurrently samples two trajectories from both a multi-view image distribution
and a 2D edited image distribution, using a coupling term to enforce the
multi-view consistency among the generated images. We validate the
effectiveness and generality of this framework on three distinct multi-view
image editing tasks, demonstrating its applicability across various model
architectures and highlighting its potential as a general solution for
multi-view consistent editing.
中文标题/摘要
标题:耦合扩散采样用于无训练多视图图像编辑
我们提出了一种推理时的扩散采样方法,使用预训练的2D图像编辑模型在多视图图像集中执行多视图一致的图像编辑。这些模型可以独立地为多视图场景或对象的一组图像生成高质量的编辑,但它们无法在不同视图之间保持一致性。现有方法通常通过优化显式的3D表示来解决这个问题,但它们会遭受优化过程漫长且在稀疏视图设置下不稳定的问题。我们提出了一种隐式的3D正则化方法,通过约束生成的2D图像序列遵循预训练的多视图图像分布来实现。这通过耦合扩散采样实现,这是一种简单的扩散采样技术,同时从多视图图像分布和2D编辑图像分布中采样两条轨迹,并使用耦合项来强制生成图像之间的多视图一致性。我们在三个不同的多视图图像编辑任务上验证了该框架的有效性和通用性,展示了其在各种模型架构中的适用性,并强调了其作为多视图一致编辑通用解决方案的潜力。
Summary / 总结
The research aims to address the inconsistency issue in multi-view image editing by proposing a training-free method using coupled diffusion sampling. This method allows for high-quality edits in each view of a 3D scene while maintaining consistency across views. The approach constrains the generated 2D images to follow a pre-trained multi-view distribution, ensuring consistency through a coupling term. Experiments on three distinct tasks show the method's effectiveness and general applicability across different model architectures.
该论文提出了一种使用预训练的2D图像编辑模型进行多视图图像编辑的方法,通过应用耦合扩散采样来实现多视图一致性,而无需进行显式的3D优化,使其在稀疏视图设置下更快且更稳定。该方法在三个不同的编辑任务上进行了验证,展示了其在不同模型架构下的有效性和通用性。
From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
Authors: Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu
First: 2025-10-16T17:59:58+00:00 · Latest: 2025-10-16T17:59:58+00:00
Comments: 21 pages, 7 figures
Abstract
The edifice of native Vision-Language Models (VLMs) has emerged as a rising
contender to typical modular VLMs, shaped by evolving model architectures and
training paradigms. Yet, two lingering clouds cast shadows over its widespread
exploration and promotion: (-) What fundamental constraints set native VLMs
apart from modular ones, and to what extent can these barriers be overcome? (-)
How to make research in native VLMs more accessible and democratized, thereby
accelerating progress in the field. In this paper, we clarify these challenges
and outline guiding principles for constructing native VLMs. Specifically, one
native VLM primitive should: (i) effectively align pixel and word
representations within a shared semantic space; (ii) seamlessly integrate the
strengths of formerly separate vision and language modules; (iii) inherently
embody various cross-modal properties that support unified vision-language
encoding, aligning, and reasoning. Hence, we launch NEO, a novel family of
native VLMs built from first principles, capable of rivaling top-tier modular
counterparts across diverse real-world scenarios. With only 390M image-text
examples, NEO efficiently develops visual perception from scratch while
mitigating vision-language conflicts inside a dense and monolithic model
crafted from our elaborate primitives. We position NEO as a cornerstone for
scalable and powerful native VLMs, paired with a rich set of reusable
components that foster a cost-effective and extensible ecosystem. Our code and
models are publicly available at: https://github.com/EvolvingLMMs-Lab/NEO.
中文标题/摘要
标题:从像素到文字——迈向大规模原生视觉-语言基础
原生视觉-语言模型(VLMs)的建筑已经成为了典型的模块化VLMs的有力竞争者,这得益于不断演进的模型架构和训练范式。然而,两个悬而未决的问题仍然阻碍了其广泛探索和推广:(-)原生VLMs与模块化VLMs之间有哪些基本约束,这些障碍可以克服到什么程度?(-)如何使原生VLMs的研究更加普及和民主化,从而加速该领域的进展。在本文中,我们澄清了这些挑战,并概述了构建原生VLMs的指导原则。具体而言,一个原生VLM的基本单元应该:(i)在共享语义空间内有效对齐像素和词的表示;(ii)无缝整合以前分离的视觉和语言模块的优势;(iii)内在地体现各种跨模态特性,支持统一的视觉-语言编码、对齐和推理。因此,我们推出了NEO,这是一种从第一原理构建的新一代原生VLMs,能够在多种现实场景中与顶级模块化对手竞争。仅使用3.9亿张图像-文本样本,NEO能够从头开始高效地发展视觉感知,同时在密集且单一的模型中缓解视觉-语言冲突,该模型由我们精心设计的基本单元构建而成。我们将NEO定位为大规模和强大原生VLMs的基石,配有一套丰富的可重用组件,促进经济高效且可扩展的生态系统。我们的代码和模型已公开发布于:https://github.com/EvolvingLMMs-Lab/NEO。
Learning an Image Editing Model without Image Editing Pairs
Authors: Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang
First: 2025-10-16T17:59:57+00:00 · Latest: 2025-10-16T17:59:57+00:00
Comments: project page: https://nupurkmr9.github.io/npedit/
Abstract
Recent image editing models have achieved impressive results while following
natural language editing instructions, but they rely on supervised fine-tuning
with large datasets of input-target pairs. This is a critical bottleneck, as
such naturally occurring pairs are hard to curate at scale. Current workarounds
use synthetic training pairs that leverage the zero-shot capabilities of
existing models. However, this can propagate and magnify the artifacts of the
pretrained model into the final trained model. In this work, we present a new
training paradigm that eliminates the need for paired data entirely. Our
approach directly optimizes a few-step diffusion model by unrolling it during
training and leveraging feedback from vision-language models (VLMs). For each
input and editing instruction, the VLM evaluates if an edit follows the
instruction and preserves unchanged content, providing direct gradients for
end-to-end optimization. To ensure visual fidelity, we incorporate distribution
matching loss (DMD), which constrains generated images to remain within the
image manifold learned by pretrained models. We evaluate our method on standard
benchmarks and include an extensive ablation study. Without any paired data,
our method performs on par with various image editing diffusion models trained
on extensive supervised paired data, under the few-step setting. Given the same
VLM as the reward model, we also outperform RL-based techniques like Flow-GRPO.
中文标题/摘要
标题:无需图像编辑配对的学习图像编辑模型
最近的图像编辑模型在遵循自然语言编辑指令方面取得了令人印象深刻的成果,但它们依赖于大规模输入-目标配对数据集的监督微调。这是一个关键瓶颈,因为这种自然出现的配对难以大规模整理。当前的变通方法使用合成训练配对,利用现有模型的零样本能力。然而,这可能会传播并放大预训练模型的缺陷到最终训练模型中。在本工作中,我们提出了一种新的训练范式,完全消除了对配对数据的需求。我们的方法通过在训练过程中展开多步扩散模型并利用视觉语言模型(VLM)的反馈直接优化。对于每个输入和编辑指令,VLM 评估编辑是否遵循指令并保留未更改的内容,提供端到端优化的直接梯度。为了确保视觉保真度,我们引入了分布匹配损失(DMD),该损失限制生成的图像保持在预训练模型学习到的图像流形内。我们在标准基准上评估了我们的方法,并包括了详尽的消融研究。在没有任何配对数据的情况下,我们的方法在多步设置下与各种在大量监督配对数据上训练的图像编辑扩散模型表现相当。使用相同的 VLM 作为奖励模型时,我们还优于基于 RL 的技术如 Flow-GRPO。
Attention Is All You Need for KV Cache in Diffusion LLMs
Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
First: 2025-10-16T17:59:48+00:00 · Latest: 2025-10-16T17:59:48+00:00
Comments: https://vila-lab.github.io/elastic-cache-webpage/
Abstract
This work studies how to adaptively recompute key-value (KV) caches for
diffusion large language models (DLMs) to maximize prediction accuracy while
minimizing decoding latency. Prior methods' decoders recompute QKV for all
tokens at every denoising step and layer, despite KV states changing little
across most steps, especially in shallow layers, leading to substantial
redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens
primarily act as a length-bias and can be cached block-wise beyond the active
prediction window; (2) KV dynamics increase with depth, suggesting that
selective refresh starting from deeper layers is sufficient; and (3) the
most-attended token exhibits the smallest KV drift, providing a conservative
lower bound on cache change for other tokens. Building on these, we propose
${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that
jointly decides ${when}$ to refresh (via an attention-aware drift test on the
most-attended token) and ${where}$ to refresh (via a depth-aware schedule that
recomputes from a chosen layer onward while reusing shallow-layer caches and
off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs
adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant
computation and accelerating decoding with negligible loss in generation
quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across
mathematical reasoning and code generation tasks demonstrate consistent
speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences,
and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy
than the baseline. Our method achieves significantly higher throughput
($6.8\times$ on GSM8K) than existing confidence-based approaches while
preserving generation quality, enabling practical deployment of diffusion LLMs.
中文标题/摘要
标题:注意力即是你在扩散大语言模型中所需的一切:针对KV缓存的自适应重计算
本研究探讨了如何为扩散大语言模型(DLMs)自适应地重新计算键值(KV)缓存,以最大化预测准确性并最小化解码延迟。先前方法的解码器在每个去噪步骤和每一层中都重新计算QKV,尽管大多数步骤中KV状态变化不大,尤其是在浅层,导致大量冗余。我们做出了三个观察:(1)距离较远的${f MASK}$标记主要作为长度偏差,可以在活动预测窗口之外块状缓存;(2)KV动态随深度增加,表明从较深层开始的选择性刷新是足够的;(3)最关注的标记表现出最小的KV漂移,为其他标记的缓存变化提供了保守的下限。基于这些观察,我们提出了${f Elastic-Cache}$,这是一种无需训练、架构无关的策略,联合决定何时(通过最关注标记的注意力感知漂移测试)和何处(通过深度感知调度,从选定层开始重新计算,同时重用浅层缓存和窗口外的${f MASK}$缓存)刷新缓存。与固定周期方案不同,Elastic-Cache为扩散大语言模型执行适应性、分层感知的缓存更新,减少冗余计算并加速解码,同时几乎不损失生成质量。在LLaDA-Instruct、LLaDA-1.5和LLaDA-V上的数学推理和代码生成任务实验中,Elastic-Cache展示了持续的加速:在GSM8K(256个标记)上为$8.7 imes$,在较长序列上为$45.1 imes$,在HumanEval上为$4.8 imes$,同时始终比基线保持更高的准确性。我们的方法在GSM8K上实现了显著更高的吞吐量($6.8 imes$),同时保持生成质量,使扩散大语言模型的实际部署成为可能。
Summary / 总结
This work addresses the issue of redundant computation in key-value (KV) caches for diffusion large language models (DLMs) by proposing Elastic-Cache, a strategy that adaptively recomputes KV caches based on the most-attended token's drift and depth-aware schedules. Experiments show consistent speedups across different tasks, with significant throughput improvements and maintained generation quality.
该研究旨在通过有效管理扩散大型语言模型(DLMs)中的键值(KV)缓存,减少解码延迟同时保持预测准确性。作者观察到,远处的${f MASK}$标记可以块级缓存,KV动态随深度增加,最关注的标记提供了缓存变化的保守下限。他们提出了${f Elastic-Cache}$策略,基于这些观察来决定何时和在哪里刷新KV缓存,从而在GSM8K和HumanEval等任务中实现显著的加速,同时保持或提高准确性。该方法在吞吐量方面优于现有的基于信心的方法,同时保持生成质量,使其能够实际部署扩散型语言模型。
RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks
Authors: Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li
Venue: NeurIPS
2025
First: 2025-10-16T17:59:37+00:00 · Latest: 2025-10-16T17:59:37+00:00
Comments: 39th Conference on Neural Information Processing Systems (NeurIPS
2025); Project Website: rdd-neurips.github.io
Abstract
To tackle long-horizon tasks, recent hierarchical vision-language-action
(VLAs) frameworks employ vision-language model (VLM)-based planners to
decompose complex manipulation tasks into simpler sub-tasks that low-level
visuomotor policies can easily handle. Typically, the VLM planner is finetuned
to learn to decompose a target task. This finetuning requires target task
demonstrations segmented into sub-tasks by either human annotation or heuristic
rules. However, the heuristic subtasks can deviate significantly from the
training data of the visuomotor policy, which degrades task performance. To
address these issues, we propose a Retrieval-based Demonstration Decomposer
(RDD) that automatically decomposes demonstrations into sub-tasks by aligning
the visual features of the decomposed sub-task intervals with those from the
training data of the low-level visuomotor policies. Our method outperforms the
state-of-the-art sub-task decomposer on both simulation and real-world tasks,
demonstrating robustness across diverse settings. Code and more results are
available at rdd-neurips.github.io.
中文标题/摘要
标题:RDD:基于检索的演示分解器用于规划者对齐在长时序任务中的计划
为解决长时序任务,最近的分层视觉-语言-动作(VLAs)框架采用基于视觉-语言模型(VLM)的规划者将复杂的操作任务分解为低级视觉-运动策略可以轻松处理的子任务。通常,VLM规划者会微调以学习分解目标任务。这种微调需要将目标任务的演示分解成子任务,由人类注释或启发式规则完成。然而,启发式的子任务可能与低级视觉-运动策略的训练数据相差甚远,这会降低任务性能。为了解决这些问题,我们提出了一种基于检索的演示分解器(RDD),该分解器通过将分解的子任务间隔的视觉特征与低级视觉-运动策略的训练数据对齐来自动分解演示。我们的方法在模拟和真实世界任务中均优于最先进的子任务分解器,展示了在各种环境中的鲁棒性。代码和更多结果可在rdd-neurips.github.io获取。
Summary / 总结
The paper introduces RDD, a retrieval-based demonstration decomposer for aligning planners in long-horizon tasks. It automatically decomposes demonstrations into sub-tasks by aligning visual features with the training data of visuomotor policies, addressing the issue of heuristic sub-tasks deviating from the training data. RDD outperforms existing methods on both simulation and real-world tasks, showing robust performance across different settings.
该论文提出了一种名为RDD的方法,通过将分解后的子任务的视觉特征与低级视动策略训练数据中的特征对齐,自动分解演示。这种方法解决了启发式子任务与训练数据偏差导致任务性能下降的问题。RDD在模拟和真实世界任务中均优于现有方法,并在不同场景中表现出良好的鲁棒性。
RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning
Authors: Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li
First: 2025-10-16T16:04:35+00:00 · Latest: 2025-10-16T16:04:35+00:00
Abstract
Improving the reasoning capabilities of embodied agents is crucial for robots
to complete complex human instructions in long-view manipulation tasks
successfully. Despite the success of large language models and vision language
models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue
facing challenges in performing long-horizon manipulation tasks in complex
real-world environments, owing to their restricted common sense and reasoning
capabilities. Considering that aligning general-purpose vision language models
to robotic planning tasks via supervised fine-tuning suffers from poor
generalization and insufficient physical understanding, we propose RoboGPT-R1,
a two-stage fine-tuning framework for embodied planning. In this framework,
supervised training acquires foundational knowledge through expert sequences,
followed by RL to address the model's shortcomings in visual-spatial
understanding and reasoning. To achieve physical understanding and action
sequence consistency in multi-step reasoning tasks, we design a rule-based
reward function that simultaneously considers long-horizon performance and
action constraint in the environment. The reasoning model, trained on
Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini,
by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the
EmbodiedBench benchmark.
中文标题/摘要
标题:RoboGPT-R1:增强机器人规划的强化学习
提高具身智能体的推理能力对于机器人在长期操作任务中成功完成复杂的人类指令至关重要。尽管大型语言模型和基于监督微调(SFT)的视觉语言模型在规划任务中取得了成功,但在复杂现实环境中的长期操作任务中,它们仍然面临挑战,因为它们的常识和推理能力有限。鉴于通过监督微调将通用视觉语言模型对齐到机器人规划任务在泛化能力和物理理解方面存在不足,我们提出RoboGPT-R1,一种两阶段微调框架,用于具身规划。在这个框架中,监督训练通过专家序列获取基础知识,随后通过RL解决模型在视觉空间理解和推理方面的不足。为了在多步推理任务中实现物理理解和动作序列一致性,我们设计了一种基于规则的奖励函数,同时考虑长期性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型显著优于更大规模的模型GPT-4o-mini,高出21.33%,并在具身Bench基准测试中超越其他基于Qwen2.5-VL-7B训练的工作20.33%。
Summary / 总结
The paper aims to enhance the reasoning capabilities of robots for complex manipulation tasks. It introduces RoboGPT-R1, a two-stage fine-tuning framework combining supervised training and reinforcement learning. The model, trained on Qwen2.5-VL-3B, shows a 21.33% improvement over GPT-4o-mini and a 20.33% improvement over Qwen2.5-VL-7B on the EmbodiedBench benchmark, demonstrating better performance in long-horizon manipulation tasks.
研究旨在通过增强机器人规划能力来完成复杂的操作任务。提出了一种两阶段微调框架RoboGPT-R1,结合监督训练和强化学习。该模型在Qwen2.5-VL-3B上训练,相比GPT-4o-mini提高了21.33%,相比Qwen2.5-VL-7B提高了20.33%,在EmbodiedBench基准测试中表现出更好的长时操作任务和视觉空间推理能力。
CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection
Authors: Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim
First: 2025-10-16T15:27:10+00:00 · Latest: 2025-10-16T15:27:10+00:00
Comments: 28 pages, 13 Figures, 12 Tables
Abstract
Open-vocabulary object detection (OVD) seeks to recognize and localize object
categories beyond those seen during training. Recent approaches typically
leverage vision-language models (VLMs) to generate pseudo-labels using
image-text alignment, allowing detectors to generalize to unseen classes
without explicit supervision. However, these methods depend heavily on direct
image-text matching, neglecting the intermediate reasoning steps essential for
interpreting semantically complex scenes. This results in limited robustness
when confronted with crowded or occluded visual contexts. In this paper, we
introduce CoT-PL, a new framework that employs structured visual
chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL
decomposes object understanding into three interpretable steps: (1) region
perception even for unseen objects, (2) category recognition via zero-shot
reasoning, and (3) background grounding to separate semantically complex
objects. Crucially, the third step naturally motivates our contrastive
background learning (CBL) that uses the pre-computed background cues as
negatives to promote feature disentanglement between objects and background. In
this way, CoT reasoning and CBL form an integrated pipeline tailored to robust
pseudo-labeling in crowded or occluded scenes. Notably, in these two settings,
our novel-class pseudo-label quality achieves relative improvements of 103.4%
and 168.4% over the best prior, respectively. Our extensive experiments
demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9
mask AP on LVIS for novel classes, setting a new state of the art.
中文标题/摘要
标题:CoT-PL:视觉链式思考推理与伪标签结合的开放词汇目标检测
开放词汇目标检测(OVD)旨在识别和定位训练期间未见过的目标类别。近期的方法通常利用视觉语言模型(VLMs)通过图像-文本对齐生成伪标签,使检测器能够在没有显式监督的情况下泛化到未见过的类别。然而,这些方法高度依赖直接的图像-文本匹配,忽视了解释语义复杂场景所必需的中间推理步骤。这导致在拥挤或遮挡的视觉上下文中表现有限。本文提出了一种新的框架CoT-PL,该框架将结构化的视觉链式思考(CoT)推理融入伪标签生成过程。CoT-PL将对象理解分解为三个可解释的步骤:(1)即使对于未见过的对象也能感知区域,(2)通过零样本推理进行类别识别,(3)背景定位以分离语义复杂的对象。最关键的是,第三步自然地促使我们使用预计算的背景线索作为负样本,以促进对象和背景之间的特征分离。这样,CoT推理和对比背景学习(CBL)形成了一种针对拥挤或遮挡场景的集成流水线,以实现稳健的伪标签生成。值得注意的是,在这两种情况下,我们对新类别伪标签的质量分别实现了103.4%和168.4%的相对改进,超过了最佳先验方法。我们的大量实验表明,CoT-PL在开放词汇COCO数据集上实现了+7.7 AP50,在LVIS数据集上实现了+2.9掩码AP,创下了新的最佳水平。
Summary / 总结
The paper introduces CoT-PL, a framework that combines visual chain-of-thought reasoning with pseudo-labeling to improve open-vocabulary object detection. It decomposes object understanding into region perception, zero-shot category recognition, and background grounding. The background grounding step uses contrastive background learning to enhance feature disentanglement. Experiments show that CoT-PL significantly improves pseudo-label quality, achieving 103.4% and 168.4% relative improvements over previous methods in crowded and occluded scenes. On open-vocabulary COCO and LVIS, CoT-PL sets new state-of-the-art performance with +7.7 AP50 and +2.9 mask AP for novel classes respectively.
该论文提出了一种结合视觉链式推理和伪标签的框架CoT-PL,以提高开放词汇对象检测。它将对象理解分解为区域感知、零样本类别识别和背景定位三个步骤。背景定位步骤使用对比背景学习来增强特征分离。实验表明,CoT-PL 显著提高了伪标签质量,在拥挤和遮挡场景中分别实现了103.4%和168.4%的相对改进。在开放词汇COCO和LVIS上,CoT-PL 为新类别分别实现了+7.7 AP50和+2.9 mask AP的新最佳性能。
SVAG-Bench: A Large-Scale Benchmark for Multi-Instance Spatio-temporal Video Action Grounding
Authors: Tanveer Hannan, Shuaicong Wu, Mark Weber, Suprosanna Shit, Jindong Gu, Rajat Koner, Aljoša Ošep, Laura Leal-Taixé, Thomas Seidl
First: 2025-10-14T22:10:49+00:00 · Latest: 2025-10-16T15:16:51+00:00
Abstract
Understanding fine-grained actions and accurately localizing their
corresponding actors in space and time are fundamental capabilities for
advancing next-generation AI systems, including embodied agents, autonomous
platforms, and human-AI interaction frameworks. Despite recent progress in
video understanding, existing methods predominantly address either
coarse-grained action recognition or generic object tracking, thereby
overlooking the challenge of jointly detecting and tracking multiple objects
according to their actions while grounding them temporally. To address this
gap, we introduce Spatio-temporal Video Action Grounding (SVAG), a novel task
that requires models to simultaneously detect, track, and temporally localize
all referent objects in videos based on natural language descriptions of their
actions. To support this task, we construct SVAG-Bench, a large-scale benchmark
comprising 688 videos, 19,590 annotated records, and 903 unique verbs, covering
a diverse range of objects, actions, and real-world scenes. We further propose
SVAGFormer, a baseline framework that adapts state of the art vision language
models for joint spatial and temporal grounding, and introduce SVAGEval, a
standardized evaluation toolkit for fair and reproducible benchmarking.
Empirical results show that existing models perform poorly on SVAG,
particularly in dense or complex scenes, underscoring the need for more
advanced reasoning over fine-grained object-action interactions in long videos.
中文标题/摘要
标题:SVAG-Bench:大规模多实例时空视频动作定位基准
理解细粒度动作并准确地在空间和时间上定位其对应的执行者是推进下一代人工智能系统,包括具身代理、自主平台和人机交互框架的基本能力。尽管在视频理解方面取得了进展,现有方法主要解决粗粒度动作识别或通用对象跟踪的问题,从而忽视了根据动作联合检测和跟踪多个对象并进行时间定位的挑战。为解决这一差距,我们引入了时空视频动作定位(SVAG)这一新任务,要求模型根据自然语言描述的动作同时检测、跟踪和时间定位视频中的所有参考对象。为了支持这一任务,我们构建了SVAG-Bench,一个包含688个视频、19,590个标注记录和903个独特动词的大规模基准,涵盖了各种对象、动作和真实场景。我们还提出了SVAGFormer,一种基线框架,将最先进的视觉语言模型适应于时空联合定位,并引入了SVAGEval,一种标准化评估工具,用于公平和可重复的基准测试。实验证明,现有模型在SVAG上的表现不佳,尤其是在密集或复杂的场景中,突显了在长视频中对细粒度对象-动作交互进行更高级推理的必要性。
Summary / 总结
The paper introduces SVAG-Bench, a large-scale benchmark for multi-instance spatio-temporal video action grounding, addressing the challenge of jointly detecting, tracking, and temporally localizing objects based on their actions. The benchmark includes 688 videos, 19,590 annotated records, and 903 unique verbs. The authors propose SVAGFormer, a baseline framework that leverages state-of-the-art vision-language models for joint spatial and temporal grounding, and introduce SVAGEval, an evaluation toolkit. Experimental results indicate that current models struggle with SVAG, especially in dense or complex scenes, highlighting the need for improved reasoning over fine-grained object-action interactions in long videos.
论文介绍了SVAG-Bench,这是一个大规模的多实例时空视频动作定位基准,旨在解决同时检测、跟踪和时间定位基于其动作的对象的挑战。作者提出了SVAGFormer,一个基于最先进的视觉语言模型的基线框架,以及SVAGEval,一个标准化评估工具。实验结果表明,当前模型在SVAG上表现不佳,尤其是在密集或复杂的场景中,强调了在长视频中对细粒度对象-动作交互进行更先进推理的必要性。
Free-Grained Hierarchical Recognition
Authors: Seulki Park, Zilin Wang, Stella X. Yu
First: 2025-10-16T14:35:18+00:00 · Latest: 2025-10-16T14:35:18+00:00
Comments: 26 pages
Abstract
Hierarchical image classification predicts labels across a semantic taxonomy,
but existing methods typically assume complete, fine-grained annotations, an
assumption rarely met in practice. Real-world supervision varies in
granularity, influenced by image quality, annotator expertise, and task
demands; a distant bird may be labeled Bird, while a close-up reveals Bald
eagle. We introduce ImageNet-F, a large-scale benchmark curated from ImageNet
and structured into cognitively inspired basic, subordinate, and fine-grained
levels. Using CLIP as a proxy for semantic ambiguity, we simulate realistic,
mixed-granularity labels reflecting human annotation behavior. We propose
free-grain learning, with heterogeneous supervision across instances. We
develop methods that enhance semantic guidance via pseudo-attributes from
vision-language models and visual guidance via semi-supervised learning. These,
along with strong baselines, substantially improve performance under mixed
supervision. Together, our benchmark and methods advance hierarchical
classification under real-world constraints.
中文标题/摘要
标题:自由粒度层次识别
层次图像分类预测语义分类学中的标签,但现有方法通常假设完全、细粒度的注释,这一假设在实践中很少实现。现实世界的监督在粒度上各不相同,受图像质量、注释者专业水平和任务需求的影响;远处的鸟可能被标记为“鸟”,而近距离则揭示了“白头鹰”。我们引入了ImageNet-F,这是一个从ImageNet中精心挑选并按认知启发的基本、次级和细粒度层次结构组织的大规模基准。使用CLIP模拟语义模糊性,我们模拟了反映人类注释行为的现实且混合粒度的标签。我们提出了自由粒度学习,即在实例上提供异质监督。我们开发了通过视觉语言模型的伪属性增强语义指导的方法,以及通过半监督学习增强视觉指导的方法。这些方法与强大的基线一起,在混合监督下显著提高了性能。我们的基准和方法共同推进了在现实世界约束下的层次分类。
Summary / 总结
The research addresses the challenge of hierarchical image classification with varying annotation granularity, which is common in real-world scenarios. The study introduces ImageNet-F, a benchmark with labels structured into basic, subordinate, and fine-grained levels, and simulates mixed-granularity labels using CLIP. The authors propose free-grain learning, which uses heterogeneous supervision and enhances semantic and visual guidance. The methods significantly improve performance under mixed supervision, advancing hierarchical classification under practical constraints.
研究解决了现实世界中图像分类中注释粒度不一致的问题。引入了ImageNet-F基准,其标签分为基本、次级和细粒度层次,并使用CLIP模拟混合粒度的标签。提出的自由粒度学习方法利用视觉语言模型的伪属性和半监督学习增强语义和视觉指导,显著提高了在混合监督下的性能,推进了在实际约束下的层次分类。
VTimeCoT: Thinking by Drawing for Video Temporal Grounding and Reasoning
Authors: Jinglei Zhang, Yuanfan Guo, Rolandos Alexandros Potamias, Jiankang Deng, Hang Xu, Chao Ma
Venue: ICCV 2025
First: 2025-10-16T13:29:02+00:00 · Latest: 2025-10-16T13:29:02+00:00
Comments: Accepted by ICCV 2025
Abstract
In recent years, video question answering based on multimodal large language
models (MLLM) has garnered considerable attention, due to the benefits from the
substantial advancements in LLMs. However, these models have a notable
deficiency in the domains of video temporal grounding and reasoning, posing
challenges to the development of effective real-world video understanding
systems. Inspired by how humans use video players to interact with the progress
bar for video comprehension, we introduce VTimeCoT, a simple yet effective
training-free framework, designed for high-performance video grounding and
reasoning. The proposed framework incorporates two novel visual tools of the
progress bar: a plug-and-play progress bar integration tool and a
high-efficiency highlighting tool. In addition, to address the limitations of
conventional text-based chain-of-thought (CoT) approaches, we introduce a
visuotemporal CoT process that integrates cross-modality reasoning across both
video and text. Our approach demonstrates significant performance improvements
on both Qwen2VL-7B and GPT4o baselines in tasks of video temporal grounding and
reasoning-based question answering. Finally, we showcase that the proposed
framework achieves a compositional and interpretable reasoning process. Project
page: https://vtimecot.github.io
中文标题/摘要
标题:VTimeCoT:通过绘制进行视频时间定位和推理
近年来,基于多模态大型语言模型(MLLM)的视频问答受到了广泛关注,得益于LLM的显著进步。然而,这些模型在视频时间定位和推理领域存在明显不足,给有效的现实世界视频理解系统的发展带来了挑战。受人类如何使用视频播放器与进度条进行互动以理解视频的启发,我们提出了VTimeCoT,这是一种简单而有效的无需训练的框架,旨在实现高性能的视频定位和推理。该框架结合了两种新颖的视觉工具:即插即用的进度条集成工具和高效突出显示工具。此外,为了解决传统基于文本的链式思考(CoT)方法的局限性,我们引入了一种结合视频和文本跨模态推理的视时空CoT过程。我们的方法在Qwen2VL-7B和GPT4o基准上的视频时间定位和基于推理的问答任务中表现出显著的性能提升。最后,我们展示了所提出的框架实现了组合性和可解释性的推理过程。项目页面:https://vtimecot.github.io
Summary / 总结
VTimeCoT is a training-free framework designed to enhance video temporal grounding and reasoning. It leverages a progress bar and highlighting tools to facilitate visual interaction and cross-modality reasoning. The framework significantly improves performance on video question answering tasks compared to baselines like Qwen2VL-7B and GPT4o, demonstrating a compositional and interpretable reasoning process.
VTimeCoT 是一个无需训练的框架,旨在提升视频时间定位和推理能力。受人类与视频进度条交互的启发,它整合了进度条和高效率高亮工具以改善视频理解。VTimeCoT 引入了跨模态的视觉时间 CoT 过程,展示了在视频问答任务中对 Qwen2VL-7B 和 GPT4o 基线的显著性能提升。该框架还提供了组合性和可解释性的推理过程。
Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Authors: Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong
First: 2025-09-25T15:01:49+00:00 · Latest: 2025-10-16T12:37:53+00:00
Comments: Preprint
Abstract
Image composition aims to seamlessly insert a user-specified object into a
new scene, but existing models struggle with complex lighting (e.g., accurate
shadows, water reflections) and diverse, high-resolution inputs. Modern
text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential
physical and resolution priors, yet lack a framework to unleash them without
resorting to latent inversion, which often locks object poses into contextually
inappropriate orientations, or brittle attention surgery. We propose SHINE, a
training-free framework for Seamless, High-fidelity Insertion with Neutralized
Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained
customization adapters (e.g., IP-Adapter) to guide latents for faithful subject
representation while preserving background integrity. Degradation-suppression
guidance and adaptive background blending are proposed to further eliminate
low-quality outputs and visible seams. To address the lack of rigorous
benchmarks, we introduce ComplexCompo, featuring diverse resolutions and
challenging conditions such as low lighting, strong illumination, intricate
shadows, and reflective surfaces. Experiments on ComplexCompo and
DreamEditBench show state-of-the-art performance on standard metrics (e.g.,
DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward).
Code and benchmark will be publicly available upon publication.
中文标题/摘要
标题:FLUX 是否已经掌握了进行物理上可信的图像合成的方法?
图像合成旨在无缝地将用户指定的对象插入到新场景中,但现有模型在处理复杂光照(例如准确的阴影、水面反射)和多样、高分辨率输入方面存在困难。现代文本到图像的扩散模型(例如SD3.5、FLUX)已经编码了重要的物理和分辨率先验知识,但缺乏一个框架来释放这些先验知识而不依赖于潜在空间反转,这通常会将物体姿态锁定为上下文不合适的姿态,或者导致脆弱的注意力手术。我们提出了SHINE,一种无需训练的无缝、高保真插入框架,以中和错误。SHINE引入了流形导向的锚点损失,利用预训练的自定义适配器(例如IP-Adapter)引导潜在空间,以实现忠实的主题表示,同时保留背景完整性。我们提出了降级抑制指导和自适应背景融合,以进一步消除低质量输出和可见接缝。为了解决缺乏严格的基准问题,我们引入了ComplexCompo,它包含多种分辨率和具有挑战性的条件,如低光照、强照明、复杂的阴影和反射表面。在ComplexCompo和DreamEditBench上的实验表明,SHINE在标准指标(例如DINOv2)和人类对齐评分(例如DreamSim、ImageReward、VisionReward)上表现出最先进的性能。代码和基准将在发表后公开。
Summary / 总结
The paper addresses the challenge of physically plausible image composition, where existing models struggle with complex lighting and high-resolution inputs. It introduces SHINE, a training-free framework that uses manifold-steered anchor loss and pretrained customization adapters to guide latents for faithful subject representation while preserving the background. The framework also includes degradation-suppression guidance and adaptive background blending to improve output quality. Experiments on ComplexCompo and DreamEditBench demonstrate superior performance compared to existing models on both standard metrics and human-aligned scores.
论文针对现有模型在处理复杂光照和高分辨率输入时难以实现物理上合理的图像合成的问题。提出了SHINE框架,利用流形导向的锚点损失和预训练的自定义适配器来引导潜在变量,以实现忠实的主题表示同时保留背景完整性。该框架还包含降级抑制指导和自适应背景融合,以进一步消除低质量输出和可见接缝。实验结果表明,该框架在标准指标和人类评分方面均优于现有模型。
Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference
Authors: Natan Bagrov, Eugene Khvedchenia, Borys Tymchenko, Shay Aharon, Lior Kadoch, Tomer Keren, Ofri Masad, Yonatan Geifman, Ran Zilberstein, Tuomas Rintamaki, Matthieu Le, Andrew Tao
First: 2025-10-16T12:34:38+00:00 · Latest: 2025-10-16T12:34:38+00:00
Abstract
Vision-language models (VLMs) have recently expanded from static image
understanding to video reasoning, but their scalability is fundamentally
limited by the quadratic cost of processing dense frame sequences. Long videos
often exceed the token budget of modern language models, leading to severe
context limitations and latency issues. We introduce Efficient Video Sampling
(EVS), a simple, plug-and-play method for reducing token redundancy in videos
by identifying and pruning temporally static patches -- spatial regions that
remain unchanged across consecutive frames. EVS preserves positional identity,
requires no architectural changes or retraining. We show that EVS substantially
reduces token count while maintaining semantic fidelity, enabling faster
inference and longer input sequences. Applied at inference time, EVS reduces
large language model (LLM) time-to-first-token (TTFT) by up to 4x with minimal
accuracy loss. When combined with an uptraining phase using stochastic pruning
rates, EVS yields models that are robust to varying compression levels and
retain full performance under aggressive pruning. Extensive experiments
demonstrate that EVS consistently improves efficiency-accuracy trade-offs,
unlocking scalable video-language understanding without sacrificing quality.
中文标题/摘要
标题:高效视频采样:通过剪枝时间冗余令牌加速VLM推理
视觉-语言模型(VLMs)最近从静态图像理解扩展到了视频推理,但其可扩展性从根本上受到处理密集帧序列的二次成本限制。长视频经常超出现代语言模型的令牌预算,导致严重的上下文限制和延迟问题。我们引入了高效视频采样(EVS),这是一种简单且即插即用的方法,通过识别并剪枝时间上静态的补丁(即连续帧中保持不变的空间区域)来减少视频中的令牌冗余。EVS 保持了位置标识,无需进行架构更改或重新训练。我们展示了EVS在显著减少令牌数量的同时保持语义保真度,从而实现更快的推理和更长的输入序列。在推理时应用EVS,可以将大型语言模型(LLM)的时间到首个令牌(TTFT)减少多达4倍,同时最小化准确率损失。当与使用随机剪枝率的上训练阶段结合时,EVS 生成的模型对不同的压缩级别具有鲁棒性,并且在激进剪枝下仍能保持全性能。大量实验表明,EVS 一致地改善了效率-准确性的权衡,无需牺牲质量即可实现可扩展的视频-语言理解。
Summary / 总结
The research aims to address the scalability issues of vision-language models (VLMs) when processing long videos, which are computationally expensive due to the quadratic cost of handling dense frame sequences. The method, Efficient Video Sampling (EVS), identifies and prunes temporally static patches to reduce token redundancy without altering the model architecture or requiring retraining. Key findings show that EVS can reduce token count while maintaining semantic fidelity, enabling faster inference and longer input sequences. At inference time, EVS reduces large language model time-to-first-token by up to 4x with minimal accuracy loss, and when combined with uptraining, it retains full performance under aggressive pruning.
研究旨在解决视觉语言模型(VLMs)处理长视频时的可扩展性问题,因为长视频往往超过了现代语言模型的令牌预算。方法是高效的视频采样(EVS),通过识别并移除连续帧中不变的空间区域来减少令牌冗余,无需修改模型架构或重新训练。关键发现表明,EVS可以在保持语义保真度的同时减少令牌数量,从而实现更快的推理和更长的输入序列。在推理时,EVS可以将大型语言模型的时间到首个令牌(TTFT)减少4倍以上,且几乎无精度损失。结合随机剪枝率的上训练阶段,EVS可以在极端剪枝下保持全性能。广泛的实验表明,EVS能够一致地改善效率-准确性的权衡,无需牺牲质量即可实现可扩展的视频-语言理解。
Talking Points: Describing and Localizing Pixels
Authors: Matan Rusanovsky, Shimon Malnick, Shai Avidan
First: 2025-10-16T11:42:03+00:00 · Latest: 2025-10-16T11:42:03+00:00
Abstract
Vision-language models have achieved remarkable success in cross-modal
understanding. Yet, these models remain limited to object-level or region-level
grounding, lacking the capability for pixel-precise keypoint comprehension
through natural language. We introduce a novel framework for pixel level
grounding. The framework consists of two complementary components: a Point
Descriptor that generates rich, contextual descriptions of individual
keypoints, and a Point Localizer that regresses precise pixel coordinates from
these descriptions. Unlike prior work that relies on templated prompts or
keypoint names, our approach produces free-form, coarse-to-fine descriptions
that situate keypoints within their visual context. Since there is no available
dataset to train such a system, we introduce LlamaPointInPart, a carefully
curated dataset of 20K+ image-keypoint-description triplets synthesized from
multiple vision-language models, capturing multi-scale information from
scene-level context to visual features around the keypoint. For cross-category
generalization, we optimize the Point Descriptor on AP-10K via GRPO, using the
frozen Point Localizer as a reward model to produce descriptions that maximize
localization accuracy. To evaluate our results we establish a new evaluation
protocol. Instead of comparing the text description produced by our method to
the ground truth, we use the localizer to determine how close is the predicted
point generated to the ground truth point. Experiments demonstrate superior
performance compared to baseline models on LlamaPointInPart.The bidirectional
nature of our framework should enable future applications in both
keypoint-guided image understanding and language-guided precise localization.
Our code and dataset are publicly available at
https://github.com/matanr/Talking_Points.
中文标题/摘要
标题:讨论要点:描述和本地化像素
视觉-语言模型在跨模态理解方面取得了显著成就。然而,这些模型仍然局限于对象级或区域级的定位,缺乏通过自然语言理解像素级关键点的能力。我们提出了一种新的像素级定位框架。该框架由两个互补组件组成:一个点描述器,生成丰富的上下文描述,以及一个点定位器,从这些描述中回归精确的像素坐标。与依赖于模板提示或关键点名称的先前工作不同,我们的方法生成了自由形式、从粗到细的描述,将关键点置于其视觉上下文中。由于没有可用的数据集来训练此类系统,我们引入了LlamaPointInPart,这是一个精心策划的数据集,包含来自多个视觉-语言模型的20K+幅图像-关键点-描述三元组,捕捉从场景级上下文到关键点周围视觉特征的多尺度信息。为了实现跨类别泛化,我们通过GRPO在AP-10K上优化点描述器,并使用冻结的点定位器作为奖励模型,生成最大化定位准确性的描述。为了评估我们的结果,我们建立了一个新的评估协议。我们不将我们方法生成的文本描述与真实值进行比较,而是使用定位器来确定预测的关键点与真实关键点的接近程度。实验结果表明,与基线模型相比,在LlamaPointInPart上具有优越的性能。我们框架的双向性质应能在未来在关键点引导的图像理解和语言引导的精确定位中发挥重要作用。我们的代码和数据集可在https://github.com/matanr/Talking_Points/公开获取。
TTT3R: 3D Reconstruction as Test-Time Training
Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
First: 2025-09-30T17:59:51+00:00 · Latest: 2025-10-16T11:37:35+00:00
Comments: Page: https://rover-xingyu.github.io/TTT3R/ Code:
https://github.com/Inception3D/TTT3R
Abstract
Modern Recurrent Neural Networks have become a competitive architecture for
3D reconstruction due to their linear-time complexity. However, their
performance degrades significantly when applied beyond the training context
length, revealing limited length generalization. In this work, we revisit the
3D reconstruction foundation models from a Test-Time Training perspective,
framing their designs as an online learning problem. Building on this
perspective, we leverage the alignment confidence between the memory state and
incoming observations to derive a closed-form learning rate for memory updates,
to balance between retaining historical information and adapting to new
observations. This training-free intervention, termed TTT3R, substantially
improves length generalization, achieving a $2\times$ improvement in global
pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU
memory to process thousands of images. Code available in
https://rover-xingyu.github.io/TTT3R
中文标题/摘要
标题:TTT3R:测试时训练的3D重建
现代循环神经网络因其线性时间复杂性已成为3D重建的竞争性架构。然而,当应用于训练上下文长度之外时,其性能显著下降,显示出有限长度泛化能力。在本文中,我们从测试时训练的角度重新审视3D重建基础模型,将其设计框架为在线学习问题。基于这一视角,我们利用记忆状态与新观测之间的对齐置信度来推导出记忆更新的闭式学习率,以平衡保留历史信息和适应新观测之间的关系。这种无需训练的干预措施,称为TTT3R,显著提高了长度泛化能力,在全局姿态估计方面比基线提高了2倍,同时以每秒20帧的速度运行,仅使用6 GB的GPU内存处理数千张图像。代码可在https://rover-xingyu.github.io/TTT3R/获取
Consistent text-to-image generation via scene de-contextualization
Authors: Song Tang, Peihao Gong, Kunyu Li, Kai Guo, Boyu Wang, Mao Ye, Jianwei Zhang, Xiatian Zhu
First: 2025-10-16T10:54:49+00:00 · Latest: 2025-10-16T10:54:49+00:00
Abstract
Consistent text-to-image (T2I) generation seeks to produce
identity-preserving images of the same subject across diverse scenes, yet it
often fails due to a phenomenon called identity (ID) shift. Previous methods
have tackled this issue, but typically rely on the unrealistic assumption of
knowing all target scenes in advance. This paper reveals that a key source of
ID shift is the native correlation between subject and scene context, called
scene contextualization, which arises naturally as T2I models fit the training
distribution of vast natural images. We formally prove the near-universality of
this scene-ID correlation and derive theoretical bounds on its strength. On
this basis, we propose a novel, efficient, training-free prompt embedding
editing approach, called Scene De-Contextualization (SDeC), that imposes an
inversion process of T2I's built-in scene contextualization. Specifically, it
identifies and suppresses the latent scene-ID correlation within the ID
prompt's embedding by quantifying the SVD directional stability to adaptively
re-weight the corresponding eigenvalues. Critically, SDeC allows for per-scene
use (one scene per prompt) without requiring prior access to all target scenes.
This makes it a highly flexible and general solution well-suited to real-world
applications where such prior knowledge is often unavailable or varies over
time. Experiments demonstrate that SDeC significantly enhances identity
preservation while maintaining scene diversity.
中文标题/摘要
标题:通过场景去语境化实现一致的文本到图像生成
一致的文本到图像(T2I)生成旨在跨不同场景生成同一主题的身份保留图像,但由于身份(ID)偏移现象,往往难以实现。先前的方法已经解决了这一问题,但通常依赖于事先知道所有目标场景的不现实假设。本文揭示了ID偏移的一个关键来源是主题和场景语境之间的自然相关性,称为场景语境化,这是T2I模型拟合大量自然图像训练分布时自然产生的。我们正式证明了这种场景-ID相关性的普遍性,并推导出其强度的理论界。在此基础上,我们提出了一种新颖、高效、无需训练的提示嵌入编辑方法,称为场景去语境化(SDeC),它施加了T2I内置场景语境化的逆过程。具体而言,它通过量化SVD方向稳定性来识别并抑制ID提示嵌入中的潜在场景-ID相关性,从而自适应地重新加权相应的特征值。关键的是,SDeC允许每场景使用(每个提示一个场景),而无需事先访问所有目标场景。这使其成为一种高度灵活且通用的解决方案,特别适合于事先知识往往不可用或随时间变化的实际应用场景。实验表明,SDeC显著增强了身份保留能力,同时保持了场景多样性。
Summary / 总结
This paper addresses the issue of identity shift in text-to-image generation, where models produce images with shifted identities across different scenes. It introduces a method called Scene De-Contextualization (SDeC) that identifies and suppresses the latent scene-ID correlation within the identity prompt's embedding, allowing for per-scene use without requiring prior access to all target scenes. Experiments show that SDeC significantly improves identity preservation while maintaining scene diversity.
该论文解决了文本到图像(T2I)生成中身份(ID)漂移的问题,即由于场景上下文化,模型生成的图像身份会发生改变。作者提出了一种名为场景去上下文化(SDeC)的新方法,该方法通过量化SVD方向稳定性来识别并抑制ID提示嵌入中的场景-ID相关性,从而允许针对每个场景使用而无需事先访问所有目标场景。实验表明,SDeC在保持场景多样性的同时显著提高了身份保真度。
Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception
Authors: Yuheng Shi, Xiaohuan Pei, Minjing Dong, Chang Xu
First: 2025-09-21T06:54:04+00:00 · Latest: 2025-10-16T10:53:17+00:00
Comments: 20 pages, 6 figures
Abstract
Multimodal Large Language Models (MLLMs) require high-resolution visual
information to perform fine-grained perception, yet processing entire
high-resolution images is computationally prohibitive. While recent methods
leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they
typically present a difficult trade-off: training-based approaches depend on
large-scale annotated datasets, while training-free methods that utilize the
model's internal attention are computationally inefficient and less accurate,
requiring either multi-pass prefill stages or reliance on the slow
auto-regressive decoding process. In this paper, we propose an efficient,
annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves
this trade-off. The SD-RPN is built around a pipeline that transforms the noisy
attention maps from the MLLM's middle layers into high-quality pseudo-RoI
labels by explicitly denoising the signal and resolving ambiguity. We use these
labels to train a lightweight Region Proposal Network (RPN) that learns a more
precise localization. This RPN is also highly efficient, predicting the RoI in
a single forward pass using features from the MLLM's middle layers, decoupling
RoI identification from the auto-regressive generation and avoiding costly
multi-pass operations. To validate our approach, we integrate the framework
into multiple MLLM families. Despite being trained on only a few (e.g. 10K)
question-answer pairs, our method demonstrates exceptional data efficiency and
generalization, achieving over a 10% absolute accuracy improvement on unseen
benchmarks, including TextVQA, DocVQA, and V-Star. Our work presents a
practical and scalable solution for enhancing the fine-grained perception of
MLLMs without requiring costly supervision or full model fine-tuning. Code is
available at https://github.com/YuHengsss/SD-RPN.
中文标题/摘要
标题:捕捉细节:自蒸馏RoI预测器实现细粒度MLLM感知
多模态大型语言模型(MLLMs)需要高分辨率的视觉信息来执行细粒度感知,但处理整个高分辨率图像在计算上是不可行的。虽然最近的方法利用区域-of-兴趣(RoI)机制专注于显著区域,但它们通常面临一个困难的权衡:基于训练的方法依赖于大规模标注数据集,而无需训练的方法利用模型内部注意力则计算效率低且准确性较低,需要多轮预填充阶段或依赖于缓慢的自回归解码过程。在本文中,我们提出了一种高效的、无需标注的自蒸馏区域建议网络(SD-RPN),解决了这一权衡问题。SD-RPN围绕一个管道构建,该管道将MLLM中间层的嘈杂注意力图转换为高质量的伪RoI标签,通过明确去噪和解决歧义。我们使用这些标签训练一个轻量级的区域建议网络(RPN),学习更精确的定位。该RPN也非常高效,在单次前向传播中使用MLLM中间层的特征预测RoI,将RoI识别与自回归生成解耦,避免了昂贵的多轮操作。为了验证我们的方法,我们将框架集成到多个MLLM家族中。尽管仅在少量(例如10K)问答对上进行训练,我们的方法仍表现出色,数据效率和泛化能力极佳,在TextVQA、DocVQA和V-Star等未见基准上实现了超过10%的绝对准确率提升。我们的工作提供了一种无需昂贵监督或全面模型微调的实用且可扩展的解决方案,以增强MLLM的细粒度感知。代码可在https://github.com/YuHengsss/SD-RPN获取。
Summary / 总结
This paper addresses the challenge of fine-grained perception in Multimodal Large Language Models (MLLMs) by proposing an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN). The SD-RPN transforms noisy attention maps from the MLLM's middle layers into high-quality pseudo-RoI labels, which are then used to train a lightweight RPN for more precise localization. The method avoids the need for large-scale annotated datasets and multi-pass operations, making it computationally efficient. Experiments show that the SD-RPN achieves over a 10% absolute accuracy improvement on benchmarks like TextVQA, DocVQA, and V-Star, even when trained on a small dataset of question-answer pairs.
本文提出了一种高效的无注释Self-Distilled Region Proposal Network (SD-RPN),通过将MLLM中间层的嘈杂注意力图转换为高质量的伪RoI标签,并使用这些标签训练轻量级RPN来进行精确定位,解决了多模态大型语言模型(MLLMs)的细粒度感知问题。该方法避免了大规模标注数据集或计算昂贵的多遍操作的需求,仅使用少量(例如10K)问题-答案对进行训练,就在TextVQA、DocVQA和V-Star等基准测试中实现了超过10%的绝对准确率提升。
Exploring Cross-Modal Flows for Few-Shot Learning
Authors: Ziqi Jiang, Yanghao Wang, Long Chen
First: 2025-10-16T10:32:48+00:00 · Latest: 2025-10-16T10:32:48+00:00
Comments: 13 pages, 6 figures
Abstract
Aligning features from different modalities, is one of the most fundamental
challenges for cross-modal tasks. Although pre-trained vision-language models
can achieve a general alignment between image and text, they often require
parameter-efficient fine-tuning (PEFT) for further adjustment. Today's PEFT
methods (e.g., prompt tuning, LoRA-based, or adapter-based) always selectively
fine-tune a subset of parameters, which can slightly adjust either visual or
textual features, and avoid overfitting. In this paper, we are the first to
highlight that all existing PEFT methods perform one-step adjustment. It is
insufficient for complex (or difficult) datasets, where features of different
modalities are highly entangled. To this end, we propose the first
model-agnostic multi-step adjustment approach by learning a cross-modal
velocity field: Flow Matching Alignment (FMA). Specifically, to ensure the
correspondence between categories during training, we first utilize a fixed
coupling strategy. Then, we propose a noise augmentation strategy to alleviate
the data scarcity issue. Finally, we design an early-stopping solver, which
terminates the transformation process earlier, improving both efficiency and
accuracy. Compared with one-step PEFT methods, FMA has the multi-step
rectification ability to achieve more precise and robust alignment. Extensive
results have demonstrated that FMA can consistently yield significant
performance gains across various benchmarks and backbones, particularly on
challenging datasets.
中文标题/摘要
标题:探索少样本学习中的跨模态流动
不同模态特征的对齐是跨模态任务中最基本的挑战之一。尽管预训练的视觉-语言模型可以在图像和文本之间实现一般对齐,但它们通常需要参数高效微调(PEFT)进行进一步调整。今天的PEFT方法(例如提示调优、LoRA基的或适配器基的)总是选择性地微调一部分参数,这可以轻微调整视觉或文本特征,避免过拟合。在本文中,我们首次指出,所有现有的PEFT方法都是一步调整。对于特征高度纠缠的复杂(或困难)数据集来说是不够的。为此,我们提出了第一个模型无关的多步调整方法,通过学习跨模态速度场:流动匹配对齐(FMA)。具体来说,为了在训练过程中确保类别的对应性,我们首先使用固定耦合策略。然后,我们提出了一种噪声增强策略来缓解数据稀缺问题。最后,我们设计了一个早期停止求解器,该求解器在更早的阶段终止变换过程,提高效率和准确性。与一步PEFT方法相比,FMA具有多步校正能力,可以实现更精确和稳健的对齐。广泛的结果表明,FMA可以在各种基准和骨干网络上一致地获得显著的性能提升,特别是在具有挑战性的数据集上。
Summary / 总结
This paper addresses the challenge of aligning features from different modalities in cross-modal tasks, particularly in few-shot learning scenarios. It introduces a novel multi-step adjustment approach called Flow Matching Alignment (FMA) that learns a cross-modal velocity field. The method uses a fixed coupling strategy, noise augmentation to handle data scarcity, and an early-stopping solver to improve efficiency and accuracy. Experimental results show that FMA outperforms one-step parameter-efficient fine-tuning methods across various benchmarks and backbones, especially on challenging datasets.
本文解决了跨模态任务中不同模态特征对齐的挑战,提出了多步调整方法Flow Matching Alignment (FMA),改进了现有的参数高效微调(PEFT)方法。FMA 学习了一个跨模态速度场,并包含固定耦合策略、噪声增强和早期停止求解器,以实现更精确和稳健的对齐。实验表明,FMA 在各种基准和骨干网络上优于单步 PEFT 方法,特别是在具有挑战性的数据集上表现出色。
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
Authors: Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, Yue Zhang, Yubo Zhang, Handong Zheng, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, Yanjun Ma
First: 2025-10-16T10:18:48+00:00 · Latest: 2025-10-16T10:18:48+00:00
Abstract
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model
tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a
compact yet powerful vision-language model (VLM) that integrates a NaViT-style
dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to
enable accurate element recognition. This innovative model efficiently supports
109 languages and excels in recognizing complex elements (e.g., text, tables,
formulas, and charts), while maintaining minimal resource consumption. Through
comprehensive evaluations on widely used public benchmarks and in-house
benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document
parsing and element-level recognition. It significantly outperforms existing
solutions, exhibits strong competitiveness against top-tier VLMs, and delivers
fast inference speeds. These strengths make it highly suitable for practical
deployment in real-world scenarios.
Summary / 总结
PaddleOCR-VL is a state-of-the-art and resource-efficient model for document parsing, featuring PaddleOCR-VL-0.9B, a compact vision-language model that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model. This model supports 109 languages and excels in recognizing complex elements such as text, tables, formulas, and charts. Comprehensive evaluations show that PaddleOCR-VL outperforms existing solutions in both page-level document parsing and element-level recognition, while maintaining fast inference speeds and strong competitiveness against top-tier VLMs.
PaddleOCR-VL 是一种面向文档解析的先进且资源高效的模型,其核心是 PaddleOCR-VL-0.9B,该模型结合了 NaViT 风格的动态分辨率视觉编码器和 ERNIE-4.5-0.3B 语言模型。该模型支持 109 种语言,并且在识别文本、表格、公式和图表等复杂元素方面表现出色。全面的评估显示,PaddleOCR-VL 在页面级文档解析和元素级识别方面均优于现有解决方案,同时保持了快速推理速度和与顶级视觉语言模型的竞争力。
Noise Projection: Closing the Prompt-Agnostic Gap Behind Text-to-Image Misalignment in Diffusion Models
Authors: Yunze Tong, Didi Zhu, Zijing Hu, Jinluan Yang, Ziyu Zhao
First: 2025-10-16T10:14:34+00:00 · Latest: 2025-10-16T10:14:34+00:00
Comments: Appendix will be appended soon
Abstract
In text-to-image generation, different initial noises induce distinct
denoising paths with a pretrained Stable Diffusion (SD) model. While this
pattern could output diverse images, some of them may fail to align well with
the prompt. Existing methods alleviate this issue either by altering the
denoising dynamics or by drawing multiple noises and conducting post-selection.
In this paper, we attribute the misalignment to a training-inference mismatch:
during training, prompt-conditioned noises lie in a prompt-specific subset of
the latent space, whereas at inference the noise is drawn from a
prompt-agnostic Gaussian prior. To close this gap, we propose a noise projector
that applies text-conditioned refinement to the initial noise before denoising.
Conditioned on the prompt embedding, it maps the noise to a prompt-aware
counterpart that better matches the distribution observed during SD training,
without modifying the SD model. Our framework consists of these steps: we first
sample some noises and obtain token-level feedback for their corresponding
images from a vision-language model (VLM), then distill these signals into a
reward model, and finally optimize the noise projector via a quasi-direct
preference optimization. Our design has two benefits: (i) it requires no
reference images or handcrafted priors, and (ii) it incurs small inference
cost, replacing multi-sample selection with a single forward pass. Extensive
experiments further show that our prompt-aware noise projection improves
text-image alignment across diverse prompts.
中文标题/摘要
标题:噪声投影:在扩散模型中弥合文本到图像对齐偏差的提示无关差距
在文本到图像生成中,不同的初始噪声会引导出不同的去噪路径,使用预训练的稳定扩散(SD)模型。虽然这种模式可以生成多种多样的图像,但其中一些图像可能无法很好地与提示对齐。现有方法通过改变去噪动态或绘制多个噪声并进行后选来缓解这一问题。在本文中,我们将对齐偏差归因于训练与推理之间的不匹配:在训练过程中,提示条件化的噪声位于潜空间的提示特定子集中,而在推理过程中,噪声是从提示无关的高斯先验中抽取的。为了弥合这一差距,我们提出了一种噪声投影器,在去噪之前对初始噪声应用文本条件化的细化。基于提示嵌入,它将噪声映射到一个提示感知的对应物,更好地匹配SD训练期间观察到的分布,而不修改SD模型。我们的框架包括以下步骤:首先,我们采样一些噪声,并从视觉语言模型(VLM)获得它们对应图像的标记级反馈,然后将这些信号提炼成奖励模型,最后通过准直接偏好优化优化噪声投影器。我们的设计具有两个优点:(i) 它不需要参考图像或手工制作的先验,(ii) 它的推理成本较低,用单次前向传递替代多样本选择。广泛的实验进一步表明,我们的提示感知噪声投影可以提高不同提示下的文本-图像对齐。
Summary / 总结
This paper addresses the issue of text-to-image misalignment in diffusion models by proposing a noise projector that refines initial noises based on prompt embeddings. The method involves sampling noises, obtaining token-level feedback from a vision-language model, and optimizing the noise projector through a reward model. Experiments demonstrate that this approach enhances text-image alignment without requiring reference images or handcrafted priors, and incurs minimal inference cost compared to existing methods.
本文提出了一种噪声投影方法,通过基于提示嵌入对初始噪声进行细化来解决文本到图像生成中的对齐问题。该方法包括采样噪声、从视觉语言模型获取标记级反馈,并通过奖励模型优化噪声投影器。实验表明,这种方法在不需参考图像或手工制作先验的情况下提高了文本与图像的对齐,并且相比现有方法具有较低的推理成本。
OmnimatteZero: Fast Training-free Omnimatte with Pre-trained Video Diffusion Models
Authors: Dvir Samuel, Matan Levy, Nir Darshan, Gal Chechik, Rami Ben-Ari
Venue: SIGGRAPH ASIA 2025
First: 2025-03-23T11:26:48+00:00 · Latest: 2025-10-16T09:59:36+00:00
Comments: Accepted to SIGGRAPH ASIA 2025. Project Page:
https://dvirsamuel.github.io/omnimattezero.github.io/
Abstract
In Omnimatte, one aims to decompose a given video into semantically
meaningful layers, including the background and individual objects along with
their associated effects, such as shadows and reflections. Existing methods
often require extensive training or costly self-supervised optimization. In
this paper, we present OmnimatteZero, a training-free approach that leverages
off-the-shelf pre-trained video diffusion models for omnimatte. It can remove
objects from videos, extract individual object layers along with their effects,
and composite those objects onto new videos. These are accomplished by adapting
zero-shot image inpainting techniques for video object removal, a task they
fail to handle effectively out-of-the-box. To overcome this, we introduce
temporal and spatial attention guidance modules that steer the diffusion
process for accurate object removal and temporally consistent background
reconstruction. We further show that self-attention maps capture information
about the object and its footprints and use them to inpaint the object's
effects, leaving a clean background. Additionally, through simple latent
arithmetic, object layers can be isolated and recombined seamlessly with new
video layers to produce new videos. Evaluations show that OmnimatteZero not
only achieves superior performance in terms of background reconstruction but
also sets a new record for the fastest Omnimatte approach, achieving real-time
performance with minimal frame runtime.
中文标题/摘要
标题:OmnimatteZero:基于预训练视频扩散模型的无需训练的全景 matte
在全景 matte 中,目标是将给定的视频分解为语义上有意义的图层,包括背景和个体对象及其相关的效果,如阴影和反射。现有方法通常需要大量的训练或昂贵的自我监督优化。在本文中,我们提出了无需训练的方法 OmnimatteZero,该方法利用现成的预训练视频扩散模型进行全景 matte。它可以移除视频中的对象,提取个体对象图层及其效果,并将这些对象合成为新的视频。这些操作通过将零样本图像修复技术适应于视频对象移除来实现,这是它们无法有效处理的任务。为了解决这个问题,我们引入了时间注意力和空间注意力引导模块,以引导扩散过程,实现准确的对象移除和时间上一致的背景重建。我们还展示了自我注意力图捕获关于对象及其足迹的信息,并使用它们来修复对象的效果,从而留下干净的背景。此外,通过简单的潜在算术,可以隔离对象图层并无缝地与新的视频图层重新组合以生成新的视频。评估表明,OmnimatteZero 不仅在背景重建方面取得了更好的性能,而且是最快的全景 matte 方法,实现了接近实时的性能,帧运行时间极小。
Summary / 总结
OmnimatteZero is a training-free method that uses pre-trained video diffusion models to decompose videos into background and object layers, including their effects. It employs zero-shot image inpainting techniques and introduces temporal and spatial attention guidance modules to accurately remove objects and reconstruct backgrounds. Experimental results demonstrate that OmnimatteZero outperforms existing methods in background reconstruction and achieves real-time performance with minimal frame runtime.
OmnimatteZero 是一种无需训练的方法,利用预训练的视频扩散模型将视频分解为背景和对象层,包括它们的效果。它使用零样本图像修复技术,并引入注意力引导模块以准确移除对象并重建背景。实验表明,OmnimatteZero 在背景重建方面优于现有方法,并实现了接近实时的性能,帧运行时间极小。
Internet of Agents: Fundamentals, Applications, and Challenges
Authors: Yuntao Wang, Shaolong Guo, Yanghe Pan, Zhou Su, Fahao Chen, Tom H. Luan, Peng Li, Jiawen Kang, Dusit Niyato
First: 2025-05-12T02:04:37+00:00 · Latest: 2025-10-16T09:32:37+00:00
Comments: 25 pages,10 figures, 10 tables. Accepted by IEEE TCCN in Oct. 2025
Abstract
With the rapid proliferation of large language models and vision-language
models, AI agents have evolved from isolated, task-specific systems into
autonomous, interactive entities capable of perceiving, reasoning, and acting
without human intervention. As these agents proliferate across virtual and
physical environments, from virtual assistants to embodied robots, the need for
a unified, agent-centric infrastructure becomes paramount. In this survey, we
introduce the Internet of Agents (IoA) as a foundational framework that enables
seamless interconnection, dynamic discovery, and collaborative orchestration
among heterogeneous agents at scale. We begin by presenting a general IoA
architecture, highlighting its hierarchical organization, distinguishing
features relative to the traditional Internet, and emerging applications. Next,
we analyze the key operational enablers of IoA, including capability
notification and discovery, adaptive communication protocols, dynamic task
matching, consensus and conflict-resolution mechanisms, and incentive models.
Finally, we identify open research directions toward building resilient and
trustworthy IoA ecosystems.
中文标题/摘要
标题:代理互联网:基础、应用与挑战
随着大型语言模型和视觉-语言模型的迅速普及,AI代理从孤立的任务特定系统演变为无需人类干预即可感知、推理和行动的自主交互实体。随着这些代理在虚拟和物理环境中普及,从虚拟助手到具身机器人,构建统一的代理中心基础设施变得至关重要。在这篇综述中,我们介绍了代理互联网(IoA)作为基础框架,使大规模异构代理能够无缝互联、动态发现和协作编排。我们首先介绍了IoA的一般架构,强调其分层组织、相对于传统互联网的独特特征以及新兴应用。接着,我们分析了IoA的关键操作使能器,包括能力通知和发现、自适应通信协议、动态任务匹配、共识和冲突解决机制以及激励模型。最后,我们指出了构建稳健和可信赖的IoA生态系统的开放研究方向。
Summary / 总结
This paper explores the evolution of AI agents into autonomous, interactive entities and introduces the Internet of Agents (IoA) as a framework for their seamless interconnection and collaborative orchestration. The study presents an IoA architecture, operational enablers such as capability discovery and adaptive communication protocols, and identifies open research directions. The main findings include the need for dynamic task matching, consensus mechanisms, and incentive models to build resilient and trustworthy IoA ecosystems.
本文探讨了AI代理从孤立的任务特定系统演变为自主交互实体的过程,并引入了互联网代理(IoA)作为其无缝互联和协作编排的基础框架。研究介绍了IoA架构、能力发现和自适应通信协议等关键运行使能器,并指出了构建稳健和可信赖的IoA生态系统的研究方向。主要发现包括动态任务匹配和共识机制对于构建稳健和可信赖的IoA生态系统的重要性。
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
Authors: Zhen Li, Duan Li, Yukai Guo, Xinyuan Guo, Bowen Li, Lanxi Xiao, Shenyu Qiao, Jiashu Chen, Zijian Wu, Hui Zhang, Xinhuan Shu, Shixia Liu
First: 2025-05-24T12:06:22+00:00 · Latest: 2025-10-16T09:17:24+00:00
Comments: 58 pages
Abstract
Infographic charts are a powerful medium for communicating abstract data by
combining visual elements (e.g., charts, images) with textual information.
However, their visual and structural richness poses challenges for large
vision-language models (LVLMs), which are typically trained on plain charts. To
bridge this gap, we introduce ChartGalaxy, a million-scale dataset designed to
advance the understanding and generation of infographic charts. The dataset is
constructed through an inductive process that identifies 75 chart types, 440
chart variations, and 68 layout templates from real infographic charts and uses
them to create synthetic ones programmatically. We showcase the utility of this
dataset through: 1) improving infographic chart understanding via fine-tuning,
2) benchmarking code generation for infographic charts, and 3) enabling
example-based infographic chart generation. By capturing the visual and
structural complexity of real design, ChartGalaxy provides a useful resource
for enhancing multimodal reasoning and generation in LVLMs.
中文标题/摘要
标题:ChartGalaxy:用于信息图表图表理解和生成的数据集
信息图表是一种通过结合视觉元素(例如图表、图像)与文本信息来传达抽象数据的强大媒介。然而,它们的视觉和结构丰富性为大型视觉语言模型(LVLMs)带来了挑战,这些模型通常是在简单的图表上进行训练的。为了弥合这一差距,我们引入了ChartGalaxy,这是一个百万规模的数据集,旨在推进信息图表的理解和生成。该数据集通过归纳过程构建,从实际的信息图表中识别出75种图表类型、440种图表变体和68种布局模板,并使用它们来程序化地创建合成信息图表。我们通过以下方式展示了该数据集的用途:1)通过微调提高信息图表的理解能力,2)对信息图表的代码生成进行基准测试,3)实现基于示例的信息图表生成。通过捕捉实际设计的视觉和结构复杂性,ChartGalaxy为增强LVLMs中的多模态推理和生成提供了有用的资源。
Summary / 总结
The research aims to address the challenges that large vision-language models face when dealing with the complex visual and structural elements of infographic charts. ChartGalaxy, a million-scale dataset, is introduced to help improve the understanding and generation of these charts. The dataset is created by identifying various chart types and layout templates from real infographics and generating synthetic ones. Key findings include improved understanding of infographic charts through fine-tuning, better benchmarking for code generation, and enhanced example-based generation capabilities.
ChartGalaxy 是一个大规模数据集,旨在通过 LVLMs 提高对信息图表的理解和生成能力。它包含 75 种图表类型、440 种变体和 68 种布局模板,通过从真实图表中归纳创建。该数据集通过微调增强信息图表理解,用于代码生成基准测试,并支持基于示例的信息图表生成,解决了真实设计的复杂性问题。
InfoDet: A Dataset for Infographic Element Detection
Authors: Jiangning Zhu, Yuxing Zhou, Zheng Wang, Juntao Yao, Yima Gu, Yuhui Yuan, Shixia Liu
Venue: ICLR 2026
First: 2025-05-23T04:56:07+00:00 · Latest: 2025-10-16T09:10:01+00:00
Comments: Submitted to ICLR 2026
Abstract
Given the central role of charts in scientific, business, and communication
contexts, enhancing the chart understanding capabilities of vision-language
models (VLMs) has become increasingly critical. A key limitation of existing
VLMs lies in their inaccurate visual grounding of infographic elements,
including charts and human-recognizable objects (HROs) such as icons and
images. However, chart understanding often requires identifying relevant
elements and reasoning over them. To address this limitation, we introduce
InfoDet, a dataset designed to support the development of accurate object
detection models for charts and HROs in infographics. It contains 11,264 real
and 90,000 synthetic infographics, with over 14 million bounding box
annotations. These annotations are created by combining the model-in-the-loop
and programmatic methods. We demonstrate the usefulness of InfoDet through
three applications: 1) constructing a Thinking-with-Boxes scheme to boost the
chart understanding performance of VLMs, 2) comparing existing object detection
models, and 3) applying the developed detection model to document layout and UI
element detection.
中文标题/摘要
标题:InfoDet:信息图表元素检测数据集
鉴于图表在科学、商业和交流等领域的核心作用,增强视觉-语言模型(VLMs)的图表理解能力变得越来越关键。现有VLMs的一个主要限制在于它们对信息图表元素,包括图表和人可识别对象(HROs,如图标和图像)的不准确视觉定位。然而,图表理解通常需要识别相关元素并进行推理。为了解决这一限制,我们引入了InfoDet,一个旨在支持开发准确的图表和HROs检测模型的数据集。它包含11,264个真实和90,000个合成的信息图表,以及超过1400万个边界框注释。这些注释通过结合模型在环中和程序化方法创建。我们通过三个应用展示了InfoDet的价值:1)构建一个思考框方案以提升VLMs的图表理解性能,2)比较现有的检测模型,3)将开发的检测模型应用于文档布局和UI元素检测。
Summary / 总结
InfoDet is a dataset aimed at improving the chart understanding capabilities of vision-language models by accurately detecting infographic elements such as charts and human-recognizable objects. It includes 11,264 real and 90,000 synthetic infographics with over 14 million bounding box annotations created using model-in-the-loop and programmatic methods. The dataset is used to enhance chart understanding performance, compare object detection models, and apply detection models to document layout and UI element detection.
InfoDet 是一个旨在通过增强视觉语言模型对信息图元素的视觉定位能力来提升其图表理解能力的数据集。它包含11,264个真实和90,000个合成的信息图,带有超过1400万个边界框注释,这些注释是通过模型在环和程序化方法创建的。该数据集用于开发适用于信息图中图表和人类可识别对象的准确检测模型,并通过三个应用展示了其有效性:提升视觉语言模型的图表理解性能、比较现有的检测模型以及将开发的检测模型应用于文档布局和UI元素检测。
Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment
Authors: Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong
First: 2025-08-21T13:42:49+00:00 · Latest: 2025-10-16T08:27:41+00:00
Abstract
Test-time adaptation (TTA) enhances the zero-shot robustness under
distribution shifts by leveraging unlabeled test data during inference. Despite
notable advances, several challenges still limit its broader applicability.
First, most methods rely on backpropagation or iterative optimization, which
limits scalability and hinders real-time deployment. Second, they lack explicit
modeling of class-conditional feature distributions. This modeling is crucial
for producing reliable decision boundaries and calibrated predictions, but it
remains underexplored due to the lack of both source data and supervision at
test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and
backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian
probabilistic inference task by modeling class-conditional likelihoods using
gradually updated class means and a shared covariance matrix. This enables
closed-form, training-free inference. To correct potential likelihood bias, we
introduce lightweight regularization guided by CLIP priors and a historical
knowledge bank. ADAPT requires no source data, no gradient updates, and no full
access to target data, supporting both online and transductive settings.
Extensive experiments across diverse benchmarks demonstrate that our method
achieves state-of-the-art performance under a wide range of distribution shifts
with superior scalability and robustness.
中文标题/摘要
标题:无需反向传播的测试时自适应通过概率高斯对齐
测试时自适应(TTA)通过在推理过程中利用未标记的测试数据来增强零样本鲁棒性,从而在分布偏移下提高鲁棒性。尽管取得了显著进展,但几个挑战仍然限制了其更广泛的适用性。首先,大多数方法依赖于反向传播或迭代优化,这限制了可扩展性并阻碍了实时部署。其次,它们缺乏对类条件特征分布的显式建模。这种建模对于生成可靠决策边界和校准预测至关重要,但由于缺乏源数据和测试时的监督,这种建模仍然未被充分探索。在本文中,我们提出了一种无需反向传播的先进分布感知测试时自适应方法ADAPT。我们将TTA重新定义为一个高斯概率推理任务,通过使用逐渐更新的类均值和共享协方差矩阵来建模类条件似然性。这使得可以进行闭式、无需训练的推理。为了纠正潜在的似然偏差,我们引入了由CLIP先验和历史知识库引导的轻量级正则化。ADAPT不需要源数据、不需要梯度更新,并且不需要完全访问目标数据,支持在线和归纳设置。在多种基准上的广泛实验表明,我们的方法在各种分布偏移下实现了最先进的性能,具有更高的可扩展性和鲁棒性。
Summary / 总结
The research aims to enhance test-time adaptation (TTA) for robustness under distribution shifts by leveraging unlabeled test data. ADAPT, a backpropagation-free method, models class-conditional feature distributions using Gaussian alignment and gradually updated class means. It introduces lightweight regularization to correct likelihood bias and achieves state-of-the-art performance across various benchmarks with better scalability and robustness.
论文提出了ADAPT方法,将TTA重新定义为高斯概率推断任务,避免了反向传播和迭代优化,实现了可扩展的实时部署。ADAPT使用更新的类均值和共享协方差矩阵来建模类条件似然性,并引入正则化来纠正似然偏差。实验表明,ADAPT在各种基准测试中表现出色,实现了最先进的性能,并且在分布变化下具有更好的可扩展性和鲁棒性。
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Authors: Deepayan Das, Davide Talon, Yiming Wang, Massimiliano Mancini, Elisa Ricci
First: 2025-03-24T12:36:24+00:00 · Latest: 2025-10-16T08:19:45+00:00
Abstract
Vision Language Models (VLMs) have lead to major improvements in multimodal
reasoning, yet they still struggle to understand user-specific concepts.
Existing personalization methods address this limitation but heavily rely on
training procedures, that can be either costly or unpleasant to individual
users. We depart from existing work, and for the first time explore the
training-free setting in the context of personalization. We propose a novel
method, Retrieval and Reasoning for Personalization (R2P), leveraging internal
knowledge of VLMs. First, we leverage VLMs to extract the concept fingerprint,
i.e., key attributes uniquely defining the concept within its semantic class.
When a query arrives, the most similar fingerprints are retrieved and scored
via chain-of-thought-reasoning. To reduce the risk of hallucinations, the
scores are validated through cross-modal verification at the attribute level:
in case of a discrepancy between the scores, R2P refines the concept
association via pairwise multimodal matching, where the retrieved fingerprints
and their images are directly compared with the query. We validate R2P on two
publicly available benchmarks and a newly introduced dataset, Personal Concepts
with Visual Ambiguity (PerVA), for concept identification highlighting
challenges in visual ambiguity. R2P consistently outperforms state-of-the-art
approaches on various downstream tasks across all benchmarks. Code will be
available upon acceptance.
中文标题/摘要
标题:基于检索与推理的无训练个性化
视觉语言模型(VLMs)在多模态推理方面取得了重大进展,但仍难以理解用户特定的概念。现有个性化方法解决了这一限制,但严重依赖于训练过程,这可能对个别用户来说既昂贵又不愉快。我们从现有工作出发,首次在个性化背景下探索无训练设置。我们提出了一种新颖的方法——个性化中的检索与推理(R2P),利用VLMs内部知识。首先,我们利用VLMs提取概念指纹,即定义概念在语义类中独特属性的关键特征。当查询到达时,检索最相似的概念指纹并通过链式推理评分。为了降低幻觉风险,评分通过属性级别的跨模态验证进行验证:如果评分之间存在差异,R2P将通过成对的多模态匹配进行概念关联的细化,其中检索到的概念指纹及其图像直接与查询进行比较。我们在两个公开可用的基准和一个新引入的数据集——视觉歧义中的个人概念(PerVA)——上验证了R2P,用于概念识别,突出了视觉歧义中的挑战。R2P在所有基准上的各种下游任务中均优于现有最佳方法。代码将在接受后提供。
Summary / 总结
The research aims to address the limitation of Vision Language Models (VLMs) in understanding user-specific concepts without relying on costly training procedures. The proposed method, Retrieval and Reasoning for Personalization (R2P), extracts concept fingerprints using VLMs and retrieves the most similar ones through chain-of-thought reasoning. It validates these scores through cross-modal verification and refines concept associations via pairwise multimodal matching if discrepancies arise. R2P outperforms existing approaches on various downstream tasks across multiple benchmarks, including a newly introduced dataset, Personal Concepts with Visual Ambiguity (PerVA).
研究旨在解决视觉语言模型(VLMs)在理解用户特定概念时的局限性,而不依赖于昂贵的训练过程。提出了一种名为Retrieval and Reasoning for Personalization (R2P)的新方法,利用VLMs的内部知识提取概念指纹,并使用链式推理来评分最相似的概念指纹。R2P通过跨模态验证进一步验证这些评分,并通过成对的多模态匹配来细化概念关联。实验在公共基准和一个新的数据集,Personal Concepts with Visual Ambiguity (PerVA)上显示,R2P在各种下游任务中优于现有方法。
WoW: Towards a World omniscient World model Through Embodied Interaction
Authors: Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, Zezhong Qian, Anthony Chen, Qiang Zhou, Yueru Jia, Jiaming Liu, Yong Dai, Qingpo Wuwu, Chengyu Bai, Yu-Kai Wang, Ying Li, Lizhang Chen, Yong Bao, Zhiyuan Jiang, Jiacheng Zhu, Kai Tang, Ruichuan An, Yulin Luo, Qiuxuan Feng, Siyuan Zhou, Chi-min Chan, Chengkai Hou, Wei Xue, Sirui Han, Yike Guo, Shanghang Zhang, Jian Tang
First: 2025-09-26T17:59:07+00:00 · Latest: 2025-10-16T07:48:00+00:00
Abstract
Humans develop an understanding of intuitive physics through active
interaction with the world. This approach is in stark contrast to current video
models, such as Sora, which rely on passive observation and therefore struggle
with grasping physical causality. This observation leads to our central
hypothesis: authentic physical intuition of the world model must be grounded in
extensive, causally rich interactions with the real world. To test this
hypothesis, we present WoW, a 14-billion-parameter generative world model
trained on 2 million robot interaction trajectories. Our findings reveal that
the model's understanding of physics is a probabilistic distribution of
plausible outcomes, leading to stochastic instabilities and physical
hallucinations. Furthermore, we demonstrate that this emergent capability can
be actively constrained toward physical realism by SOPHIA, where
vision-language model agents evaluate the DiT-generated output and guide its
refinement by iteratively evolving the language instructions. In addition, a
co-trained Inverse Dynamics Model translates these refined plans into
executable robotic actions, thus closing the imagination-to-action loop. We
establish WoWBench, a new benchmark focused on physical consistency and causal
reasoning in video, where WoW achieves state-of-the-art performance in both
human and autonomous evaluation, demonstrating strong ability in physical
causality, collision dynamics, and object permanence. Our work provides
systematic evidence that large-scale, real-world interaction is a cornerstone
for developing physical intuition in AI. Models, data, and benchmarks will be
open-sourced.
中文标题/摘要
标题:WoW:通过具身交互构建世界全知模型
人类通过与世界的主动互动来理解直观的物理法则。这种方法与当前基于被动观察的视频模型(如Sora)形成鲜明对比,后者难以掌握物理因果关系。这一观察促使我们提出中心假设:世界模型中的真实物理直觉必须基于与现实世界进行广泛且因果丰富的互动。为了验证这一假设,我们提出了WoW,一个由200万机器人互动轨迹训练而成的140亿参数生成世界模型。我们的研究发现,该模型对物理法则的理解表现为可能结果的概率分布,导致随机不稳定性和物理幻觉。此外,我们展示了通过SOPHIA(一种视图-语言模型代理)可以主动约束这种新兴能力,使其向物理现实靠拢,其中视图-语言模型代理评估DiT生成的输出,并通过迭代演化语言指令来引导其改进。此外,一个共同训练的逆动力学模型将这些改进后的计划转化为可执行的机器人动作,从而闭合了想象到行动的循环。我们建立了WoWBench,一个专注于视频中物理一致性和因果推理的新基准,WoW在人类和自主评估中均达到最先进的性能,展示了强大的物理因果关系、碰撞动力学和物体持久性能力。我们的工作提供了大规模现实世界互动是开发物理直觉的基石的系统性证据。模型、数据和基准将开源。
Summary / 总结
The research aims to develop a world model that understands physics through active interaction, contrasting with passive observation methods. WoW, a 14-billion-parameter generative model, was trained on 2 million robot interaction trajectories. The model exhibits stochastic instabilities and physical hallucinations, but these can be constrained by SOPHIA, which refines language instructions based on vision-language model evaluations. The Inverse Dynamics Model translates these plans into robotic actions, closing the imagination-to-action loop. WoW outperforms existing models in physical consistency and causal reasoning, as demonstrated by the WoWBench benchmark, showing strong physical causality, collision dynamics, and object permanence. This work highlights the importance of real-world interaction for developing physical intuition in AI.
研究旨在通过主动交互来发展一个理解物理的世界模型,与依赖被动观察的方法形成对比。WoW是一个包含140亿参数的生成模型,通过200万次机器人交互轨迹进行训练。该模型表现出随机不稳定性和物理幻觉,但可以通过SOPHIA进行约束,SOPHIA基于视觉-语言模型评估来逐步优化语言指令。逆动力学模型将这些计划转化为可执行的机器人动作,从而完成想象到行动的闭环。WoW在物理一致性与因果推理方面超越了现有模型,在WoWBench基准测试中表现出强大的物理因果性、碰撞动力学和物体持久性。这项工作强调了大规模真实世界交互对于在AI中发展物理直觉的重要性。
Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control
Authors: Zhe Wu, Hongjin Lu, Junliang Xing, Changhao Zhang, Yin Zhu, Yuhao Yang, Yuheng Jing, Kai Li, Kun Shao, Jianye Hao, Jun Wang, Yuanchun Shi
First: 2025-10-16T07:38:21+00:00 · Latest: 2025-10-16T07:38:21+00:00
Abstract
Building agents that autonomously operate mobile devices has attracted
increasing attention. While Vision-Language Models (VLMs) show promise, most
existing approaches rely on direct state-to-action mappings, which lack
structured reasoning and planning, and thus generalize poorly to novel tasks or
unseen UI layouts. We introduce Hi-Agent, a trainable hierarchical
vision-language agent for mobile control, featuring a high-level reasoning
model and a low-level action model that are jointly optimized. For efficient
training, we reformulate multi-step decision-making as a sequence of
single-step subgoals and propose a foresight advantage function, which
leverages execution feedback from the low-level model to guide high-level
optimization. This design alleviates the path explosion issue encountered by
Group Relative Policy Optimization (GRPO) in long-horizon tasks and enables
stable, critic-free joint training. Hi-Agent achieves a new State-Of-The-Art
(SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark,
significantly outperforming prior methods across three paradigms: prompt-based
(AppAgent: 17.7%), supervised (Filtered BC: 54.5%), and reinforcement
learning-based (DigiRL: 71.9%). It also demonstrates competitive zero-shot
generalization on the ScreenSpot-v2 benchmark. On the more challenging
AndroidWorld benchmark, Hi-Agent also scales effectively with larger backbones,
showing strong adaptability in high-complexity mobile control scenarios.
中文标题/摘要
标题:Hi-Agent:移动设备控制的分层视觉语言代理
构建能够自主操作移动设备的代理引起了越来越多的关注。尽管视觉语言模型(VLMs)显示出潜力,但大多数现有方法依赖于直接的状态到动作映射,缺乏结构化的推理和规划,因此在新任务或未见过的UI布局上泛化能力较差。我们引入了Hi-Agent,这是一种用于移动控制的可训练分层视觉语言代理,具备高层推理模型和低层动作模型,并且是联合优化的。为了高效训练,我们将多步决策制定重新表述为一系列单步子目标,并提出了一种前瞻优势函数,该函数利用低层模型的执行反馈来指导高层优化。这种设计缓解了在长期任务中遇到的组相对策略优化(GRPO)路径爆炸问题,并使稳定、无批评家的联合训练成为可能。Hi-Agent在Android-in-the-Wild(AitW)基准测试中达到了新的最佳状态87.9%的任务成功率,显著优于先前方法在三种范式中的表现:提示驱动(AppAgent:17.7%)、监督(过滤后的BC:54.5%)和强化学习驱动(DigiRL:71.9%)。它还在ScreenSpot-v2基准测试中展示了竞争力的零样本泛化能力。在更具挑战性的AndroidWorld基准测试中,Hi-Agent也随着更大模型的使用而有效扩展,展示了在高复杂度移动控制场景中的强大适应性。
Summary / 总结
Hi-Agent is a hierarchical vision-language agent designed for mobile device control, addressing the limitations of direct state-to-action mappings by incorporating structured reasoning and planning. It consists of a high-level reasoning model and a low-level action model that are jointly optimized. Hi-Agent uses a foresight advantage function to guide high-level optimization with feedback from the low-level model, enabling stable joint training. The agent achieves an 87.9% task success rate on the Android-in-the-Wild benchmark, surpassing previous methods across prompt-based, supervised, and reinforcement learning paradigms, and demonstrates strong zero-shot generalization and adaptability in complex scenarios.
研究旨在开发能够自主操作移动设备的代理,解决现有方法缺乏结构化推理和规划的问题。引入了Hi-Agent,这是一种具有高阶推理模型和低阶动作模型的分层视觉语言代理,这两个模型是联合优化的。通过将多步决策制定重新表述为一系列单步子目标,并使用前瞻优势函数,Hi-Agent 在 AitW 基准测试中达到了新的 SOTA 87.9% 任务成功率,超越了先前的方法,并在复杂场景中展示了强大的适应性。
A Clinically-Grounded Two-Stage Framework for Renal CT Report Generation
Authors: Renjie Liang, Zhengkang Fan, Jinqian Pan, Chenkun Sun, Bruce Daniel Steinberg, Russell Terry, Jie Xu
First: 2025-06-30T07:45:02+00:00 · Latest: 2025-10-16T06:21:00+00:00
Abstract
Objective Renal cancer is a common malignancy and a major cause of
cancer-related deaths. Computed tomography (CT) is central to early detection,
staging, and treatment planning. However, the growing CT workload increases
radiologists' burden and risks incomplete documentation. Automatically
generating accurate reports remains challenging because it requires integrating
visual interpretation with clinical reasoning. Advances in artificial
intelligence (AI), especially large language and vision-language models, offer
potential to reduce workload and enhance diagnostic quality.
Methods We propose a clinically informed, two-stage framework for automatic
renal CT report generation. In Stage 1, a multi-task learning model detects
structured clinical features from each 2D image. In Stage 2, a vision-language
model generates free-text reports conditioned on the image and the detected
features. To evaluate clinical fidelity, generated clinical features are
extracted from the reports and compared with expert-annotated ground truth.
Results Experiments on an expert-labeled dataset show that incorporating
detected features improves both report quality and clinical accuracy. The model
achieved an average AUC of 0.75 for key imaging features and a METEOR score of
0.33, demonstrating higher clinical consistency and fewer template-driven
errors.
Conclusion Linking structured feature detection with conditioned report
generation provides a clinically grounded approach to integrate structured
prediction and narrative drafting for renal CT reporting. This method enhances
interpretability and clinical faithfulness, underscoring the value of
domain-relevant evaluation metrics for medical AI development.
中文标题/摘要
标题:基于临床的两阶段框架用于肾CT报告生成
目标 肾癌是一种常见的恶性肿瘤,是癌症相关死亡的主要原因。计算机断层扫描(CT)在早期检测、分期和治疗计划中起着关键作用。然而,CT工作量的增加增加了放射科医生的负担并可能导致记录不完整。自动生成准确的报告仍然具有挑战性,因为它需要将视觉解释与临床推理相结合。人工智能(AI)的进步,尤其是大型语言和视觉-语言模型,有可能减轻工作量并提高诊断质量。 方法 我们提出了一种基于临床的两阶段框架,用于自动肾CT报告生成。在第一阶段,多任务学习模型从每个2D图像中检测结构化的临床特征。在第二阶段,视觉-语言模型根据图像和检测到的特征生成自由文本报告。为了评估临床准确性,从报告中提取生成的临床特征并与专家标注的真实值进行比较。 结果 在专家标注的数据集上进行的实验表明,结合检测到的特征可以提高报告质量和临床准确性。该模型在关键影像特征上的平均AUC为0.75,METEOR得分为0.33,显示出更高的临床一致性并减少了模板驱动的错误。 结论 将结构化特征检测与条件报告生成相结合,提供了一种基于临床的方法,用于将结构化预测与叙述性起草相结合,以进行肾CT报告。该方法增强了可解释性和临床真实性,突显了医学AI开发中领域相关评估指标的价值。
Summary / 总结
The study aims to address the challenges of generating accurate renal CT reports by proposing a two-stage framework. In Stage 1, a multi-task learning model detects structured clinical features from 2D images, and in Stage 2, a vision-language model generates free-text reports based on these features. Evaluation on an expert-labeled dataset shows that incorporating detected features improves report quality and clinical accuracy, with an average AUC of 0.75 for key imaging features and a METEOR score of 0.33, indicating higher clinical consistency and fewer template-driven errors.
研究旨在通过提出两阶段框架来解决生成准确的肾CT报告的挑战。第一阶段使用多任务学习模型从2D图像中检测结构化的临床特征,第二阶段使用视觉-语言模型根据这些特征生成自由文本报告。实验结果显示,结合检测到的特征可以提高报告质量和临床准确性,关键影像特征的平均AUC为0.75,METEOR得分为0.33,表明更高的临床一致性和更少的模板驱动错误。
Group-Wise Optimization for Self-Extensible Codebooks in Vector Quantized Models
Authors: Hong-Kai Zheng, Piji Li
First: 2025-10-15T09:14:22+00:00 · Latest: 2025-10-16T05:26:09+00:00
Abstract
Vector Quantized Variational Autoencoders (VQ-VAEs) leverage self-supervised
learning through reconstruction tasks to represent continuous vectors using the
closest vectors in a codebook. However, issues such as codebook collapse
persist in the VQ model. To address these issues, existing approaches employ
implicit static codebooks or jointly optimize the entire codebook, but these
methods constrain the codebook's learning capability, leading to reduced
reconstruction quality. In this paper, we propose Group-VQ, which performs
group-wise optimization on the codebook. Each group is optimized independently,
with joint optimization performed within groups. This approach improves the
trade-off between codebook utilization and reconstruction performance.
Additionally, we introduce a training-free codebook resampling method, allowing
post-training adjustment of the codebook size. In image reconstruction
experiments under various settings, Group-VQ demonstrates improved performance
on reconstruction metrics. And the post-training codebook sampling method
achieves the desired flexibility in adjusting the codebook size.
中文标题/摘要
标题:组优化在向量量化模型中自扩展码本
向量量化变分自编码器(VQ-VAEs)通过重建任务利用自监督学习来用码本中最近的向量表示连续向量。然而,VQ模型中存在码本崩溃等问题。为解决这些问题,现有方法采用隐式静态码本或联合优化整个码本,但这些方法限制了码本的学习能力,导致重建质量降低。本文提出了一种组-VQ方法,对码本进行组优化。每个组独立优化,组内进行联合优化。这种方法改善了码本利用与重建性能之间的权衡。此外,我们引入了一种无需训练的码本重采样方法,允许在训练后调整码本大小。在不同设置下的图像重建实验中,组-VQ在重建指标上表现出更好的性能。而训练后的码本采样方法实现了调整码本大小所需的灵活性。
Summary / 总结
This paper addresses the issue of codebook collapse in VQ-VAEs by proposing Group-VQ, which optimizes codebook groups independently while allowing joint optimization within groups. This method enhances the balance between codebook utilization and reconstruction performance. Experimental results show that Group-VQ improves reconstruction metrics in image reconstruction tasks, and the post-training codebook resampling method provides flexibility in adjusting the codebook size.
本文通过提出Group-VQ来解决VQ-VAEs中的代码簿坍塌问题,该方法对代码簿进行分组优化,每组独立优化,组内进行联合优化,从而改善代码簿利用和重建性能之间的权衡。实验结果表明,Group-VQ在重建指标上优于现有方法,并通过无训练的代码簿采样方法在后训练阶段实现了代码簿大小的灵活调整。
Watermarking for Factuality: Guiding Vision-Language Models Toward Truth via Tri-layer Contrastive Decoding
Authors: Kyungryul Back, Seongbeom Park, Milim Kim, Mincheol Kwon, SangHyeok Lee, Hyunyoung Lee, Junhee Cho, Seunghyun Park, Jinkyu Kim
Venue: EMNLP 2025
First: 2025-10-16T04:58:45+00:00 · Latest: 2025-10-16T04:58:45+00:00
Comments: EMNLP 2025 Findings; Project: https://github.com/KR-0822/TCD
Abstract
Large Vision-Language Models (LVLMs) have recently shown promising results on
various multimodal tasks, even achieving human-comparable performance in
certain cases. Nevertheless, LVLMs remain prone to hallucinations -- they often
rely heavily on a single modality or memorize training data without properly
grounding their outputs. To address this, we propose a training-free, tri-layer
contrastive decoding with watermarking, which proceeds in three steps: (1)
select a mature layer and an amateur layer among the decoding layers, (2)
identify a pivot layer using a watermark-related question to assess whether the
layer is visually well-grounded, and (3) apply tri-layer contrastive decoding
to generate the final output. Experiments on public benchmarks such as POPE,
MME and AMBER demonstrate that our method achieves state-of-the-art performance
in reducing hallucinations in LVLMs and generates more visually grounded
responses.
中文标题/摘要
标题:事实水印:通过三层对比解码引导视觉-语言模型趋向真实
大型视觉-语言模型(LVLMs)在各种多模态任务上最近取得了令人鼓舞的结果,甚至在某些情况下达到了与人类相当的性能。然而,LVLMs仍然容易产生幻觉——它们往往依赖单一模态或记忆训练数据,而没有正确地将输出与视觉内容对接。为了解决这个问题,我们提出了一种无需训练的三层对比解码方法,带有水印,该方法分为三个步骤:(1)选择解码层中的成熟层和新手层,(2)使用与水印相关的问题来识别枢轴层,以评估该层是否视觉对接良好,(3)应用三层对比解码生成最终输出。在POPE、MME和AMBER等公开基准上的实验表明,我们的方法在减少LVLMs中的幻觉方面达到了最先进的性能,并生成了更多视觉对接良好的响应。
Summary / 总结
The research aims to reduce hallucinations in large vision-language models (LVLMs) by proposing a tri-layer contrastive decoding method with watermarking. This method involves selecting a mature layer, an amateur layer, and a pivot layer, and then applying contrastive decoding to generate more grounded outputs. Experiments on benchmarks show that the proposed method effectively reduces hallucinations and improves the visual grounding of responses in LVLMs.
研究旨在通过提出一种无需训练的三层对比解码方法结合水印技术来减少大型视觉语言模型(LVLM)中的幻觉。该方法包括选择成熟层、新手层和枢轴层,然后应用三层对比解码生成更视觉地接地的响应。实验表明,这种方法显著减少了幻觉并提高了LVLM响应的接地性。