Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps
Authors: Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu
First: 2024-09-17T00:58:00+00:00 · Latest: 2025-10-01T17:58:44+00:00
Abstract
In recent years, spatial computing a.k.a. Extended Reality (XR) has emerged
as a transformative technology, offering users immersive and interactive
experiences across diversified virtual environments. Users can interact with XR
apps through interactable GUI elements (IGEs) on the stereoscopic
three-dimensional (3D) graphical user interface (GUI). The accurate recognition
of these IGEs is instrumental, serving as the foundation of many software
engineering tasks, including automated testing and effective GUI search. The
most recent IGE detection approaches for 2D mobile apps typically train a
supervised object detection model based on a large-scale manually-labeled GUI
dataset, usually with a pre-defined set of clickable GUI element categories
like buttons and spinners. Such approaches can hardly be applied to IGE
detection in XR apps, due to a multitude of challenges including complexities
posed by open-vocabulary and heterogeneous IGE categories, intricacies of
context-sensitive interactability, and the necessities of precise spatial
perception and visual-semantic alignment for accurate IGE detection results.
Thus, it is necessary to embark on the IGE research tailored to XR apps. In
this paper, we propose the first zero-shot cOntext-sensitive inteRactable GUI
ElemeNT dEtection framework for virtual Reality apps, named Orienter. By
imitating human behaviors, Orienter observes and understands the semantic
contexts of XR app scenes first, before performing the detection. The detection
process is iterated within a feedback-directed validation and reflection loop.
Specifically, Orienter contains three components, including (1) Semantic
context comprehension, (2) Reflection-directed IGE candidate detection, and (3)
Context-sensitive interactability classification. Extensive experiments
demonstrate that Orienter is more effective than the state-of-the-art GUI
element detection approaches.
Alternating Training-based Label Smoothing Enhances Prompt Generalization
Authors: Yang Chen, Yanbin Wei, Ke Jin, Yi Kong, James Kwok, Yu Zhang
First: 2025-08-25T09:54:37+00:00 · Latest: 2025-10-01T17:22:06+00:00
Abstract
Recent advances in pre-trained vision-language models have demonstrated
remarkable zero-shot generalization capabilities. To further enhance these
models' adaptability to various downstream tasks, prompt tuning has emerged as
a parameter-efficient fine-tuning method. However, despite its efficiency, the
generalization ability of prompt remains limited. In contrast, label smoothing
(LS) has been widely recognized as an effective regularization technique that
prevents models from becoming over-confident and improves their generalization.
This inspires us to explore the integration of LS with prompt tuning. However,
we have observed that the vanilla LS even weakens the generalization ability of
prompt tuning. To address this issue, we propose the Alternating Training-based
Label Smoothing (ATLaS) method, which alternately trains with standard one-hot
labels and soft labels generated by LS to supervise the prompt tuning.
Moreover, we introduce two types of efficient offline soft labels, including
Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), to provide
inter-class or instance-class relationships for prompt tuning. The theoretical
properties of the proposed ATLaS method are analyzed. Extensive experiments
demonstrate that the proposed ATLaS method, combined with CSL and ISL,
consistently enhances the generalization performance of prompt tuning.
Moreover, the proposed ATLaS method exhibits high compatibility with prevalent
prompt tuning methods, enabling seamless integration into existing methods.
中文标题/摘要
标题:交替训练基于标签平滑增强提示泛化
预训练的视觉-语言模型在零样本泛化能力方面取得了显著进展。为了进一步增强这些模型对各种下游任务的适应性,提示调优作为一种参数高效的微调方法已经出现。然而,尽管其高效性,提示的泛化能力仍然有限。相比之下,标签平滑(LS)作为一种有效的正则化技术,被广泛认为可以防止模型变得过于自信并提高其泛化能力。这启发我们探索将LS与提示调优相结合的方法。然而,我们观察到,传统的LS甚至会削弱提示调优的泛化能力。为了解决这一问题,我们提出了基于交替训练的标签平滑(ATLaS)方法,该方法交替使用标准的一热标签和由LS生成的软标签来监督提示调优。此外,我们引入了两种高效的离线软标签,包括类间软标签(CSL)和实例间软标签(ISL),为提示调优提供类间或实例类间关系。我们分析了所提出的ATLaS方法的理论性质。广泛的实验表明,结合CSL和ISL的所提出的ATLaS方法,一致地增强了提示调优的泛化性能。此外,所提出的ATLaS方法与现有的提示调优方法具有很高的兼容性,能够无缝集成到现有方法中。
Summary / 总结
This study aims to improve the generalization ability of prompt tuning in vision-language models by integrating label smoothing (LS) with prompt tuning. The proposed Alternating Training-based Label Smoothing (ATLaS) method alternates between training with one-hot labels and soft labels generated by LS. Theoretical analysis and extensive experiments show that ATLaS, combined with Class-wise Soft Labels (CSL) and Instance-wise Soft Labels (ISL), consistently enhances the generalization performance of prompt tuning, making it highly compatible with existing prompt tuning methods.
论文旨在通过将标签平滑(LS)与提示调优结合,增强视觉-语言模型中提示调优的一般化能力。提出了交替训练基于标签平滑(ATLaS)方法,该方法交替使用一热标签和由LS生成的软标签进行训练。研究引入了两类高效的离线软标签,即类间软标签(CSL)和实例间软标签(ISL),以提高提示调优的效果。广泛实验表明,结合CSL和ISL的ATLaS方法能够一致地提升提示调优的一般化性能。
DepthLM: Metric Depth From Vision Language Models
Authors: Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, Yangyang Shi
First: 2025-09-29T19:12:13+00:00 · Latest: 2025-10-01T17:18:15+00:00
Abstract
Vision language models (VLMs) can flexibly address various vision tasks
through text interactions. Although successful in semantic understanding,
state-of-the-art VLMs including GPT-5 still struggle in understanding 3D from
2D inputs. On the other hand, expert pure vision models achieve super-human
accuracy in metric depth estimation, a key 3D understanding task. However, they
require task-specific architectures and losses. Such difference motivates us to
ask: Can VLMs reach expert-level accuracy without architecture or loss change?
We take per-pixel metric depth estimation as the representative task and show
that the answer is yes! Surprisingly, comprehensive analysis shows that
text-based supervised-finetuning with sparse labels is sufficient for VLMs to
unlock strong 3D understanding, no dense prediction head or complex
regression/regularization loss is needed. The bottleneck for VLMs lies actually
in pixel reference and cross-dataset camera ambiguity, which we address through
visual prompting and intrinsic-conditioned augmentation. With much smaller
models, our method DepthLM surpasses the accuracy of most advanced VLMs by over
2x, making VLMs for the first time comparable with pure vision models.
Interestingly, without explicit enforcement during training, VLMs trained with
DepthLM naturally avoids over-smoothing, having much fewer flying points at
boundary regions than pure vision models. The simplicity of DepthLM also
enables a single VLM to cover various 3D tasks beyond metric depth. Our code
and model will be released at the link below.
Summary / 总结
The research aims to leverage vision language models (VLMs) for metric depth estimation, a key 3D understanding task, by exploring their potential without altering their architecture or loss functions. The study shows that text-based supervised fine-tuning with sparse labels is sufficient for VLMs to achieve expert-level accuracy. The main bottleneck is addressed through visual prompting and intrinsic-conditioned augmentation, leading to a method, DepthLM, that surpasses most advanced VLMs by over 2x in accuracy, making VLMs comparable to pure vision models. Interestingly, DepthLM-trained VLMs avoid over-smoothing, resulting in fewer flying points at boundary regions compared to pure vision models.
研究旨在探索是否可以在不改变架构或损失函数的情况下,使视觉语言模型(VLMs)在度量深度估计任务上达到专家级的准确性。研究显示,基于文本的监督微调和稀疏标签就足以让VLMs在该任务上表现良好。主要发现包括超越大多数先进VLMs的2倍以上准确性,并且避免过度平滑,边界区域的漂移点更少,与纯视觉模型相比。视觉提示和固有条件增强解决了像素参考和跨数据集相机歧义问题。
SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
Authors: Jintao Zhang, Chendong Xiang, Haofeng Huang, Jia Wei, Haocheng Xi, Jun Zhu, Jianfei Chen
Venue: ICML
First: 2025-02-25T12:02:17+00:00 · Latest: 2025-10-01T16:15:31+00:00
Comments: @inproceedings{zhang2025spargeattn, title={Spargeattn: Accurate
sparse attention accelerating any model inference}, author={Zhang, Jintao and
Xiang, Chendong and Huang, Haofeng and Wei, Jia and Xi, Haocheng and Zhu, Jun
and Chen, Jianfei}, booktitle={International Conference on Machine Learning
(ICML)}, year={2025} }
Abstract
An efficient attention implementation is essential for large models due to
its quadratic time complexity. Fortunately, attention commonly exhibits
sparsity, i.e., many values in the attention map are near zero, allowing for
the omission of corresponding computations. Many studies have utilized the
sparse pattern to accelerate attention. However, most existing works focus on
optimizing attention within specific models by exploiting certain sparse
patterns of the attention map. A universal sparse attention that guarantees
both the speedup and end-to-end performance of diverse models remains elusive.
In this paper, we propose SpargeAttn, a universal sparse and quantized
attention for any model. Our method uses a two-stage online filter: in the
first stage, we rapidly and accurately predict the attention map, enabling the
skip of some matrix multiplications in attention. In the second stage, we
design an online softmax-aware filter that incurs no extra overhead and further
skips some matrix multiplications. Experiments show that our method
significantly accelerates diverse models, including language, image, and video
generation, without sacrificing end-to-end metrics. The codes are available at
https://github.com/thu-ml/SpargeAttn.
中文标题/摘要
标题:SpargeAttention: 准确且无需训练的稀疏注意加速任意模型推理
由于注意力机制的时间复杂度为二次方,高效的注意力实现对于大型模型至关重要。幸运的是,注意力通常表现出稀疏性,即注意力图中的许多值接近零,允许省略相应的计算。许多研究利用稀疏模式来加速注意力。然而,大多数现有工作集中在通过利用注意力图的特定稀疏模式在特定模型中优化注意力。一种同时保证各种模型的加速和端到端性能的通用稀疏注意力仍然难以实现。在本文中,我们提出了SpargeAttn,一种适用于任何模型的通用稀疏和量化注意力。我们的方法使用两阶段在线过滤器:在第一阶段,我们快速准确地预测注意力图,从而省略一些矩阵乘法。在第二阶段,我们设计了一种在线softmax感知过滤器,不会产生额外开销,并进一步省略一些矩阵乘法。实验表明,我们的方法在不牺牲端到端指标的情况下,显著加速了包括语言、图像和视频生成在内的多种模型。代码可在https://github.com/thu-ml/SpargeAttn 获取。
Summary / 总结
SpargeAttention aims to accelerate attention mechanisms in large models by exploiting the inherent sparsity in attention maps, without compromising performance. It introduces a two-stage online filter to predict and filter out unnecessary computations, significantly speeding up diverse models including language, image, and video generation tasks. The method achieves this without any additional training overhead and maintains end-to-end performance.
SpargeAttention 提出了一种通用且高效的稀疏注意力机制,旨在加速模型推理而不牺牲性能。该方法采用两阶段在线过滤器:第一阶段预测注意力图以跳过不必要的矩阵乘法,第二阶段进一步通过意识到 softmax 操作来跳过更多乘法。实验表明,SpargeAttention 显著加速了包括语言、图像和视频生成在内的多种模型,同时保持端到端性能。
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
Authors: Wei Li, Renshan Zhang, Rui Shao, Jie He, Liqiang Nie
Venue: NeurIPS 2025
First: 2025-08-28T17:50:58+00:00 · Latest: 2025-10-01T15:48:45+00:00
Comments: Accepted to NeurIPS 2025, Project Page:
https://jiutian-vl.github.io/CogVLA-page
Abstract
Recent Vision-Language-Action (VLA) models built on pre-trained
Vision-Language Models (VLMs) require extensive post-training, resulting in
high computational overhead that limits scalability and deployment.We propose
CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages
instruction-driven routing and sparsification to improve both efficiency and
performance. CogVLA draws inspiration from human multimodal coordination and
introduces a 3-stage progressive architecture. 1) Encoder-FiLM based
Aggregation Routing (EFA-Routing) injects instruction information into the
vision encoder to selectively aggregate and compress dual-stream visual tokens,
forming a instruction-aware latent representation. 2) Building upon this
compact visual encoding, LLM-FiLM based Pruning Routing (LFP-Routing)
introduces action intent into the language model by pruning
instruction-irrelevant visually grounded tokens, thereby achieving token-level
sparsity. 3) To ensure that compressed perception inputs can still support
accurate and coherent action generation, we introduce V-L-A Coupled Attention
(CAtten), which combines causal vision-language attention with bidirectional
action parallel decoding. Extensive experiments on the LIBERO benchmark and
real-world robotic tasks demonstrate that CogVLA achieves state-of-the-art
performance with success rates of 97.4% and 70.0%, respectively, while reducing
training costs by 2.5-fold and decreasing inference latency by 2.8-fold
compared to OpenVLA. CogVLA is open-sourced and publicly available at
https://github.com/JiuTian-VL/CogVLA.
Summary / 总结
Recent Vision-Language-Action (VLA) models built on pre-trained Vision-Language Models (VLMs) require extensive post-training, resulting in high computational overhead that limits scalability and deployment.We propose CogVLA, a Cognition-Aligned Vision-Language-Action framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance.
CogVLA 是一种认知对齐的 VLA 框架,通过指令驱动的路由和稀疏化来提升效率和性能。它包括三个阶段:EFA-Routing 将指令信息注入视觉编码器以形成指令感知的潜在表示,LFP-Routing 通过去除与指令无关的视觉标记来实现标记级稀疏化,CAtten 结合因果视觉语言注意力与双向动作并行解码。CogVLA 在 LIBERO 基准测试中的成功率达到了 97.4%,在真实世界机器人任务中的成功率达到了 70.0%,同时将训练成本和推理延迟分别降低了 2.5 倍和 2.8 倍,优于 OpenVLA。
ReWatch-R1: Boosting Complex Video Reasoning in Large Vision-Language Models through Agentic Data Synthesis
Authors: Congzhi Zhang, Zhibin Wang, Yinchao Ma, Jiawei Peng, Yihan Wang, Qiang Zhou, Jun Song, Bo Zheng
First: 2025-09-28T05:38:16+00:00 · Latest: 2025-10-01T15:33:19+00:00
Abstract
While Reinforcement Learning with Verifiable Reward (RLVR) significantly
advances image reasoning in Large Vision-Language Models (LVLMs), its
application to complex video reasoning remains underdeveloped. This gap stems
primarily from a critical data bottleneck: existing datasets lack the
challenging, multi-hop questions and high-quality, video-grounded
Chain-of-Thought (CoT) data necessary to effectively bootstrap RLVR. To address
this, we introduce ReWatch, a large-scale dataset built to foster advanced
video reasoning. We propose a novel multi-stage synthesis pipeline to
synthesize its three components: ReWatch-Caption, ReWatch-QA, and ReWatch-CoT.
A core innovation is our Multi-Agent ReAct framework for CoT synthesis, which
simulates a human-like "re-watching" process to generate video-grounded
reasoning traces by explicitly modeling information retrieval and verification.
Building on this dataset, we develop ReWatch-R1 by post-training a strong
baseline LVLM with Supervised Fine-Tuning (SFT) and our RLVR framework. This
framework incorporates a novel Observation \& Reasoning (O\&R) reward mechanism
that evaluates both the final answer's correctness and the reasoning's
alignment with video content, directly penalizing hallucination. Our
experiments show that ReWatch-R1 achieves state-of-the-art average performance
on five challenging video reasoning benchmarks. Project Page:
https://rewatch-r1.github.io
Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned
Authors: Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, Soujanya Poria
First: 2025-09-27T10:56:58+00:00 · Latest: 2025-10-01T12:54:17+00:00
Abstract
Process Reward Models (PRMs) provide step-level supervision that improves the
reliability of reasoning in large language models. While PRMs have been
extensively studied in text-based domains, their extension to Vision Language
Models (VLMs) remains limited. Existing Vision-Language PRMs (VL-PRMs) rely on
Monte Carlo Tree Search (MCTS) for data construction, which can often produce
noisy supervision signals and limit generalization across tasks. In this work,
we aim to elucidate the design space of VL-PRMs by exploring diverse strategies
for dataset construction, training, and test-time scaling. First, we introduce
a hybrid data synthesis framework that combines MCTS with judgments from a
strong VLM, producing more accurate step-level labels. Second, we propose
perception-focused supervision, enabling our PRM to explicitly detect errors at
the visual grounding stage of reasoning. Third, we systematically evaluate
multiple test-time scaling strategies, showing that our PRMs can reliably guide
VLMs toward more accurate solutions. Our experiments covering five diverse
multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and
MathVision) reveal several key insights: (i) VL-PRMs when used as Outcome
Reward Models (ORMs) during test-time scaling (TTS) can outperform VL-PRM
guided process step selection, (ii) smaller VL-PRMs can match or even surpass
larger ones in detecting process errors, (iii) VL-PRMs uncover latent reasoning
abilities in stronger VLM backbones, (iv) perception-level supervision leads to
significant gains in test-time scaling, and (v) TTS performance of different
policies improve on advanced math reasoning datasets despite not training
VL-PRMs on such datasets. We hope our work will motivate further research and
support the advancement of VLMs.
Summary / 总结
Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models.
该研究旨在通过开发过程奖励模型(PRMs)来提高视觉语言模型(VLMs)的推理可靠性。作者引入了一种结合蒙特卡洛树搜索(MCTS)和强VLM的混合数据合成框架,以生成更准确的标签,并提出感知焦点监督,以在推理的视觉接地阶段检测错误。通过在五个跨模态基准上的实验评估了多种测试时缩放策略。实验结果表明,当VL-PRMs作为结果奖励模型用于测试时缩放时,可以优于步骤选择;较小的VL-PRMs在检测过程错误方面表现出色;感知级监督显著提高了测试时缩放性能。这些发现为推进VLMs提供了关键见解。
ViLBias: Detecting and Reasoning about Bias in Multimodal Content
Authors: Shaina Raza, Caesar Saleh, Azib Farooq, Emrul Hasan, Franklin Ogidi, Maximus Powers, Veronica Chatrath, Marcelo Lotif, Karanpal Sekhon, Roya Javadi, Haad Zahid, Anam Zahid, Vahid Reza Khazaie, Zhenyu Yu
First: 2024-12-22T15:05:30+00:00 · Latest: 2025-10-01T12:49:54+00:00
Comments: Under review
Abstract
Detecting bias in multimodal news requires models that reason over
text--image pairs, not just classify text. In response, we present ViLBias, a
VQA-style benchmark and framework for detecting and reasoning about bias in
multimodal news. The dataset comprises 40,945 text--image pairs from diverse
outlets, each annotated with a bias label and concise rationale using a
two-stage LLM-as-annotator pipeline with hierarchical majority voting and
human-in-the-loop validation. We evaluate Small Language Models (SLMs), Large
Language Models (LLMs), and Vision--Language Models (VLMs) across closed-ended
classification and open-ended reasoning (oVQA), and compare parameter-efficient
tuning strategies. Results show that incorporating images alongside text
improves detection accuracy by 3--5\%, and that LLMs/VLMs better capture subtle
framing and text--image inconsistencies than SLMs. Parameter-efficient methods
(LoRA/QLoRA/Adapters) recover 97--99\% of full fine-tuning performance with
$<5\%$ trainable parameters. For oVQA, reasoning accuracy spans 52--79\% and
faithfulness 68--89\%, both improved by instruction tuning; closed accuracy
correlates strongly with reasoning ($r = 0.91$). ViLBias offers a scalable
benchmark and strong baselines for multimodal bias detection and rationale
quality.
中文标题/摘要
标题:ViLBias:检测和推理多模态内容中的偏见
检测多模态新闻中的偏见需要模型能够在文本-图像对上进行推理,而不仅仅是对文本进行分类。为此,我们提出了ViLBias,这是一种用于检测和推理多模态新闻中偏见的VQA风格基准和框架。数据集包含来自不同来源的40,945个文本-图像对,每个对都用两阶段LLM作为注释者管道和分层多数投票进行标注,并通过人工在环验证。我们评估了小型语言模型(SLMs)、大型语言模型(LLMs)和视觉-语言模型(VLMs)在封闭分类和开放推理(oVQA)上的表现,并比较了参数高效调优策略。结果表明,将图像与文本结合使用可提高检测准确性3-5%,并且LLMs/VLMs比SLMs更好地捕捉到了微妙的框架和文本-图像不一致。参数高效方法(LoRA/QLoRA/适配器)在不到5%的可训练参数下恢复了97-99%的全量微调性能。对于oVQA,推理准确率在52-79%之间,忠实度在68-89%之间,两者都通过指令调优得到了提升;封闭准确度与推理的相关性很强(r = 0.91)。ViLBias提供了一个可扩展的基准和多模态偏见检测及推理质量的强基线。
Summary / 总结
The research aims to detect and reason about bias in multimodal news by developing ViLBias, a VQA-style benchmark. It evaluates SLMs, LLMs, and VLMs on closed-ended classification and open-ended reasoning tasks, showing that incorporating images improves detection accuracy and that LLMs/VLMs better capture subtle biases. Parameter-efficient tuning methods recover near-full fine-tuning performance with minimal trainable parameters. For open-ended reasoning, accuracy and faithfulness are improved with instruction tuning, and closed accuracy correlates strongly with reasoning.
研究旨在通过开发ViLBias基准来检测和推理多模态新闻中的偏见。它在封闭分类和开放推理任务上评估了SLMs、LLMs和VLMs,结果显示结合图像可以提高检测准确性,且LLMs/VLMs比SLMs更能捕捉到细微的偏见。参数高效调优方法可以恢复接近完全微调的性能,且通过指令调优可以提高推理准确性和忠实度。
AdaBlock-dLLM: Semantic-Aware Diffusion LLM Inference via Adaptive Block Size
Authors: Guanxi Lu, Hao Mark Chen, Yuto Karashima, Zhican Wang, Daichi Fujiki, Hongxiang Fan
First: 2025-09-30T15:53:56+00:00 · Latest: 2025-10-01T11:26:36+00:00
Comments: Preprint. Under review
Abstract
Diffusion-based large language models (dLLMs) are gaining attention for their
inherent capacity for parallel decoding, offering a compelling alternative to
autoregressive LLMs. Among various decoding strategies, blockwise
semi-autoregressive (semi-AR) approaches are widely adopted due to their
natural support for KV caching and their favorable accuracy-speed trade-off.
However, this paper identifies two fundamental limitations in the conventional
semi-AR decoding approach that applies a fixed block size: i) late decoding
overhead, where the unmasking of high-confidence tokens outside the current
block is unnecessarily delayed, and ii) premature decoding error, where
low-confidence tokens inside the current block are committed too early, leading
to incorrect tokens. This paper presents the first systematic investigation
challenging the fixed block size assumption in semi-AR decoding. Through a
statistical analysis of confidence dynamics during the denoising process, we
identify a volatility band (VB) region during dLLM decoding, which encodes
local semantic structure and can be used to guide adaptive block sizing.
Leveraging these insights, we introduce AdaBlock-dLLM, a training-free,
plug-and-play scheduler that adaptively aligns block boundaries with semantic
steps by adjusting block size during runtime. Extensive experiments across
diverse benchmarks show that AdaBlock-dLLM achieves up to 5.3% accuracy
improvement under the same throughput budget. Beyond inference-time
optimization, we hope our semantics-aware adaptive scheduling approach and
confidence-based analysis will inspire future training strategies for dLLMs.
Summary / 总结
Diffusion-based large language models (dLLMs) are gaining attention for their inherent capacity for parallel decoding, offering a compelling alternative to autoregressive LLMs.
本文针对半自回归解码中固定块大小的局限性,提出了基于信心动态的AdaBlock-dLLM,该调度器在运行时根据信心动态自适应调整块大小。实验结果显示,在相同的吞吐量预算下,可以提高高达5.3%的准确性。
Training-Free Data Assimilation with GenCast
Authors: Thomas Savary, François Rozet, Gilles Louppe
First: 2025-09-23T08:59:44+00:00 · Latest: 2025-10-01T10:31:26+00:00
Abstract
Data assimilation is widely used in many disciplines such as meteorology,
oceanography, and robotics to estimate the state of a dynamical system from
noisy observations. In this work, we propose a lightweight and general method
to perform data assimilation using diffusion models pre-trained for emulating
dynamical systems. Our method builds on particle filters, a class of data
assimilation algorithms, and does not require any further training. As a
guiding example throughout this work, we illustrate our methodology on GenCast,
a diffusion-based model that generates global ensemble weather forecasts.
中文标题/摘要
标题:无需训练的数据同化方法GenCast
数据同化在气象学、海洋学和机器人学等多个学科中被广泛使用,用于从噪声观测中估计动力系统的状态。在本文中,我们提出了一种轻量级且通用的方法,使用预训练的扩散模型来进行数据同化。该方法基于粒子滤波器,这是一种数据同化算法,不需要任何进一步的训练。在整个工作中,我们以基于扩散模型的GenCast生成全球集合天气预报为例,来说明我们的方法。
Summary / 总结
This paper introduces GenCast, a method for data assimilation that leverages pre-trained diffusion models to estimate the state of dynamical systems without additional training. It builds on particle filters and demonstrates its effectiveness using GenCast, a diffusion-based model for generating global weather forecasts.
本文提出了一种名为GenCast的方法,利用预训练的扩散模型进行数据同化,无需进一步训练即可估计动态系统的状态。该方法基于粒子滤波器,并通过生成全球天气预报的扩散模型GenCast进行演示。
Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization
Authors: Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan
Venue: NeurIPS 2025
First: 2025-02-03T04:51:28+00:00 · Latest: 2025-10-01T10:07:18+00:00
Comments: NeurIPS 2025
Abstract
Preference optimization for diffusion models aims to align them with human
preferences for images. Previous methods typically use Vision-Language Models
(VLMs) as pixel-level reward models to approximate human preferences. However,
when used for step-level preference optimization, these models face challenges
in handling noisy images of different timesteps and require complex
transformations into pixel space. In this work, we show that pre-trained
diffusion models are naturally suited for step-level reward modeling in the
noisy latent space, as they are explicitly designed to process latent images at
various noise levels. Accordingly, we propose the Latent Reward Model (LRM),
which repurposes components of the diffusion model to predict preferences of
latent images at arbitrary timesteps. Building on LRM, we introduce Latent
Preference Optimization (LPO), a step-level preference optimization method
conducted directly in the noisy latent space. Experimental results indicate
that LPO significantly improves the model's alignment with general, aesthetic,
and text-image alignment preferences, while achieving a 2.5-28x training
speedup over existing preference optimization methods. Our code and models are
available at https://github.com/Kwai-Kolors/LPO.
Summary / 总结
Preference optimization for diffusion models aims to align them with human preferences for images.
本文提出了一种潜奖励模型(LRM)和潜偏好优化(LPO)方法,以解决将扩散模型与人类偏好对齐的问题。LRM重新利用扩散模型的组件来预测任意时间步的潜图像偏好,而LPO直接在噪声潜空间中优化偏好。该方法显著提高了与通用、美学和图文对齐偏好的对齐程度,并比现有方法实现了2.5-28倍的训练加速。
Easi3R: Estimating Disentangled Motion from DUSt3R Without Training
Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
Venue: ICCV
First: 2025-03-31T17:59:58+00:00 · Latest: 2025-10-01T09:41:33+00:00
Comments: Page: https://easi3r.github.io/ Code:
https://github.com/Inception3D/Easi3R
Abstract
Recent advances in DUSt3R have enabled robust estimation of dense point
clouds and camera parameters of static scenes, leveraging Transformer network
architectures and direct supervision on large-scale 3D datasets. In contrast,
the limited scale and diversity of available 4D datasets present a major
bottleneck for training a highly generalizable 4D model. This constraint has
driven conventional 4D methods to fine-tune 3D models on scalable dynamic video
data with additional geometric priors such as optical flow and depths. In this
work, we take an opposite path and introduce Easi3R, a simple yet efficient
training-free method for 4D reconstruction. Our approach applies attention
adaptation during inference, eliminating the need for from-scratch pre-training
or network fine-tuning. We find that the attention layers in DUSt3R inherently
encode rich information about camera and object motion. By carefully
disentangling these attention maps, we achieve accurate dynamic region
segmentation, camera pose estimation, and 4D dense point map reconstruction.
Extensive experiments on real-world dynamic videos demonstrate that our
lightweight attention adaptation significantly outperforms previous
state-of-the-art methods that are trained or finetuned on extensive dynamic
datasets. Our code is publicly available for research purpose at
https://easi3r.github.io/
Summary / 总结
Easi3R is a training-free method for 4D reconstruction that leverages attention adaptation during inference to estimate disentangled motion from DUSt3R. It achieves accurate dynamic region segmentation, camera pose estimation, and 4D dense point map reconstruction without the need for extensive training or fine-tuning. Experiments on real-world dynamic videos show that Easi3R outperforms previous state-of-the-art methods that rely on extensive dynamic datasets for training or fine-tuning.
Easi3R 是一种无需训练的 4D 重建方法,通过推理时的注意力适应来从 DUSt3R 中估计解耦的运动。它实现了准确的动力区域分割、相机姿态估计和 4D 密集点云重建,无需进行从头训练或微调。在真实世界的动态视频上的实验表明,Easi3R 在训练或微调依赖于大量动态数据集的方法中表现出色。
CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Authors: Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou
First: 2025-09-26T07:46:30+00:00 · Latest: 2025-10-01T09:15:14+00:00
Abstract
Despite significant advances in Vision Language Models (VLMs), they remain
constrained by the complexity and redundancy of visual input. When images
contain large amounts of irrelevant information, VLMs are susceptible to
interference, thus generating excessive task-irrelevant reasoning processes or
even hallucinations. This limitation stems from their inability to discover and
process the required regions during reasoning precisely. To address this
limitation, we present the Chain of Foresight-Focus Thought (CoFFT), a novel
training-free approach that enhances VLMs' visual reasoning by emulating human
visual cognition. Each Foresight-Focus Thought consists of three stages: (1)
Diverse Sample Generation: generates diverse reasoning samples to explore
potential reasoning paths, where each sample contains several reasoning steps;
(2) Dual Foresight Decoding: rigorously evaluates these samples based on both
visual focus and reasoning progression, adding the first step of optimal sample
to the reasoning process; (3) Visual Focus Adjustment: precisely adjust visual
focus toward regions most beneficial for future reasoning, before returning to
stage (1) to generate subsequent reasoning samples until reaching the final
answer. These stages function iteratively, creating an interdependent cycle
where reasoning guides visual focus and visual focus informs subsequent
reasoning. Empirical results across multiple benchmarks using Qwen2.5-VL,
InternVL-2.5, and Llava-Next demonstrate consistent performance improvements of
3.1-5.8% with controllable increasing computational overhead.
中文标题/摘要
标题:CoFFT:预见-聚焦思维链对视觉语言模型
尽管视觉语言模型(VLMs)取得了显著进展,但它们仍然受限于视觉输入的复杂性和冗余性。当图像包含大量无关信息时,VLMs容易受到干扰,从而产生过多的任务无关推理过程甚至幻觉。这一限制源于它们在推理过程中无法精确发现和处理所需区域。为解决这一限制,我们提出了预见-聚焦思维链(CoFFT),这是一种无需训练的新颖方法,通过模拟人类视觉认知来增强VLMs的视觉推理能力。每个预见-聚焦思维链包括三个阶段:(1)多样样本生成:生成多样化的推理样本以探索潜在的推理路径,每个样本包含多个推理步骤;(2)双重预见解码:基于视觉焦点和推理进展严格评估这些样本,将最优样本的第一步加入推理过程;(3)视觉焦点调整:精确调整视觉焦点以指向对未来推理最有益的区域,然后返回到阶段(1)生成后续推理样本,直到达到最终答案。这些阶段以迭代方式运行,形成一个相互依赖的循环,其中推理引导视觉焦点,视觉焦点指导后续推理。使用Qwen2.5-VL、InternVL-2.5和Llava-Next在多个基准测试中的实验证明,性能提高了3.1-5.8%,且计算开销可控地增加。
Summary / 总结
CoFFT is a training-free approach that enhances VLMs' visual reasoning by emulating human visual cognition. It consists of three stages: Diverse Sample Generation, Dual Foresight Decoding, and Visual Focus Adjustment. This method iteratively generates and evaluates reasoning samples, adjusting visual focus to improve reasoning efficiency. Experiments on Qwen2.5-VL, InternVL-2.5, and Llava-Next show consistent performance improvements of 3.1-5.8% with controllable computational overhead.
CoFFT 是一种无需训练的方法,通过模拟人类视觉认知来增强 VLM 的视觉推理能力。它包括三个阶段:多样样本生成、双重前瞻解码和视觉焦点调整。这个迭代过程通过 3.1-5.8% 的性能提升,跨多个基准测试展示了改进,同时可控地增加计算开销。该方法解决了 VLM 对无关视觉信息的干扰易导致过度无关推理或幻觉的问题。
Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving
Authors: Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull
First: 2025-06-12T19:14:00+00:00 · Latest: 2025-10-01T06:56:48+00:00
Abstract
We present Poutine, a 3B-parameter vision-language model (VLM) tailored for
end-to-end autonomous driving in long-tail driving scenarios. Poutine is
trained in two stages. To obtain strong base driving capabilities, we train
Poutine-Base in a self-supervised vision-language-trajectory (VLT) next-token
prediction fashion on 83 hours of CoVLA nominal driving and 11 hours of Waymo
long-tail driving. Accompanying language annotations are auto-generated with a
72B-parameter VLM. Poutine is obtained by fine-tuning Poutine-Base with Group
Relative Policy Optimization (GRPO) using less than 500 preference-labeled
frames from the Waymo validation set. We show that both VLT pretraining and RL
fine-tuning are critical to attain strong driving performance in the long-tail.
Poutine-Base achieves a rater-feedback score (RFS) of 8.12 on the validation
set, nearly matching Waymo's expert ground-truth RFS. The final Poutine model
achieves an RFS of 7.99 on the official Waymo test set, placing 1st in the 2025
Waymo Vision-Based End-to-End Driving Challenge by a significant margin. These
results highlight the promise of scalable VLT pre-training and lightweight RL
fine-tuning to enable robust and generalizable autonomy.
中文标题/摘要
标题:Poutine:面向长尾驾驶场景的端到端自主驾驶的视觉-语言-轨迹预训练和强化学习微调
我们介绍了Poutine,一个针对长尾驾驶场景的端到端自主驾驶定制的3亿参数视觉-语言模型(VLM)。Poutine分为两个阶段进行训练。为了获得强大的基础驾驶能力,我们使用83小时的CoVLA标准驾驶数据和11小时的Waymo长尾驾驶数据,以自我监督的视觉-语言-轨迹(VLT)下一个标记预测方式训练Poutine-Base。伴随的语言注解由一个72亿参数的VLM自动生成。Poutine通过使用组相对策略优化(GRPO)对Poutine-Base进行微调,使用Waymo验证集中的不到500个偏好标记帧。我们展示了VLT预训练和RL微调对于在长尾场景中获得强大驾驶性能至关重要。Poutine-Base在验证集上的评分者反馈评分(RFS)为8.12,几乎与Waymo专家的地面真值RFS相当。最终的Poutine模型在官方Waymo测试集上的RFS为7.99,显著领先于2025年Waymo基于视觉的端到端驾驶挑战赛的其他参赛者,位居第一。这些结果突显了可扩展的VLT预训练和轻量级RL微调在实现稳健和泛化自主性方面的潜力。
Summary / 总结
We present Poutine, a 3B-parameter vision-language model (VLM) tailored for end-to-end autonomous driving in long-tail driving scenarios.
iFinder: Structured Zero-Shot Vision-Based LLM Grounding for Dash-Cam Video Reasoning
Authors: Manyi Yao, Bingbing Zhuang, Sparsh Garg, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker, Abhishek Aich
Venue: NeurIPS 2025
First: 2025-09-23T20:25:53+00:00 · Latest: 2025-10-01T06:54:44+00:00
Comments: Accepted at NeurIPS 2025
Abstract
Grounding large language models (LLMs) in domain-specific tasks like post-hoc
dash-cam driving video analysis is challenging due to their general-purpose
training and lack of structured inductive biases. As vision is often the sole
modality available for such analysis (i.e., no LiDAR, GPS, etc.), existing
video-based vision-language models (V-VLMs) struggle with spatial reasoning,
causal inference, and explainability of events in the input video. To this end,
we introduce iFinder, a structured semantic grounding framework that decouples
perception from reasoning by translating dash-cam videos into a hierarchical,
interpretable data structure for LLMs. iFinder operates as a modular,
training-free pipeline that employs pretrained vision models to extract
critical cues -- object pose, lane positions, and object trajectories -- which
are hierarchically organized into frame- and video-level structures. Combined
with a three-block prompting strategy, it enables step-wise, grounded reasoning
for the LLM to refine a peer V-VLM's outputs and provide accurate reasoning.
Evaluations on four public dash-cam video benchmarks show that iFinder's
proposed grounding with domain-specific cues, especially object orientation and
global context, significantly outperforms end-to-end V-VLMs on four zero-shot
driving benchmarks, with up to 39% gains in accident reasoning accuracy. By
grounding LLMs with driving domain-specific representations, iFinder offers a
zero-shot, interpretable, and reliable alternative to end-to-end V-VLMs for
post-hoc driving video understanding.
Summary / 总结
iFinder is a structured semantic grounding framework designed to enhance the performance of large language models (LLMs) in post-hoc dash-cam driving video analysis. It decouples perception from reasoning by translating dash-cam videos into a hierarchical, interpretable data structure. The framework uses pretrained vision models to extract critical cues such as object pose, lane positions, and object trajectories, which are then organized into frame- and video-level structures. Evaluations on four public dash-cam video benchmarks demonstrate that iFinder significantly outperforms end-to-end vision-language models, achieving up to 39% gains in accident reasoning accuracy.
iFinder 是一种结构化的语义接地框架,旨在增强大型语言模型(LLMs)在后置dash-cam视频分析中的性能。该框架通过将dash-cam视频转换为层次化、可解释的数据结构来分离感知和推理。框架使用预训练的视觉模型提取关键线索,如物体姿态、车道位置和物体轨迹,并将其组织成帧级和视频级结构。在四个公开的dash-cam视频基准上的评估表明,iFinder 在事故推理准确性方面显著优于端到端的视觉-语言模型,最高可提高39%的准确率。
ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction
Authors: Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim
First: 2025-06-10T10:40:10+00:00 · Latest: 2025-10-01T06:34:36+00:00
Comments: Accepted at ICCV25
Abstract
Vision-language models such as CLIP have recently propelled open-vocabulary
dense prediction tasks by enabling recognition of a broad range of visual
concepts. However, CLIP still struggles with fine-grained, region-level
understanding, hindering its effectiveness on these dense prediction tasks. We
identify two pivotal factors required to address this limitation: semantic
coherence and fine-grained vision-language alignment. Current adaptation
methods often improve fine-grained alignment at the expense of semantic
coherence, and often rely on extra modules or supervised fine-tuning. To
overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel
approach that simultaneously enhances semantic coherence and fine-grained
alignment by leveraging own knowledge of a model across all representation
levels. Unlike prior methods, ATAS uses only unlabeled images and an internal
self-distillation process to refine representations of CLIP vision encoders,
preserving local semantic consistency while sharpening local detail
recognition. On open-vocabulary object detection and semantic segmentation
benchmarks, ATAS achieves substantial performance gains, outperforming baseline
CLIP models. These results validate the effectiveness of our approach and
underscore the importance of jointly maintaining semantic coherence and
fine-grained alignment for advanced open-vocabulary dense prediction.
中文标题/摘要
标题:ATAS:任意到任意的自我蒸馏以增强开放词汇密集预测
视觉-语言模型如CLIP最近通过使识别广泛视觉概念成为可能,推动了开放词汇密集预测任务。然而,CLIP在细粒度和区域级理解方面仍然存在问题,阻碍了其在这些密集预测任务中的有效性。我们确定了两个关键因素以解决这一限制:语义连贯性和细粒度的视觉-语言对齐。当前的适应方法通常在提高细粒度对齐的同时牺牲了语义连贯性,并且往往依赖于额外的模块或监督微调。为了克服这些问题,我们提出了任意到任意的自我蒸馏(ATAS),这是一种新颖的方法,通过利用模型在所有表示层次上的自我知识,同时增强语义连贯性和细粒度对齐。与先前的方法不同,ATAS仅使用未标记的图像和内部自我蒸馏过程来细化CLIP视觉编码器的表示,保持局部语义一致性的同时增强局部细节识别。在开放词汇目标检测和语义分割基准测试中,ATAS实现了显著的性能提升,超越了基线CLIP模型。这些结果验证了我们方法的有效性,并强调了同时保持语义连贯性和细粒度对齐对于高级开放词汇密集预测的重要性。
Training-free LLM Verification via Recycling Few-shot Examples
Authors: Dongseok Lee, Jimyung Hong, Dongyoung Kim, Jaehyung Kim
First: 2025-06-08T10:02:07+00:00 · Latest: 2025-10-01T04:58:58+00:00
Abstract
Although LLMs have achieved remarkable performance, the inherent
stochasticity of their reasoning process and varying conclusions present
significant challenges. Majority voting or Best-of-N with external verification
models has been explored to find the most promising solution among multiple LLM
outputs. However, these approaches have certain limitations, such as limited
applicability or the cost of an additional training step. To address this
problem, we propose a novel and effective framework that Recycles Few-shot
examples to verify LLM outputs (ReFeri). Our key idea is to additionally
utilize the given few-shot examples to evaluate the candidate outputs of the
target query, not only using them to generate outputs as the conventional
few-shot prompting setup. Specifically, ReFeri evaluates the generated outputs
by combining two different scores, designed motivated from Bayes' rule, and
subsequently selects the candidate that is both confidently determined and
contextually coherent through a few additional LLM inferences. Experiments with
three different LLMs and across seven diverse tasks demonstrate that our
framework significantly improves the accuracy of LLMs-achieving an average gain
of 4.8%-through effective response selection, without additional training.
中文标题/摘要
标题:基于回收少量示例的无训练LLM验证
尽管LLM取得了显著的性能,其推理过程中的固有随机性和结论的多样性带来了重大挑战。多数投票或外部验证模型的Best-of-N已被探索以在多个LLM输出中找到最佳解决方案。然而,这些方法存在一定的局限性,如适用性有限或需要额外的训练步骤。为解决这一问题,我们提出了一种新颖且有效的框架,通过回收少量示例来验证LLM输出(ReFeri)。我们的核心思想是除了像传统少量示例提示设置那样使用给定的少量示例生成输出外,还利用这些示例来评估目标查询的候选输出。具体而言,ReFeri通过结合两种不同的分数来评估生成的输出,这些分数的设计灵感来自贝叶斯规则,并通过少量额外的LLM推理来选择既自信又上下文连贯的候选输出。实验表明,我们的框架在三个不同LLM和七个不同任务上显著提高了LLM的准确性,平均提高了4.8%,而无需额外训练。
Summary / 总结
The paper addresses the challenge of verifying the outputs of large language models (LLMs) due to their inherent stochasticity and varying conclusions. It proposes ReFeri, a training-free framework that recycles few-shot examples to evaluate and select the most accurate LLM outputs. By combining two scores derived from Bayes' rule, ReFeri ensures both confidence and contextual coherence, achieving an average accuracy improvement of 4.8% across three LLMs and seven tasks without additional training.
论文针对大型语言模型(LLMs)由于其固有的随机性和结论变化带来的验证难题,提出了一种无需训练的框架ReFeri,通过回收少量示例来评估和选择最准确的LLM输出。通过结合源自贝叶斯规则的两种评分,ReFeri确保了信心和上下文一致性,实现了在三个LLM和七个任务上的平均准确率提升4.8%,且无需额外训练。
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Authors: Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, Xiaobo Xia
First: 2025-04-14T17:45:54+00:00 · Latest: 2025-10-01T04:55:20+00:00
Abstract
Existing efforts in building Graphical User Interface (GUI) agents largely
rely on the training paradigm of supervised fine-tuning on Large
Vision-Language Models (LVLMs). However, this approach not only demands
extensive amounts of training data but also struggles to effectively understand
GUI screenshots and generalize to unseen interfaces. The issue significantly
limits its application in real-world scenarios, especially for high-level
tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models
(e.g., DeepSeek-R1), which efficiently enhances the problem-solving
capabilities of large language models in real-world settings, we propose \name,
the first reinforcement learning framework designed to enhance the GUI
capabilities of LVLMs in high-level real-world task scenarios, through unified
action space rule modeling. By leveraging a small amount of carefully curated
high-quality data across multiple platforms (including Windows, Linux, MacOS,
Android, and Web) and employing policy optimization algorithms such as Group
Relative Policy Optimization (GRPO) to update the model, \name achieves
superior performance using only 0.02\% of the data (3K vs. 13M) compared to
previous state-of-the-art methods like OS-Atlas across eight benchmarks
spanning three different platforms (mobile, desktop, and web). These results
demonstrate the immense potential of reinforcement learning based on unified
action space rule modeling in improving the execution capabilities of LVLMs for
real-world GUI agent tasks.
中文标题/摘要
标题:GUI-R1:一种通用的基于R1风格的视觉-语言行动模型用于GUI代理
现有构建图形用户界面(GUI)代理的努力主要依赖于在大型视觉-语言模型(LVLM)上进行监督微调的训练范式。然而,这种方法不仅需要大量的训练数据,而且难以有效理解GUI截图并泛化到未见过的界面。这一问题显著限制了其在实际场景中的应用,尤其是在执行高级任务时。受大型推理模型(如DeepSeek-R1)中强化微调(RFT)的启发,该方法能够高效地增强大型语言模型在实际环境中的问题解决能力,我们提出了
ame,这是第一个通过统一行动空间规则建模来增强LVLM在高级实际任务场景中GUI能力的强化学习框架。通过利用跨多个平台(包括Windows、Linux、MacOS、Android和Web)精心收集的高质量数据,并采用如组相对策略优化(GRPO)等策略优化算法来更新模型,
ame仅使用了0.02%的数据(3K vs. 13M)就超越了包括OS-Atlas在内的先前最先进的方法,在涵盖三个不同平台(移动、桌面和网络)的八个基准测试中取得了优异的性能。这些结果表明,基于统一行动空间规则建模的强化学习在提高LVLM执行GUI代理任务的能力方面具有巨大的潜力。
Summary / 总结
The paper addresses the limitations of supervised fine-tuning for GUI agents, which require large amounts of training data and struggle with understanding GUI screenshots. It proposes GUI-R1, a reinforcement learning framework that uses a small amount of curated data and policy optimization algorithms to enhance LVLMs for high-level GUI tasks. GUI-R1 achieves superior performance using only 0.02% of the data compared to previous methods across eight benchmarks on three platforms, showcasing the potential of reinforcement learning for improving LVLMs in real-world GUI agent tasks.
论文针对监督微调方法在GUI代理中的局限性,需要大量训练数据且难以理解GUI截图。它提出了GUI-R1,一种使用少量精挑细选的数据和策略优化算法来增强LVLMs在高级GUI任务中的框架。GUI-R1在三个平台(移动、桌面和网络)的八个基准测试中表现出色,仅使用了之前方法如OS-Atlas的0.02%的数据。
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Authors: Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu
First: 2025-03-21T17:51:14+00:00 · Latest: 2025-10-01T04:26:04+00:00
Abstract
Vision Language Models (VLMs) excel at identifying and describing objects but
often fail at spatial reasoning. We study why VLMs, such as LLaVA, underutilize
spatial cues despite having positional encodings and spatially rich vision
encoder features. Our analysis reveals a key imbalance: vision token embeddings
have much larger norms than text tokens, suppressing LLM's position embedding.
To expose this mechanism, we developed three interpretability tools: (1) the
Position Sensitivity Index, which quantifies reliance on token order, (2) the
Cross Modality Balance, which reveals attention head allocation patterns, and
(3) a RoPE Sensitivity probe, which measures dependence on rotary positional
embeddings. These tools uncover that vision tokens and system prompts dominate
attention. We validated our mechanistic understanding through targeted
interventions that predictably restore positional sensitivity. These findings
reveal previously unknown failure modes in multimodal attention and demonstrate
how interpretability analysis can guide principled improvements.
中文标题/摘要
标题:超越语义:重新发现视觉语言模型中的空间意识
视觉语言模型(VLMs)在识别和描述物体方面表现出色,但在空间推理方面经常失败。我们研究了为什么像LLaVA这样的VLMs尽管具有位置编码和空间丰富的视觉编码器特征,但仍然未能充分利用空间线索。我们的分析揭示了一个关键的不平衡:视觉标记嵌入的范数远大于文本标记,抑制了LLM的位置嵌入。为了揭示这一机制,我们开发了三种可解释性工具:(1)位置敏感指数,量化对标记顺序的依赖程度;(2)跨模态平衡,揭示注意力头分配模式;(3)RoPE敏感性探针,衡量对旋转位置嵌入的依赖程度。这些工具揭示了视觉标记和系统提示主导了注意力。我们通过有针对性的干预措施验证了我们的机制理解,这些干预措施能够预测性地恢复位置敏感性。这些发现揭示了多模态注意力中未知的失败模式,并展示了可解释性分析如何指导原理性的改进。
AgenticIQA: An Agentic Framework for Adaptive and Interpretable Image Quality Assessment
Authors: Hanwei Zhu, Yu Tian, Keyan Ding, Baoliang Chen, Bolin Chen, Shiqi Wang, Weisi Lin
First: 2025-09-30T09:37:01+00:00 · Latest: 2025-10-01T04:01:40+00:00
Abstract
Image quality assessment (IQA) is inherently complex, as it reflects both the
quantification and interpretation of perceptual quality rooted in the human
visual system. Conventional approaches typically rely on fixed models to output
scalar scores, limiting their adaptability to diverse distortions,
user-specific queries, and interpretability needs. Furthermore, scoring and
interpretation are often treated as independent processes, despite their
interdependence: interpretation identifies perceptual degradations, while
scoring abstracts them into a compact metric. To address these limitations, we
propose AgenticIQA, a modular agentic framework that integrates vision-language
models (VLMs) with traditional IQA tools in a dynamic, query-aware manner.
AgenticIQA decomposes IQA into four subtasks -- distortion detection,
distortion analysis, tool selection, and tool execution -- coordinated by a
planner, executor, and summarizer. The planner formulates task-specific
strategies, the executor collects perceptual evidence via tool invocation, and
the summarizer integrates this evidence to produce accurate scores with
human-aligned explanations. To support training and evaluation, we introduce
AgenticIQA-200K, a large-scale instruction dataset tailored for IQA agents, and
AgenticIQA-Eval, the first benchmark for assessing the planning, execution, and
summarization capabilities of VLM-based IQA agents. Extensive experiments
across diverse IQA datasets demonstrate that AgenticIQA consistently surpasses
strong baselines in both scoring accuracy and explanatory alignment.
中文标题/摘要
标题:AgenticIQA:一种适应性和可解释性的图像质量评估框架
图像质量评估(IQA)本质上是复杂的,因为它反映了根植于人类视觉系统的感知质量和量化与解释。传统方法通常依赖固定模型输出标量分数,限制了它们对多种失真、用户特定查询和解释需求的适应性。此外,评分和解释通常被视为独立的过程,尽管它们之间存在相互依赖性:解释识别感知降级,而评分将它们抽象为紧凑的度量标准。为了解决这些限制,我们提出了AgenticIQA,这是一种模块化的代理框架,将视觉语言模型(VLMs)与传统IQA工具结合在一起,以动态、查询感知的方式进行整合。AgenticIQA将IQA分解为四个子任务——失真检测、失真分析、工具选择和工具执行,由计划者、执行者和总结者协调。计划者制定任务特定策略,执行者通过工具调用收集感知证据,总结者将这些证据整合以生成与人类对齐的准确评分和解释。为了支持训练和评估,我们引入了AgenticIQA-200K,这是一个针对IQA代理定制的大规模指令数据集,以及AgenticIQA-Eval,这是第一个评估基于VLM的IQA代理规划、执行和总结能力的基准。广泛的实验表明,AgenticIQA在评分准确性和解释对齐方面始终优于强大的基线。
Summary / 总结
AgenticIQA is proposed to address the limitations of conventional IQA methods by integrating vision-language models with traditional IQA tools in a dynamic, query-aware manner. It decomposes IQA into four subtasks: distortion detection, distortion analysis, tool selection, and tool execution, coordinated by a planner, executor, and summarizer. AgenticIQA outperforms strong baselines in scoring accuracy and explanatory alignment across various IQA datasets.
AgenticIQA 是一个模块化框架,将视觉语言模型与传统图像质量评估工具结合,以解决固定模型在适应性和可解释性方面的局限性。它将图像质量评估分解为四个子任务:失真检测、分析、工具选择和执行,由规划者、执行者和总结者管理。实验表明,AgenticIQA 在各种图像质量评估数据集中的评分准确性和解释一致性方面均优于强基线。
STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence
Authors: Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu
First: 2025-05-30T04:46:34+00:00 · Latest: 2025-10-01T03:07:21+00:00
Abstract
Diffusion models (DMs) and flow-matching models have demonstrated remarkable
performance in image and video generation. However, such models require a
significant number of function evaluations (NFEs) during sampling, leading to
costly inference. Consequently, quality-preserving fast sampling methods that
require fewer NFEs have been an active area of research. However, prior
training-free sampling methods fail to simultaneously address two key
challenges: the stiffness of the ODE (i.e., the non-straightness of the
velocity field) and dependence on the semi-linear structure of the DM ODE
(which limits their direct applicability to flow-matching models). In this
work, we introduce the Stabilized Taylor Orthogonal Runge--Kutta (STORK)
method, addressing both design concerns. We demonstrate that STORK consistently
improves the quality of diffusion and flow-matching sampling for image and
video generation. Code is available at https://github.com/ZT220501/STORK.
中文标题/摘要
标题:STORK: 通过解决刚性和结构依赖性加快扩散和流匹配采样
扩散模型(DMs)和流匹配模型在图像和视频生成方面表现出色。然而,这些模型在采样过程中需要大量的函数评估(NFEs),导致成本高昂的推理。因此,能够保持质量且需要较少NFEs的快速采样方法一直是研究的热点。然而,先前的无训练采样方法未能同时解决两个关键挑战:ODE的刚性(即速度场的非直线性)和DM ODE的半线性结构依赖性(这限制了它们直接应用于流匹配模型的适用性)。在本工作中,我们引入了稳定泰勒正交龙格-库塔(STORK)方法,解决了这两个设计问题。我们证明STORK能够一致地提高图像和视频生成中扩散和流匹配采样的质量。代码可在https://github.com/ZT220501/STORK 获取。
Summary / 总结
This paper addresses the challenge of fast sampling in diffusion models (DMs) and flow-matching models, which require a large number of function evaluations (NFEs) for inference. The authors introduce the Stabilized Taylor Orthogonal Runge--Kutta (STORK) method to resolve both the stiffness of the ODE and the semi-linear structure dependence, which are key issues in previous methods. Experiments show that STORK improves the quality of sampling for both DMs and flow-matching models in image and video generation tasks.
本文解决了扩散模型(DMs)和流匹配模型在快速采样时需要大量函数评估(NFEs)的问题。作者引入了稳定泰勒正交龙格-库塔(STORK)方法,以解决先前方法中存在的两个关键问题:ODE的刚性和半线性结构依赖性。实验表明,STORK在图像和视频生成任务中的采样质量得到了提升。
PCoreSet: Effective Active Learning through Knowledge Distillation from Vision-Language Models
Authors: Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Dongseop Kim, Sung Ju Hwang
First: 2025-06-01T08:54:37+00:00 · Latest: 2025-10-01T01:14:57+00:00
Comments: 39 pages, 25 figures, preprint
Abstract
Knowledge distillation (KD) is a widely used framework for training compact,
task-specific models by transferring the knowledge from teacher models.
However, its application to active learning (AL), which aims to minimize
annotation costs through iterative sample selection, remains underexplored.
This gap stems from the fact that KD typically assumes access to sufficient
labeled data, whereas AL operates in data-scarce scenarios where task-specific
teacher models are often unavailable. In this paper, we first introduce
ActiveKD, a framework that integrates AL with KD by leveraging the zero- and
few-shot capabilities of large vision-language models (VLMs). A key aspect of
ActiveKD is the structured prediction bias of VLMs-i.e., their predictions form
clusters in the probability space. We regard this structure as an inductive
bias of the teacher model, capturing generalizable output patterns beneficial
to student learning. To exploit this bias, we propose Probabilistic CoreSet
(PCoreSet), a selection strategy that maximizes coverage in the probability
space rather than the feature space. PCoreSet strategically selects
probabilistically diverse unlabeled samples, facilitating more efficient
transfer of teacher knowledge under limited annotation budgets. Extensive
evaluations on 11 datasets show that ActiveKD consistently improves performance
across selection methods (e.g., +29.07% on ImageNet, averaged over methods).
Under ActiveKD, PCoreSet ranks first in 64/73 settings (approximately 87.7%)
across 5 student and 3 teacher networks, always achieving the best performance
except for first 2 AL rounds. Our code is available at
https://github.com/erjui/PCoreSet.
中文标题/摘要
标题:PCoreSet:通过来自视觉-语言模型的知识蒸馏实现有效的主动学习
知识蒸馏(KD)是一种广泛使用的框架,通过从教师模型转移知识来训练紧凑的任务特定模型。然而,其在主动学习(AL)中的应用仍然未被充分探索,主动学习旨在通过迭代样本选择来最小化标注成本。这一差距源于KD通常假设有足够的标注数据,而AL则在数据稀缺的场景中运行,此时任务特定的教师模型往往不可用。在本文中,我们首先引入了ActiveKD框架,该框架通过利用大型视觉-语言模型(VLM)的零样本和少样本能力,将AL与KD结合起来。ActiveKD的关键方面是VLM的结构化预测偏差,即它们的预测在概率空间中形成簇。我们将这种结构视为教师模型的归纳偏置,捕捉到对学生学习有益的一般可泛化输出模式。为了利用这种偏置,我们提出了概率核心集(PCoreSet)选择策略,该策略在概率空间中最大化覆盖范围,而不是在特征空间中。PCoreSet战略性地选择概率上多样化的未标注样本,在有限的标注预算下更有效地转移教师知识。在11个数据集上的广泛评估显示,ActiveKD在选择方法(例如,ImageNet上平均提高29.07%)中始终表现出色。在ActiveKD下,PCoreSet在5个学生网络和3个教师网络的59/73个设置(约87.7%)中排名第一,始终达到最佳性能,仅在前2轮主动学习中除外。我们的代码可在https://github.com/erjui/PCoreSet获取。
Breaking Down and Building Up: Mixture of Skill-Based Vision-and-Language Navigation Agents
Authors: Tianyi Ma, Yue Zhang, Zehao Wang, Parisa Kordjamshidi
First: 2025-08-11T05:50:30+00:00 · Latest: 2025-10-01T00:48:33+00:00
Abstract
Vision-and-Language Navigation (VLN) poses significant challenges for agents
to interpret natural language instructions and navigate complex 3D
environments. While recent progress has been driven by large-scale pre-training
and data augmentation, current methods still struggle to generalize to unseen
scenarios, particularly when complex spatial and temporal reasoning is
required. In this work, we propose SkillNav, a modular framework that
introduces structured, skill-based reasoning into Transformer-based VLN agents.
Our method decomposes navigation into a set of interpretable atomic skills
(e.g., Vertical Movement, Area and Region Identification, Stop and Pause), each
handled by a specialized agent. To support targeted skill training without
manual data annotation, we construct a synthetic dataset pipeline that
generates diverse, linguistically natural, skill-specific
instruction-trajectory pairs. We then introduce a novel training-free
Vision-Language Model (VLM)-based router, which dynamically selects the most
suitable agent at each time step by aligning sub-goals with visual observations
and historical actions. SkillNav obtains competitive results on commonly used
benchmarks and establishes state-of-the-art generalization to the GSA-R2R, a
benchmark with novel instruction styles and unseen environments.
中文标题/摘要
标题:分解与构建:基于技能的视觉-语言导航代理的混合
视觉-语言导航(VLN)对代理解读自然语言指令并导航复杂3D环境提出了重大挑战。尽管最近的进步主要得益于大规模预训练和数据增强,但当前的方法仍然难以在未见过的场景中泛化,尤其是在需要复杂的空间和时间推理时。在本文中,我们提出了一种模块化框架SkillNav,该框架将结构化的技能推理引入到基于Transformer的VLN代理中。我们的方法将导航分解为一组可解释的基本技能(例如,垂直移动、区域和区域识别、停止和暂停),每种技能由一个专门的代理处理。为了支持针对特定技能的训练而无需手动数据注释,我们构建了一个合成数据集管道,生成多样且语言自然的技能特定指令-轨迹对。然后,我们引入了一种新颖的无需训练的视觉-语言模型(VLM)路由器,该路由器在每个时间步动态选择最合适的代理,通过将子目标与视觉观察和历史动作对齐来实现。SkillNav在常用基准测试中获得了竞争力的结果,并在具有新颖指令风格和未见过环境的GSA-R2R基准测试中建立了最先进的泛化能力。
Summary / 总结
The research aims to address the challenges of vision-and-language navigation by proposing SkillNav, a modular framework that decomposes navigation into interpretable atomic skills. It uses a synthetic dataset pipeline to generate diverse skill-specific instruction-trajectory pairs and introduces a VLM-based router to dynamically select the most suitable agent at each step. SkillNav achieves competitive results on benchmarks and sets a new state-of-the-art in generalization to the GSA-R2R benchmark, which includes novel instruction styles and unseen environments.
研究旨在通过提出SkillNav模块化框架解决视觉-语言导航的挑战,该框架将导航分解为可解释的基本技能。使用合成数据集管道生成多样化的技能特定指令-轨迹对,并引入基于VLM的路由器在每一步动态选择最合适的代理。SkillNav在基准测试中取得了竞争力的结果,并在包含新型指令风格和未见过的环境的GSA-R2R基准测试中达到了最先进的泛化能力。
Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance
Authors: Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel
Venue: MICCAI
First: 2025-09-07T08:52:18+00:00 · Latest: 2025-09-30T23:25:22+00:00
Comments: Accepted to the 2025 MICCAI ELAMI Workshop
Abstract
Vision-language models have demonstrated impressive capabilities in
generating 2D images under various conditions; however, the success of these
models is largely enabled by extensive, readily available pretrained foundation
models. Critically, comparable pretrained models do not exist for 3D,
significantly limiting progress. As a result, the potential of vision-language
models to produce high-resolution 3D counterfactual medical images conditioned
solely on natural language remains unexplored. Addressing this gap would enable
powerful clinical and research applications, such as personalized
counterfactual explanations, simulation of disease progression, and enhanced
medical training by visualizing hypothetical conditions in realistic detail.
Our work takes a step toward this challenge by introducing a framework capable
of generating high-resolution 3D counterfactual medical images of synthesized
patients guided by free-form language prompts. We adapt state-of-the-art 3D
diffusion models with enhancements from Simple Diffusion and incorporate
augmented conditioning to improve text alignment and image quality. To our
knowledge, this is the first demonstration of a language-guided native-3D
diffusion model applied to neurological imaging, where faithful
three-dimensional modeling is essential. On two neurological MRI datasets, our
framework simulates varying counterfactual lesion loads in Multiple Sclerosis
and cognitive states in Alzheimer's disease, generating high-quality images
while preserving subject fidelity. Our results lay the groundwork for
prompt-driven disease progression analysis in 3D medical imaging. Project link
- https://lesupermomo.github.io/imagining-alternatives/.
Summary / 总结
The research aims to generate high-resolution 3D counterfactual medical images using language guidance, addressing the lack of pretrained models for 3D vision-language tasks. The method involves adapting state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and augmented conditioning. The framework successfully generates high-quality 3D images of synthesized patients based on free-form language prompts, demonstrating its capability in simulating varying counterfactual conditions in Multiple Sclerosis and Alzheimer's disease MRI datasets while maintaining subject fidelity. This work paves the way for prompt-driven disease progression analysis in 3D medical imaging.
研究旨在利用语言指导生成高分辨率的3D反事实医学图像,解决3D视觉-语言任务缺乏预训练模型的问题。方法包括改进最先进的3D扩散模型,并结合Simple Diffusion和增强条件。框架成功地根据自由形式的语言提示生成了合成患者的高质量3D图像,展示了其在模拟多发性硬化症和阿尔茨海默病MRI数据集中不同反事实条件的能力,同时保持了个体的一致性。这项工作为3D医学成像中的提示驱动疾病进展分析奠定了基础。
TDBench: A Benchmark for Top-Down Image Understanding with Reliability Analysis of Vision-Language Models
Authors: Kaiyuan Hou, Minghui Zhao, Lilin Xu, Yuang Fan, Xiaofan Jiang
First: 2025-04-01T19:01:13+00:00 · Latest: 2025-09-30T22:02:15+00:00
Abstract
Top-down images play an important role in safety-critical settings such as
autonomous navigation and aerial surveillance, where they provide holistic
spatial information that front-view images cannot capture. Despite this, Vision
Language Models (VLMs) are mostly trained and evaluated on front-view
benchmarks, leaving their performance in the top-down setting poorly
understood. Existing evaluations also overlook a unique property of top-down
images: their physical meaning is preserved under rotation. In addition,
conventional accuracy metrics can be misleading, since they are often inflated
by hallucinations or "lucky guesses", which obscures a model's true reliability
and its grounding in visual evidence. To address these issues, we introduce
TDBench, a benchmark for top-down image understanding that includes 2000
curated questions for each rotation. We further propose RotationalEval (RE),
which measures whether models provide consistent answers across four rotated
views of the same scene, and we develop a reliability framework that separates
genuine knowledge from chance. Finally, we conduct four case studies targeting
underexplored real-world challenges. By combining rigorous evaluation with
reliability metrics, TDBench not only benchmarks VLMs in top-down perception
but also provides a new perspective on trustworthiness, guiding the development
of more robust and grounded AI systems. Project homepage:
https://github.com/Columbia-ICSL/TDBench
Summary / 总结
TDBench is introduced to evaluate the performance of Vision Language Models (VLMs) in understanding top-down images, which are crucial for safety-critical applications. The benchmark includes 2000 questions per rotation and introduces RotationalEval (RE) to measure consistency across rotated views. Key findings show that existing VLMs often perform poorly in top-down settings, and the reliability framework helps distinguish genuine knowledge from chance. This work enhances the trustworthiness of VLMs in real-world applications by providing a more rigorous evaluation.
TDBench 是一个用于评估 VLM 在顶部向下图像理解中的基准,解决了这一领域的评估不足问题。它包含每个旋转2000个问题,并引入了 RotationalEval 来衡量在不同视图中的一致性,以及一个可靠性框架来评估真正的知识。研究发现,VLM 经常表现不一致且答案不可靠,突显了需要更 robust 的模型。案例研究显示,当前模型在顶部向下感知中的现实世界挑战上表现不佳。
Stitch: Training-Free Position Control in Multimodal Diffusion Transformers
Authors: Jessica Bader, Mateusz Pach, Maria A. Bravo, Serge Belongie, Zeynep Akata
First: 2025-09-30T17:59:51+00:00 · Latest: 2025-09-30T17:59:51+00:00
Comments: Preprint
Abstract
Text-to-Image (T2I) generation models have advanced rapidly in recent years,
but accurately capturing spatial relationships like "above" or "to the right
of" poses a persistent challenge. Earlier methods improved spatial relationship
following with external position control. However, as architectures evolved to
enhance image quality, these techniques became incompatible with modern models.
We propose Stitch, a training-free method for incorporating external position
control into Multi-Modal Diffusion Transformers (MMDiT) via
automatically-generated bounding boxes. Stitch produces images that are both
spatially accurate and visually appealing by generating individual objects
within designated bounding boxes and seamlessly stitching them together. We
find that targeted attention heads capture the information necessary to isolate
and cut out individual objects mid-generation, without needing to fully
complete the image. We evaluate Stitch on PosEval, our benchmark for
position-based T2I generation. Featuring five new tasks that extend the concept
of Position beyond the basic GenEval task, PosEval demonstrates that even top
models still have significant room for improvement in position-based
generation. Tested on Qwen-Image, FLUX, and SD3.5, Stitch consistently enhances
base models, even improving FLUX by 218% on GenEval's Position task and by 206%
on PosEval. Stitch achieves state-of-the-art results with Qwen-Image on
PosEval, improving over previous models by 54%, all accomplished while
integrating position control into leading models training-free. Code is
available at https://github.com/ExplainableML/Stitch.
Summary / 总结
Stitch is a training-free method that integrates external position control into Multi-Modal Diffusion Transformers (MMDiT) using automatically-generated bounding boxes. This approach allows for the accurate generation of spatial relationships in text-to-image (T2I) models, improving their ability to capture precise spatial information. Stitch enhances the performance of various T2I models, with significant improvements on the PosEval benchmark, particularly on the Position task, where it outperforms previous models by 218% and 206% for FLUX and Qwen-Image respectively.
Stitch 是一种无需训练的方法,通过自动生成的边界框将外部位置控制集成到多模态扩散变换器(MMDiT)中。这种方法使得文本到图像模型能够准确生成空间关系,如“上方”或“右侧”。Stitch 显著提升了 Qwen-Image、FLUX 和 SD3.5 等多种模型在位置基任务上的表现,例如在 GenEval 的位置任务上 FLUX 的提升达到了 218%,在 PosEval 上 Qwen-Image 的提升则达到了 54%。
TTT3R: 3D Reconstruction as Test-Time Training
Authors: Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen
First: 2025-09-30T17:59:51+00:00 · Latest: 2025-09-30T17:59:51+00:00
Comments: Page: https://rover-xingyu.github.io/TTT3R Code:
https://github.com/Inception3D/TTT3R
Abstract
Modern Recurrent Neural Networks have become a competitive architecture for
3D reconstruction due to their linear-time complexity. However, their
performance degrades significantly when applied beyond the training context
length, revealing limited length generalization. In this work, we revisit the
3D reconstruction foundation models from a Test-Time Training perspective,
framing their designs as an online learning problem. Building on this
perspective, we leverage the alignment confidence between the memory state and
incoming observations to derive a closed-form learning rate for memory updates,
to balance between retaining historical information and adapting to new
observations. This training-free intervention, termed TTT3R, substantially
improves length generalization, achieving a $2\times$ improvement in global
pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU
memory to process thousands of images. Code available in
https://rover-xingyu.github.io/TTT3R
中文标题/摘要
标题:TTT3R:测试时训练的3D重建
现代循环神经网络因其线性时间复杂性已成为3D重建的竞争性架构。然而,当应用于训练上下文长度之外时,其性能显著下降,显示出有限长度泛化能力。在本文中,我们从测试时训练的角度重新审视3D重建基础模型,将其设计框架化为在线学习问题。基于这一视角,我们利用记忆状态与新观测之间的对齐置信度来推导出记忆更新的闭式学习率,以平衡保留历史信息和适应新观测之间的关系。这种无需训练的干预措施,称为TTT3R,显著提高了长度泛化能力,在全局姿态估计方面比基线提高了2倍,同时以每秒20帧的速度运行,仅使用6 GB的GPU内存处理数千张图像。代码可在https://rover-xingyu.github.io/TTT3R获取
Query-Kontext: An Unified Multimodal Model for Image Generation and Editing
Authors: Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, Jingdong Wang
First: 2025-09-30T17:59:46+00:00 · Latest: 2025-09-30T17:59:46+00:00
Comments: 23 pages, 10 figures
Abstract
Unified Multimodal Models (UMMs) have demonstrated remarkable performance in
text-to-image generation (T2I) and editing (TI2I), whether instantiated as
assembled unified frameworks which couple powerful vision-language model (VLM)
with diffusion-based generator, or as naive Unified Multimodal Models with an
early fusion of understanding and generation modalities. We contend that in
current unified frameworks, the crucial capability of multimodal generative
reasoning which encompasses instruction understanding, grounding, and image
referring for identity preservation and faithful reconstruction, is
intrinsically entangled with high-fidelity synthesis. In this work, we
introduce Query-Kontext, a novel approach that bridges the VLM and diffusion
model via a multimodal ``kontext'' composed of semantic cues and coarse-grained
image conditions encoded from multimodal inputs. This design delegates the
complex ability of multimodal generative reasoning to powerful VLM while
reserving diffusion model's role for high-quality visual synthesis. To achieve
this, we propose a three-stage progressive training strategy. First, we connect
the VLM to a lightweight diffusion head via multimodal kontext tokens to
unleash the VLM's generative reasoning ability. Second, we scale this head to a
large, pre-trained diffusion model to enhance visual detail and realism.
Finally, we introduce a low-level image encoder to improve image fidelity and
perform instruction tuning on downstream tasks. Furthermore, we build a
comprehensive data pipeline integrating real, synthetic, and open-source
datasets, covering diverse multimodal reference-to-image scenarios, including
image generation, instruction-driven editing, customized generation, and
multi-subject composition. Experiments show that our approach matches strong
unified baselines and even outperforms task-specific state-of-the-art methods
in several cases.
中文标题/摘要
标题:查询-上下文:一种统一的多模态模型用于图像生成和编辑
统一多模态模型(UMMs)在文本到图像生成(T2I)和编辑(TI2I)方面表现出色,无论是作为结合强大视觉语言模型(VLM)与扩散生成器的组装统一框架,还是作为早期融合理解和生成模态的朴素统一多模态模型。我们认为,在当前的统一框架中,多模态生成推理的关键能力,包括指令理解、定位和图像引用以保持身份和忠实重建,与高保真合成是内在交织的。在本文中,我们引入了Query-Kontext,这是一种通过由多模态输入编码的语义线索和粗粒度图像条件组成的多模态“kontext”将VLM与扩散模型连接起来的新方法。此设计将复杂的多模态生成推理能力委托给强大的VLM,同时保留扩散模型的高质量视觉合成角色。为了实现这一点,我们提出了一种三阶段渐进式训练策略。首先,我们通过多模态kontext标记将VLM连接到一个轻量级的扩散头部,以释放VLM的生成推理能力。其次,我们将此头部扩展到一个大型预训练的扩散模型,以增强视觉细节和真实性。最后,我们引入了一个低级图像编码器以提高图像保真度,并在下游任务中进行指令调优。此外,我们构建了一个综合的数据管道,整合了真实、合成和开源数据集,涵盖了各种多模态参考到图像场景,包括图像生成、指令驱动编辑、定制生成和多主题组合。实验表明,我们的方法与强大的统一基线相当,并且在某些情况下甚至优于特定任务的最新方法。
Summary / 总结
The research aims to improve text-to-image generation and editing by proposing Query-Kontext, a novel unified multimodal model. It integrates a vision-language model with a diffusion model through a multimodal 'kontext' to handle multimodal generative reasoning and high-fidelity synthesis. The model is trained in three stages: connecting the VLM to a lightweight diffusion head, scaling to a large diffusion model, and incorporating a low-level image encoder. Experiments show that Query-Kontext matches strong unified baselines and outperforms task-specific state-of-the-art methods in some cases.
研究旨在通过解决多模态生成推理与高保真合成的内在纠缠问题,提高统一多模态模型(UMMs)在文本到图像生成和编辑中的性能。Query-Kontext 方法引入了一个由语义线索和图像条件组成的多模态 'kontext',将视觉语言模型与扩散模型连接起来。该方法采用三阶段训练策略来增强生成推理和视觉合成。实验表明,Query-Kontext 方法与强大的统一基线相当,并在某些情况下优于特定任务的最先进的方法。
Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models
Authors: Jiaying Wu, Fanxiao Li, Zihang Fu, Min-Yen Kan, Bryan Hooi
First: 2025-05-21T13:14:32+00:00 · Latest: 2025-09-30T17:53:25+00:00
Abstract
The impact of misinformation arises not only from factual inaccuracies but
also from the misleading narratives that creators deliberately embed.
Interpreting such creator intent is therefore essential for multimodal
misinformation detection (MMD) and effective information governance. To this
end, we introduce DeceptionDecoded, a large-scale benchmark of 12,000
image-caption pairs grounded in trustworthy reference articles, created using
an intent-guided simulation framework that models both the desired influence
and the execution plan of news creators. The dataset captures both misleading
and non-misleading cases, spanning manipulations across visual and textual
modalities, and supports three intent-centric tasks: (1) misleading intent
detection, (2) misleading source attribution, and (3) creator desire inference.
We evaluate 14 state-of-the-art vision-language models (VLMs) and find that
they struggle with intent reasoning, often relying on shallow cues such as
surface-level alignment, stylistic polish, or heuristic authenticity signals.
These results highlight the limitations of current VLMs and position
DeceptionDecoded as a foundation for developing intent-aware models that go
beyond shallow cues in MMD.
Summary / 总结
The research aims to uncover misleading narratives in multimodal news by interpreting creator intent, which is crucial for misinformation detection. DeceptionDecoded, a benchmark dataset of 12,000 image-caption pairs, was created using an intent-guided simulation framework. Evaluating 14 state-of-the-art vision-language models, the study found that these models often rely on surface-level cues rather than deep intent reasoning, highlighting the need for more advanced models.
论文介绍了名为DeceptionDecoded的大规模数据集,包含12,000个图像-标题对,旨在揭示多模态新闻中的误导性创作者意图。该数据集包括误导性和非误导性案例,并支持三项任务:误导性意图检测、误导性来源归因和创作者意图推断。评估14个最先进的视觉-语言模型后,研究发现这些模型通常依赖于浅层线索而非深入的意图推理,突显了在多模态信息检测中需要更高级模型的需求。
Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
Authors: John Gkountouras, Ivan Titov
First: 2025-09-30T17:46:46+00:00 · Latest: 2025-09-30T17:46:46+00:00
Abstract
Recent text-only models demonstrate remarkable mathematical reasoning
capabilities. Extending these to visual domains requires vision-language models
to translate images into text descriptions. However, current models, trained to
produce captions for human readers, often omit the precise details that
reasoning systems require. This creates an interface mismatch: reasoners often
fail not due to reasoning limitations but because they lack access to critical
visual information. We propose Adaptive-Clarification Reinforcement Learning
(AC-RL), which teaches vision models what information reasoners need through
interaction. Our key insight is that clarification requests during training
reveal information gaps; by penalizing success that requires clarification, we
create pressure for comprehensive initial captions that enable the reasoner to
solve the problem in a single pass. AC-RL improves average accuracy by 4.4
points over pretrained baselines across seven visual mathematical reasoning
benchmarks, and analysis shows it would cut clarification requests by up to 39%
if those were allowed. By treating clarification as a form of implicit
supervision, AC-RL demonstrates that vision-language interfaces can be
effectively learned through interaction alone, without requiring explicit
annotations.
中文标题/摘要
标题:澄清作为监督:视觉语言接口的强化学习
近期仅文本模型展示了卓越的数学推理能力。将这些能力扩展到视觉领域需要视觉语言模型将图像转换为文本描述。然而,当前模型虽然训练目的是为人类读者生成描述,但往往忽略了推理系统所需的精确细节。这导致了接口不匹配:推理系统往往并非由于推理能力不足,而是因为缺乏关键的视觉信息而失败。我们提出了一种适应性澄清强化学习(AC-RL),通过互动教会视觉模型推理系统所需的信息。我们的核心见解是,在训练过程中提出的澄清请求揭示了信息缺口;通过惩罚需要澄清的成功,我们为生成能够一次性解决问题的全面初始描述施加了压力。AC-RL在七个视觉数学推理基准测试中将平均准确性提高了4.4个百分点,分析表明,如果允许澄清请求,它将减少高达39%的澄清请求。通过将澄清视为隐式监督的形式,AC-RL展示了视觉语言接口可以通过互动学习,而无需显式注释即可有效学习。
Summary / 总结
The research aims to improve vision-language models for visual mathematical reasoning by addressing the gap between current models and reasoning systems. The method involves Adaptive-Clarification Reinforcement Learning (AC-RL), which uses clarification requests during training to guide models to produce more comprehensive captions. Key experimental findings show that AC-RL improves average accuracy by 4.4 points across seven benchmarks and reduces the need for clarification by up to 39%. This suggests that vision-language interfaces can be effectively learned through interaction alone, enhancing their utility for reasoning tasks without explicit annotations.
研究旨在通过解决视觉数学推理中信息需求与当前模型提供的信息之间的不匹配问题,来提升视觉语言模型。方法是采用自适应澄清强化学习(AC-RL),通过互动来教会模型所需信息,并通过惩罚需要澄清的成功来施加压力,促使模型提供全面的初始描述。关键发现表明,AC-RL 在七个基准测试中的平均准确率提高了 4.4 个百分点,并且如果允许澄清请求,最多可减少 39% 的请求。这表明视觉语言接口可以通过互动学习来有效训练,而无需显式标注。