arXiv 论文速递

Optimal Control Meets Flow Matching: A Principled Route to Multi-Subject Fidelity

Authors: Eric Tillmann Bill, Enis Simsar, Thomas Hofmann

First: 2025-10-02T17:59:58+00:00 · Latest: 2025-10-02T17:59:58+00:00

Comments: Code: https://github.com/ericbill21/FOCUS/

Abstract

Text-to-image (T2I) models excel on single-entity prompts but struggle with multi-subject descriptions, often showing attribute leakage, identity entanglement, and subject omissions. We introduce the first theoretical framework with a principled, optimizable objective for steering sampling dynamics toward multi-subject fidelity. Viewing flow matching (FM) through stochastic optimal control (SOC), we formulate subject disentanglement as control over a trained FM sampler. This yields two architecture-agnostic algorithms: (i) a training-free test-time controller that perturbs the base velocity with a single-pass update, and (ii) Adjoint Matching, a lightweight fine-tuning rule that regresses a control network to a backward adjoint signal while preserving base-model capabilities. The same formulation unifies prior attention heuristics, extends to diffusion models via a flow-diffusion correspondence, and provides the first fine-tuning route explicitly designed for multi-subject fidelity. Empirically, on Stable Diffusion 3.5, FLUX, and Stable Diffusion XL, both algorithms consistently improve multi-subject alignment while maintaining base-model style. Test-time control runs efficiently on commodity GPUs, and fine-tuned controllers trained on limited prompts generalize to unseen ones. We further highlight FOCUS (Flow Optimal Control for Unentangled Subjects), which achieves state-of-the-art multi-subject fidelity across models.

中文标题/摘要

标题：最优控制与流匹配结合：通往多主体保真度的原理性路径

文本到图像（T2I）模型在单一实体提示上表现出色，但在处理多主体描述时遇到困难，经常出现属性泄漏、身份纠缠和主体遗漏。我们提出了第一个理论框架，提供了一个可优化的目标，以引导采样动力学向多主体保真度方向发展。通过将流匹配（FM）视为随机最优控制（SOC），我们将主体解纠缠视为对训练好的FM采样器的控制。这产生了两种架构无关的算法：（i）一个无需训练的测试时控制器，通过单次更新扰动基础速度，以及（ii）轻量级的微调规则Adjoint Matching，该规则通过回归控制网络到反向伴随信号来实现微调，同时保留基础模型的能力。相同的公式统一了先前的注意力启发式方法，通过流扩散对应关系扩展到扩散模型，并提供了第一个明确为多主体保真度设计的微调路径。实验上，在Stable Diffusion 3.5、FLUX和Stable Diffusion XL上，两种算法在保持基础模型风格的同时，一致地提高了多主体对齐度。测试时的控制在普通GPU上高效运行，微调控制器在有限提示下训练后可以泛化到未见过的提示。我们进一步强调FOCUS（Flow Optimal Control for Unentangled Subjects），它在多个模型上实现了最先进的多主体保真度。

Summary / 总结

This paper addresses the challenge of generating images with multiple subjects in text-to-image models, which often suffer from attribute leakage and identity entanglement. It introduces a theoretical framework using stochastic optimal control to steer the sampling dynamics towards multi-subject fidelity. Two algorithms are proposed: a test-time controller that perturbs the base velocity and a fine-tuning rule called Adjoint Matching. Both methods improve multi-subject alignment while preserving the base model's style. The test-time control is efficient and can be run on commodity GPUs, and fine-tuned controllers generalize well to unseen prompts. FOCUS, a specific implementation, achieves state-of-the-art results in multi-subject fidelity across different models.

论文针对现有文本到图像模型在处理多主体描述时面临的属性泄漏、身份纠缠和主体遗漏等问题，提出了一个基于随机最优控制和流匹配的理论框架，以引导采样动力学向多主体保真度靠拢。提出了两种算法：一种是用于高效扰动的测试时控制器，另一种是名为Adjoint Matching的微调规则。这两种算法都能提高多主体对齐效果，同时保留基模型的风格，并且微调控制器在有限提示下也能很好地泛化到未见过的提示。FOCUS方法在不同模型中实现了最先进的多主体保真度结果。

NoiseShift: Resolution-Aware Noise Recalibration for Better Low-Resolution Image Generation

Authors: Ruozhen He, Moayed Haji-Ali, Ziyan Yang, Vicente Ordonez

First: 2025-10-02T17:59:43+00:00 · Latest: 2025-10-02T17:59:43+00:00

Abs · PDF · Code1 · Code2

Abstract

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training. High-resolution text-to-image generators are currently unable to easily offer an out-of-the-box budget-efficient alternative to their users who might not need high-resolution images. We identify a key technical insight in diffusion models that when addressed can help tackle this limitation: Noise schedulers have unequal perceptual effects across resolutions. The same level of noise removes disproportionately more signal from lower-resolution images than from high-resolution images, leading to a train-test mismatch. We propose NoiseShift, a training-free method that recalibrates the noise level of the denoiser conditioned on resolution size. NoiseShift requires no changes to model architecture or sampling schedule and is compatible with existing models. When applied to Stable Diffusion 3, Stable Diffusion 3.5, and Flux-Dev, quality at low resolutions is significantly improved. On LAION-COCO, NoiseShift improves SD3.5 by 15.89%, SD3 by 8.56%, and Flux-Dev by 2.44% in FID on average. On CelebA, NoiseShift improves SD3.5 by 10.36%, SD3 by 5.19%, and Flux-Dev by 3.02% in FID on average. These results demonstrate the effectiveness of NoiseShift in mitigating resolution-dependent artifacts and enhancing the quality of low-resolution image generation.

中文标题/摘要

标题：NoiseShift：针对分辨率的噪声重新校准以获得更好的低分辨率图像生成

在固定分辨率集上训练的文本到图像扩散模型往往在生成低于训练分辨率的图像时无法很好地泛化。当前的高分辨率文本到图像生成器无法为不需要高分辨率图像的用户提供一种经济高效的替代方案。我们发现了一个关键的技术洞察：当解决这一问题时，扩散模型中的噪声调度器在不同分辨率下具有不等的感知效果。相同水平的噪声从低分辨率图像中移除的信号比从高分辨率图像中移除的更多，导致训练和测试之间的不匹配。我们提出了一种无需训练的方法NoiseShift，该方法根据分辨率大小重新校准去噪器的噪声水平。NoiseShift 不需要对模型架构或采样计划进行任何更改，并且与现有模型兼容。当应用于Stable Diffusion 3、Stable Diffusion 3.5和Flux-Dev时，低分辨率的质量显著提高。在LAION-COCO上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了15.89%、8.56%和2.44%。在CelebA上，NoiseShift分别将SD3.5、SD3和Flux-Dev的FID提高了10.36%、5.19%和3.02%。这些结果表明NoiseShift在减轻分辨率依赖性伪影和提高低分辨率图像生成质量方面的有效性。

Summary / 总结

Text-to-image diffusion models trained on a fixed set of resolutions often fail to generalize, even when asked to generate images at lower resolutions than those seen during training.

NoiseShift 是一种无需训练的方法，根据分辨率大小重新校准文本到图像扩散模型中的噪声水平，解决训练与测试之间的不匹配问题。它在不同模型中提高了低分辨率图像生成的质量，平均 FID 改进分别为 SD3.5 的 15.89%，SD3 的 8.56%，Flux-Dev 的 2.44% 在 LAION-COCO 上，以及 SD3.5 的 10.36%，SD3 的 5.19%，Flux-Dev 的 3.02% 在 CelebA 上。

VideoNSA: Native Sparse Attention Scales Video Understanding

Authors: Enxin Song, Wenhao Chai, Shusheng Yang, Ethan Armand, Xiaojun Shan, Haiyang Xu, Jianwen Xie, Zhuowen Tu

First: 2025-10-02T17:58:54+00:00 · Latest: 2025-10-02T17:58:54+00:00

Comments: Project Page: https://enxinsong.com/VideoNSA-web/, Code: https://github.com/Espere-1119-Song/VideoNSA

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video understanding in multimodal language models remains limited by context length: models often miss key transition frames and struggle to maintain coherence across long time scales. To address this, we adapt Native Sparse Attention (NSA) to video-language models. Our method, VideoNSA, adapts Qwen2.5-VL through end-to-end training on a 216K video instruction dataset. We employ a hardware-aware hybrid approach to attention, preserving dense attention for text, while employing NSA for video. Compared to token-compression and training-free sparse baselines, VideoNSA achieves improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Further ablation analysis reveals four key findings: (1) reliable scaling to 128K tokens; (2) an optimal global-local attention allocation at a fixed budget; (3) task-dependent branch usage patterns; and (4) the learnable combined sparse attention help induce dynamic attention sinks.

Summary / 总结

The research aims to enhance video understanding in multimodal language models by addressing the limitation of context length. VideoNSA adapts Native Sparse Attention (NSA) to Qwen2.5-VL, employing a hybrid approach that uses dense attention for text and NSA for video. The method is trained end-to-end on a large video instruction dataset and shows improved performance on long-video understanding, temporal reasoning, and spatial benchmarks. Key findings include reliable scaling to 128K tokens, optimal global-local attention allocation, task-dependent branch usage patterns, and the benefit of learnable combined sparse attention.

VideoNSA通过将Native Sparse Attention (NSA)应用到视频-语言模型中，提升了长视频理解和时间推理能力。该方法通过对Qwen2.5-VL进行端到端训练，使用216K视频指令数据集，展示了优于token压缩和无训练稀疏基线的性能。关键发现包括可靠地扩展到128K tokens，最优的全局-局部注意力分配，任务相关的分支使用模式，以及可学习的组合稀疏注意力有助于动态注意力聚焦。

From Behavioral Performance to Internal Competence: Interpreting Vision-Language Models with VLM-Lens

Authors: Hala Sheta, Eric Huang, Shuyu Wu, Ilia Alenabi, Jiajun Hong, Ryker Lin, Ruoxi Ning, Daniel Wei, Jialin Yang, Jiawei Zhou, Ziqiao Ma, Freda Shi

Venue: EMNLP 2025

First: 2025-10-02T17:58:41+00:00 · Latest: 2025-10-02T17:58:41+00:00

Comments: EMNLP 2025 System Demonstration | Code: https://github.com/compling-wat/vlm-lens

Abs · PDF · Code1 · Code2 · Code3

Abstract

We introduce VLM-Lens, a toolkit designed to enable systematic benchmarking, analysis, and interpretation of vision-language models (VLMs) by supporting the extraction of intermediate outputs from any layer during the forward pass of open-source VLMs. VLM-Lens provides a unified, YAML-configurable interface that abstracts away model-specific complexities and supports user-friendly operation across diverse VLMs. It currently supports 16 state-of-the-art base VLMs and their over 30 variants, and is extensible to accommodate new models without changing the core logic. The toolkit integrates easily with various interpretability and analysis methods. We demonstrate its usage with two simple analytical experiments, revealing systematic differences in the hidden representations of VLMs across layers and target concepts. VLM-Lens is released as an open-sourced project to accelerate community efforts in understanding and improving VLMs.

中文标题/摘要

标题：从行为性能到内在能力：使用VLM-Lens解析视觉-语言模型

我们介绍了VLM-Lens，一个旨在通过支持从开源视觉-语言模型（VLMs）前向传递过程中任何层提取中间输出来实现系统基准测试、分析和解释的工具包。VLM-Lens提供了一个统一的、通过YAML配置的接口，抽象掉了模型特定的复杂性，并支持用户友好的操作，适用于各种不同的VLMs。它目前支持16个最先进的基础VLM及其超过30个变体，并且可以通过添加新模型而不改变核心逻辑来扩展。该工具包易于与各种可解释性和分析方法集成。我们通过两个简单的分析实验展示了其使用方法，揭示了VLMs在不同层和目标概念上的隐藏表示的系统性差异。VLM-Lens作为一个开源项目发布，旨在加速社区对理解和改进VLMs的努力。

Test-Time Anchoring for Discrete Diffusion Posterior Sampling

Authors: Litu Rout, Andreas Lugmayr, Yasamin Jafarian, Srivatsan Varadharajan, Constantine Caramanis, Sanjay Shakkottai, Ira Kemelmacher-Shlizerman

First: 2025-10-02T17:58:37+00:00 · Latest: 2025-10-02T17:58:37+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

We study the problem of posterior sampling using pretrained discrete diffusion foundation models, aiming to recover images from noisy measurements without retraining task-specific models. While diffusion models have achieved remarkable success in generative modeling, most advances rely on continuous Gaussian diffusion. In contrast, discrete diffusion offers a unified framework for jointly modeling categorical data such as text and images. Beyond unification, discrete diffusion provides faster inference, finer control, and principled training-free Bayesian inference, making it particularly well-suited for posterior sampling. However, existing approaches to discrete diffusion posterior sampling face severe challenges: derivative-free guidance yields sparse signals, continuous relaxations limit applicability, and split Gibbs samplers suffer from the curse of dimensionality. To overcome these limitations, we introduce Anchored Posterior Sampling (APS) for masked diffusion foundation models, built on two key innovations -- quantized expectation for gradient-like guidance in discrete embedding space, and anchored remasking for adaptive decoding. Our approach achieves state-of-the-art performance among discrete diffusion samplers across linear and nonlinear inverse problems on the standard benchmarks. We further demonstrate the benefits of our approach in training-free stylization and text-guided editing.

中文标题/摘要

标题：测试时锚定用于离散扩散后验采样

我们研究使用预训练的离散扩散基础模型进行后验采样的问题，旨在从噪声测量中恢复图像而不重新训练特定任务的模型。虽然扩散模型在生成建模方面取得了显著成功，但大多数进展依赖于连续的高斯扩散。相比之下，离散扩散为联合建模诸如文本和图像的分类数据提供了一个统一框架。除了统一之外，离散扩散提供了更快的推理、更精细的控制和无需训练的贝叶斯推理，使其特别适合后验采样。然而，现有的离散扩散后验采样方法面临严重挑战：无导数指导产生稀疏信号，连续松弛限制了适用性，而分裂吉布斯采样器遭受维度灾难。为了克服这些限制，我们引入了锚定后验采样（APS）方法，基于两个关键创新——量化期望用于离散嵌入空间中的梯度式指导，以及锚定重新遮罩用于自适应解码。我们的方法在标准基准上的线性和非线性逆问题中，离散扩散采样器中达到了最先进的性能。我们进一步展示了我们方法在无需训练的风格化和文本引导编辑中的优势。

Summary / 总结

The research aims to improve posterior sampling using pretrained discrete diffusion models to recover images from noisy measurements without retraining task-specific models. The method, Anchored Posterior Sampling (APS), introduces quantized expectation for gradient-like guidance and anchored remasking for adaptive decoding. The approach outperforms existing samplers on linear and nonlinear inverse problems and shows benefits in training-free stylization and text-guided editing.

本文研究了使用预训练的离散扩散模型从噪声测量中恢复图像的问题，而不重新训练特定任务的模型。作者提出了锚定后验采样（APS），结合了离散嵌入空间中的量化期望进行梯度似然引导和锚定重新遮罩进行自适应解码。APS在标准基准上的线性和非线性逆问题中优于现有方法，并展示了其在无训练风格化和文本引导编辑中的优势。

microCLIP: Unsupervised CLIP Adaptation via Coarse-Fine Token Fusion for Fine-Grained Image Classification

Authors: Sathira Silva, Eman Ali, Chetan Arora, Muhammad Haris Khan

First: 2025-10-02T17:47:39+00:00 · Latest: 2025-10-02T17:47:39+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Unsupervised adaptation of CLIP-based vision-language models (VLMs) for fine-grained image classification requires sensitivity to microscopic local cues. While CLIP exhibits strong zero-shot transfer, its reliance on coarse global features restricts its performance on fine-grained classification tasks. Prior efforts inject fine-grained knowledge by aligning large language model (LLM) descriptions with the CLIP $\texttt{[CLS]}$ token; however, this approach overlooks spatial precision. We propose $\textbf{microCLIP}$, a self-training framework that jointly refines CLIP's visual and textual representations using fine-grained cues. At its core is Saliency-Oriented Attention Pooling (SOAP) within a lightweight TokenFusion module, which builds a saliency-guided $\texttt{[FG]}$ token from patch embeddings and fuses it with the global $\texttt{[CLS]}$ token for coarse-fine alignment. To stabilize adaptation, we introduce a two-headed LLM-derived classifier: a frozen classifier that, via multi-view alignment, provides a stable text-based prior for pseudo-labeling, and a learnable classifier initialized from LLM descriptions and fine-tuned with TokenFusion. We further develop Dynamic Knowledge Aggregation, which convexly combines fixed LLM/CLIP priors with TokenFusion's evolving logits to iteratively refine pseudo-labels. Together, these components uncover latent fine-grained signals in CLIP, yielding a consistent $2.90\%$ average accuracy gain across 13 fine-grained benchmarks while requiring only light adaptation. Our code is available at https://github.com/sathiiii/microCLIP.

中文标题/摘要

标题：microCLIP：通过粗细粒度标记融合的无监督CLIP适应用于细粒度图像分类

基于CLIP的视觉-语言模型（VLMs）的无监督适应用于细粒度图像分类需要对微观局部线索的敏感性。尽管CLIP表现出强大的零样本迁移能力，但其对粗大全局特征的依赖限制了其在细粒度分类任务上的性能。先前的努力通过将大型语言模型（LLM）描述与CLIP的$\texttt{[CLS]}$标记对齐来注入细粒度知识，但这种方法忽略了空间精度。我们提出了一种名为$\textbf{microCLIP}$的自我训练框架，该框架联合精炼CLIP的视觉和文本表示，使用细粒度线索。其核心是轻量级标记融合模块中的注意力导向聚合（SOAP），该模块从补丁嵌入中构建一个导向显著性的$\texttt{[FG]}$标记，并将其与全局$\texttt{[CLS]}$标记融合以实现粗细粒度对齐。为了稳定适应，我们引入了一种双头LLM衍生分类器：一个冻结的分类器，通过多视图对齐，提供稳定的文字先验用于伪标签，以及一个从LLM描述初始化并结合标记融合进行微调的可学习分类器。我们进一步开发了动态知识聚合，该聚合通过凸组合固定LLM/CLIP先验与标记融合的演变逻辑来迭代精炼伪标签。这些组件共同揭示了CLIP中的潜在细粒度信号，实现了在13个细粒度基准上的平均2.90%一致准确率提升，同时仅需轻量级适应。我们的代码可在https://github.com/sathiiii/microCLIP获取。

Summary / 总结

microCLIP is an unsupervised adaptation framework for CLIP-based vision-language models to improve fine-grained image classification. It introduces Saliency-Oriented Attention Pooling and a TokenFusion module to refine CLIP's visual and textual representations using fine-grained cues. The framework also includes a two-headed LLM-derived classifier and Dynamic Knowledge Aggregation to stabilize adaptation and refine pseudo-labels. This approach achieves an average accuracy gain of 2.90% across 13 fine-grained benchmarks with minimal adaptation effort.

microCLIP 是一种无监督的适应框架，用于改进基于 CLIP 的视觉-语言模型在细粒度图像分类中的表现。它引入了注意力引导池化和 TokenFusion 模块，利用细粒度线索来细化视觉和文本表示。该框架还包括双头分类器和动态知识聚合，以稳定适应并细化伪标签。这种方法在 13 个细粒度基准测试中实现了平均 2.90% 的准确率提升，且适应工作量较小。

From Frames to Clips: Efficient Key Clip Selection for Long-Form Video Understanding

Authors: Guangyu Sun, Archit Singhal, Burak Uzkent, Mubarak Shah, Chen Chen, Garin Kessler

First: 2025-10-02T17:43:01+00:00 · Latest: 2025-10-02T17:43:01+00:00

Abs · PDF · Code1 · Code2

Abstract

Video Large Language Models (VLMs) have achieved remarkable results on a variety of vision language tasks, yet their practical use is limited by the "needle in a haystack" problem: the massive number of visual tokens produced from raw video frames exhausts the model's context window. Existing solutions alleviate this issue by selecting a sparse set of frames, thereby reducing token count, but such frame-wise selection discards essential temporal dynamics, leading to suboptimal reasoning about motion and event continuity. In this work we systematically explore the impact of temporal information and demonstrate that extending selection from isolated key frames to key clips, which are short, temporally coherent segments, improves video understanding. To maintain a fixed computational budget while accommodating the larger token footprint of clips, we propose an adaptive resolution strategy that dynamically balances spatial resolution and clip length, ensuring a constant token count per video. Experiments on three long-form video benchmarks demonstrate that our training-free approach, F2C, outperforms uniform sampling up to 8.1%, 5.6%, and 10.3% on Video-MME, LongVideoBench and MLVU benchmarks, respectively. These results highlight the importance of preserving temporal coherence in frame selection and provide a practical pathway for scaling Video LLMs to real world video understanding applications. Project webpage is available at https://guangyusun.com/f2c .

中文标题/摘要

标题：从帧到片段：长视频理解中的高效关键片段选择

视频大型语言模型（VLMs）在各种视觉语言任务中取得了显著成果，但其实际应用受到‘大海捞针’问题的限制：从原始视频帧中生成的大量视觉标记耗尽了模型的上下文窗口。现有解决方案通过选择稀疏的帧集来缓解这一问题，从而减少标记数量，但这种基于帧的选择会丢弃重要的时间动态性，导致对运动和事件连续性的推理效果不佳。在本工作中，我们系统地探讨了时间信息的影响，并证明将选择从孤立的关键帧扩展到关键片段（即短且时间上连贯的片段）可以提高视频理解。为了在保持固定计算预算的同时适应片段更大的标记占用量，我们提出了一种自适应分辨率策略，动态平衡空间分辨率和片段长度，确保每个视频的标记数量恒定。在三个长视频基准上的实验表明，我们的无需训练的方法F2C在Video-MME、LongVideoBench和MLVU基准上分别比均匀采样高出8.1%、5.6%和10.3%。这些结果突显了在帧选择中保持时间连贯性的重要性，并为将视频LLMs扩展到实际视频理解应用提供了实用途径。项目网页可在https://guangyusun.com/f2c 查看。

Summary / 总结

This work addresses the challenge of processing long-form videos for video understanding by proposing a method to select key clips instead of individual frames. The approach maintains a fixed computational budget by adaptively balancing spatial resolution and clip length, ensuring a constant token count. Experiments show that the proposed method, F2C, outperforms uniform sampling on three benchmarks by up to 10.3%, demonstrating the importance of preserving temporal coherence in frame selection for video understanding tasks.

该研究通过提出选择关键片段而非单个帧的方法，解决了处理长视频进行视频理解的挑战。该方法通过适配地平衡空间分辨率和片段长度来维持固定的计算预算，确保每个视频的令牌计数恒定。实验表明，所提出的F2C方法在三个基准上比均匀采样高出最多10.3%，证明了在视频理解任务中保留时间连贯性的重要性。

GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation

Authors: Weijia Dou, Xu Zhang, Yi Bin, Jian Liu, Bo Peng, Guoqing Wang, Yang Yang, Heng Tao Shen

First: 2025-10-02T16:37:56+00:00 · Latest: 2025-10-02T16:37:56+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent attempts to transfer features from 2D Vision-Language Models (VLMs) to 3D semantic segmentation expose a persistent trade-off. Directly projecting 2D features into 3D yields noisy and fragmented predictions, whereas enforcing geometric coherence necessitates costly training pipelines and large-scale annotated 3D data. We argue that this limitation stems from the dominant segmentation-and-matching paradigm, which fails to reconcile 2D semantics with 3D geometric structure. The geometric cues are not eliminated during the 2D-to-3D transfer but remain latent within the noisy and view-aggregated features. To exploit this property, we propose GeoPurify that applies a small Student Affinity Network to purify 2D VLM-generated 3D point features using geometric priors distilled from a 3D self-supervised teacher model. During inference, we devise a Geometry-Guided Pooling module to further denoise the point cloud and ensure the semantic and structural consistency. Benefiting from latent geometric information and the learned affinity network, GeoPurify effectively mitigates the trade-off and achieves superior data efficiency. Extensive experiments on major 3D benchmarks demonstrate that GeoPurify achieves or surpasses state-of-the-art performance while utilizing only about 1.5% of the training data. Our codes and checkpoints are available at [https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify).

中文标题/摘要

标题：GeoPurify：一种用于开放词汇3D分割的高效几何蒸馏框架

最近尝试将2D视觉语言模型（VLMs）的特征转移到3D语义分割中，暴露了一个持续存在的权衡。直接将2D特征投影到3D中会产生嘈杂和碎片化的预测，而强制几何一致性则需要昂贵的训练管道和大规模标注的3D数据。我们认为这一限制源于占主导地位的分割和匹配范式，它无法调和2D语义与3D几何结构。在2D到3D的转移过程中，几何线索并未被消除，而是潜藏在嘈杂和视角聚合的特征中。为了利用这一特性，我们提出GeoPurify，它使用3D自监督教师模型提取的几何先验，通过小型学生亲和网络净化2D VLM生成的3D点特征。在推理过程中，我们设计了一个几何引导聚合模块，进一步去噪点云并确保语义和结构的一致性。得益于潜藏的几何信息和学习到的亲和网络，GeoPurify有效地缓解了这一权衡，实现了更高的数据效率。在主要的3D基准测试上的广泛实验表明，GeoPurify在使用约1.5%的训练数据的情况下，达到了或超越了最先进的性能。我们的代码和检查点可在[https://github.com/tj12323/GeoPurify](https://github.com/tj12323/GeoPurify)获取。

Summary / 总结

GeoPurify is a data-efficient framework that purifies 2D Vision-Language Model-generated 3D point features using geometric priors from a 3D self-supervised teacher model. It includes a Student Affinity Network for purification and a Geometry-Guided Pooling module for denoising and ensuring semantic and structural consistency. Experiments show that GeoPurify outperforms or matches state-of-the-art models while using only 1.5% of the training data.

GeoPurify 是一个数据高效的框架，通过 3D 自监督教师模型提取的几何先验来净化 2D 视觉语言模型生成的 3D 点特征，包含一个学生亲和网络进行净化和一个几何引导聚合模块进行去噪和确保语义和结构一致性。实验表明，GeoPurify 在使用仅 1.5% 的训练数据的情况下，能够超越或匹配现有最佳模型的性能。

Unlocking Vision-Language Models for Video Anomaly Detection via Fine-Grained Prompting

Authors: Shu Zou, Xinyu Tian, Lukas Wesemann, Fabian Waschkowski, Zhaoyuan Yang, Jing Zhang

First: 2025-10-02T16:06:31+00:00 · Latest: 2025-10-02T16:06:31+00:00

Comments: 14 pages, video anomaly detection

Abs · PDF · Code1 · Code2

Abstract

Prompting has emerged as a practical way to adapt frozen vision-language models (VLMs) for video anomaly detection (VAD). Yet, existing prompts are often overly abstract, overlooking the fine-grained human-object interactions or action semantics that define complex anomalies in surveillance videos. We propose ASK-Hint, a structured prompting framework that leverages action-centric knowledge to elicit more accurate and interpretable reasoning from frozen VLMs. Our approach organizes prompts into semantically coherent groups (e.g. violence, property crimes, public safety) and formulates fine-grained guiding questions that align model predictions with discriminative visual cues. Extensive experiments on UCF-Crime and XD-Violence show that ASK-Hint consistently improves AUC over prior baselines, achieving state-of-the-art performance compared to both fine-tuned and training-free methods. Beyond accuracy, our framework provides interpretable reasoning traces towards anomaly and demonstrates strong generalization across datasets and VLM backbones. These results highlight the critical role of prompt granularity and establish ASK-Hint as a new training-free and generalizable solution for explainable video anomaly detection.

中文标题/摘要

标题：通过精细粒度的提示解锁视觉-语言模型在视频异常检测中的应用

提示已成为一种实用的方法，用于适应冻结的视觉-语言模型（VLMs）以进行视频异常检测（VAD）。然而，现有的提示往往过于抽象，忽视了定义监视视频中复杂异常的精细的人-物交互或动作语义。我们提出了一种名为ASK-Hint的结构化提示框架，该框架利用动作中心的知识来激发冻结VLMs更准确和可解释的推理。我们的方法将提示组织成语义上一致的组（例如，暴力、财产犯罪、公共安全），并制定细粒度的引导问题，使模型预测与区分性视觉线索保持一致。在UCF-Crime和XD-Violence上的广泛实验表明，ASK-Hint在AUC上始终优于先前的基线，与微调和无训练方法相比，实现了最先进的性能。除了准确性之外，我们的框架提供了可解释的推理轨迹，以指向异常，并展示了在不同数据集和VLM主干上的强大泛化能力。这些结果突显了提示粒度的关键作用，并将ASK-Hint确立为新的无训练和可泛化的可解释视频异常检测解决方案。

Post-hoc Probabilistic Vision-Language Models

Authors: Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, Martin Trapp

First: 2024-12-08T18:16:13+00:00 · Latest: 2025-10-02T15:13:15+00:00

Comments: Project page: https://aaltoml.github.io/BayesVLM/

Abs · PDF · Code1 · Code2 · Project1

Abstract

Vision-language models (VLMs), such as CLIP and SigLIP, have found remarkable success in classification, retrieval, and generative tasks. For this, VLMs deterministically map images and text descriptions to a joint latent space in which their similarity is assessed using the cosine similarity. However, a deterministic mapping of inputs fails to capture uncertainties over concepts arising from domain shifts when used in downstream tasks. In this work, we propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our method leverages a Bayesian posterior approximation over the last layers in VLMs and analytically quantifies uncertainties over cosine similarities. We demonstrate its effectiveness for uncertainty quantification and support set selection in active learning. Compared to baselines, we obtain improved and well-calibrated predictive uncertainties, interpretable uncertainty estimates, and sample-efficient active learning. Our results show promise for safety-critical applications of large-scale models.

中文标题/摘要

标题：事后概率视觉-语言模型

视觉-语言模型（VLMs），如CLIP和SigLIP，在分类、检索和生成任务中取得了显著的成功。为此，VLMs将图像和文本描述确定性地映射到一个联合潜在空间，在该空间中使用余弦相似度评估它们的相似性。然而，在下游任务中使用确定性映射输入时，无法捕捉由于领域转移而产生的概念不确定性。在本文中，我们提出了一种不需要额外训练的VLMs的事后不确定性估计方法。我们的方法利用了VLMs最后一层的贝叶斯后验近似，并对余弦相似度进行了分析性量化。我们展示了其在不确定性量化和积极学习支持集选择中的有效性。与基线相比，我们获得了改进且校准良好的预测不确定性、可解释的不确定性估计以及样本高效的积极学习。我们的结果表明，对于大规模模型的安全关键应用具有前景。

Summary / 总结

This work addresses the limitations of deterministic mappings in vision-language models (VLMs) by proposing a post-hoc method for uncertainty estimation. The method uses Bayesian posterior approximation to quantify uncertainties in cosine similarities without additional training. The study demonstrates that this approach improves predictive uncertainties, provides interpretable uncertainty estimates, and enhances sample-efficient active learning compared to baselines. The results are promising for safety-critical applications of large-scale models.

本文提出了一种后验不确定性估计方法，以解决视觉-语言模型（VLMs）中确定性映射的局限性。该方法通过贝叶斯后验近似来量化余弦相似性的不确定性，无需额外训练。研究显示，这种方法能够改善预测不确定性，提供可解释的不确定性估计，并提高样本高效的学习。这对于大型模型的安全关键应用具有重要意义。

DiCache: Let Diffusion Model Determine Its Own Cache

Authors: Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, Jiaqi Wang

First: 2025-08-24T13:30:00+00:00 · Latest: 2025-10-02T14:42:41+00:00

Comments: Project Page: https://bujiazi.github.io/dicache.github.io/ Code: https://github.com/Bujiazi/DiCache

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent years have witnessed the rapid development of acceleration techniques for diffusion models, especially caching-based acceleration methods. These studies seek to answer two fundamental questions: "When to cache" and "How to use cache", typically relying on predefined empirical laws or dataset-level priors to determine caching timings and adopting handcrafted rules for multi-step cache utilization. However, given the highly dynamic nature of the diffusion process, they often exhibit limited generalizability and fail to cope with diverse samples. In this paper, a strong sample-specific correlation is revealed between the variation patterns of the shallow-layer feature differences in the diffusion model and those of deep-layer features. Moreover, we have observed that the features from different model layers form similar trajectories. Based on these observations, we present DiCache, a novel training-free adaptive caching strategy for accelerating diffusion models at runtime, answering both when and how to cache within a unified framework. Specifically, DiCache is composed of two principal components: (1) Online Probe Profiling Scheme leverages a shallow-layer online probe to obtain an on-the-fly indicator for the caching error in real time, enabling the model to dynamically customize the caching schedule for each sample. (2) Dynamic Cache Trajectory Alignment adaptively approximates the deep-layer feature output from multi-step historical caches based on the shallow-layer feature trajectory, facilitating higher visual quality. Extensive experiments validate DiCache's capability in achieving higher efficiency and improved fidelity over state-of-the-art approaches on various leading diffusion models including WAN 2.1, HunyuanVideo and Flux.

Summary / 总结

DiCache is a training-free adaptive caching strategy for diffusion models that determines when and how to cache based on the correlation between shallow and deep layer feature variations. It uses an online probe to dynamically customize caching schedules and aligns deep-layer features using historical caches. Experiments show DiCache outperforms existing methods in efficiency and visual fidelity across various diffusion models like WAN 2.1, HunyuanVideo, and Flux.

DiCache 是一种基于浅层和深层特征变化相关性的无训练自适应缓存策略，用于确定何时以及如何缓存。它通过在线探针动态定制缓存计划，并根据浅层特征轨迹近似多步历史缓存中的深层特征输出，以提高视觉质量。实验表明，DiCache 在各种扩散模型（如 WAN 2.1、HunyuanVideo 和 Flux）中优于现有方法，在效率和保真度方面表现出色。

Robust Prompt Tuning for Vision-Language Models with Mild Semantic Noise

Authors: Yansheng Gao, Yufei Zheng, Shengsheng Wang

First: 2025-08-06T17:42:30+00:00 · Latest: 2025-10-02T14:03:47+00:00

Abs · PDF · Code1 · Code2

Abstract

Prompt tuning has shown promising results, but its robustness and generalization to unseen categories remain limited. Through our experiments, we demonstrate that the complete removal of semantic noise is a key factor restricting robustness. Existing methods typically suppress or filter out semantic noise in the prompt space, inadvertently hindering the model's robustness and its ability to generalize to unseen categories. To address this, we propose ANPrompt, a robust prompt tuning framework that actively incorporates weak semantic noise. By clustering weakly perturbed features into noise prompts and integrating them with learnable tokens in both the text and vision encoders, ANPrompt ensures controlled exposure to semantic variations. To enhance the visual pathway, we introduce the Noise-Resistant Visual Prompt Prototype (NRVPP), which stabilizes visual semantics under weak perturbations. Additionally, we propose a Weak Alignment Loss (WALoss) at the logits level to enforce consistency between clean and perturbed predictions, providing stable supervision. By combining weak semantic noise exposure with logits-based consistency, ANPrompt prevents overfitting to specific phrasings while preserving semantic integrity. Extensive experiments across 11 benchmarks, including base-to-new splits, show that ANPrompt consistently outperforms existing prompt tuning methods, offering superior robustness to semantic noise and improved generalization across tasks.

中文标题/摘要

标题：具有轻微语义噪声的视觉-语言模型鲁棒提示调优

提示调优已经显示出有希望的结果，但其鲁棒性和对未见过类别的泛化能力仍然有限。通过我们的实验，我们证明了完全消除语义噪声是限制鲁棒性的关键因素。现有方法通常在提示空间中抑制或过滤语义噪声，无意中阻碍了模型的鲁棒性和其对未见过类别的泛化能力。为了解决这个问题，我们提出了ANPrompt，这是一种鲁棒的提示调优框架，主动整合弱语义噪声。通过将弱扰动特征聚类成噪声提示，并在文本和视觉编码器中与可学习的标记集成，ANPrompt确保了对语义变化的可控暴露。为了增强视觉路径，我们引入了抗噪声视觉提示原型（NRVPP），它在弱扰动下稳定视觉语义。此外，我们提出了在logits层的弱对齐损失（WALoss），以在干净和扰动预测之间强制一致性，提供稳定的监督。通过结合弱语义噪声暴露和logits一致性，ANPrompt防止了对特定短语的过拟合，同时保持了语义完整性。在包括基础到新类别的11个基准测试中，ANPrompt始终优于现有提示调优方法，提供了对语义噪声的更强鲁棒性和跨任务的更好泛化能力。

Summary / 总结

The research aims to enhance the robustness and generalization of vision-language models by addressing the limitations of prompt tuning. The method involves ANPrompt, which actively incorporates weak semantic noise into the prompt tuning process. By clustering weakly perturbed features and integrating them with learnable tokens, ANPrompt ensures controlled exposure to semantic variations. Experiments across 11 benchmarks demonstrate that ANPrompt outperforms existing methods, providing better robustness to semantic noise and improved generalization across tasks.

研究旨在通过解决提示调优方法的局限性，增强视觉-语言模型的鲁棒性和泛化能力。ANPrompt 是一种新型框架，通过将弱语义噪声引入提示调优过程来提升模型的鲁棒性。实验结果表明，ANPrompt 在 11 个基准测试中均优于现有方法，展示了对语义噪声的更强鲁棒性和更好的任务泛化能力。

PlaceFM: A Training-free Geospatial Foundation Model of Places using Large-Scale Point of Interest Data

Authors: Mohammad Hashemi, Hossein Amiri, Andreas Zufle

First: 2025-06-25T15:10:31+00:00 · Latest: 2025-10-02T13:01:06+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

With the rapid growth and continual updates of geospatial data from diverse sources, geospatial foundation model pre-training for urban representation learning has emerged as a key research direction for advancing data-driven urban planning. Spatial structure is fundamental to effective geospatial intelligence systems; however, existing foundation models often lack the flexibility to reason about places, context-rich regions spanning multiple spatial granularities that may consist of many spatially and semantically related points of interest. To address this gap, we propose PlaceFM, a geospatial foundation model that captures place representations through a training-free, clustering-based approach. PlaceFM summarizes the entire point of interest graph constructed from U.S. Foursquare data, producing general-purpose region embeddings while automatically identifying places of interest. These embeddings can be directly integrated into geolocation data pipelines to support a variety of urban downstream tasks. Without the need for costly pre-training, PlaceFM provides a scalable and efficient solution for multi-granular geospatial analysis. Extensive experiments on two real-world prediction tasks, ZIP code-level population density and housing prices, demonstrate that PlaceFM not only outperforms most state-of-the-art graph-based geospatial foundation models but also achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs. The implementation is available at https://github.com/mohammadhashemii/PlaceFM.

中文标题/摘要

标题：PlaceFM：基于大规模兴趣点数据的无训练地理空间基础模型

随着地理空间数据从多种来源的快速增长和持续更新，基于地理空间数据的城市表示学习预训练已成为推动数据驱动城市规划的关键研究方向。空间结构是有效地理空间智能系统的基础；然而，现有的基础模型往往缺乏灵活地推理地方的能力，即跨越多个空间粒度的上下文丰富的区域，这些区域可能包含许多空间上和语义上相关的兴趣点。为了解决这一差距，我们提出了一种无训练的地理空间基础模型PlaceFM，通过基于聚类的方法捕捉地方表示。PlaceFM 通过美国Foursquare数据构建的兴趣点图总结，生成通用区域嵌入，同时自动识别兴趣点。这些嵌入可以直接集成到地理定位数据管道中，以支持各种城市下游任务。无需昂贵的预训练，PlaceFM 提供了一种可扩展且高效的多粒度地理空间分析解决方案。在两个实际预测任务（邮政编码级别的人口密度和住房价格）上的广泛实验表明，PlaceFM 不仅优于大多数最先进的基于图的地理空间基础模型，还在大规模POI图上生成区域级表示时实现了高达100倍的速度提升。实现代码可在https://github.com/mohammadhashemii/PlaceFM 获取。

Summary / 总结

PlaceFM is a training-free geospatial foundation model that captures place representations using a clustering-based approach on large-scale point of interest data. It automatically identifies places of interest without pre-training, providing general-purpose region embeddings that can be integrated into geolocation data pipelines. Experiments show that PlaceFM outperforms most state-of-the-art graph-based models and achieves up to a 100x speedup in generating region-level representations for urban tasks like population density and housing prices prediction.

PlaceFM 是一种使用训练-free、基于聚类的方法从大规模点兴趣数据中捕获地方表示的地理空间基础模型。它能够自动识别地方兴趣点并生成通用区域嵌入，这些嵌入可以直接集成到地理定位数据管道中以支持城市任务。实验表明，PlaceFM 在生成大规模 POI 图上的区域级表示方面比最先进的图基地理空间基础模型性能更优，并且可以实现高达 100 倍的速度提升。

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Authors: Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang

First: 2025-09-30T06:37:47+00:00 · Latest: 2025-10-02T12:24:56+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model's reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

Summary / 总结

This study explores the dual nature of reasoning in Vision-Language Models (VLMs), showing that while reasoning enhances logical inference and performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures. The authors attribute this to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this, they propose Vision-Anchored Policy Optimization (VAPO), which improves the model's reliance on visual information and achieves new state-of-the-art results on various benchmarks.

研究探讨了视觉语言模型（VLMs）中推理的双重性质，发现虽然推理增强了逻辑推理能力和复杂任务的表现，但也可能损害感知定位，导致基本视觉问题的识别失败。作者将这一现象归因于视觉遗忘，即长时间推理导致模型逐渐忽视视觉输入。为了解决这一问题，他们提出了视觉锚定策略优化（VAPO），该方法增强了模型对视觉信息的依赖，并在多种基准测试中取得了新的最佳结果。

SoftCFG: Uncertainty-guided Stable Guidance for Visual Autoregressive Model

Authors: Dongli Xu, Aleksei Tiulpin, Matthew B. Blaschko

First: 2025-10-01T15:04:00+00:00 · Latest: 2025-10-02T09:32:57+00:00

Comments: preprint

Abs · PDF · Code1 · Code2

Abstract

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens. While Classifier-Free Guidance (CFG) has been adopted to improve conditional generation, its application in AR models faces two key issues: guidance diminishing, where the conditional-unconditional gap quickly vanishes as decoding progresses, and over-guidance, where strong conditions distort visual coherence. To address these challenges, we propose SoftCFG, an uncertainty-guided inference method that distributes adaptive perturbations across all tokens in the sequence. The key idea behind SoftCFG is to let each generated token contribute certainty-weighted guidance, ensuring that the signal persists across steps while resolving conflicts between text guidance and visual context. To further stabilize long-sequence generation, we introduce Step Normalization, which bounds cumulative perturbations of SoftCFG. Our method is training-free, model-agnostic, and seamlessly integrates with existing AR pipelines. Experiments show that SoftCFG significantly improves image quality over standard CFG and achieves state-of-the-art FID on ImageNet 256*256 among autoregressive models.

中文标题/摘要

标题：SoftCFG：基于不确定性指导的视觉自回归模型稳定生成方法

自回归（AR）模型已成为通过将图像建模为离散标记序列的强大工具。尽管已采用分类器无条件引导（CFG）来改进条件生成，但在AR模型中的应用面临两个关键问题：引导减弱，其中条件与无条件之间的差距随着解码进程迅速消失；以及过度引导，其中强烈的条件导致视觉连贯性受损。为了解决这些挑战，我们提出了SoftCFG，这是一种基于不确定性指导的推理方法，它在整个序列中分配自适应扰动。SoftCFG的核心思想是让每个生成的标记贡献加权引导，确保信号在步骤之间持续存在，同时解决文本指导与视觉上下文之间的冲突。为了进一步稳定长序列生成，我们引入了步长归一化，以限制SoftCFG的累积扰动。该方法无需训练，模型无关，并且可以无缝集成到现有的AR管道中。实验表明，SoftCFG在图像质量上显著优于标准CFG，并在ImageNet 256*256的自回归模型中达到了最先进的FID。

Summary / 总结

Autoregressive (AR) models have emerged as powerful tools for image generation by modeling images as sequences of discrete tokens.

SoftCFG 是一种基于不确定性指导的推理方法，旨在提高自回归模型在图像生成中的稳定性。它通过在整个序列中分配自适应扰动来解决指导减弱和过度指导的问题，确保信号在各步骤中持续存在并解决文本指导与视觉上下文之间的冲突。实验表明，SoftCFG 显著提升了图像质量，并在 ImageNet 256*256 的自回归模型中达到了最先进的 FID 分数。

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention

Authors: Mingzhe Zheng, Yongqi Xu, Haojian Huang, Xuran Ma, Yexin Liu, Wenjie Shu, Yatian Pang, Feilong Tang, Qifeng Chen, Harry Yang, Ser-Nam Lim

First: 2024-12-03T08:33:50+00:00 · Latest: 2025-10-02T08:38:03+00:00

Comments: Code: https://github.com/DuNGEOnmassster/VideoGen-of-Thought.git; Webpage: https://cheliosoops.github.io/VGoT/

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. Existing solutions either rely on extensive manual scripting/editing or prioritize single-shot fidelity over cross-scene continuity, limiting their practicality for movie-like content. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence by systematically addressing three core challenges: (1) Narrative fragmentation: Existing methods lack structured storytelling. We propose dynamic storyline modeling, which turns the user prompt into concise shot drafts and then expands them into detailed specifications across five domains (character dynamics, background continuity, relationship evolution, camera movements, and HDR lighting) with self-validation to ensure logical progress. (2) Visual inconsistency: previous approaches struggle to maintain consistent appearance across shots. Our identity-aware cross-shot propagation builds identity-preserving portrait (IPP) tokens that keep character identity while allowing controlled trait changes (expressions, aging) required by the story. (3) Transition artifacts: Abrupt shot changes disrupt immersion. Our adjacent latent transition mechanisms implement boundary-aware reset strategies that process adjacent shots' features at transition points, enabling seamless visual flow while preserving narrative continuity. Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4\% in within-shot face consistency and 17.4\% in style consistency, while requiring 10x fewer manual adjustments. VGoT bridges the gap between raw visual synthesis and director-level storytelling for automated multi-shot video generation.

Nav-EE: Navigation-Guided Early Exiting for Efficient Vision-Language Models in Autonomous Driving

Authors: Haibo Hu, Lianming Huang, Xinyu Wang, Yufei Cui, Nan Guan, Chun Jason Xue

First: 2025-10-02T08:37:58+00:00 · Latest: 2025-10-02T08:37:58+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) are increasingly applied in autonomous driving for unified perception and reasoning, but high inference latency hinders real-time deployment. Early-exit reduces latency by terminating inference at intermediate layers, yet its task-dependent nature limits generalization across diverse scenarios. We observe that this limitation aligns with autonomous driving: navigation systems can anticipate upcoming contexts (e.g., intersections, traffic lights), indicating which tasks will be required. We propose Nav-EE, a navigation-guided early-exit framework that precomputes task-specific exit layers offline and dynamically applies them online based on navigation priors. Experiments on CODA, Waymo, and BOSCH show that Nav-EE achieves accuracy comparable to full inference while reducing latency by up to 63.9%. Real-vehicle integration with Autoware Universe further demonstrates reduced inference latency (600ms to 300ms), supporting faster decision-making in complex scenarios. These results suggest that coupling navigation foresight with early-exit offers a viable path toward efficient deployment of large models in autonomous systems. Code and data are available at our anonymous repository: https://anonymous.4open.science/r/Nav-EE-BBC4

中文标题/摘要

标题：Nav-EE：自主驾驶中高效视觉-语言模型的导航引导早期退出

视觉-语言模型（VLMs）在自主驾驶中被广泛应用，用于统一感知和推理，但高推理延迟阻碍了实时部署。早期退出通过在中间层终止推理来减少延迟，但其任务依赖性限制了在不同场景中的泛化能力。我们观察到这一限制与自主驾驶一致：导航系统可以预测即将出现的上下文（例如，交叉口、交通灯），从而指示哪些任务将被需要。我们提出了一种导航引导的早期退出框架Nav-EE，该框架离线预计算了特定任务的退出层，并基于导航先验在线动态应用。在CODA、Waymo和BOSCH上的实验表明，Nav-EE在保持与完整推理相同准确性的前提下，延迟最多可减少63.9%。与Autoware Universe的实车集成进一步证明了延迟减少（600ms至300ms），支持在复杂场景中更快的决策。这些结果表明，将导航先见与早期退出相结合，为在自主系统中高效部署大型模型提供了一条可行路径。代码和数据可在我们的匿名存储库中获得：https://anonymous.4open.science/r/Nav-EE-BBC4

Summary / 总结

The research aims to address the high inference latency of Vision-Language Models (VLMs) in autonomous driving, which hinders real-time deployment. Nav-EE, a navigation-guided early-exit framework, precomputes task-specific exit layers offline and applies them dynamically based on navigation priors, reducing latency by up to 63.9% while maintaining comparable accuracy to full inference. Real-vehicle integration shows a significant reduction in inference latency from 600ms to 300ms, supporting faster decision-making in complex scenarios.

研究旨在解决视觉-语言模型（VLMs）在自动驾驶中的高推理延迟问题，这阻碍了实时部署。Nav-EE框架提出了一种基于导航的早期退出方法，该方法在离线预计算任务特定的退出层后，根据导航先验在线动态应用，实现了与完整推理相当的准确性，同时延迟最多减少了63.9%。实车集成进一步证实了推理延迟的显著减少，支持在复杂场景中更快的决策。这表明将导航先见与早期退出相结合可以有效地在自主系统中部署大型模型。

Accelerating Attention with Basis Decomposition

Authors: Jialin Zhao

First: 2025-10-02T06:58:10+00:00 · Latest: 2025-10-02T06:58:10+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Attention is a core operation in large language models (LLMs) and vision-language models (VLMs). We present BD Attention (BDA), the first lossless algorithmic reformulation of attention. BDA is enabled by a simple matrix identity from Basis Decomposition (BD), which restructures multi-head projections into a compact form while preserving exact outputs. Unlike I/O-aware system optimizations such as FlashAttention, BDA provides a mathematically guaranteed acceleration that is architecture-agnostic. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining required and, on modern GPUs, achieves 32% faster key/value projections and 25% smaller weights, while increasing end-to-end perplexity (PPL) by just 0.02% (FP16) or 0.0004% (FP32), a negligible effect on model performance. These results position BDA as the first theoretically exact method for lossless attention acceleration that is complementary to existing engineering-level optimizations. Our code is available at https://github.com/abcbdf/basis-decomposition-official.

中文标题/摘要

标题：基于分解加速注意力

注意力是大型语言模型（LLMs）和视觉-语言模型（VLMs）中的核心操作。我们提出了BD注意力（BDA），这是第一个无损的算法重构注意力的方法。BDA得益于基分解（BD）中的一个简单矩阵恒等式，将多头投影重新结构化为紧凑形式，同时保持精确输出。与FlashAttention等输入输出感知系统优化不同，BDA提供了数学上保证的加速，且不受架构限制。在DeepSeek-V2-Lite（16B，FP16）上，BDA只需要4s的离线准备时间，无需重新训练，并在现代GPU上实现32%更快的关键/值投影和25%更小的权重，同时增加端到端困惑度（PPL）仅0.02%（FP16）或0.0004%（FP32），对模型性能的影响可以忽略不计。这些结果将BDA定位为第一个理论上精确的无损注意力加速方法，与现有的工程级优化相辅相成。我们的代码可在https://github.com/abcbdf/basis-decomposition-official获取。

Summary / 总结

The paper introduces BD Attention (BDA), a lossless algorithmic reformulation of attention that uses Basis Decomposition (BD) to restructure multi-head projections into a compact form while preserving exact outputs. On DeepSeek-V2-Lite (16B, FP16), BDA requires only 4s of offline preparation with no retraining and achieves 32% faster key/value projections and 25% smaller weights, with a negligible increase in end-to-end perplexity (PPL) of 0.02% (FP16) or 0.0004% (FP32).

BD Attention (BDA) 是基于 Basis Decomposition (BD) 的注意力算法重构，将多头投影重新结构化为紧凑形式而不改变输出。在 DeepSeek-V2-Lite (16B, FP16) 上，BDA 需要 4s 的离线准备时间，实现 32% 更快的键/值投影和 25% 更小的权重，同时仅导致端到端困惑度 (PPL) 微小增加（FP16 中为 0.02%，FP32 中为 0.0004%）。这使 BDA 成为一种理论上精确的无损注意力加速方法，可以补充现有的优化。

Contrastive Representation Regularization for Vision-Language-Action Models

Authors: Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

First: 2025-10-02T06:41:22+00:00 · Latest: 2025-10-02T06:41:22+00:00

Comments: 20 pages, 12 figures

Abs · PDF · Code1 · Code2

Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Summary / 总结

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs).

本文提出了机器人状态感知对比损失（RS-CL），以改进用于机器人操作的视觉-语言-动作（VLA）模型的表示学习。RS-CL通过使用状态之间的相对距离作为软监督，使模型的表示与机器人的本体感受状态更加一致。该方法增强了控制相关的表示学习，且不增加显著的复杂性。实验结果表明，RS-CL显著提高了操作性能，将RoboCasa-Kitchen中抓取和放置任务的成功率从45.0%提升到了58.3%。

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

Authors: Yu Sun, Yin Li, Ruixiao Sun, Chunhui Liu, Fangming Zhou, Ze Jin, Linjie Wang, Xiang Shen, Zhuolin Hao, Hongyu Xiong

First: 2025-03-21T21:55:05+00:00 · Latest: 2025-10-02T06:26:27+00:00

Abs · PDF · Code1 · Code2

Abstract

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

中文标题/摘要

标题：增强音频的视觉语言建模与潜在空间扩展以实现高质量数据扩展

基于变换器的多模态模型在工业规模的内容理解和相关性排名中广泛应用于推荐、搜索和广告系统。提高标记训练数据质量和跨模态融合显著提升了模型性能，影响了诸如高质量观看率和广告收入等关键指标。高质量注解对于推进内容建模至关重要，但传统的基于统计的主动学习（AL）方法存在局限性：它们难以检测过度自信的误分类，并且在区分深度神经网络中的语义相似项方面效果较差。此外，音频信息在短视频平台中发挥着越来越重要的作用，但大多数预训练的多模态架构主要集中在文本和图像上。虽然可以从所有三种模态重新训练是可能的，但这牺牲了利用现有预训练视觉语言（VL）和音频模型的好处。为了解决这些挑战，我们提出了基于kNN的潜在空间扩展（LSB）以提高AL效率，并提出了一种将音频整合到VL模型中的视觉语言建模与音频增强（VLMAE）的中融合方法。该系统部署在生产系统中，带来了显著的商业收益。

Summary / 总结

The research aims to improve the quality of labeled training data and cross-modal fusion in transformer-based multimodal models, which are used in industrial-scale recommendation, search, and advertising systems. The method involves using kNN-based Latent Space Broadening (LSB) to enhance active learning efficiency and integrating audio into vision-language models through a mid-fusion approach, known as VLMAE. Key experimental findings show that this approach leads to significant business gains, particularly in improving quality view rates and ad revenue.

研究旨在通过改进标记训练数据质量和跨模态融合，在工业规模的推荐、搜索和广告系统中提升基于变换器的多模态模型性能。方法包括使用基于kNN的潜在空间扩展（LSB）来提高主动学习效率，并通过中间融合方式将音频整合到视觉语言模型中，称为VLMAE。实验结果表明，这种方法在提高质量观看率和广告收入方面取得了显著的商业收益。

VaPR -- Vision-language Preference alignment for Reasoning

Authors: Rohan Wadhawan, Fabrice Y Harel-Canada, Zi-Yi Dou, Suhaila Shakiah, Robinson Piramuthu, Nanyun Peng

Venue: COLM 2025

First: 2025-10-02T06:10:43+00:00 · Latest: 2025-10-02T06:10:43+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Preference finetuning methods like Direct Preference Optimization (DPO) with AI-generated feedback have shown promise in aligning Large Vision-Language Models (LVLMs) with human preferences. However, existing techniques overlook the prevalence of noise in synthetic preference annotations in the form of stylistic and length biases. To this end, we introduce a hard-negative response generation framework based on LLM-guided response editing, that produces rejected responses with targeted errors, maintaining stylistic and length similarity to the accepted ones. Using this framework, we develop the VaPR dataset, comprising 30K high-quality samples, to finetune three LVLM families: LLaVA-V1.5, Qwen2VL & Qwen2.5VL (2B-13B sizes). Our VaPR models deliver significant performance improvements across ten benchmarks, achieving average gains of 6.5% (LLaVA), 4.0% (Qwen2VL), and 1.5% (Qwen2.5VL), with notable improvements on reasoning tasks. A scaling analysis shows that performance consistently improves with data size, with LLaVA models benefiting even at smaller scales. Moreover, VaPR reduces the tendency to answer "Yes" in binary questions - addressing a common failure mode in LVLMs like LLaVA. Lastly, we show that the framework generalizes to open-source LLMs as editors, with models trained on VaPR-OS achieving ~99% of the performance of models trained on \name, which is synthesized using GPT-4o. Our data, models, and code can be found on the project page https://vap-r.github.io

中文标题/摘要

标题：VaPR -- 视觉-语言偏好对齐以推理

像直接偏好优化（DPO）这样的偏好微调方法，通过AI生成的反馈，已经在对齐大型视觉-语言模型（LVLMs）与人类偏好方面显示出潜力。然而，现有技术忽视了合成偏好注解中噪声的普遍存在，这些噪声以风格和长度偏差的形式出现。为此，我们提出了一种基于LLM引导的响应编辑的硬负响应生成框架，该框架生成具有目标错误的被拒绝响应，同时保持与被接受响应的风格和长度相似性。利用这一框架，我们开发了包含30000个高质量样本的VaPR数据集，用于微调三个LVLM家族：LLaVA-V1.5、Qwen2VL & Qwen2.5VL（2B-13B规模）。我们的VaPR模型在十个基准测试中实现了显著的性能提升，平均增益分别为6.5%（LLaVA）、4.0%（Qwen2VL）和1.5%（Qwen2.5VL），在推理任务上尤为突出。性能分析显示，随着数据量的增加，性能持续提升，LLaVA模型即使在较小规模下也能受益。此外，VaPR减少了二元问题中回答“是”的倾向，解决了LVLMs如LLaVA的常见失败模式。最后，我们展示了该框架可以应用于开源LLM作为编辑器，使用VaPR-OS训练的模型在性能上达到了使用GPT-4o合成的\name模型的约99%。我们的数据、模型和代码可以在项目页面https://vap-r.github.io找到

Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks

Authors: Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen

First: 2025-09-29T08:49:21+00:00 · Latest: 2025-10-02T06:05:35+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6\% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.

中文标题/摘要

标题：欧几里得的礼物：通过几何代理任务增强视觉-语言模型的空间感知和推理能力

空间智能涵盖了丰富的能力，包括可视化和变换形状、心理旋转物体、判断相对位置和包含关系，以及估算数量。然而，这仍然是多模态大型语言模型（MLLMs）的一个关键未解决的挑战。为了填补这一空白，我们建议将欧几里得几何问题解决作为代理任务。具体来说，我们精心构建了一个多模态数据集，称为Euclid30K，包含约30000个平面几何和立体几何问题。为了使模型能够从这些几何问题中获取和应用欧几里得原理，我们使用了组相对策略优化（GRPO）对Qwen2.5VL家族和RoboBrain2.0家族进行微调，激励模型识别形状、计数和关联实体，并使用欧几里得原理进行多步演绎推理。我们的实验表明，这些模型在四个空间推理基准测试（Super-CLEVR、Omni3DBench、VSI-Bench和MindCube）上实现了显著的零样本增益，无需任何特定任务的调整。值得注意的是，经过Euclid30K训练后，所有评估模型的平均VSI-Bench准确率从34.5%提高到40.5%，提高了5.5个百分点。其中，RoboBrain2.0-Euclid-7B的准确率达到49.6%，超越了之前的最佳模型Spatial-MLLM。据我们所知，这是首次系统研究表明几何导向的微调可以赋予视觉-语言模型广泛转移的空间技能。代码和Euclid30K数据集可在https://zgca-ai4edu.github.io/Euclids_Gift/找到。

Summary / 总结

FreeViS: Training-free Video Stylization with Inconsistent References

Authors: Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

First: 2025-10-02T05:27:06+00:00 · Latest: 2025-10-02T05:27:06+00:00

Comments: Project Page: \url{https://xujiacong.github.io/FreeViS/}

Abs · PDF · Code1 · Code2 · Project1

Abstract

Video stylization plays a key role in content creation, but it remains a challenging problem. Na\"ively applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

中文标题/摘要

标题：FreeViS: 无需训练的视频风格化方法与不一致的参考

视频风格化在内容创作中起着关键作用，但仍然是一个具有挑战性的问题。逐帧应用图像风格化会损害时间一致性并减少风格丰富性。或者，训练专门的视频风格化模型通常需要配对的视频数据且计算成本高昂。在本文中，我们提出了一种无需训练的视频风格化框架FreeViS，该框架能够生成具有丰富风格细节和强时间连贯性的风格化视频。我们的方法将多个风格化参考整合到预训练的图像到视频（I2V）模型中，有效地缓解了先前工作中观察到的传播错误，同时不引入闪烁和卡顿。此外，它利用高频补偿来约束内容布局和运动，并结合基于流的运动线索来保留低显著性区域中的风格纹理。通过广泛的评估，FreeViS 提供了更高的风格化保真度和更强的时间一致性，优于最近的基线，并获得了强烈的人类偏好。我们的无需训练的管道为高质量、时间连贯的视频风格化提供了一种实用且经济的解决方案。代码和视频可通过 https://xujiacong.github.io/FreeViS/ 获取

Summary / 总结

FreeViS is a training-free video stylization framework that integrates multiple stylized references into a pretrained image-to-video model to generate videos with rich style details and strong temporal coherence. It leverages high-frequency compensation and flow-based motion cues to preserve style textures and constrain content layout and motion. Extensive evaluations show that FreeViS outperforms recent baselines in terms of stylization fidelity and temporal consistency, and it receives strong human preference. The training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization.

FreeViS 是一个无需训练的视频风格化框架，通过将多种风格化参考整合到预训练的图像到视频模型中，生成具有丰富风格细节和强时间连贯性的视频。它利用高频补偿和基于流的运动线索来保留风格纹理并约束内容布局和运动。广泛评估表明，FreeViS 在风格化保真度和时间连贯性方面优于最近的基线，并且获得了强烈的人类偏好。无需训练的流水线为高质量、时间连贯的视频风格化提供了一个实用且经济的解决方案。

Look Less, Reason More: Rollout-Guided Adaptive Pixel-Space Reasoning

Authors: Xuchen Li, Xuzhao Li, Jiahui Gao, Renjie Pi, Shiyu Hu, Wentao Zhang

First: 2025-10-02T05:14:52+00:00 · Latest: 2025-10-02T05:14:52+00:00

Comments: Preprint, Under review

Abs · PDF · Code1 · Code2

Abstract

Vision-Language Models (VLMs) excel at many multimodal tasks, yet they frequently struggle with tasks requiring precise understanding and handling of fine-grained visual elements. This is mainly due to information loss during image encoding or insufficient attention to critical regions. Recent work has shown promise by incorporating pixel-level visual information into the reasoning process, enabling VLMs to access high-resolution visual details during their thought process. However, this pixel-level information is often overused, leading to inefficiency and distraction from irrelevant visual details. To address these challenges, we propose the first framework for adaptive pixel reasoning that dynamically determines necessary pixel-level operations based on the input query. Specifically, we first apply operation-aware supervised fine-tuning to establish baseline competence in textual reasoning and visual operations, then design a novel rollout-guided reinforcement learning framework relying on feedback of the model's own responses, which enables the VLM to determine when pixel operations should be invoked based on query difficulty. Experiments on extensive multimodal reasoning benchmarks show that our model achieves superior performance while significantly reducing unnecessary visual operations. Impressively, our model achieves 73.4\% accuracy on HR-Bench 4K while maintaining a tool usage ratio of only 20.1\%, improving accuracy and simultaneously reducing tool usage by 66.5\% compared to the previous methods.

中文标题/摘要

标题：少看多思：基于展开引导的自适应像素空间推理

视觉-语言模型（VLMs）在许多跨模态任务中表现出色，但在需要精确理解和处理细粒度视觉元素的任务中经常遇到困难。这主要是由于图像编码过程中的信息丢失或对关键区域的关注不足。最近的工作通过将像素级视觉信息纳入推理过程，使VLMs能够在思考过程中访问高分辨率的视觉细节，显示出前景。然而，这种像素级信息的过度使用导致了效率低下和对无关视觉细节的干扰。为了解决这些挑战，我们提出了第一个自适应像素推理框架，该框架根据输入查询动态确定必要的像素级操作。具体来说，我们首先应用操作感知的监督微调来建立文本推理和视觉操作的基础能力，然后设计一种基于模型自身响应反馈的展开引导强化学习框架，使VLM能够根据查询难度确定何时调用像素操作。在广泛的跨模态推理基准测试中，我们的模型在显著减少不必要的视觉操作的同时实现了优越的性能。令人印象深刻的是，我们的模型在HR-Bench 4K上达到了73.4%的准确率，同时工具使用率为20.1%，与之前的方法相比，准确率提高了，同时工具使用率减少了66.5%。

Summary / 总结

The research aims to enhance the precision of Vision-Language Models (VLMs) in tasks requiring fine-grained visual understanding by proposing an adaptive pixel reasoning framework. This framework dynamically decides when to use pixel-level visual information based on the input query, reducing unnecessary operations. Experiments show that the proposed model outperforms previous methods with higher accuracy and lower tool usage on various multimodal reasoning benchmarks, achieving 73.4% accuracy on HR-Bench 4K while using only 20.1% of visual operations, a 66.5% reduction compared to previous approaches.

研究旨在通过提出一种自适应像素推理框架来提升视觉-语言模型在需要精细视觉理解的任务中的精度。该框架根据输入查询动态决定何时使用像素级视觉信息，减少不必要的操作并提高效率。实验表明，所提出模型在HR-Bench 4K上的准确率达到73.4%，工具使用率为20.1%，相比之前的方法降低了66.5%的工具使用率。

Does Bigger Mean Better? Comparitive Analysis of CNNs and Biomedical Vision Language Modles in Medical Diagnosis

Authors: Ran Tong, Jiaqi Liu, Su Liu, Jiexi Xu, Lanruo Wang, Tong Wang

First: 2025-10-01T01:46:09+00:00 · Latest: 2025-10-02T04:22:36+00:00

Comments: 6pages,3 figures.Uunder review of International Conference on Artificial Intelligence, Computer, Data Sciences and Applications

Abs · PDF · Code1 · Code2

Abstract

The accurate interpretation of chest radiographs using automated methods is a critical task in medical imaging. This paper presents a comparative analysis between a supervised lightweight Convolutional Neural Network (CNN) and a state-of-the-art, zero-shot medical Vision-Language Model (VLM), BiomedCLIP, across two distinct diagnostic tasks: pneumonia detection on the PneumoniaMNIST benchmark and tuberculosis detection on the Shenzhen TB dataset. Our experiments show that supervised CNNs serve as highly competitive baselines in both cases. While the default zero-shot performance of the VLM is lower, we demonstrate that its potential can be unlocked via a simple yet crucial remedy: decision threshold calibration. By optimizing the classification threshold on a validation set, the performance of BiomedCLIP is significantly boosted across both datasets. For pneumonia detection, calibration enables the zero-shot VLM to achieve a superior F1-score of 0.8841, surpassing the supervised CNN's 0.8803. For tuberculosis detection, calibration dramatically improves the F1-score from 0.4812 to 0.7684, bringing it close to the supervised baseline's 0.7834. This work highlights a key insight: proper calibration is essential for leveraging the full diagnostic power of zero-shot VLMs, enabling them to match or even outperform efficient, task-specific supervised models.

中文标题/摘要

标题：更大意味着更好吗？医学成像中CNN与生物医学视觉语言模型的比较分析

使用自动化方法准确解释胸部X光片是医学影像中的关键任务。本文在两个不同的诊断任务上对监督轻量级卷积神经网络（CNN）和最先进的零样本医学视觉-语言模型（VLM）BiomedCLIP进行了比较分析：使用PneumoniaMNIST基准进行肺炎检测，以及使用Shenzhen TB数据集进行结核病检测。我们的实验表明，在这两种情况下，监督CNN都是极具竞争力的基线模型。虽然VLM的默认零样本性能较低，但我们证明了一个简单的关键补救措施——决策阈值校准——可以解锁其潜力。通过在验证集上优化分类阈值，BiomedCLIP在两个数据集上的性能显著提升。对于肺炎检测，校准使零样本VLM的F1分数达到0.8841，超过了监督CNN的0.8803。对于结核病检测，校准将F1分数从0.4812提高到0.7684，使其接近监督基线的0.7834。这项工作强调了一个关键见解：适当的校准对于充分利用零样本VLM的全部诊断能力至关重要，使其能够匹配甚至超越高效的、针对特定任务的监督模型。

Summary / 总结

This paper compares a supervised lightweight Convolutional Neural Network (CNN) with a state-of-the-art zero-shot medical Vision-Language Model (VLM), BiomedCLIP, for pneumonia and tuberculosis detection. Experiments show that while BiomedCLIP initially performs lower, optimizing its decision threshold significantly improves its performance, achieving F1-scores of 0.8841 and 0.7684 for pneumonia and tuberculosis detection, respectively, closely matching the supervised CNN's performance.

该论文将一个监督的轻量级卷积神经网络（CNN）与最先进的零样本医疗视觉-语言模型（VLM）BiomedCLIP分别用于肺炎和肺结核检测。实验表明，虽然VLM初始性能较低，但通过优化其决策阈值，其性能显著提升，在肺炎检测中F1分数达到0.8841，在肺结核检测中达到0.7684，均超过了或接近了CNN的性能。

Source-Free Cross-Domain Continual Learning

Authors: Muhammad Tanzil Furqon, Mahardhika Pratama, Igor Škrjanc, Lin Liu, Habibullah Habibullah, Kutluyil Dogancay

First: 2025-10-02T04:09:25+00:00 · Latest: 2025-10-02T04:09:25+00:00

Abs · PDF · Code1 · Code2

Abstract

Although existing cross-domain continual learning approaches successfully address many streaming tasks having domain shifts, they call for a fully labeled source domain hindering their feasibility in the privacy constrained environments. This paper goes one step ahead with the problem of source-free cross-domain continual learning where the use of source-domain samples are completely prohibited. We propose the idea of rehearsal-free frequency-aware dynamic prompt collaborations (REFEREE) to cope with the absence of labeled source-domain samples in realm of cross-domain continual learning. REFEREE is built upon a synergy between a source-pre-trained model and a large-scale vision-language model, thus overcoming the problem of sub-optimal generalizations when relying only on a source pre-trained model. The domain shift problem between the source domain and the target domain is handled by a frequency-aware prompting technique encouraging low-frequency components while suppressing high-frequency components. This strategy generates frequency-aware augmented samples, robust against noisy pseudo labels. The noisy pseudo-label problem is further addressed with the uncertainty-aware weighting strategy where the mean and covariance matrix are weighted by prediction uncertainties, thus mitigating the adverse effects of the noisy pseudo label. Besides, the issue of catastrophic forgetting (CF) is overcome by kernel linear discriminant analysis (KLDA) where the backbone network is frozen while the classification is performed using the linear discriminant analysis approach guided by the random kernel method. Our rigorous numerical studies confirm the advantage of our approach where it beats prior arts having access to source domain samples with significant margins.

Summary / 总结

This paper addresses source-free cross-domain continual learning, where labeled source-domain samples are unavailable. It proposes REFEREE, which combines a source-pretrained model with a large-scale vision-language model to handle domain shifts. Key findings include improved generalization, robustness against noisy pseudo labels, and mitigation of catastrophic forgetting through a novel weighting strategy and kernel linear discriminant analysis. The approach outperforms previous methods with access to source domain samples by significant margins.

该论文解决了无源域跨域连续学习的问题，即无法获取源域标签样本。它提出了一种名为REFEREE的方法，结合了预训练模型和大规模视觉-语言模型来处理域偏移问题。主要发现包括提高泛化能力、对噪声伪标签的鲁棒性以及通过新颖的加权策略和核线性判别分析克服灾难性遗忘。该方法在可访问源域样本的情况下显著优于先前方法。

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Authors: Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Venue: NeurIPS 2025

First: 2025-02-03T04:51:28+00:00 · Latest: 2025-10-02T04:07:53+00:00

Comments: NeurIPS 2025

Abs · PDF · Code1 · Code2 · Code3

Abstract

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model's alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.

中文标题/摘要

标题：作为噪声感知潜在奖励模型的扩散模型在步骤级偏好优化中的应用

扩散模型的偏好优化旨在使它们与人类对图像的偏好相一致。先前的方法通常使用视觉语言模型（VLM）作为像素级奖励模型来近似人类偏好。然而，当用于步骤级偏好优化时，这些模型在处理不同时间步的噪声图像时面临挑战，并需要复杂的像素空间转换。在本文中，我们展示了预训练的扩散模型自然适合在噪声潜在空间中进行步骤级奖励建模，因为它们明确设计用于处理各种噪声水平的潜在图像。因此，我们提出了潜在奖励模型（LRM），它重新利用扩散模型的组件来预测任意时间步的潜在图像偏好。基于LRM，我们引入了潜在偏好优化（LPO），这是一种直接在噪声潜在空间中进行步骤级偏好优化的方法。实验结果表明，LPO显著提高了模型与一般、美学和图文对齐偏好的一致性，同时比现有偏好优化方法快2.5至28倍的训练速度。我们的代码和模型可在https://github.com/Kwai-Kolors/LPO/获得。

Summary / 总结

This paper addresses the challenge of aligning diffusion models with human preferences by proposing a Latent Reward Model (LRM) and Latent Preference Optimization (LPO) method. LRM utilizes pre-trained diffusion models to predict preferences for latent images at various timesteps, while LPO optimizes preferences directly in the noisy latent space. The results show that LPO enhances alignment with general, aesthetic, and text-image preferences and achieves a 2.5-28x training speedup compared to existing methods.

本文提出了一种潜奖励模型（LRM）和潜偏好优化（LPO）方法，以解决将扩散模型与人类偏好对齐的问题。LRM 重新利用预训练扩散模型的组件来预测任意时间步的潜图像偏好，而 LPO 直接在噪声潜空间中进行偏好优化。该方法显著提高了与通用、美学和图文对齐偏好的对齐，并实现了比现有方法快2.5-28倍的训练加速。

ImageNet-Think-250K: A Large-Scale Synthetic Dataset for Multimodal Reasoning for Vision Language Models

Authors: Krishna Teja Chitty-Venkata, Murali Emani

First: 2025-10-02T02:02:45+00:00 · Latest: 2025-10-02T02:02:45+00:00

Comments: Preprint

Abs · PDF · Code1 · Code2

Abstract

We develop ImageNet-Think, a multimodal reasoning dataset designed to aid the development of Vision Language Models (VLMs) with explicit reasoning capabilities. Our dataset is built on 250,000 images from ImageNet21k dataset, providing structured thinking tokens and corresponding answers. Our synthetic dataset is generated by two state-of-the-art VLMs: GLM-4.1V-9B-Thinking and Kimi-VL-A3B-Thinking-2506. Each image is accompanied by two pairs of thinking-answer sequences, creating a resource for training and evaluating multimodal reasoning models. We capture the step-by-step reasoning process of VLMs and the final descriptive answers. Our goal with this dataset is to enable the development of more robust VLMs while contributing to the broader understanding of multimodal reasoning mechanisms. The dataset and evaluation benchmarks will be publicly available to aid research in reasoning/thinking multimodal VLMs.

中文标题/摘要

标题：ImageNet-Think-250K：用于视觉语言模型多模态推理的大规模合成数据集

我们开发了ImageNet-Think，这是一个多模态推理数据集，旨在帮助视觉语言模型（VLMs）的发展，具有明确的推理能力。我们的数据集基于ImageNet21k数据集中的250,000张图像，提供结构化的思考标记和相应的答案。我们的合成数据集由两个最先进的VLMs生成：GLM-4.1V-9B-Thinking和Kimi-VL-A3B-Thinking-2506。每张图像都配有两对思考-答案序列，为训练和评估多模态推理模型提供了资源。我们捕捉了VLMs的逐步推理过程和最终描述性答案。我们希望通过这个数据集能够促进更稳健的VLMs的发展，同时为更广泛的多模态推理机制的理解做出贡献。该数据集和评估基准将公开发布，以帮助研究多模态推理/思考VLMs。

Summary / 总结

The research aims to develop ImageNet-Think, a dataset of 250,000 images from ImageNet21k, designed to enhance the reasoning capabilities of Vision Language Models (VLMs). The dataset includes structured thinking tokens and corresponding answers generated by two state-of-the-art VLMs. Key findings show that this dataset captures the step-by-step reasoning process of VLMs, facilitating the development of more robust multimodal reasoning models. The dataset and evaluation benchmarks will be publicly available.

研究旨在开发包含250,000张ImageNet21k图像的ImageNet-Think数据集，以增强视觉语言模型的推理能力。该数据集包括结构化的思考令牌及其对应的答案，由两个最先进的视觉语言模型生成。关键发现表明，该数据集能够用于训练和评估多模态推理模型，捕捉视觉语言模型的推理过程和最终答案。该数据集将公开提供，以支持多模态推理视觉语言模型的研究。

Poutine: Vision-Language-Trajectory Pre-Training and Reinforcement Learning Post-Training Enable Robust End-to-End Autonomous Driving

Authors: Luke Rowe, Rodrigue de Schaetzen, Roger Girgis, Christopher Pal, Liam Paull

First: 2025-06-12T19:14:00+00:00 · Latest: 2025-10-02T01:37:54+00:00

Abs · PDF · Code1 · Code2

Abstract

Maintaining good driving behavior in out-of-distribution scenarios remains a critical challenge in autonomous driving. A promising direction is to leverage the generalist knowledge and reasoning capabilities of large-language models by treating unusual driving scenarios as a logical reasoning task. In this work, we present Poutine, a method that uses an off-the-shelf 3B-parameter vision-language model (VLM) - without any additional components - to achieve robust end-to-end autonomous driving via a simple and scalable training recipe. To learn strong base driving capabilities, we first train Poutine-Base using self-supervised next-token prediction over vision, language, and trajectory (VLT) tokens, leveraging both nominal and long-tail driving data. In the second stage, we fine-tune Poutine-Base using Group Relative Policy Optimization (GRPO) with a small set of human preference-labeled examples. We evaluated our approach on the Waymo end-to-end driving benchmark curated for long-tail scenarios. The final Poutine model achieves an RFS of 7.99 on the test set, placing 1st in the 2025 Waymo Vision-Based End-to-End Driving Challenge by a significant margin. Our results suggest that handcrafted tokenizers or custom architectural components added to base VLMs in prior work are not necessary to achieve strong driving performance. Instead, this work highlights the potential of scalable VLT pretraining combined with lightweight RL fine-tuning to enable robust and generalizable autonomous driving.

Summary / 总结

Maintaining good driving behavior in out-of-distribution scenarios remains a critical challenge in autonomous driving.

MIRA: Towards Mitigating Reward Hacking in Inference-Time Alignment of T2I Diffusion Models

Authors: Kevin Zhai, Utsav Singh, Anirudh Thatipelli, Souradip Chakraborty, Anit Kumar Sahu, Furong Huang, Amrit Singh Bedi, Mubarak Shah

First: 2025-10-02T00:47:36+00:00 · Latest: 2025-10-02T00:47:36+00:00

Abs · PDF · Code1 · Code2

Abstract

Diffusion models excel at generating images conditioned on text prompts, but the resulting images often do not satisfy user-specific criteria measured by scalar rewards such as Aesthetic Scores. This alignment typically requires fine-tuning, which is computationally demanding. Recently, inference-time alignment via noise optimization has emerged as an efficient alternative, modifying initial input noise to steer the diffusion denoising process towards generating high-reward images. However, this approach suffers from reward hacking, where the model produces images that score highly, yet deviate significantly from the original prompt. We show that noise-space regularization is insufficient and that preventing reward hacking requires an explicit image-space constraint. To this end, we propose MIRA (MItigating Reward hAcking), a training-free, inference-time alignment method. MIRA introduces an image-space, score-based KL surrogate that regularizes the sampling trajectory with a frozen backbone, constraining the output distribution so reward can increase without off-distribution drift (reward hacking). We derive a tractable approximation to KL using diffusion scores. Across SDv1.5 and SDXL, multiple rewards (Aesthetic, HPSv2, PickScore), and public datasets (e.g., Animal-Animal, HPDv2), MIRA achieves >60\% win rate vs. strong baselines while preserving prompt adherence; mechanism plots show reward gains with near-zero drift, whereas DNO drifts as compute increases. We further introduce MIRA-DPO, mapping preference optimization to inference time with a frozen backbone, extending MIRA to non-differentiable rewards without fine-tuning.

Summary / 总结

The paper addresses the issue of reward hacking in inference-time alignment of text-to-image diffusion models, where the model generates images that score highly but do not adhere to the original prompt. To mitigate this, the authors propose MIRA, an inference-time method that introduces an image-space, score-based KL surrogate to regularize the sampling trajectory, preventing off-distribution drift. Experiments show MIRA outperforms strong baselines with a >60% win rate across different models, rewards, and datasets, while maintaining prompt adherence and minimal drift.

该论文针对文本到图像扩散模型在推理时对齐过程中出现的奖励作弊问题，即模型生成的图像虽然得分高但与原始提示不符。为解决这一问题，作者提出了MIRA，一种推理时的方法，通过引入基于图像空间的得分KL近似来正则化采样轨迹，防止分布外漂移。实验结果显示，MIRA在不同模型、奖励和数据集上的表现优于强基线，胜率超过60%，同时保持对原始提示的严格遵守和极小的漂移。