arXiv 论文速递

PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes

Authors: Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, Benjamin Busam

First: 2025-12-31T17:32:31+00:00 · Latest: 2025-12-31T17:32:31+00:00

Abstract

Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.

中文标题/摘要

标题：PhysTalk: 3D 高斯场景中的语言驱动实时物理

逼真的视觉模拟无处不在，但其创建需要计算时间、渲染和专家动画知识。从文本输入生成开放词汇视觉效果成为一种有前景的解决方案，能够释放巨大的创意潜力。然而，当前的工作流程缺乏物理真实性和有效的语言界面，需要缓慢的离线优化。相比之下，PhysTalk 将 3D 高斯点绘（3DGS）场景作为输入，并将任意用户提示翻译成实时、基于物理的 4D 动画。一个大型语言模型（LLM）生成可执行代码，直接通过轻量级代理和粒子动力学修改 3DGS 参数。值得注意的是，PhysTalk 是第一个直接将 3DGS 与物理模拟器结合而无需依赖耗时的网格提取的框架。尽管保持开放词汇，这种设计使得通过碰撞感知的基于物理的操纵任意多材料对象的交互式 3D 高斯动画成为可能。最后，PhysTalk 是无训练的且计算量轻：这使得 4D 动画广泛可及，并将这些工作流程从“渲染和等待”的范式转向与现代、基于物理的管道进行互动对话。

Summary / 总结

PhysTalk is a framework that translates user prompts into real-time, physics-based 4D animations using a 3D Gaussian Splatting (3DGS) scene as input. It leverages a large language model to generate executable code that modifies 3DGS parameters through lightweight proxies and particle dynamics, enabling interactive manipulation of multi-material objects. Notably, PhysTalk is the first to couple 3DGS directly with a physics simulator without mesh extraction, making 4D animation accessible and shifting workflows from a 'render and wait' paradigm to an interactive dialogue with a physics-informed pipeline.

PhysTalk 是一个框架，通过将用户提示转化为实时的基于物理的 4D 动画，使用 3D 贝塞尔点绘 (3DGS) 场景作为输入。它利用大型语言模型生成可执行代码，通过轻量级代理和粒子动力学修改 3DGS 参数，实现对多材料对象的交互式操作。值得注意的是，PhysTalk 是第一个直接将 3DGS 与物理模拟器结合而无需进行网格提取的框架，这使得 4D 动画更加普及，并将工作流程从‘渲染等待’模式转变为与现代物理导向管道的互动对话。

DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments

Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh

First: 2025-12-31T17:31:29+00:00 · Latest: 2025-12-31T17:31:29+00:00

Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)

Abs · PDF · Code1 · Code2

Abstract

Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.

中文标题/摘要

标题：DarkEQA：在低光室内环境中的视觉语言模型体态问答基准测试

视觉语言模型（VLMs）越来越多地被用作体态代理的核心推理模块。现有的基准测试在理想、光线充足的条件下评估其能力，但全天候24/7运行需要在广泛的视觉退化条件下表现出色，包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为应对这一未充分探索的挑战，我们提出了DarkEQA，这是一个开源基准测试，用于在多级低光条件下评估与体态问答相关的感知基本能力。DarkEQA通过在受控退化条件下评估从第一人称观察中进行问答来隔离感知瓶颈，从而实现可归因的鲁棒性分析。DarkEQA的一个关键设计特点是其物理保真度：视觉退化在线性RAW空间中建模，模拟基于物理的照明下降和传感器噪声，随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强（LLIE）模型来展示DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的VLMs的局限性。我们的代码和基准数据集将在接受后发布。

Summary / 总结

DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in low-light indoor environments, addressing the underexplored challenge of robust 24/7 operation. The method involves degrading egocentric observations in a physically faithful manner to isolate perceptual limitations. Key findings show that state-of-the-art VLMs struggle with question answering under low-light conditions, highlighting the need for improved robustness in VLMs for real-world applications. The benchmark includes a physical fidelity rendering pipeline and will be released upon acceptance.

DarkEQA 是一个基准，旨在评估 Vision-Language 模型在低光室内环境中的性能，解决其在 24/7 运行中鲁棒性的不足。该方法通过在受控的低光条件下降级第一人称观察，视觉降级在线性 RAW 空间中建模以模拟真实的物理现象。关键发现表明，当前的 VLM 在这些具有挑战性的视觉条件下难以完成感知任务，突显了低光性能改进的需求。基准将在接受后发布。

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents

Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig

First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-31T17:30:11+00:00

Abs · PDF · Code1 · Code2

Abstract

While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.

中文标题/摘要

标题：DAVE：一种面向文档理解和网络代理的VLM视觉编码器

尽管视觉语言模型（VLMs）在多模态任务中表现出色，但它们所选择的视觉编码器存在根本性弱点：低级特征缺乏对于文档理解和网络代理至关重要的稳健的结构和空间信息。为弥补这一差距，我们提出了DAVE，一种专为VLMs设计并针对这些任务定制的视觉编码器。我们的训练管道旨在利用大量未标注数据，以避免对文档和网络图像进行昂贵的大规模注释的需求。我们首先在未标注图像上进行自我监督预训练，然后在监督自回归预训练阶段，模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内，我们采用了两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐：(i) 我们引入了一种新的模型合并方案，将使用不同文本解码器训练的编码器结合起来，以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练，将预训练的一般性编码器（例如SigLIP2）的特征与我们自己的文档和网络特定表示融合。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性，确立了DAVE作为文档和网络应用的强大视觉编码器的地位。

Summary / 总结

DAVE is a vision encoder designed to enhance the performance of Vision-language models (VLMs) in document understanding and web agent tasks by leveraging self-supervised and supervised pretraining on unlabeled and high-quality data, respectively. The model incorporates a novel model-merging scheme and ensemble training to ensure broad compatibility and improved performance. Experimental results demonstrate DAVE's effectiveness in various document and web tasks, making it a robust vision encoder for these applications.

DAVE 是一种为 VLMs 设计的视觉编码器，旨在增强文档理解和网页代理任务，通过自监督和监督预训练来实现。它利用大量未标注数据，并结合使用不同文本解码器训练的编码器和集成训练来提高兼容性和性能。实验表明，DAVE 在文档任务、VQA、网页定位和基于代理的基准测试中均优于现有模型，使其成为这些应用的 robust 视觉编码器。

CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement

Authors: Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong

First: 2025-12-31T16:21:31+00:00 · Latest: 2025-12-31T16:21:31+00:00

Comments: This paper is 6 pages in length and contains 2 figures. Tao Fang (Corresponding Author), Lina Lu (Co-corresponding Author)

Abs · PDF · Code1 · Code2 · Code3

Abstract

Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.

中文标题/摘要

标题：CPJ: 通过LLM评判修正的可解释农业害虫诊断

准确且可解释的农作物疾病诊断对于农业决策至关重要，但现有方法往往依赖昂贵的监督微调且在领域迁移时表现不佳。我们提出了一种无需训练的少样本框架Caption--Prompt--Judge (CPJ)，通过结构化、可解释的图像描述来增强农业害虫问答。CPJ 使用大型视觉-语言模型生成多角度描述，并通过LLM评判模块迭代修正，然后指导双重答案问答过程，用于识别和管理响应。在CDDMBench上评估，CPJ 显著提高了性能：使用GPT-5-mini描述，GPT-5-Nano 在疾病分类上的得分提高了22.7个百分点，在问答得分上提高了19.5分，超过无描述基线。该框架提供了透明、基于证据的推理，无需微调即可推动稳健且可解释的农业诊断。我们的代码和数据已公开发布在：https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis。

Summary / 总结

The research aims to improve the accuracy and interpretability of crop disease diagnosis in agriculture. The proposed Caption-Prompt-Judge (CPJ) framework uses large vision-language models to generate multi-angle captions, which are iteratively refined by an LLM-as-Judge module. This process informs a dual-answer VQA system for disease recognition and management. On the CDDMBench, CPJ significantly outperforms no-caption baselines, achieving a 22.7 percentage point improvement in disease classification and a 19.5 point increase in QA score using GPT-5-mini captions, while providing transparent reasoning without fine-tuning.

研究旨在开发一种准确且可解释的方法，以支持农作物疾病的诊断，从而促进农业决策。提出的Caption-Prompt-Judge (CPJ)框架使用大型视觉-语言模型生成多角度的描述，这些描述通过LLM-as-Judge模块迭代精炼，以指导双重答案的VQA过程。在CDDMBench上，CPJ显著提高了性能，疾病分类提高了22.7个百分点，QA得分提高了19.5分，且无需微调。该框架提供了透明的推理，并已公开发布。

ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting

Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim

First: 2025-02-20T18:01:41+00:00 · Latest: 2025-12-31T15:43:05+00:00

Comments: Accepted and to appear in IJCNLP-AACL 2025

Abs · PDF · Code1 · Code2

Abstract

Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.

中文标题/摘要

标题：ReVision：一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM

随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口，高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型（VLMs）支持多模态交互，通常依赖于基于云的处理，这引发了（1）视觉隐私问题，即传输敏感的视觉数据到服务器，以及（2）其有限的实时、设备端可用性问题。本文探讨了视觉指令重写这一新颖的方法，即将多模态指令转换为纯文本命令，允许轻量级的设备端指令重写VLM（参数量250M）与现有的对话AI系统无缝集成，增强视觉数据隐私。为此，我们提供了一个涵盖14个领域的超过39,000个示例的数据集，并开发了一个紧凑的VLM，该模型在图像字幕数据集上进行预训练，并针对指令重写进行了微调。实验结果通过NLG指标（如BLEU、METEOR和ROUGE）以及语义解析分析评估，表明即使是最小量化版本的模型（存储占用<500MB）也能实现有效的指令重写，从而实现以隐私为中心的多模态AI应用。

Summary / 总结

This paper addresses the need for efficient and privacy-preserving multimodal interaction by introducing ReVision, a dataset and baseline vision-language model for visual instruction rewriting. The model transforms multimodal instructions into text-only commands, enhancing privacy and on-device usability. Experiments show that even a quantized version of the model can effectively rewrite instructions, achieving good performance on NLG metrics and semantic parsing analysis.

本文探讨了AR和VR技术兴起背景下高效且隐私保护的多模态交互需求。提出了ReVision数据集和基线视觉语言模型，用于隐私保护的任务导向视觉指令重写。该模型将视觉指令转换为纯文本命令，支持轻量级的设备端处理。实验结果显示，即使是最小量化版本的模型也能有效重写指令，从而增强隐私保护同时保持性能。

Are First-Order Diffusion Samplers Really Slower? A Fast Forward-Value Approach

Authors: Yuchen Jiao, Na Li, Changxiao Cai, Gen Li

First: 2025-12-31T15:35:53+00:00 · Latest: 2025-12-31T15:35:53+00:00

Abs · PDF · Code1 · Code2

Abstract

Higher-order ODE solvers have become a standard tool for accelerating diffusion probabilistic model (DPM) sampling, motivating the widespread view that first-order methods are inherently slower and that increasing discretization order is the primary path to faster generation. This paper challenges this belief and revisits acceleration from a complementary angle: beyond solver order, the placement of DPM evaluations along the reverse-time dynamics can substantially affect sampling accuracy in the low-neural function evaluation (NFE) regime. We propose a novel training-free, first-order sampler whose leading discretization error has the opposite sign to that of DDIM. Algorithmically, the method approximates the forward-value evaluation via a cheap one-step lookahead predictor. We provide theoretical guarantees showing that the resulting sampler provably approximates the ideal forward-value trajectory while retaining first-order convergence. Empirically, across standard image generation benchmarks (CIFAR-10, ImageNet, FFHQ, and LSUN), the proposed sampler consistently improves sample quality under the same NFE budget and can be competitive with, and sometimes outperform, state-of-the-art higher-order samplers. Overall, the results suggest that the placement of DPM evaluations provides an additional and largely independent design angle for accelerating diffusion sampling.

中文标题/摘要

标题：一阶扩散采样器真的更慢吗？一种快速前向值方法

高阶ODE求解器已成为加速扩散概率模型(DPM)采样的标准工具，这促使人们普遍认为一阶方法本质上更慢，并且提高离散化阶数是实现更快生成的主要途径。本文挑战了这一观点，并从互补的角度重新审视加速：除了求解器阶数之外，DPM评估在反向时间动力学中的位置会在低神经网络评估次数(NFE)区间内显著影响采样精度。我们提出了一种新的无需训练的一阶采样器，其主要离散化误差与DDIM相反。算法上，该方法通过廉价的一步前瞻预测器近似前向值评估。我们提供了理论保证，表明该采样器能够证明地逼近理想的前向值轨迹，同时保持一阶收敛性。实验上，在标准图像生成基准（CIFAR-10、ImageNet、FFHQ和LSUN）上，所提出的采样器在相同的NFE预算下始终能提高样本质量，并且可以与最先进的高阶采样器竞争，有时甚至优于它们。总体而言，结果表明，DPM评估的位置提供了加速扩散采样的另一个独立设计角度。

Summary / 总结

This paper challenges the belief that first-order diffusion samplers are inherently slower than higher-order methods. It proposes a novel first-order sampler that approximates the forward-value evaluation via a cheap one-step lookahead predictor, achieving first-order convergence while improving sample quality under the same neural function evaluation budget. Empirically, the proposed sampler outperforms or matches state-of-the-art higher-order samplers across various image generation benchmarks.

该论文挑战了一阶扩散采样器本质上比高阶方法更慢的观点。它提出了一种新型的一阶采样器，通过廉价的一步前瞻预测来近似前向值评估，实现了与一阶收敛性的同时，在相同的神经函数评估预算下提高了样本质量。实验结果显示，该提出的方法在各种图像生成基准上始终优于或可与最先进的高阶采样器媲美。

Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control

Authors: Jason Armitage, Rico Sennnrich

First: 2025-12-31T12:39:03+00:00 · Latest: 2025-12-31T12:39:03+00:00

Abs · PDF · Code1 · Code2

Abstract

Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.

中文标题/摘要

标题：2D系统中2D视觉输入与3D多对象场景的语言对齐

跨模态系统在处理3D场景时面临维度跃迁问题，通过场景内相机可以弥合维度差距，但需要学习一个控制模块。我们提出了一种新方法，通过无导数优化实现后悔最小化，以提高多元互信息估计。我们的算法使基于2D视觉输入训练的即插即用跨模态系统能够在线适应物体遮挡并区分特征。富有表现力的度量与基于价值的优化相结合，帮助场景内相机直接从视觉语言模型的嘈杂输出中学习。由此产生的流水线在无需预训练或微调的情况下，提高了跨模态任务在多对象3D场景中的性能。

Summary / 总结

The research addresses the challenge of aligning video and language in 2D systems for processing 3D scenes. It introduces a method that uses regret minimisation with derivative-free optimisation to improve multivariate mutual information estimates. This method allows cross-modal systems trained on 2D inputs to adapt to 3D scenes, handle object occlusions, and differentiate features. The approach enables the system to learn directly from vision-language model outputs, enhancing performance in cross-modal tasks on multi-object 3D scenes without pretraining or fine-tuning.

研究解决了2D跨模态系统处理3D场景的挑战，这些系统虽然训练于2D视觉输入，但面临维度上的转变。作者提出了一种方法，通过使用无导数优化的后悔最小化来增强多变量互信息估计。该方法使系统能够在线适应物体遮挡并区分特征，从而在多对象3D场景的跨模态任务中提高性能，无需预训练或微调。

CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation

Authors: ZhenQi Chen, TsaiChing Ni, YuanFu Yang

First: 2025-12-27T19:08:18+00:00 · Latest: 2025-12-31T10:44:57+00:00

Abs · PDF · Code1 · Code2

Abstract

Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.

中文标题/摘要

标题：CritiFusion: 语义批评和光谱对齐的文本到图像生成

近期的文本到图像扩散模型在视觉保真度方面取得了显著进展，但往往难以与复杂的提示实现语义对齐。我们提出了一种名为CritiFusion的新型推理时框架，该框架结合了多模态语义批评机制和频域细化，以提高文本到图像的一致性和细节。所提出的CritiCore模块利用视觉语言模型和多个大型语言模型来丰富提示上下文并生成高层次的语义反馈，引导扩散过程更好地与提示的意图对齐。此外，SpecFusion在频域中合并中间生成状态，注入粗略的结构信息同时保留高频细节。无需额外的模型训练。CritiFusion作为与现有扩散主干兼容的插件细化阶段。在标准基准上的实验表明，我们的方法显著提高了文本到图像对应和视觉质量的人类对齐指标。CritiFusion在人类偏好评分和美学评估中持续提升性能，达到与最先进的奖励优化方法相当的结果。定性结果进一步证明了我们的语义批评和光谱对齐策略在细节、真实性和提示忠实度方面的优越性。

Summary / 总结

CritiFusion is a novel framework that enhances text-to-image generation by integrating a semantic critique mechanism and spectral alignment. It uses a vision-language model and multiple large language models to enrich the prompt context and guide the diffusion process, improving alignment with the prompt's intent. Additionally, it merges intermediate generation states in the spectral domain to preserve high-frequency details. Experiments show that CritiFusion significantly improves human-aligned metrics and aesthetic evaluations, achieving results comparable to state-of-the-art reward optimization approaches.

研究旨在通过提出CritiFusion框架解决文本到图像生成中的语义对齐问题，该框架结合了语义批评机制和频域对齐。Criticore使用视觉语言模型和语言模型丰富提示上下文，引导扩散过程，而SpecFusion在频域中细化中间生成状态，保留细节。实验表明，CriticFusion提高了人类对齐的度量标准、美学评估和人类偏好评分，达到了与最先进的奖励优化方法相当的结果。

Multimodal Fact-Checking: An Agent-based Approach

Authors: Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli

First: 2025-12-28T13:58:33+00:00 · Latest: 2025-12-31T09:37:15+00:00

Comments: Code and dataset will be released at https://github.com/xudanni0927/AgentFact

Abs · PDF · Code1 · Code2 · Code3

Abstract

The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.

中文标题/摘要

标题：基于代理的多模态事实核查：一种代理导向的方法

多模态错误信息的快速传播对自动化事实核查系统构成了日益严峻的挑战。现有的方法，包括大型视觉语言模型（LVLM）和深度多模态融合方法，往往由于推理能力有限和证据利用浅显而效果不佳。一个关键瓶颈是没有专门的数据集提供完整的现实世界多模态错误信息实例，并附带标注的推理过程和可验证的证据。为解决这一限制，我们引入了RW-Post，这是一个高质量且可解释的现实世界多模态事实核查数据集。RW-Post将现实世界多模态声明与其原始社交媒体帖子对齐，保留了声明中丰富的上下文信息。此外，该数据集还包括详细的推理过程和明确链接的证据，这些证据是通过大型语言模型辅助提取管道从人类撰写的事实核查文章中提取出来的，从而实现全面验证和解释。基于RW-Post，我们提出了AgentFact，这是一种代理导向的多模态事实核查框架，旨在模拟人类验证工作流程。AgentFact由五个专门的代理组成，它们协作处理关键的事实核查子任务，包括策略规划、高质量证据检索、视觉分析、推理和解释生成。这些代理通过迭代工作流协调，该工作流在证据搜索和任务感知证据过滤与推理之间交替进行，促进战略决策和系统性证据分析。广泛的实验结果表明，RW-Post与AgentFact之间的协同作用显著提高了多模态事实核查的准确性和可解释性。

Summary / 总结

The paper addresses the challenge of automated fact-checking for multimodal misinformation by introducing RW-Post, a new dataset that includes detailed reasoning and linked evidence. Based on RW-Post, the authors propose AgentFact, an agent-based framework that collaborates through an iterative workflow to handle key fact-checking tasks. Experimental results show that AgentFact enhances both the accuracy and interpretability of multimodal fact-checking compared to existing methods.

论文通过引入RW-Post数据集，该数据集提供了包含详细推理过程和证据的完整现实世界多模态虚假信息实例，来应对自动化事实核查的挑战。随后提出了AgentFact，这是一种基于代理的框架，通过迭代工作流协作处理关键事实核查子任务，提高了准确性和可解释性。实验结果表明，RW-Post和AgentFact的协同作用显著提升了多模态事实核查系统的性能。

ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving

Authors: Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li

First: 2025-12-28T14:06:37+00:00 · Latest: 2025-12-31T09:18:13+00:00

Comments: 11 pages, 4 figures. Project page: https://pqh22.github.io/projects/ColaVLA/index.html

Abs · PDF · Code1 · Code2 · Project1

Abstract

Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.

中文标题/摘要

标题：ColaVLA：利用认知潜在推理进行自主驾驶分层并行轨迹规划

自主驾驶需要从复杂的多模态输入中生成安全可靠的轨迹。传统模块化管道将感知、预测和规划分离，而近期的端到端(E2E)系统则联合学习它们。视觉语言模型(VLMs)进一步丰富了这一范式，通过引入跨模态先验和常识推理，然而当前基于VLM的规划器面临三个关键挑战：(i) 离散文本推理与连续控制之间的不匹配，(ii) 自回归链式思考解码的高延迟，(iii) 低效或非因果规划器限制了实时部署。我们提出ColaVLA，一个统一的视觉语言行动框架，将推理从文本转移到统一的潜在空间，并与分层并行轨迹解码器耦合。认知潜在推理器通过自我适应选择将场景理解压缩为决策导向的元动作嵌入，仅需两次VLM前向传递。分层并行规划器则在单次前向传递中生成多尺度、因果一致的轨迹。这些组件共同保留了VLM的泛化能力和可解释性，同时实现高效、准确和安全的轨迹生成。在nuScenes基准测试中，ColaVLA在开环和闭环设置中均实现了最先进的性能，具有有利的效率和鲁棒性。

Summary / 总结

ColaVLA addresses the challenges of generating safe and reliable trajectories in autonomous driving by leveraging a unified vision-language-action framework. It uses a Cognitive Latent Reasoner to compress scene understanding into compact embeddings and a Hierarchical Parallel Planner to generate multi-scale, causality-consistent trajectories efficiently. Experiments show that ColaVLA outperforms existing methods in both open-loop and closed-loop settings with better efficiency and robustness.

ColaVLA 通过利用统一的视觉-语言-动作框架来解决生成安全可靠轨迹的挑战。它使用认知潜空间推理器将场景理解压缩成紧凑的嵌入，并使用分层并行规划器在单次前向传递中生成多尺度、因果一致的轨迹。实验表明，ColaVLA 在 nuScenes 基准测试中在开环和闭环设置中均优于现有方法，具有更好的效率和鲁棒性。

LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving

Authors: Qian Cheng, Weitao Zhou, Cheng Jing, Nanshan Deng, Junze Wen, Zhaoyang Liu, Kun Jiang, Diange Yang

First: 2025-12-31T08:27:10+00:00 · Latest: 2025-12-31T08:27:10+00:00

Abs · PDF · Code1 · Code2

Abstract

Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time deployment.This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.

中文标题/摘要

标题：LSRE：实时语义风险检测的潜在语义规则编码在自动驾驶中的应用

真实的自动驾驶必须遵守复杂的社会规则，这些规则超出了法律规定的交通法规。许多语义约束，如为紧急车辆让路、遵守交通警察的手势或为校车停车，对于人类来说是直观的，但很难明确编码。尽管大型视觉-语言模型（VLMs）可以解释这些语义，但其推理成本使其在实时部署中不切实际。本文提出了一种LSRE（潜在语义规则编码）框架，将稀疏采样的VLM判断转化为递归世界模型潜在空间中的决策边界。通过将语言定义的安全语义编码到轻量级的潜在分类器中，LSRE能够在10 Hz的频率下进行实时语义风险评估，而无需每帧查询VLM。在CARLA上的六个语义失败场景实验表明，LSRE在语义风险检测准确性方面与大型VLM基线相当，同时提供显著更早的危险预知，并保持较低的计算延迟。此外，LSRE还能够泛化到罕见的语义相似测试案例，表明语言引导的潜在分类为自动驾驶中的语义安全监控提供了一种有效且可部署的机制。

Summary / 总结

The research aims to address the challenge of real-time semantic risk detection in autonomous driving, where complex social rules beyond legal regulations must be followed. LSRE, a Latent Semantic Rule Encoding framework, converts VLM judgments into decision boundaries within a recurrent world model, enabling real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments show that LSRE achieves comparable semantic risk detection accuracy to a large VLM baseline, with earlier hazard anticipation and low computational latency, and it generalizes well to unseen cases.

研究旨在解决自动驾驶中实时语义风险检测的挑战，需要遵循超出法律法规的复杂社会规则。LSRE（Latent Semantic Rule Encoding）框架将VLM判断转化为循环世界模型中的决策边界，实现每秒10次的实时语义风险评估，无需每帧查询VLM。实验表明，LSRE在语义风险检测准确性上与大型VLM基线相当，具有更早的危险预兆和低计算延迟，并且能够很好地泛化到未见过的案例中。

Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting

Authors: Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao

First: 2025-12-31T08:10:03+00:00 · Latest: 2025-12-31T08:10:03+00:00

Abs · PDF · Code1 · Code2 · Code3

Abstract

Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.

中文标题/摘要

标题：演化而非训练：通过演化提示实现零样本推理分割

推理分割要求模型解释复杂的、上下文相关的语言查询以实现像素级定位。当前主流方法主要依赖监督微调（SFT）或强化学习（RL）。然而，SFT面临灾难性遗忘和领域依赖性问题，而RL常常受到训练不稳定性及对预定义奖励函数的严格依赖的困扰。尽管最近的无训练方法绕过了这些训练负担，但它们本质上受限于静态推理范式。这些方法通常依赖于一次性的“生成-分割”链，这导致推理深度不足，缺乏自我纠正语言幻觉或空间误解的能力。在本文中，我们挑战这些限制并提出EVOL-SAM3，这是一种新颖的零样本框架，将推理分割重新定义为推理时的演化搜索过程。EVOL-SAM3 不依赖于固定提示，而是维护一组提示假设，并通过“生成-评估-演化”循环迭代优化它们。我们引入了视觉竞技场来通过参考无损的两两对决评估提示适应度，并引入语义变异操作来注入多样性并纠正语义错误。此外，异构竞技场模块结合几何先验与语义推理以确保稳健的最终选择。大量实验表明，EVOL-SAM3 不仅在零样本设置下大幅优于静态基线，还在具有挑战性的ReasonSeg基准上显著超越完全监督的最新方法。代码可在 https://github.com/AHideoKuzeA/Evol-SAM3 获取。

Summary / 总结

The paper addresses the limitations of current reasoning segmentation methods, which rely on supervised fine-tuning or reinforcement learning, by proposing EVOL-SAM3. This framework reformulates reasoning segmentation as an evolutionary search process at inference time, maintaining a population of prompt hypotheses and iteratively refining them through a 'Generate-Evaluate-Evolve' loop. The method uses a Visual Arena for prompt fitness assessment and a Semantic Mutation operator to inject diversity and correct semantic errors. Experiments show that EVOL-SAM3 outperforms static baselines and fully supervised state-of-the-art methods on the ReasonSeg benchmark in a zero-shot setting.

本文提出了一种名为EVOL-SAM3的零样本框架，该框架在推理时使用进化搜索过程来解决推理分割的挑战。与依赖固定提示或监督微调的传统方法不同，EVOL-SAM3维护了一群提示假设，并通过‘生成-评估-进化’循环逐步优化它们。该框架引入了视觉竞技场来评估提示的适应度，并引入了语义变异操作来纠正语义错误。实验表明，EVOL-SAM3在零样本设置下不仅超越了静态基线，还在挑战性的ReasonSeg基准上显著超过了全监督的最新方法。

Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Authors: Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen

First: 2025-10-15T17:59:45+00:00 · Latest: 2025-12-31T07:36:24+00:00

Comments: Code: https://github.com/EnVision-Research/MTI

Abs · PDF · Code1 · Code2 · Code3

Abstract

Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.

中文标题/摘要

标题：少即是多：通过最小化测试时干预提高LLM推理能力

大型语言模型（LLMs）的近期进展集中在通过增加推理计算来提高测试时的推理能力，但往往以效率为代价。我们重新审视测试时的行为，并发现一个简单但未充分探索的现象：推理不确定性是高度局部化的——只有少量高熵令牌主要影响输出的正确性。受此启发，我们提出了最小化测试时干预（MTI），这是一种无需训练的框架，通过最小的开销来增强推理准确性和稳定性。MTI 包括：(i) 选择性CFG干预，在不确定位置应用无分类引导；(ii) 轻量级负提示引导，重用主模型的KV缓存以高效地近似无条件解码。MTI 在通用任务、编程任务和STEM任务中均表现出一致的改进——例如，DeepSeek-R1-7B在六个基准上的平均改进为+9.28%，AIME2024使用Ling-mini-2.0时为+11.25%，同时保持高度高效。

Summary / 总结

This paper addresses the efficiency trade-off in large language models (LLMs) by proposing Minimal Test-Time Intervention (MTI), which enhances reasoning accuracy and stability with minimal overhead. MTI selectively applies classifier-free guidance and uses lightweight negative-prompt guidance, leveraging the main model's KV cache. The method consistently improves performance across various tasks, achieving up to 11.25% improvement on AIME2024 using Ling-mini-2.0, while maintaining efficiency.

本文提出了一种最小测试时干预（MTI）方法，通过仅在不确定位置应用分类器无条件引导和轻量级负提示引导，以最小的开销提升大型语言模型（LLMs）的推理准确性和稳定性。MTI在各种任务中表现出一致的改进，例如在六个基准测试中DeepSeek-R1-7B的平均改进为+9.28%，在AIME2024中使用Ling-mini-2.0的改进为+11.25%。

ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration

Authors: Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen

Venue: AAAI 2026 poster

First: 2025-12-19T07:27:19+00:00 · Latest: 2025-12-31T06:37:00+00:00

Comments: Accepted for poster presentation at AAAI 2026

Abs · PDF · Code1 · Code2

Abstract

Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model's temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.

中文标题/摘要

标题：ProCache：基于约束的特征缓存与选择性计算以加速扩散变换器

扩散变换器（DiTs）在生成建模中取得了最先进的性能，但其高昂的计算成本阻碍了实时部署。虽然特征缓存通过利用时间冗余提供了一种无训练的加速解决方案，但现有方法存在两个关键局限性：（1）均匀的缓存间隔无法与DiT的时间非均匀动态对齐；（2）使用过大的缓存间隔进行简单的特征重用会导致严重的误差累积。在本文中，我们分析了去噪过程中DiT特征的演变，发现特征变化和误差传播在时间和深度上都高度变化。受此启发，我们提出了ProCache，这是一种基于约束的动态特征缓存框架，通过两个核心组件解决了这些问题：（i）一种约束感知的缓存模式搜索模块，通过离线约束采样生成非均匀的激活时间表，针对模型的时间特性进行定制；（ii）一种选择性计算模块，在深层块和高重要性标记中选择性地计算缓存段，以最小的开销减轻误差累积。在PixArt-alpha和DiT上的广泛实验表明，ProCache在几乎不降低质量的情况下实现了高达1.96倍和2.90倍的加速，显著优于先前的基于缓存的方法。

Summary / 总结

ProCache is a training-free dynamic feature caching framework designed to accelerate Diffusion Transformers (DiTs) by addressing the limitations of uniform caching intervals and excessive error accumulation. It uses a constraint-aware caching pattern search module to generate non-uniform activation schedules and a selective computation module to minimize error propagation. Experiments show that ProCache can achieve up to 1.96x and 2.90x acceleration with negligible quality loss compared to previous caching-based methods.

ProCache 是一个无需训练的动态特征缓存框架，旨在通过解决均匀缓存间隔和特征重用的简单方法带来的问题来加速扩散变换器（DiTs）。它使用一个约束感知的缓存模式搜索模块生成非均匀的激活时间表，并使用一个选择性计算模块来最小化错误累积，同时减少开销。实验表明，ProCache 可以实现最高1.96倍和2.90倍的加速，且质量几乎没有下降，显著优于之前的缓存方法。

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Authors: Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

First: 2025-10-16T15:27:10+00:00 · Latest: 2025-12-31T05:45:29+00:00

Comments: 28 pages, 13 Figures, 12 Tables

Abs · PDF · Code1 · Code2 · Code3

Abstract

Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art. Code and models are available at https://github.com/hchoi256/cotpl.

中文标题/摘要

标题：CoT-PL：视觉链式思考推理与伪标签结合在开放词汇对象检测中的应用

开放词汇对象检测（OVD）旨在识别和定位训练期间未见过的对象类别。近期方法通常利用视觉语言模型（VLMs）通过图像-文本对齐生成伪标签，使检测器能够在没有显式监督的情况下泛化到未见过的类别。然而，这些方法高度依赖直接的图像-文本匹配，忽视了解释语义复杂场景所需的中间推理步骤。这导致在拥挤或遮挡的视觉上下文中表现有限。本文提出了一种新的框架CoT-PL，该框架将结构化的视觉链式思考（CoT）推理融入伪标签生成过程。CoT-PL将对象理解分解为三个可解释的步骤：（1）即使对于未见过的对象也能感知区域，（2）通过零样本推理进行类别识别，（3）背景定位以分离语义复杂的对象。最关键的是，第三步自然地促使我们使用预先计算的背景线索作为负样本，以促进对象与背景的特征解耦。这样，CoT推理和对比背景学习（CBL）形成了一条针对拥挤或遮挡场景的集成流水线，以实现稳健的伪标签生成。值得注意的是，在这两种情况下，我们对新类伪标签的质量分别比最佳先前方法提高了103.4%和168.4%。我们的大量实验表明，CoT-PL在开放词汇COCO数据集上实现了+7.7 AP50，在LVIS数据集上实现了+2.9掩码AP，创下了新的最佳水平。代码和模型可在https://github.com/hchoi256/cotpl获取。

Summary / 总结

The paper introduces CoT-PL, a framework that integrates visual chain-of-thought reasoning into the pseudo-labeling process for open-vocabulary object detection. It decomposes object understanding into region perception, zero-shot category recognition, and background grounding. The background grounding step uses contrastive background learning to improve feature disentanglement. Experiments show that CoT-PL significantly improves pseudo-label quality and achieves state-of-the-art results on open-vocabulary COCO and LVIS datasets, with relative improvements of 103.4% and 168.4% respectively over previous methods. The framework is designed to handle crowded or occluded scenes robustly. Code and models are available online.

研究旨在通过解决现有方法依赖直接图像-文本匹配的局限性，提高开放词汇对象检测的性能。提出的CoT-PL框架引入了结构化的视觉链式推理和对比背景学习，以生成更 robust 的伪标签，特别是在拥挤或遮挡的场景中。关键实验结果表明，CoT-PL 显著提高了新类别伪标签的质量，在这些设置中分别实现了103.4%和168.4%的相对改进。在开放词汇COCO和LVIS上，CoT-PL 达到了+7.7 AP50和+2.9 mask AP的新最佳性能。

Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning

Authors: Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang, Kaiyu Li, Jianfei Yang, Quan Wang

First: 2025-12-31T03:28:17+00:00 · Latest: 2025-12-31T03:28:17+00:00

Abs · PDF · Code1 · Code2

Abstract

Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.

中文标题/摘要

标题：通过决策模糊性引导的强化微调改进少样本变化检测视觉问答

变化检测视觉问答（CDVQA）需要通过推理生物时相遥感图像中的语义变化来回答文本查询。一种直接的方法是通过监督微调（SFT）增强CDVQA性能。尽管取得了近期进展，我们观察到，大量失败并非源自明显错误的预测，而是决策模糊性，其中模型对正确答案和强干扰项赋予了相似的置信度。为了正式化这一挑战，我们将决策模糊样本（DAS）定义为真实答案与最竞争替代品之间概率差距较小的实例。我们认为，明确优化DAS对于提高CDVQA模型的可区分性和鲁棒性至关重要。为此，我们提出了DARFT框架，该框架首先使用SFT训练的参考策略挖掘DAS，然后在挖掘的子集上应用组相对策略优化。通过利用多样本解码和组内相对优势，DARFT抑制了强干扰项并细化了决策边界，而无需额外监督。广泛的实验表明，DARFT在少样本设置中相对于SFT基线具有一致的改进。

Summary / 总结

The paper addresses the challenge of decision ambiguity in change detection visual question answering (CDVQA) by defining Decision-Ambiguous Samples (DAS) and proposing DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework. This framework mines DAS using an SFT-trained reference policy and applies group-relative policy optimization to improve model discriminability and robustness. Experiments show consistent improvements over supervised fine-tuning baselines, especially in few-shot settings.

研究旨在通过解决决策模糊问题来改进少量样本的变更检测视觉问答（CDVQA），即模型在正确答案和强干扰选项之间难以抉择。方法是决策模糊引导强化微调（DARFT），首先通过监督微调训练的参考策略识别决策模糊样本（DAS），然后使用组相对策略优化来优化这些样本。实验结果显示，在少量样本设置中，DARFT 比监督微调基线有持续的改进。

Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Authors: Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang, Pragaash Ponnusamy, Harikaran Subbaraj, Jue Wang, Shuaiwen Leon Song, Ben Athiwaratkun

First: 2025-12-31T02:46:04+00:00 · Latest: 2025-12-31T02:46:04+00:00

Abs · PDF · Code1 · Code2

Abstract

Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.

中文标题/摘要

标题：理解与引导测试时推理模型的认知行为

大型语言模型（LLMs）通常依赖长链推理（CoT）来解决复杂任务。虽然有效，但这些路径往往效率低下，导致因过度生成标记而产生高延迟，或者产生不稳定推理，交替出现浅层、不一致的推理和重复、冗长的推理。在本工作中，我们研究了推理路径的结构，并发现与验证和回溯等不同认知行为相关的专门注意头。通过在推理时轻柔地干预这些头，我们可以引导模型远离低效模式。基于这一见解，我们提出了CREST，一种无需训练的方法，用于测试时的认知推理引导。CREST有两个组成部分：（1）离线校准步骤，识别认知头并推导出特定于头的引导向量，（2）推理时的程序，旋转隐藏表示以抑制沿这些向量的分量。CREST自适应地抑制无生产力的推理行为，从而提高准确性和降低计算成本。在各种推理基准和模型中，CREST将准确率提高高达17.5%，同时减少标记使用量37.6%，提供了一条简单而有效的快速、可靠LLM推理途径。

Summary / 总结

This work addresses the inefficiencies in long chain-of-thought reasoning used by large language models, which can lead to high latency and unstable reasoning. By analyzing the structure of reasoning trajectories, the authors identify specialized attention heads associated with cognitive behaviors like verification and backtracking. They propose CREST, a training-free method for steering these behaviors at test-time, which involves an offline calibration step and an inference-time procedure to suppress unproductive reasoning. Experiments show that CREST improves accuracy by up to 17.5% and reduces token usage by 37.6% across various benchmarks and models, making reasoning more efficient and reliable.

这项工作研究了大型语言模型在使用长链推理时的低效性，并提出了一种无需训练的方法CREST来引导模型进行更高效的推理。通过识别与验证和回溯等认知行为相关的特殊注意力头，在推理时抑制不必要的推理，CREST提高了准确率并降低了计算成本。在各种基准测试中，CREST将准确率提高了最多17.5%，并减少了37.6%的令牌使用量。

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Authors: Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

First: 2025-12-31T01:19:14+00:00 · Latest: 2025-12-31T01:19:14+00:00

Abs · PDF · Code1 · Code2 · Code3 · Project1

Abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

中文标题/摘要

标题：PhyGDPO：物理感知的组间直接偏好优化方法以实现物理一致的文本到视频生成

近年来，文本到视频(T2V)生成取得了良好的视觉效果，但合成严格遵循物理定律的视频仍然是一个开放的挑战。现有方法主要基于图形或提示扩展，难以泛化到复杂的模拟环境或学习隐含的物理推理。训练数据中缺乏丰富的物理交互和现象也是一个问题。在本文中，我们首先引入了一个物理增强的视频数据构建流水线PhyAugPipe，利用具有链式推理的视觉语言模型(VLM)收集大规模训练数据集PhyVidGen-135K。然后，我们提出了一个基于组间Plackett-Luce概率模型的物理感知的组间直接偏好优化PhyGDPO框架，以捕捉超越成对比较的整体偏好。在PhyGDPO中，我们设计了一种物理引导奖励(PGR)方案，将基于VLM的物理奖励嵌入以引导优化向物理一致性发展。我们还提出了一种LoRA-Switch参考(LoRA-SR)方案，以消除内存密集型的参考重复，实现高效的训练。实验表明，我们的方法在PhyGenBench和VideoPhy2上显著优于最先进的开源方法。请访问我们的项目页面https://caiyuanhao1998.github.io/project/PhyGDPO查看更多视频结果。我们的代码、模型和数据将在https://github.com/caiyuanhao1998/Open-PhyGDPO发布

Summary / 总结

This paper addresses the challenge of generating physically consistent videos from text descriptions. It introduces PhyAugPipe, a pipeline that uses a vision-language model for collecting a large dataset with rich physics interactions. The proposed PhyGDPO framework optimizes video generation by incorporating physics-aware rewards and a reference scheme that reduces memory usage. Experiments demonstrate that PhyGDPO outperforms existing methods on PhyGenBench and VideoPhy2 benchmarks, achieving better physical consistency in generated videos.

本文解决了从文本描述生成物理一致视频的挑战。它引入了PhyAugPipe，一种使用视觉-语言模型的数据增强管道，以及PhyGDPO框架，该框架基于群体偏好和物理导向奖励优化视频生成。该方法在PhyGenBench和VideoPhy2基准测试上显著提高了物理一致性。

Training-Free Color-Aware Adversarial Diffusion Sanitization for Diffusion Stegomalware Defense at Security Gateways

Authors: Vladimir Frants, Sos Agaian

First: 2025-12-30T22:53:33+00:00 · Latest: 2025-12-30T22:53:33+00:00

Abs · PDF · Code1 · Code2

Abstract

The rapid expansion of generative AI has normalized large-scale synthetic media creation, enabling new forms of covert communication. Recent generative steganography methods, particularly those based on diffusion models, can embed high-capacity payloads without fine-tuning or auxiliary decoders, creating significant challenges for detection and remediation. Coverless diffusion-based techniques are difficult to counter because they generate image carriers directly from secret data, enabling attackers to deliver stegomalware for command-and-control, payload staging, and data exfiltration while bypassing detectors that rely on cover-stego discrepancies. This work introduces Adversarial Diffusion Sanitization (ADS), a training-free defense for security gateways that neutralizes hidden payloads rather than detecting them. ADS employs an off-the-shelf pretrained denoiser as a differentiable proxy for diffusion-based decoders and incorporates a color-aware, quaternion-coupled update rule to reduce artifacts under strict distortion limits. Under a practical threat model and in evaluation against the state-of-the-art diffusion steganography method Pulsar, ADS drives decoder success rates to near zero with minimal perceptual impact. Results demonstrate that ADS provides a favorable security-utility trade-off compared to standard content transformations, offering an effective mitigation strategy against diffusion-driven steganography.

中文标题/摘要

标题：无需训练的色彩感知对抗扩散净化：安全网关中的扩散隐秘软件防御

生成式AI的迅速发展使大规模合成媒体的创建变得普遍，开启了新的隐蔽通信形式。最近基于扩散模型的生成式隐写术方法可以在不进行微调或辅助解码器的情况下嵌入高容量的载荷，给检测和修复带来了巨大挑战。无掩蔽的基于扩散的技术难以对抗，因为它们直接从秘密数据生成图像载体，使攻击者能够通过绕过依赖于掩蔽-隐写术差异的检测器来交付隐秘软件，用于命令与控制、载荷部署和数据泄露。本文引入了对抗扩散净化（ADS），这是一种无需训练的安全网关防御技术，旨在中和隐藏的载荷而非检测它们。ADS 使用现成的预训练去噪器作为可微代理扩散解码器，并结合色彩感知的四元数耦合更新规则，在严格失真限制下减少伪影。在实际威胁模型下，与最先进的扩散隐写术方法Pulsar相比，ADS 在最小感知影响下将解码成功率驱动至接近零。结果表明，ADS 提供了与标准内容转换相比更有利的安全-效用权衡，提供了一种有效的对抗扩散驱动隐写术的缓解策略。

Summary / 总结

This work addresses the challenge of detecting and mitigating diffusion-based steganography methods that embed high-capacity payloads covertly in images. It introduces Adversarial Diffusion Sanitization (ADS), a training-free defense mechanism that uses an off-the-shelf pretrained denoiser as a differentiable proxy for diffusion-based decoders and incorporates a color-aware, quaternion-coupled update rule. The method successfully drives decoder success rates to near zero while minimizing perceptual impact, demonstrating a favorable security-utility trade-off compared to standard content transformations.

该研究针对检测和抵御不需微调或辅助解码器即可在图像中嵌入高容量载荷的扩散模型隐写术方法的挑战。引入了一种名为对抗扩散净化（ADS）的无训练防御机制，使用现成的预训练去噪器作为扩散解码器的可微代理。ADS 包含一种颜色感知的四元数耦合更新规则，以减少失真下的伪影。评估结果显示，ADS 可以显著将解码器的成功率降低到接近零，同时保持最小的感知影响，相比标准内容变换提供了更有利的安全-实用性权衡，有效对抗了基于扩散的隐写术方法。

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

Authors: Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavoned, Martin Steinert

First: 2025-12-30T21:20:41+00:00 · Latest: 2025-12-30T21:20:41+00:00

Comments: 17 pages without bibliography or appendix. The main paper has 16 figures

Abs · PDF · Code1 · Code2

Abstract

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained vision-language model (VLM) fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning.

中文标题/摘要

标题：桥梁上的基础模型：基于视觉语言模型的海上自主航行语义风险检测与安全机动

国际海事组织（IMO）的MASS代码草案要求自主和远程监督的海上船舶在偏离其操作设计域时能够检测到，并进入预定义的后备程序通知操作员，允许立即的人工干预，并在未经批准的情况下不得更改航程计划。在警报到接管的间隙满足这些义务需要一个短期、可人工干预的后备机动程序。传统的海上自主系统在需要理解意义的情况下（例如，潜水员标志意味着水中有人员，火意味着危险）难以应对。我们认为（i）视觉语言模型（VLMs）为这些分布外情况提供了语义意识，（ii）快速-慢速异常检测管道与短期、可人工干预的后备机动程序使这一操作在交接窗口内成为可能。我们引入了语义瞭望（Semantic Lookout），这是一种仅使用摄像头、候选受限的视觉语言模型（VLM）后备机动程序选择器，它在持续的人类授权下从水有效、世界锚定的轨迹中选择一个谨慎的动作（或保持位置）。在40个港口场景中，我们测量了每次呼叫的场景理解和延迟，与人类共识的对齐（模型三票多数投票），火灾危险场景下的短期风险缓解，以及水上警报->后备机动->操作员交接。亚10秒模型保留了大多数先进模型的大部分意识。后备机动程序选择器优于仅几何模型基准，并在火灾场景中增加了安全距离。现场运行验证了端到端操作。这些结果支持VLMs作为与IMO MASS代码草案兼容的语义后备机动程序选择器，符合实际的延迟预算，并激励未来工作，即领域适应的混合自主，将基础模型语义与多传感器鸟瞰视图感知和短期重规划相结合。

DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images

Authors: Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy

First: 2025-12-30T16:48:20+00:00 · Latest: 2025-12-30T16:48:20+00:00

Abs · PDF · Code1 · Code2 · Project1

Abstract

Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).

中文标题/摘要

标题：DermaVQA-DAS：皮肤病评估方案（DAS）及数据集，用于患者生成的皮肤病图像的封闭式问题回答与分割

皮肤病图像分析的最新进展得益于大规模标注数据集；然而，现有大多数基准主要集中在皮肤镜图像上，缺乏患者自述查询和临床背景，限制了其在以患者为中心的护理中的应用。为解决这一问题，我们引入了DermaVQA-DAS，这是DermaVQA数据集的扩展，支持两种互补任务：封闭式问题回答（QA）和皮肤病病变分割。该工作的核心是皮肤病评估方案（DAS），这是一种新型专家开发的框架，系统地以结构化和标准化的形式捕捉临床有意义的皮肤病特征。DAS 包含36个高层次和27个细粒度的评估问题，其中包含英文和中文的多项选择题。利用DAS，我们提供了专家标注的数据集，用于封闭式QA和分割，并对最先进的多模态模型进行了基准测试。对于分割，我们评估了多种提示策略，并展示了提示设计对性能的影响：默认提示在Mean-of-Max和Mean-of-Mean评估聚合方案下表现最佳，而结合患者查询标题和内容的增强提示在基于多数投票的微评分评估下表现最佳，使用BiomedParse时Jaccard指数为0.395，Dice得分为0.566。对于封闭式QA，模型的整体性能很强，平均准确率从0.729到0.798不等；o3获得最佳整体准确率（0.798），紧随其后的是GPT-4.1（0.796），而Gemini-1.5-Pro在Gemini家族中表现出竞争力（0.783）。我们公开发布了DermaVQA-DAS、DAS方案和评估协议，以支持和加速未来在患者中心的皮肤病视觉语言建模研究（https://osf.io/72rp3）。

Summary / 总结

The research aims to address the gap in dermatological image analysis by focusing on patient-generated images and their associated queries, which are often lacking in existing benchmarks. The study introduces DermaVQA-DAS, which includes a Dermatology Assessment Schema (DAS) for structured clinical feature capture and expert-annotated datasets for closed-ended question answering and segmentation. Key findings show that for segmentation, the default prompt performs best under certain evaluation schemes, while an augmented prompt incorporating patient queries yields the highest performance. For closed-ended QA, models like o3 and GPT-4 achieve high accuracies, with o3 showing the best overall performance. The datasets and evaluation protocols are publicly available to support future research.

研究旨在通过关注患者生成的图像及其相关查询来弥补现有基准数据的不足，这些查询在现有数据集中往往缺失。研究引入了DermaVQA-DAS，其中包括一个皮肤病评估框架（DAS）用于结构化的临床特征捕捉和专家标注的数据集，用于封闭式问题回答和分割。在分割方面，不同的提示策略被评估，结果显示默认提示在某些评估方案下表现最佳，而结合患者查询的增强提示在多数投票基础上的微评分中表现最佳。在封闭式问题回答方面，模型如o3和GPT-4表现出较强性能，o3取得了最高的准确率。数据集和评估协议已公开发布，以支持未来的研究。

Spatial-aware Vision Language Model for Autonomous Driving

Authors: Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong

First: 2025-12-30T16:35:00+00:00 · Latest: 2025-12-30T16:35:00+00:00

Abs · PDF · Code1 · Code2

Abstract

While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.

中文标题/摘要

标题：具有空间意识的视觉语言模型用于自动驾驶

视觉-语言模型（VLMs）通过利用语言模型中嵌入的常识，在端到端的自动驾驶中显示出巨大的潜力。然而，它们依赖于2D图像线索进行复杂场景理解和决策，这成为确保安全性和可靠性的关键瓶颈。当前基于图像的方法在准确的度量空间推理和几何推断方面存在困难，导致不可靠的驾驶策略。为了解决这一问题，我们提出了一种名为LVLDrive（LiDAR-视觉-语言）的新框架，该框架通过引入LiDAR点云作为额外输入模态，专门设计用于增强现有VLMs的稳健3D度量空间理解能力，以实现自动驾驶。关键挑战在于减轻3D数据引入的灾难性干扰对预训练VLMs的影响。为此，我们引入了一种渐进融合Q-Former，逐步注入LiDAR特征，确保VLMs的稳定性和知识库的保留。此外，我们开发了一个空间意识问答（SA-QA）数据集，以明确教授模型高级3D感知和推理能力。在驾驶基准上的广泛实验表明，LVLDrive在场景理解、度量空间感知和可靠的驾驶决策方面优于仅基于视觉的模型。我们的工作强调了构建可信赖的基于VLM的自动驾驶系统时明确的3D度量数据的必要性。

Summary / 总结

The research aims to enhance Vision-Language Models (VLMs) for autonomous driving by integrating LiDAR data to improve 3D spatial understanding. The method involves a Gradual Fusion Q-Former that incrementally incorporates LiDAR features into pre-trained VLMs to maintain their existing knowledge. Key experimental results show that LVLDrive outperforms vision-only models in scene understanding, metric spatial perception, and driving decision-making, emphasizing the importance of 3D data for reliable autonomous systems.

本文提出了一种名为LVLDrive的方法，通过集成LiDAR数据来增强Vision-Language Models (VLMs)的3D空间理解能力。该方法使用Gradual Fusion Q-Former逐步将LiDAR特征注入到预训练的VLMs中，确保稳定性。作者还引入了一个空间感知问答数据集来训练模型的3D感知能力。实验结果显示，LVLDrive在场景理解、度量空间感知和驾驶决策方面优于仅基于视觉的模型。这项工作强调了3D数据对于构建可靠的自主驾驶系统的必要性。

SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning

Authors: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu

First: 2025-12-30T16:31:45+00:00 · Latest: 2025-12-30T16:31:45+00:00

Abs · PDF · Code1 · Code2

Abstract

While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.

中文标题/摘要

标题：SenseNova-MARS：通过强化学习赋能多模态代理推理与搜索

尽管视觉语言模型（VLMs）可以通过代理推理解决复杂任务，但它们的能力仍然主要局限于文本导向的推理链或孤立的工具调用。它们无法展现出人类所需的熟练度，以无缝地将动态工具操作与持续推理交织在一起，特别是在需要协调外部工具（如搜索和图像裁剪）的知识密集型和视觉复杂场景中。在本文中，我们提出了SenseNova-MARS，这是一种新颖的多模态代理推理与搜索框架，通过强化学习（RL）赋予VLMs交织的视觉推理和工具使用能力。具体而言，SenseNova-MARS动态整合了图像搜索、文本搜索和图像裁剪工具，以应对精细和知识密集型的视觉理解挑战。在RL阶段，我们提出了批标准化组序列策略优化（BN-GSPO）算法，以提高训练稳定性并增强模型调用工具和有效推理的能力。为了全面评估代理VLMs在复杂视觉任务上的表现，我们引入了HR-MMSearch基准，这是第一个由高分辨率图像和知识密集型及搜索驱动的问题组成的搜索导向基准。实验表明，SenseNova-MARS在开源搜索和精细图像理解基准上达到了最先进的性能。具体而言，在搜索导向基准上，SenseNova-MARS-8B在MMSearch上的得分为67.84，在HR-MMSearch上的得分为41.64，超过了诸如Gemini-3-Flash和GPT-5等专有模型。SenseNova-MARS代表了向代理VLMs迈出的有希望的一步，提供了有效的和稳健的工具使用能力。为了促进该领域的进一步研究，我们将发布所有代码、模型和数据集。

Summary / 总结

SenseNova-MARS is a framework that enhances Vision-Language Models (VLMs) with the ability to perform interleaved visual reasoning and tool-use through reinforcement learning. It integrates image search, text search, and image crop tools to handle fine-grained and knowledge-intensive visual understanding tasks. The BN-GSPO algorithm in the RL stage improves training stability and tool invocation. Experiments show that SenseNova-MARS outperforms existing models on search-oriented benchmarks, achieving 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models like Gemini-3-Flash and GPT-5.

本文提出了SenseNova-MARS框架，通过强化学习增强视觉语言模型(VLMs)的视觉推理和工具使用能力，动态整合图像搜索、文本搜索和图像裁剪工具以应对复杂的视觉理解任务。BN-GSPO算法被提出以提高训练稳定性。SenseNova-MARS在开源搜索和细粒度图像理解基准测试中表现出色，显著提升了代理型VLMs的能力。同时引入了HR-MMSearch基准，用于评估VLMs在高分辨率图像和知识密集型问题上的表现，进一步验证了模型的有效性。

Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation

Authors: Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang

First: 2025-10-13T03:19:45+00:00 · Latest: 2025-12-30T15:40:12+00:00

Abs · PDF · Code1 · Code2

Abstract

Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as "Multimodal Context Drift", which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.

中文标题/摘要

标题：弥补一致性差距：交错图像-文本生成的显式结构化记忆

现有的视觉语言模型（VLMs）在长时间的交错图像-文本交互中往往难以保持逻辑性、实体身份和艺术风格。我们将其局限性称为“多模态上下文漂移”，这源于隐式神经表示在长序列中固有的衰减或纠缠倾向。为了解决这一问题，我们提出了IUT-Plug，这是一种模型无关的神经-符号结构化状态跟踪机制。不同于依赖于瞬态注意力图的纯神经方法，IUT-Plug 引入了图像理解树（IUT）作为显式的持久性记忆模块。该框架通过以下步骤运作：(1) 将视觉场景解析为分层的符号结构（实体、属性和关系）；(2) 进行增量状态更新，逻辑锁定不变属性并修改变化元素；(3) 通过拓扑约束指导生成。我们在一个包含3,000个人工标注样本的新基准上评估了我们的方法。实验结果表明，IUT-Plug 有效地缓解了上下文漂移，与无结构的文本提示基线相比，实现了显著更高的一致性得分。这表明显式的符号定位对于保持多模态生成中的稳健长期一致性至关重要。

Summary / 总结

The paper addresses the issue of 'Multimodal Context Drift' in Vision Language Models (VLMs), where the models struggle to maintain consistency in extended image-text interactions. To tackle this, the authors propose IUT-Plug, a model-agnostic mechanism that uses an explicit, persistent Image Understanding Tree (IUT) to track structured state. The method involves parsing visual scenes into hierarchical symbolic structures, updating states incrementally, and guiding generation with topological constraints. Experiments on a new benchmark show that IUT-Plug improves consistency scores, indicating the importance of explicit symbolic grounding for long-term multimodal generation.

研究针对Vision Language Models (VLMs)在长时间图像-文本交互中出现的‘多模态上下文漂移’问题，提出了一种名为IUT-Plug的模型通用机制，通过引入显式的持久性Image Understanding Tree (IUT)来跟踪结构化的状态。方法包括解析视觉场景为符号结构、增量更新状态并用拓扑约束引导生成。实验在新基准上显示，IUT-Plug在一致性得分上显著优于无结构的文本提示基线，表明显式的符号接地对于保持多模态生成中的长期一致性至关重要。

ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation

Authors: Ziquan Liu, Zhewei Zhu, Xuyang Shi

First: 2025-12-30T13:38:30+00:00 · Latest: 2025-12-30T13:38:30+00:00

Comments: 10 pages, 4 figures

Abs · PDF · Code1 · Code2

Abstract

Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.

中文标题/摘要

标题：ARM：一种可学习的即插即用模块用于基于CLIP的开放词汇语义分割

开放词汇语义分割（OVSS）从根本上受到CLIP粗略的图像级表示的限制，缺乏精确的像素级细节。现有的无需训练的方法试图通过从昂贵的外部基础模型（如SAM、DINO）导入先验知识或通过应用静态的手工制作启发式方法来解决这一问题，CLIP的内部特征。这些方法要么计算成本高，要么效果不佳。我们提出了注意力精炼模块（ARM），这是一种轻量级、可学习的模块，有效地解锁并精炼了CLIP的内部潜力。与静态融合方法不同，ARM学习自适应地融合层次特征。它采用语义引导的交叉注意力块，使用鲁棒的深层特征（K, V）来选择和精炼细节丰富的浅层特征（Q），然后通过一个自我注意力块。关键创新在于“一次训练，随处使用”的范式。ARM在通用数据集（如COCO-Stuff）上训练一次后，作为通用的即插即用后处理器，适用于多种无需训练的框架。大量实验表明，ARM在多个基准测试上始终提升了基线性能，且几乎无推理开销，建立了高效的无需训练的OVSS范式。

Summary / 总结

The research addresses the challenge of open-vocabulary semantic segmentation (OVSS) by proposing the Attention Refinement Module (ARM), which enhances CLIP's coarse image-level representations. ARM is a lightweight, learnable module that adaptively fuses hierarchical features through a semantically-guided cross-attention mechanism, improving pixel-level detail. Experiments demonstrate that ARM consistently improves baseline performance across multiple benchmarks with minimal computational cost, making it a versatile post-processor for various training-free frameworks.

研究针对开放词汇语义分割（OVSS）的挑战，提出了注意力精炼模块（ARM），以增强CLIP的粗略图像级表示。ARM是一个轻量级的学习模块，能够适应性地融合层次特征，通过语义引导的交叉注意力机制来精炼浅层特征。实验表明，ARM在多个基准测试上提高了基线性能，且计算成本较低，使其成为各种无训练框架的通用后处理器。

RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation

Authors: Ming-Ming Yu, Yi Chen, Börje F. Karlsson, Wenjun Wu

First: 2025-12-30T13:25:22+00:00 · Latest: 2025-12-30T13:25:22+00:00

Abs · PDF · Code1 · Code2

Abstract

Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.

中文标题/摘要

标题：RANGER：通过上下文适应的单目零样本语义导航框架

在复杂环境中高效地找到目标是现实世界体态应用的基础。尽管最近多模态基础模型的进步使得零样本物体目标导航成为可能，允许机器人搜索任意物体而无需微调，但现有方法面临两个关键限制：（1）对模拟器提供的精确深度和姿态信息的高度依赖，这限制了其在现实世界场景中的应用；（2）缺乏上下文学习（ICL）能力，使得难以快速适应新环境，如利用短视频。为了解决这些挑战，我们提出了一种名为RANGER的新颖零样本、开放式词汇语义导航框架，仅使用单目相机操作。利用强大的3D基础模型，RANGER消除了对深度和姿态的依赖，同时展示了强大的ICL能力。通过简单观察新环境的短视频，系统也可以显著提高任务效率，无需进行架构修改或微调。该框架整合了几个关键组件：基于关键帧的3D重建、语义点云生成、基于视觉语言模型（VLM）的探索价值估计、高层自适应航点选择和低层动作执行。在HM3D基准和真实世界环境中进行的实验表明，RANGER在导航成功率和探索效率方面表现出竞争力，同时展示了优越的ICL适应性，无需对环境进行先前的3D建图。

Summary / 总结

RANGER is a zero-shot semantic navigation framework that uses only a monocular camera to navigate complex environments without the need for precise depth and pose information. It leverages 3D foundation models and in-context learning (ICL) to adapt quickly to new environments. Experiments show that RANGER performs well in terms of navigation success rate and exploration efficiency, and it can adapt to new environments by observing short videos without requiring architectural modifications or fine-tuning.

RANGER 是一种仅使用单目相机的零样本语义导航框架，旨在高效地在复杂环境中寻找目标。它通过消除对精确深度和姿态信息的依赖并结合上下文学习（ICL）能力来解决现有方法的局限性。RANGER 集成了 3D 重建、语义点云生成和 VLM 驱动的探索价值估计等关键组件，并在 HM3D 基准和真实环境中的实验中展示了竞争力的导航成功率和探索效率，同时无需对环境进行先前的 3D 映射。

UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?

Authors: Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao

First: 2025-12-29T14:49:50+00:00 · Latest: 2025-12-30T13:23:48+00:00

Abs · PDF · Code1 · Code2

Abstract

Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.

中文标题/摘要

标题：UniHetero：生成能否在大规模数据下增强视觉-语言模型的理解？

视觉-语言大型模型正朝着统一视觉理解与生成任务的方向发展。然而，生成是否能增强理解在大规模数据下仍是一个未被充分探索的问题。在本工作中，我们通过一个简洁的统一结构模型UniHetero，在超过200万样本的大规模预训练下进行了分析。我们的主要观察结果是：(1) 生成可以提高理解，但只有在生成语义而非像素时才有效。统一的视觉-语言模型中普遍认为添加生成任务会自然增强理解，但在大规模数据下并非总是如此。在超过200万样本的预训练下，生成任务仅在操作语义层面时才有助于理解，即模型学会在大规模语言模型中自回归高层次的视觉表示时。一旦像素级目标（如扩散损失）直接干扰大规模语言模型，理解性能往往会下降。(2) 生成揭示了更优的数据扩展趋势和更高的数据利用效率。统一的生成-理解相比单独的理解展示了更优的扩展趋势，揭示了一种更有效的从视觉模态直接学习视觉知识的方法，而不是通过描述到文本。(3) 在输入嵌入上进行自回归可以有效捕捉视觉细节。与常用的视觉编码器相比，在输入嵌入上进行视觉自回归显示出较少的累积误差，并且是跨模态的，可以扩展到所有模态。学习到的语义表示捕捉了视觉信息，如物体、位置、形状和颜色；进一步支持像素级图像生成。

CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers

Authors: Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim

First: 2025-12-30T12:55:38+00:00 · Latest: 2025-12-30T12:55:38+00:00

Comments: 16 pages, 20 figures

Abs · PDF · Code1 · Code2

Abstract

Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.

中文标题/摘要

标题：CorGi: 贡献指导的块级区间缓存加速扩散变换器推理，无需训练

扩散变换器（DiT）在视觉生成方面取得了显著的性能，但其迭代去噪过程与较大的容量相结合导致了高昂的推理成本。近期研究表明，DiT模型的迭代去噪过程在各步骤中存在大量的冗余计算。为了有效减少DiT中的冗余计算，我们提出了CorGi（贡献指导的块级区间缓存），这是一种无需训练的DiT推理加速框架，它选择性地在去噪步骤之间重用DiT中的变压器块输出。CorGi缓存低贡献块，并在每个区间内的后续步骤中重用它们，以减少冗余计算同时保持生成质量。对于文本到图像任务，我们进一步提出了CorGi+，它利用每个块的交叉注意力图来识别重要标记，并应用部分注意更新以保护重要的对象细节。在最先进的DiT模型上的评估表明，CorGi和CorGi+在平均上实现了2.0倍的加速，同时保持了高质量的生成。

Summary / 总结

CorGi is a training-free inference acceleration framework for diffusion transformers (DiT) that selectively reuses transformer block outputs across denoising steps to reduce redundant computation. It caches low-contribution blocks and reuses them in later steps to maintain generation quality, achieving up to 2.0x speedup on average. For text-to-image tasks, CorGi+ uses per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details, further enhancing performance.

CorGi 是一种针对扩散变压器（DiT）的无训练推理加速框架，通过在去噪步骤间选择性重用变压器块输出来减少冗余计算。它缓存低贡献度的块并在后续步骤中重用它们以保持生成质量，实现了对最先进的 DiT 模型高达 2.0 倍的加速。CorGi+ 进一步通过使用每块的交叉注意力图来保护重要对象细节，增强文本到图像任务中的生成质量。

VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng

Venue: NeurIPS 2025 poster

First: 2025-10-26T14:36:15+00:00 · Latest: 2025-12-30T12:31:56+00:00

Comments: NeurIPS 2025 poster

Abs · PDF · Code1 · Code2 · Code3

Abstract

Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

中文标题/摘要

标题：VADTree：基于层次粒度感知树的无监督视频异常检测

视频异常检测（VAD）旨在识别视频中的异常。监督方法需要大量领域内的训练数据，并且无法为异常提供清晰的解释。相比之下，无监督方法利用大型预训练模型的知识储备和语言互动来检测异常。然而，当前固定长度的时间窗口采样方法难以准确捕捉具有不同时间跨度的异常。因此，我们提出了VADTree，它利用层次粒度感知树（HGTree）结构进行灵活的采样。VADTree利用预训练的通用事件边界检测（GEBD）模型嵌入的知识来表征潜在的异常事件边界。具体来说，VADTree基于边界置信度将视频分解为通用事件节点，并通过自适应粗细层次结构化和冗余去除构建HGTree。然后，将多维先验注入视觉语言模型（VLMs）以增强节点级别的异常感知，并通过大型语言模型（LLMs）实现通用事件节点的异常推理。最后，使用跨簇节点相关方法整合多粒度异常评分。在三个具有挑战性的数据集上的广泛实验表明，VADTree在无监督设置中实现了最先进的性能，同时大幅减少了采样的视频片段数量。代码将在https://github.com/wenlongli10/VADTree/上提供。

Summary / 总结

VADTree is proposed to address the limitations of training-free video anomaly detection methods by utilizing a Hierarchical Granularity-aware Tree (HGTree) structure. It decomposes videos into generic event nodes and constructs an HGTree for adaptive coarse-fine hierarchical structuring. VADTree leverages a pre-trained Generic Event Boundary Detection (GEBD) model and integrates multi-dimensional priors into visual language models to enhance anomaly perception. Experiments show that VADTree outperforms existing methods in training-free settings with fewer sampled video segments.

VADTree 通过使用层次粒度感知树（HGTree）结构来解决训练免费视频异常检测方法的局限性。它将视频分解为通用事件节点并构建 HGTree，以捕捉不同时间跨度的异常。VADTree 利用预训练的通用事件边界检测（GEBD）模型，并将多维先验注入视觉语言模型以增强异常感知。实验表明，VADTree 在减少采样视频片段数量的同时优于现有方法。

Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

Authors: TsaiChing Ni, ZhenQi Chen, YuanFu Yang

First: 2025-12-30T11:45:22+00:00 · Latest: 2025-12-30T11:45:22+00:00

Abs · PDF · Code1 · Code2

Abstract

We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.

中文标题/摘要

标题：基于大规模多模态数据集的开放词汇工业缺陷理解研究

我们提出了IMDD-1M，这是首个包含1,000,000个图像-文本对的大型工业多模态缺陷数据集，旨在推动制造和质量检测中的多模态学习。IMDD-1M 包含超过60种材料类别和400多种缺陷类型的高分辨率真实世界缺陷，每种缺陷都附有专家验证的注释和详细的文本描述，说明缺陷的位置、严重程度和上下文属性。该数据集支持包括分类、分割、检索、描述和生成建模在内的广泛应用。基于IMDD-1M，我们从零开始训练了一个基于扩散的视觉-语言基础模型，特别适用于工业场景。该模型作为可泛化的基础，可以通过轻量级微调高效适应特定领域。与专门的专家模型相比，它只需要不到5%的任务特定数据即可达到相当的性能，突显了数据高效基础模型适应在工业检测和生成中的潜力，为可扩展、领域适应和知识导向的制造智能铺平了道路。

Summary / 总结

The research aims to advance multimodal learning in industrial defect understanding by introducing IMDD-1M, a large-scale multimodal dataset with 1,000,000 image-text pairs. The dataset includes high-resolution defects from over 60 material categories and 400 defect types, each with expert annotations and detailed descriptions. A diffusion-based vision-language foundation model was trained on this dataset, which can be fine-tuned with minimal data to achieve performance comparable to specialized expert models, demonstrating the potential for data-efficient adaptation in industrial inspection and generation tasks.

研究旨在通过引入包含1,000,000个图像-文本对的IMDD-1M大型多模态数据集，推进工业缺陷理解中的多模态学习。该数据集包括来自超过60种材料类别和400种缺陷类型的高分辨率缺陷，每种缺陷都有专家注释和详细的描述。基于此数据集训练了一个扩散型视觉-语言基础模型，该模型可以通过少量数据进行微调以达到与专门专家模型相当的性能，展示了在工业检测和生成任务中数据高效适应的潜力。