PhysTalk: Language-driven Real-time Physics in 3D Gaussian Scenes
Authors: Luca Collorone, Mert Kiray, Indro Spinelli, Fabio Galasso, Benjamin Busam
First: 2025-12-31T17:32:31+00:00 · Latest: 2025-12-31T17:32:31+00:00
Abstract
Realistic visual simulations are omnipresent, yet their creation requires computing time, rendering, and expert animation knowledge. Open-vocabulary visual effects generation from text inputs emerges as a promising solution that can unlock immense creative potential. However, current pipelines lack both physical realism and effective language interfaces, requiring slow offline optimization. In contrast, PhysTalk takes a 3D Gaussian Splatting (3DGS) scene as input and translates arbitrary user prompts into real time, physics based, interactive 4D animations. A large language model (LLM) generates executable code that directly modifies 3DGS parameters through lightweight proxies and particle dynamics. Notably, PhysTalk is the first framework to couple 3DGS directly with a physics simulator without relying on time consuming mesh extraction. While remaining open vocabulary, this design enables interactive 3D Gaussian animation via collision aware, physics based manipulation of arbitrary, multi material objects. Finally, PhysTalk is train-free and computationally lightweight: this makes 4D animation broadly accessible and shifts these workflows from a "render and wait" paradigm toward an interactive dialogue with a modern, physics-informed pipeline.
中文标题/摘要
标题:PhysTalk: 3D 高斯场景中的语言驱动实时物理
逼真的视觉模拟无处不在,但其创建需要计算时间、渲染和专家动画知识。基于开放词汇的视觉效果生成从文本输入中脱颖而出,成为一种有望释放巨大创意潜力的解决方案。然而,当前的工作流程缺乏物理真实感和有效的语言接口,需要缓慢的离线优化。相比之下,PhysTalk 将 3D 高斯点绘(3DGS)场景作为输入,并将任意用户提示翻译成基于实时物理的 4D 交互式动画。一个大型语言模型(LLM)生成可执行代码,直接通过轻量级代理和粒子动力学修改 3DGS 参数。值得注意的是,PhysTalk 是第一个直接将 3DGS 与物理模拟器结合而无需依赖耗时的网格提取的框架。尽管保持开放词汇,此设计使用户能够通过碰撞感知的物理基础操作任意多材料对象进行交互式 3D 高斯动画。最后,PhysTalk 是无训练的且计算量轻:这使得 4D 动画广泛可及,并将这些工作流程从“渲染等待”范式转向与现代、物理启发式管道的互动对话。
DarkEQA: Benchmarking Vision-Language Models for Embodied Question Answering in Low-Light Indoor Environments
Authors: Yohan Park, Hyunwoo Ha, Wonjun Jo, Tae-Hyun Oh
First: 2025-12-31T17:31:29+00:00 · Latest: 2025-12-31T17:31:29+00:00
Comments: Submitted to IEEE Robotics and Automation Letters (RA-L)
Abstract
Vision Language Models (VLMs) are increasingly adopted as central reasoning modules for embodied agents. Existing benchmarks evaluate their capabilities under ideal, well-lit conditions, yet robust 24/7 operation demands performance under a wide range of visual degradations, including low-light conditions at night or in dark environments--a core necessity that has been largely overlooked. To address this underexplored challenge, we present DarkEQA, an open-source benchmark for evaluating EQA-relevant perceptual primitives under multi-level low-light conditions. DarkEQA isolates the perception bottleneck by evaluating question answering from egocentric observations under controlled degradations, enabling attributable robustness analysis. A key design feature of DarkEQA is its physical fidelity: visual degradations are modeled in linear RAW space, simulating physics-based illumination drop and sensor noise followed by an ISP-inspired rendering pipeline. We demonstrate the utility of DarkEQA by evaluating a wide range of state-of-the-art VLMs and Low-Light Image Enhancement (LLIE) models. Our analysis systematically reveals VLMs' limitations when operating under these challenging visual conditions. Our code and benchmark dataset will be released upon acceptance.
中文标题/摘要
标题:DarkEQA:在低光室内环境中的视觉语言模型体态问答基准测试
视觉语言模型(VLMs)越来越多地被用作体态代理的核心推理模块。现有的基准测试在理想、光线充足的条件下评估它们的能力,但全天候24/7运行需要在各种视觉退化条件下表现出色,包括夜间或黑暗环境中的低光条件——这一核心需求已被很大程度上忽视。为应对这一未充分探索的挑战,我们提出了DarkEQA,这是一个开源基准测试,用于在多级低光条件下评估与体态问答相关的感知基本能力。DarkEQA通过在受控退化条件下从第一人称观察进行问答评估,隔离了感知瓶颈,使可归因的鲁棒性分析成为可能。DarkEQA的一个关键设计特点是其物理保真度:视觉退化在线性RAW空间中建模,模拟基于物理的照明下降和传感器噪声,随后通过ISP启发式的渲染管道。我们通过评估一系列最先进的VLMs和低光图像增强(LLIE)模型展示了DarkEQA的实用性。我们的分析系统地揭示了这些视觉条件下的体态操作限制。我们的代码和基准数据集将在接受后发布。
Summary / 总结
DarkEQA is a benchmark designed to evaluate the performance of Vision-Language Models (VLMs) in low-light indoor environments, addressing the underexplored challenge of robust 24/7 operation. It uses a controlled degradation process to simulate low-light conditions and evaluates the models' perceptual capabilities. Key findings show that current VLMs struggle with question answering under these challenging visual conditions.
DarkEQA 是一个基准,旨在评估 Vision-Language 模型在低光条件下的性能,解决了 24/7 运行时的低光照环境下的鲁棒性问题。它通过模拟低光照环境来控制降级过程,专注于与体感问答相关的感知基本原理。研究结果表明,当前的 VLMs 在低光条件下表现不佳,突显了需要开发能够在黑暗环境中运行的改进模型的必要性。
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
Authors: Brandon Huang, Hang Hua, Zhuoran Yu, Trevor Darrell, Rogerio Feris, Roei Herzig
First: 2025-12-19T04:09:24+00:00 · Latest: 2025-12-31T17:30:11+00:00
Abstract
While Vision-language models (VLMs) have demonstrated remarkable performance across multi-modal tasks, their choice of vision encoders presents a fundamental weakness: their low-level features lack the robust structural and spatial information essential for document understanding and web agents. To bridge this gap, we introduce DAVE, a vision encoder purpose-built for VLMs and tailored for these tasks. Our training pipeline is designed to leverage abundant unlabeled data to bypass the need for costly large-scale annotations for document and web images. We begin with a self-supervised pretraining stage on unlabeled images, followed by a supervised autoregressive pretraining stage, where the model learns tasks like parsing and localization from limited, high-quality data. Within the supervised stage, we adopt two strategies to improve our encoder's alignment with both general visual knowledge and diverse document and web agentic tasks: (i) We introduce a novel model-merging scheme, combining encoders trained with different text decoders to ensure broad compatibility with different web agentic architectures. (ii) We use ensemble training to fuse features from pretrained generalist encoders (e.g., SigLIP2) with our own document and web-specific representations. Extensive experiments on classic document tasks, VQAs, web localization, and agent-based benchmarks validate the effectiveness of our approach, establishing DAVE as a strong vision encoder for document and web applications.
中文标题/摘要
标题:DAVE:一种用于文档理解和网络代理的VLM视觉编码器
尽管视觉语言模型(VLMs)在多模态任务中表现出色,但它们所选择的视觉编码器存在根本性弱点:其低级特征缺乏文档理解和网络代理所需的稳健的结构和空间信息。为弥补这一差距,我们引入了DAVE,这是一种专为VLMs设计的视觉编码器,特别适用于这些任务。我们的训练管道旨在利用大量未标注数据,以避免对文档和网络图像进行昂贵的大规模注释的需要。我们首先在未标注图像上进行自我监督预训练阶段,然后进行监督自回归预训练阶段,在此阶段,模型从有限的高质量数据中学习解析和定位等任务。在监督阶段内,我们采用了两种策略来提高编码器与通用视觉知识和多样化文档及网络代理任务的对齐:(i) 我们引入了一种新的模型合并方案,将使用不同文本解码器训练的编码器结合起来,以确保与不同网络代理架构的广泛兼容性。(ii) 我们使用集成训练将预训练的一般编码器(例如SigLIP2)的特征与我们自己的文档和网络特定表示融合在一起。在经典文档任务、VQAs、网络定位和基于代理的基准测试中的广泛实验验证了我们方法的有效性,确立了DAVE作为文档和网络应用的强大视觉编码器的地位。
Summary / 总结
DAVE is a vision encoder designed to enhance the performance of vision-language models in document understanding and web agent tasks. It leverages self-supervised and supervised pretraining on unlabeled and high-quality data, respectively. DAVE incorporates a model-merging scheme and ensemble training to improve its compatibility with various web agentic architectures and specific document and web representations. Experimental results show that DAVE outperforms existing vision encoders on classic document tasks, VQAs, web localization, and agent-based benchmarks, making it a robust choice for document and web applications.
DAVE 是一种专门为 VLMs 设计的视觉编码器,旨在增强文档理解和网页代理任务,通过结合自监督和监督预训练方法。它利用未标记的数据进行初始训练,并用高质量数据进行微调,以完成解析和定位等任务。DAVE 的有效性通过在文档任务、VQA、网页定位和代理基准测试中的实验得到验证,展示了其在这些应用中的稳健性和适应性。
CPJ: Explainable Agricultural Pest Diagnosis via Caption-Prompt-Judge with LLM-Judged Refinement
Authors: Wentao Zhang, Tao Fang, Lina Lu, Lifei Wang, Weihe Zhong
First: 2025-12-31T16:21:31+00:00 · Latest: 2025-12-31T16:21:31+00:00
Comments: This paper is 6 pages in length and contains 2 figures. Tao Fang (Corresponding Author), Lina Lu (Co-corresponding Author)
Abstract
Accurate and interpretable crop disease diagnosis is essential for agricultural decision-making, yet existing methods often rely on costly supervised fine-tuning and perform poorly under domain shifts. We propose Caption--Prompt--Judge (CPJ), a training-free few-shot framework that enhances Agri-Pest VQA through structured, interpretable image captions. CPJ employs large vision-language models to generate multi-angle captions, refined iteratively via an LLM-as-Judge module, which then inform a dual-answer VQA process for both recognition and management responses. Evaluated on CDDMBench, CPJ significantly improves performance: using GPT-5-mini captions, GPT-5-Nano achieves \textbf{+22.7} pp in disease classification and \textbf{+19.5} points in QA score over no-caption baselines. The framework provides transparent, evidence-based reasoning, advancing robust and explainable agricultural diagnosis without fine-tuning. Our code and data are publicly available at: https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.
中文标题/摘要
标题:CPJ: 通过 Caption-Prompt-Judge 与 LLM 判定修正实现可解释的农业害虫诊断
准确且可解释的农作物疾病诊断对于农业决策至关重要,但现有方法往往依赖昂贵的监督微调且在领域迁移时表现不佳。我们提出 Caption--Prompt--Judge (CPJ),一种无需训练的少样本框架,通过结构化、可解释的图像描述增强农业害虫问答。CPJ 利用大型视觉-语言模型生成多角度描述,并通过一个 LLM-as-Judge 模块迭代修正,然后指导一个双答案问答过程,用于识别和管理响应。在 CDDMBench 上评估,CPJ 显著提高了性能:使用 GPT-5-mini 描述,GPT-5-Nano 在疾病分类上比无描述基线提高了 \textbf{+22.7} 个百分点,在问答得分上提高了 \textbf{+19.5} 分。该框架提供了透明、基于证据的推理,推动了无需微调的稳健和可解释的农业诊断。我们的代码和数据已公开发布在:https://github.com/CPJ-Agricultural/CPJ-Agricultural-Diagnosis.
Summary / 总结
The research aims to improve the accuracy and interpretability of crop disease diagnosis in agriculture. The proposed Caption-Prompt-Judge (CPJ) framework uses large vision-language models to generate multi-angle captions, which are iteratively refined by an LLM-as-Judge module. This process informs a dual-answer VQA system for disease recognition and management. On the CDDMBench, CPJ significantly outperforms no-caption baselines, achieving a 22.7 percentage point improvement in disease classification and a 19.5 point increase in QA score using GPT-5-mini captions, GPT-5-Nano. The framework provides transparent reasoning, enhancing robust and explainable agricultural diagnosis without requiring fine-tuning.
研究旨在提高农业作物病害诊断的准确性和可解释性。提出的Caption-Prompt-Judge (CPJ)框架利用大型视觉-语言模型生成多角度的图像描述,并通过LLM-as-Judge模块迭代精炼这些描述。这些描述指导一个双答案的VQA系统进行病害识别和管理。在CDDMBench上,CPJ显著优于无描述基线,使用GPT-5-mini描述时,GPT-5-Nano在病害分类上的表现提高了22.7个百分点,在问答得分上提高了19.5分。该框架提供了透明的推理过程,增强了农业诊断的稳健性和可解释性,无需进行微调。
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Authors: Abhijit Mishra, Mingda Li, Hsiang Fu, Richard Noh, Minji Kim
First: 2025-02-20T18:01:41+00:00 · Latest: 2025-12-31T15:43:05+00:00
Comments: Accepted and to appear in IJCNLP-AACL 2025
Abstract
Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores Visual Instruction Rewriting, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs (250M parameters) with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
中文标题/摘要
标题:ReVision:一种用于隐私保护任务导向视觉指令重写的数据集和基线VLM
随着AR、VR和配备强大摄像头的现代智能手机成为人机通信的主要接口,高效的隐私保护多模态交互变得至关重要。现有的强大视觉-语言模型(VLMs)支持多模态交互,通常依赖于基于云的处理,这引发了关于(1)视觉隐私问题,即传输敏感的视觉数据到服务器,以及(2)其有限的实时、设备端可用性的问题。本文探讨了视觉指令重写这一新颖的方法,即将多模态指令转换为纯文本命令,允许轻量级设备端指令重写VLM(参数量250M)与现有对话AI系统的无缝集成,增强视觉数据隐私。为此,我们提供了一个涵盖14个领域的超过39,000个示例的数据集,并开发了一个紧凑的VLM,该模型在图像描述数据集上进行预训练,并针对指令重写进行了微调。实验结果通过NLG指标(如BLEU、METEOR和ROUGE)以及语义解析分析评估,表明即使是最小量化版本的模型(存储占用量<500MB)也能实现有效的指令重写,从而实现以隐私为中心的多模态AI应用。
Summary / 总结
This paper addresses the need for efficient and privacy-preserving multimodal interaction by introducing ReVision, a dataset and baseline VLM for visual instruction rewriting. The method involves transforming multimodal instructions into text-only commands using a compact VLM (250M parameters) pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results show that even a quantized version of the model can effectively rewrite instructions, enhancing privacy in vision-language interactions.
该论文通过引入ReVision数据集和基线视觉语言模型,解决高效且保护隐私的多模态交互需求,该模型将视觉指令转换为纯文本命令,减少对云处理的需求并增强隐私保护。数据集包含超过39,000个跨14个领域的示例,模型在图像描述数据集上预训练,并针对指令重写进行微调,即使在小于500MB存储足迹的量化版本中也能实现有效的指令重写,通过NLG指标和语义解析分析进行评估。
Are First-Order Diffusion Samplers Really Slower? A Fast Forward-Value Approach
Authors: Yuchen Jiao, Na Li, Changxiao Cai, Gen Li
First: 2025-12-31T15:35:53+00:00 · Latest: 2025-12-31T15:35:53+00:00
Abstract
Higher-order ODE solvers have become a standard tool for accelerating diffusion probabilistic model (DPM) sampling, motivating the widespread view that first-order methods are inherently slower and that increasing discretization order is the primary path to faster generation. This paper challenges this belief and revisits acceleration from a complementary angle: beyond solver order, the placement of DPM evaluations along the reverse-time dynamics can substantially affect sampling accuracy in the low-neural function evaluation (NFE) regime.
We propose a novel training-free, first-order sampler whose leading discretization error has the opposite sign to that of DDIM. Algorithmically, the method approximates the forward-value evaluation via a cheap one-step lookahead predictor. We provide theoretical guarantees showing that the resulting sampler provably approximates the ideal forward-value trajectory while retaining first-order convergence. Empirically, across standard image generation benchmarks (CIFAR-10, ImageNet, FFHQ, and LSUN), the proposed sampler consistently improves sample quality under the same NFE budget and can be competitive with, and sometimes outperform, state-of-the-art higher-order samplers. Overall, the results suggest that the placement of DPM evaluations provides an additional and largely independent design angle for accelerating diffusion sampling.
中文标题/摘要
标题:一阶扩散采样器真的更慢吗?一种快速前向值方法
高阶ODE求解器已成为加速扩散概率模型(DPM)采样的标准工具,这促使人们普遍认为一阶方法本质上较慢,并且提高离散化阶数是实现更快生成的主要途径。本文挑战了这一观点,并从互补的角度重新审视加速:除了求解器阶数外,DPM评估在反向时间动力学中的位置会在低神经网络评估次数(NFE)区间显著影响采样精度。
我们提出了一种新的无需训练的一阶采样器,其主要离散化误差与DDIM相反。算法上,该方法通过廉价的一步前瞻预测器近似前向值评估。我们提供了理论保证,表明该采样器能够证明地逼近理想的前向值轨迹,同时保持一阶收敛性。实验上,在标准图像生成基准(CIFAR-10、ImageNet、FFHQ和LSUN)上,所提出的采样器在相同的NFE预算下始终能提高样本质量,并且有时可以与最先进的高阶采样器竞争。总体而言,结果表明,DPM评估的位置提供了加速扩散采样的另一个独立设计角度。
Summary / 总结
This paper challenges the belief that first-order diffusion samplers are inherently slower than higher-order methods. It introduces a novel first-order sampler that approximates the forward-value evaluation through a cheap one-step lookahead predictor, providing theoretical guarantees of first-order convergence. Empirically, the sampler improves sample quality across various benchmarks and can match or outperform state-of-the-art higher-order samplers under the same neural function evaluation budget.
本文挑战了第一阶扩散采样器比高阶方法更慢的观念。它提出了一种新型的第一阶采样器,通过廉价的一步前瞻预测来近似前向值评估,并展示了其在各种基准上的样本质量改进,同时具有理论上的第一阶收敛保证。结果表明,DPM评估的位置可以显著影响采样精度,并提供了一个加速扩散采样的额外设计角度。
Video and Language Alignment in 2D Systems for 3D Multi-object Scenes with Multi-Information Derivative-Free Control
Authors: Jason Armitage, Rico Sennnrich
First: 2025-12-31T12:39:03+00:00 · Latest: 2025-12-31T12:39:03+00:00
Abstract
Cross-modal systems trained on 2D visual inputs are presented with a dimensional shift when processing 3D scenes. An in-scene camera bridges the dimensionality gap but requires learning a control module. We introduce a new method that improves multivariate mutual information estimates by regret minimisation with derivative-free optimisation. Our algorithm enables off-the-shelf cross-modal systems trained on 2D visual inputs to adapt online to object occlusions and differentiate features. The pairing of expressive measures and value-based optimisation assists control of an in-scene camera to learn directly from the noisy outputs of vision-language models. The resulting pipeline improves performance in cross-modal tasks on multi-object 3D scenes without resorting to pretraining or finetuning.
中文标题/摘要
标题:2D系统中2D视觉输入与3D多对象场景的语言对齐
跨模态系统在处理3D场景时面临维度跃迁问题。场景内的相机可以弥合维度差距,但需要学习一个控制模块。我们提出了一种新方法,通过无导数优化的遗憾最小化来提高多元互信息估计。我们的算法使基于2D视觉输入训练的即插即用跨模态系统能够在线适应物体遮挡并区分特征。表达性度量与基于价值的优化相结合,帮助控制场景内的相机直接从视觉-语言模型的嘈杂输出中学习。由此产生的流水线在无需预训练或微调的情况下,提高了多对象3D场景跨模态任务的性能。
Summary / 总结
The research addresses the challenge of processing 3D scenes by cross-modal systems originally trained on 2D inputs. It introduces a method using regret minimisation with derivative-free optimisation to improve multivariate mutual information estimates. This enables the systems to adapt online to object occlusions and differentiate features, improving performance in cross-modal tasks on multi-object 3D scenes without pretraining or fine-tuning.
研究解决了由原本训练于2D输入的跨模态系统处理3D场景的挑战。它提出了一种使用无导数优化的后悔最小化方法来提高多变量互信息估计。这使得系统能够在线适应物体遮挡并区分特征,从而在多对象3D场景的跨模态任务中提高性能,无需进行预训练或微调。
CritiFusion: Semantic Critique and Spectral Alignment for Faithful Text-to-Image Generation
Authors: ZhenQi Chen, TsaiChing Ni, YuanFu Yang
First: 2025-12-27T19:08:18+00:00 · Latest: 2025-12-31T10:44:57+00:00
Abstract
Recent text-to-image diffusion models have achieved remarkable visual fidelity but often struggle with semantic alignment to complex prompts. We introduce CritiFusion, a novel inference-time framework that integrates a multimodal semantic critique mechanism with frequency-domain refinement to improve text-to-image consistency and detail. The proposed CritiCore module leverages a vision-language model and multiple large language models to enrich the prompt context and produce high-level semantic feedback, guiding the diffusion process to better align generated content with the prompt's intent. Additionally, SpecFusion merges intermediate generation states in the spectral domain, injecting coarse structural information while preserving high-frequency details. No additional model training is required. CritiFusion serves as a plug-in refinement stage compatible with existing diffusion backbones. Experiments on standard benchmarks show that our method notably improves human-aligned metrics of text-to-image correspondence and visual quality. CritiFusion consistently boosts performance on human preference scores and aesthetic evaluations, achieving results on par with state-of-the-art reward optimization approaches. Qualitative results further demonstrate superior detail, realism, and prompt fidelity, indicating the effectiveness of our semantic critique and spectral alignment strategy.
中文标题/摘要
标题:CritiFusion: 语义批评和频域对齐以实现忠实的文本到图像生成
近期的文本到图像扩散模型在视觉保真度方面取得了显著进展,但往往难以与复杂的提示实现语义对齐。我们提出了一种名为CritiFusion的新型推理时框架,该框架结合了多模态语义批评机制和频域细化,以提高文本到图像的一致性和细节。所提出的CritiCore模块利用视觉语言模型和多个大型语言模型来丰富提示上下文并生成高层次的语义反馈,引导扩散过程更好地与提示的意图对齐。此外,SpecFusion在频域中合并中间生成状态,注入粗略的结构信息同时保留高频细节。无需额外的模型训练。CritiFusion作为与现有扩散主干兼容的插件细化阶段。在标准基准上的实验表明,我们的方法显著提高了文本到图像对应和视觉质量的人类对齐指标。CritiFusion在人类偏好评分和美学评估中持续提升性能,达到与最先进的奖励优化方法相当的结果。定性结果进一步证明了我们的语义批评和频域对齐策略在细节、真实性和提示忠实度方面的优越性。
Summary / 总结
CritiFusion is a novel framework that enhances text-to-image generation by integrating a semantic critique mechanism and spectral alignment. It uses a vision-language model and multiple large language models to enrich prompt context and guide the diffusion process, ensuring better alignment with the prompt's intent. Additionally, it merges intermediate generation states in the spectral domain to preserve high-frequency details. Experiments show that CritiFusion improves human-aligned metrics and aesthetic evaluations, achieving results comparable to state-of-the-art reward optimization approaches.
CritiFusion 是一种新颖的推理时框架,通过集成多模态语义批评机制和频域细化来提升文本到图像生成。CritiCore 模块使用视觉语言和语言模型提供高层次的语义反馈,引导扩散过程更好地与提示的意图对齐。SpecFusion 在频域中合并中间生成状态,保留高频细节的同时增加粗略的结构信息。实验表明,CritiFusion 改进了人类对齐的指标、美学评估和偏好评分,达到了与最先进的奖励优化方法相当的结果。
Multimodal Fact-Checking: An Agent-based Approach
Authors: Danni Xu, Shaojing Fan, Harry Cheng, Mohan Kankanhalli
First: 2025-12-28T13:58:33+00:00 · Latest: 2025-12-31T09:37:15+00:00
Comments: Code and dataset will be released at https://github.com/xudanni0927/AgentFact
Abstract
The rapid spread of multimodal misinformation poses a growing challenge for automated fact-checking systems. Existing approaches, including large vision language models (LVLMs) and deep multimodal fusion methods, often fall short due to limited reasoning and shallow evidence utilization. A key bottleneck is the lack of dedicated datasets that provide complete real-world multimodal misinformation instances accompanied by annotated reasoning processes and verifiable evidence. To address this limitation, we introduce RW-Post, a high-quality and explainable dataset for real-world multimodal fact-checking. RW-Post aligns real-world multimodal claims with their original social media posts, preserving the rich contextual information in which the claims are made. In addition, the dataset includes detailed reasoning and explicitly linked evidence, which are derived from human written fact-checking articles via a large language model assisted extraction pipeline, enabling comprehensive verification and explanation. Building upon RW-Post, we propose AgentFact, an agent-based multimodal fact-checking framework designed to emulate the human verification workflow. AgentFact consists of five specialized agents that collaboratively handle key fact-checking subtasks, including strategy planning, high-quality evidence retrieval, visual analysis, reasoning, and explanation generation. These agents are orchestrated through an iterative workflow that alternates between evidence searching and task-aware evidence filtering and reasoning, facilitating strategic decision-making and systematic evidence analysis. Extensive experimental results demonstrate that the synergy between RW-Post and AgentFact substantially improves both the accuracy and interpretability of multimodal fact-checking.
中文标题/摘要
标题:基于代理的多模态事实核查:一种代理导向的方法
多模态错误信息的快速传播对自动化事实核查系统构成了日益严峻的挑战。现有的方法,包括大型视觉语言模型(LVLM)和深度多模态融合方法,往往由于推理能力有限和证据利用浅显而效果不佳。一个关键瓶颈是没有专门的数据集提供完整的现实世界多模态错误信息实例及其注释的推理过程和可验证的证据。为了解决这一限制,我们引入了RW-Post,这是一个高质量且可解释的现实世界多模态事实核查数据集。RW-Post将现实世界多模态声明与其原始社交媒体帖子对齐,保留了声明中丰富的上下文信息。此外,该数据集还包括详细的推理过程和明确链接的证据,这些证据是通过大型语言模型辅助提取管道从人类撰写的事实核查文章中提取出来的,从而实现全面的验证和解释。基于RW-Post,我们提出了AgentFact,这是一种代理导向的多模态事实核查框架,旨在模拟人类验证工作流程。AgentFact 包含五个专门的代理,它们协作处理关键的事实核查子任务,包括策略规划、高质量证据检索、视觉分析、推理和解释生成。这些代理通过迭代工作流协调,该工作流在证据搜索和任务感知证据过滤与推理之间交替进行,促进战略决策和系统性证据分析。广泛的实验结果表明,RW-Post 和 AgentFact 的协同作用显著提高了多模态事实核查的准确性和可解释性。
Summary / 总结
The paper addresses the challenge of automated fact-checking for multimodal misinformation by introducing RW-Post, a new dataset that provides real-world multimodal claims with detailed reasoning and evidence. Based on RW-Post, the authors propose AgentFact, an agent-based framework that collaborates through an iterative workflow to handle key fact-checking tasks. Experimental results show that AgentFact enhances both the accuracy and interpretability of multimodal fact-checking compared to existing methods.
论文通过引入RW-Post数据集,该数据集包含详细的推理和证据,来应对多模态虚假信息的自动事实核查挑战。基于RW-Post,作者提出了AgentFact,这是一种协作处理关键事实核查任务的基于代理的框架。实验结果表明,AgentFact在多模态事实核查的准确性和可解释性方面优于现有方法。
ColaVLA: Leveraging Cognitive Latent Reasoning for Hierarchical Parallel Trajectory Planning in Autonomous Driving
Authors: Qihang Peng, Xuesong Chen, Chenye Yang, Shaoshuai Shi, Hongsheng Li
First: 2025-12-28T14:06:37+00:00 · Latest: 2025-12-31T09:18:13+00:00
Comments: 11 pages, 4 figures. Project page: https://pqh22.github.io/projects/ColaVLA/index.html
Abstract
Autonomous driving requires generating safe and reliable trajectories from complex multimodal inputs. Traditional modular pipelines separate perception, prediction, and planning, while recent end-to-end (E2E) systems learn them jointly. Vision-language models (VLMs) further enrich this paradigm by introducing cross-modal priors and commonsense reasoning, yet current VLM-based planners face three key challenges: (i) a mismatch between discrete text reasoning and continuous control, (ii) high latency from autoregressive chain-of-thought decoding, and (iii) inefficient or non-causal planners that limit real-time deployment. We propose ColaVLA, a unified vision-language-action framework that transfers reasoning from text to a unified latent space and couples it with a hierarchical, parallel trajectory decoder. The Cognitive Latent Reasoner compresses scene understanding into compact, decision-oriented meta-action embeddings through ego-adaptive selection and only two VLM forward passes. The Hierarchical Parallel Planner then generates multi-scale, causality-consistent trajectories in a single forward pass. Together, these components preserve the generalization and interpretability of VLMs while enabling efficient, accurate and safe trajectory generation. Experiments on the nuScenes benchmark show that ColaVLA achieves state-of-the-art performance in both open-loop and closed-loop settings with favorable efficiency and robustness.
中文标题/摘要
标题:ColaVLA:利用认知潜在推理进行自主驾驶分层并行轨迹规划
自主驾驶需要从复杂的多模态输入中生成安全可靠的轨迹。传统模块化管道将感知、预测和规划分离,而最近的端到端(E2E)系统则联合学习它们。视觉语言模型(VLMs)进一步丰富了这一范式,通过引入跨模态先验和常识推理,但当前基于VLM的规划器面临三个关键挑战:(i)离散文本推理与连续控制之间的不匹配,(ii)自回归链式思考解码的高延迟,以及(iii)效率低下或非因果规划器,限制了实时部署。我们提出ColaVLA,这是一种统一的视觉语言行动框架,将推理从文本转移到统一的潜在空间,并与分层并行轨迹解码器耦合。认知潜在推理器通过自我适应选择将场景理解压缩为决策导向的元动作嵌入,仅需两次VLM前向传递。分层并行规划器则在单次前向传递中生成多尺度、因果一致的轨迹。这些组件共同保留了VLM的泛化能力和可解释性,同时实现了高效、准确和安全的轨迹生成。在nuScenes基准测试中,ColaVLA在开环和闭环设置中均实现了最先进的性能,具有有利的效率和鲁棒性。
Summary / 总结
ColaVLA addresses the challenges in VLM-based trajectory planning by introducing a unified vision-language-action framework that transfers reasoning into a latent space and couples it with a hierarchical, parallel trajectory decoder. This approach reduces the mismatch between text reasoning and continuous control, lowers latency, and enables efficient real-time deployment. Experiments on the nuScenes benchmark demonstrate that ColaVLA outperforms existing methods in both open-loop and closed-loop settings with improved efficiency and robustness.
ColaVLA 通过将认知潜空间推理整合到统一的视觉-语言-动作框架中,解决了生成安全可靠轨迹的挑战。它使用认知潜空间推理器将场景理解压缩为紧凑的元动作嵌入,并使用层次并行规划器生成多尺度、因果一致的轨迹。实验表明,ColaVLA 在效率和鲁棒性方面优于现有方法,同时保持了最先进的性能。
LSRE: Latent Semantic Rule Encoding for Real-Time Semantic Risk Detection in Autonomous Driving
Authors: Qian Cheng, Weitao Zhou, Cheng Jing, Nanshan Deng, Junze Wen, Zhaoyang Liu, Kun Jiang, Diange Yang
First: 2025-12-31T08:27:10+00:00 · Latest: 2025-12-31T08:27:10+00:00
Abstract
Real-world autonomous driving must adhere to complex human social rules that extend beyond legally codified traffic regulations. Many of these semantic constraints, such as yielding to emergency vehicles, complying with traffic officers' gestures, or stopping for school buses, are intuitive for humans yet difficult to encode explicitly. Although large vision-language models (VLMs) can interpret such semantics, their inference cost makes them impractical for real-time deployment.This work proposes LSRE, a Latent Semantic Rule Encoding framework that converts sparsely sampled VLM judgments into decision boundaries within the latent space of a recurrent world model. By encoding language-defined safety semantics into a lightweight latent classifier, LSRE enables real-time semantic risk assessment at 10 Hz without per-frame VLM queries. Experiments on six semantic-failure scenarios in CARLA demonstrate that LSRE attains semantic risk detection accuracy comparable to a large VLM baseline, while providing substantially earlier hazard anticipation and maintaining low computational latency. LSRE further generalizes to rarely seen semantic-similar test cases, indicating that language-guided latent classification offers an effective and deployable mechanism for semantic safety monitoring in autonomous driving.
中文标题/摘要
标题:LSRE:自主驾驶中的实时语义风险检测的潜在语义规则编码
现实世界中的自主驾驶必须遵守复杂的社会规则,这些规则超出了法律规定的交通法规。许多语义约束,如为紧急车辆让路、遵守交通警察的手势或为校车停车,对于人类来说是直观的,但很难明确编码。尽管大型视觉-语言模型(VLMs)可以解释这些语义,但它们的推理成本使其在实时部署中不切实际。本文提出了一种LSRE(潜在语义规则编码)框架,将稀疏采样的VLM判断转换为递归世界模型潜在空间中的决策边界。通过将语言定义的安全语义编码到轻量级的潜在分类器中,LSRE能够在10 Hz的频率下进行实时语义风险评估,而无需每帧查询VLM。在CARLA上的六个语义失败场景实验表明,LSRE在语义风险检测准确性方面与大型VLM基线相当,同时提供显著更早的危险预知,并保持较低的计算延迟。此外,LSRE还能够泛化到罕见的语义相似测试案例,表明语言引导的潜在分类为自主驾驶中的语义安全监控提供了一种有效且可部署的机制。
Summary / 总结
The research aims to address the challenge of real-time semantic risk detection in autonomous driving, where complex social rules beyond traffic regulations must be followed. LSRE, a Latent Semantic Rule Encoding framework, converts VLM judgments into decision boundaries in a recurrent world model's latent space, enabling real-time semantic risk assessment at 10 Hz without querying VLMs per frame. Experiments show LSRE matches a large VLM baseline in accuracy but provides earlier hazard anticipation and lower computational latency, and it generalizes well to unseen semantic scenarios.
研究旨在解决自动驾驶中实时语义风险检测的问题,其中必须遵守超出交通法规的复杂社会规则。LSRE(Latent Semantic Rule Encoding)框架将VLM判断转换为递归世界模型潜空间中的决策边界,从而在每帧不查询VLM的情况下实现10 Hz的实时语义风险评估。实验表明,LSRE在准确度上与大型VLM基线相当,但能更早地预见危险并保持较低的计算延迟,且能够很好地泛化到未见过的语义相似测试案例。
Evolving, Not Training: Zero-Shot Reasoning Segmentation via Evolutionary Prompting
Authors: Kai Ye, Xiaotong You, Jianghang Lin, Jiayi Ji, Pingyang Dai, Liujuan Cao
First: 2025-12-31T08:10:03+00:00 · Latest: 2025-12-31T08:10:03+00:00
Abstract
Reasoning Segmentation requires models to interpret complex, context-dependent linguistic queries to achieve pixel-level localization. Current dominant approaches rely heavily on Supervised Fine-Tuning (SFT) or Reinforcement Learning (RL). However, SFT suffers from catastrophic forgetting and domain dependency, while RL is often hindered by training instability and rigid reliance on predefined reward functions. Although recent training-free methods circumvent these training burdens, they are fundamentally limited by a static inference paradigm. These methods typically rely on a single-pass "generate-then-segment" chain, which suffers from insufficient reasoning depth and lacks the capability to self-correct linguistic hallucinations or spatial misinterpretations. In this paper, we challenge these limitations and propose EVOL-SAM3, a novel zero-shot framework that reformulates reasoning segmentation as an inference-time evolutionary search process. Instead of relying on a fixed prompt, EVOL-SAM3 maintains a population of prompt hypotheses and iteratively refines them through a "Generate-Evaluate-Evolve" loop. We introduce a Visual Arena to assess prompt fitness via reference-free pairwise tournaments, and a Semantic Mutation operator to inject diversity and correct semantic errors. Furthermore, a Heterogeneous Arena module integrates geometric priors with semantic reasoning to ensure robust final selection. Extensive experiments demonstrate that EVOL-SAM3 not only substantially outperforms static baselines but also significantly surpasses fully supervised state-of-the-art methods on the challenging ReasonSeg benchmark in a zero-shot setting. The code is available at https://github.com/AHideoKuzeA/Evol-SAM3.
中文标题/摘要
标题:演化而非训练:通过演化提示实现零样本推理分割
推理分割要求模型解释复杂的、上下文相关的语言查询以实现像素级定位。当前主流方法主要依赖监督微调(SFT)或强化学习(RL)。然而,SFT容易出现灾难性遗忘和领域依赖性问题,而RL则常常受到训练不稳定性及对预定义奖励函数的严格依赖的困扰。尽管最近的无训练方法绕过了这些训练负担,但它们本质上受限于静态推理范式。这些方法通常依赖于一次性的“生成-分割”链,这导致推理深度不足,缺乏自我纠正语言幻觉或空间误解的能力。在本文中,我们挑战这些限制并提出EVOL-SAM3,这是一种新颖的零样本框架,将推理分割重新定义为推理时的演化搜索过程。EVOL-SAM3 不依赖于固定提示,而是维护一组提示假设,并通过“生成-评估-演化”循环逐步优化它们。我们引入了视觉竞技场来通过无参考的成对比赛评估提示适应度,并引入语义变异操作来注入多样性并纠正语义错误。此外,异构竞技场模块将几何先验与语义推理结合,以确保最终选择的鲁棒性。大量实验表明,EVOL-SAM3 不仅在零样本设置下显著优于静态基线,还在具有挑战性的ReasonSeg基准测试中显著超越完全监督的最新方法。代码可在https://github.com/AHideoKuzeA/Evol-SAM3 获取。
Summary / 总结
This paper addresses the limitations of current reasoning segmentation methods by proposing EVOL-SAM3, which reformulates the task as an evolutionary search process at inference time. Instead of relying on a single prompt, it maintains a population of hypotheses and iteratively refines them through a 'Generate-Evaluate-Evolve' loop. The method uses a Visual Arena for prompt fitness assessment and a Semantic Mutation operator to correct semantic errors. Experiments show that EVOL-SAM3 outperforms both static baselines and fully supervised state-of-the-art methods on the ReasonSeg benchmark in a zero-shot setting.
论文针对当前依赖监督微调或强化学习的推理分割方法的局限性,提出了EVOL-SAM3框架。该框架将推理分割重新定义为推理时的进化搜索过程,维护一个提示假设的群体,并通过“生成-评估-进化”循环迭代优化。方法使用视觉竞技场进行提示适应性评估,并使用语义变异操作注入多样性并纠正语义错误。实验表明,EVOL-SAM3在零样本设置下不仅优于静态基线,还在挑战性的ReasonSeg基准上显著超越完全监督的最新方法。
Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention
Authors: Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen
First: 2025-10-15T17:59:45+00:00 · Latest: 2025-12-31T07:36:24+00:00
Comments: Code: https://github.com/EnVision-Research/MTI
Abstract
Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +9.28% average improvement on six benchmarks for DeepSeek-R1-7B and +11.25% on AIME2024 using Ling-mini-2.0-while remaining highly efficient.
中文标题/摘要
标题:少即是多:通过最小化测试时干预提高LLM推理能力
大型语言模型(LLMs)的近期进展集中在通过增加推理计算来提高测试时的推理能力,但往往以效率为代价。我们重新审视测试时的行为,并揭示了一个简单但尚未充分探索的现象:推理不确定性是高度局部化的——只有少量高熵令牌主要影响输出的正确性。受此启发,我们提出了最小化测试时干预(MTI),这是一种无需训练的框架,通过最小的开销来增强推理准确性和稳定性。MTI 包括:(i) 选择性CFG干预,在不确定位置仅应用分类器自由引导;(ii) 轻量级负提示引导,利用主模型的KV缓存高效地近似无条件解码。MTI 在通用任务、编程任务和STEM任务中均表现出一致的改进——例如,DeepSeek-R1-7B在六个基准上的平均改进为+9.28%,AIME2024使用Ling-mini-2.0时为+11.25%,同时保持高度高效。
Summary / 总结
This paper addresses the inefficiency of test-time scaling in large language models (LLMs) by proposing Minimal Test-Time Intervention (MTI), which enhances reasoning accuracy and stability with minimal overhead. MTI involves selective classifier-free guidance at uncertain positions and lightweight negative-prompt guidance using the main model's KV cache. The method consistently improves reasoning accuracy across various tasks, achieving up to 11.25% improvement on AIME2024 using Ling-mini-2.0 while maintaining efficiency.
本文通过提出Minimal Test-Time Intervention (MTI)来解决大型语言模型(LLMs)测试时扩展的低效率问题,MTI以最小的开销提升推理准确性和稳定性。MTI在不确定位置选择性地应用分类器无条件引导,并利用轻量级的负提示引导来高效地近似无条件解码。实验结果显示,MTI在各种任务上表现出一致的改进,例如在六个基准测试中DeepSeek-R1-7B的平均改进为9.28%,在AIME2024中使用Ling-mini-2.0的改进为11.25%,同时保持高效率。
ProCache: Constraint-Aware Feature Caching with Selective Computation for Diffusion Transformer Acceleration
Authors: Fanpu Cao, Yaofo Chen, Zeng You, Wei Luo, Cen Chen
Venue: AAAI 2026 poster
First: 2025-12-19T07:27:19+00:00 · Latest: 2025-12-31T06:37:00+00:00
Comments: Accepted for poster presentation at AAAI 2026
Abstract
Diffusion Transformers (DiTs) have achieved state-of-the-art performance in generative modeling, yet their high computational cost hinders real-time deployment. While feature caching offers a promising training-free acceleration solution by exploiting temporal redundancy, existing methods suffer from two key limitations: (1) uniform caching intervals fail to align with the non-uniform temporal dynamics of DiT, and (2) naive feature reuse with excessively large caching intervals can lead to severe error accumulation. In this work, we analyze the evolution of DiT features during denoising and reveal that both feature changes and error propagation are highly time- and depth-varying. Motivated by this, we propose ProCache, a training-free dynamic feature caching framework that addresses these issues via two core components: (i) a constraint-aware caching pattern search module that generates non-uniform activation schedules through offline constrained sampling, tailored to the model's temporal characteristics; and (ii) a selective computation module that selectively computes within deep blocks and high-importance tokens for cached segments to mitigate error accumulation with minimal overhead. Extensive experiments on PixArt-alpha and DiT demonstrate that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation, significantly outperforming prior caching-based methods.
中文标题/摘要
标题:ProCache:基于约束的特征缓存与选择性计算以加速扩散变换器
扩散变换器(DiTs)在生成建模中取得了最先进的性能,但其高昂的计算成本阻碍了实时部署。虽然特征缓存通过利用时间冗余提供了一种无训练的加速解决方案,但现有方法存在两个关键局限性:(1)均匀的缓存间隔无法与DiT的时间非均匀动态对齐;(2)使用过大的缓存间隔进行简单的特征重用会导致严重的误差累积。在本文中,我们分析了去噪过程中DiT特征的演变,发现特征变化和误差传播在时间和深度上都高度变化。受此启发,我们提出了一种基于约束的动态特征缓存框架ProCache,通过两个核心组件解决了这些问题:(i)一种约束感知的缓存模式搜索模块,通过离线约束采样生成非均匀激活时间表,以适应模型的时间特性;(ii)一种选择性计算模块,在深层块和高重要性标记中选择性地计算缓存段,以最小化误差累积,同时减少开销。在PixArt-alpha和DiT上的广泛实验表明,ProCache在几乎不降低质量的情况下实现了高达1.96倍和2.90倍的加速,显著优于先前的基于缓存的方法。
Summary / 总结
ProCache is a training-free dynamic feature caching framework designed to accelerate Diffusion Transformers (DiTs) by addressing the limitations of uniform caching intervals and naive feature reuse. It uses a constraint-aware caching pattern search module to generate non-uniform activation schedules and a selective computation module to minimize error accumulation. Experiments show that ProCache achieves up to 1.96x and 2.90x acceleration with negligible quality degradation compared to previous methods.
ProCache 是一个无需训练的动态特征缓存框架,旨在通过解决均匀缓存间隔和特征重用的简单方法带来的问题来加速扩散变换器(DiTs)。它使用一个约束感知的缓存模式搜索模块生成非均匀的激活时间表,并使用一个选择性计算模块来最小化误差累积。实验表明,ProCache 可以实现高达 1.96 倍和 2.90 倍的加速,同时保持质量基本不变,显著优于之前的缓存基方法。
CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection
Authors: Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim
First: 2025-10-16T15:27:10+00:00 · Latest: 2025-12-31T05:45:29+00:00
Comments: 28 pages, 13 Figures, 12 Tables
Abstract
Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art. Code and models are available at https://github.com/hchoi256/cotpl.
中文标题/摘要
标题:CoT-PL:视觉链式思考推理与伪标签结合的开放词汇目标检测
开放词汇目标检测(OVD)旨在识别和定位训练期间未见过的对象类别。最近的方法通常利用视觉语言模型(VLMs)通过图像-文本对齐生成伪标签,使检测器能够在没有显式监督的情况下泛化到未见过的类别。然而,这些方法高度依赖直接的图像-文本匹配,忽视了解释语义复杂场景所必需的中间推理步骤。这导致在拥挤或遮挡的视觉上下文中表现有限。本文提出了一种新的框架CoT-PL,该框架将结构化的视觉链式思考(CoT)推理融入伪标签过程。CoT-PL将对象理解分解为三个可解释的步骤:(1)即使对于未见过的对象也能感知区域,(2)通过零样本推理进行类别识别,(3)背景定位以分离语义复杂的对象。最关键的是,第三步自然地促使我们使用预计算的背景线索作为负样本,以促进对象和背景之间的特征解耦。这样,CoT推理和CBL形成了一条集成的流水线,专门针对拥挤或遮挡场景中的稳健伪标签生成。值得注意的是,在这两种情况下,我们对新类伪标签的质量分别相对于最佳先验提高了103.4%和168.4%。我们的大量实验表明,CoT-PL在开放词汇COCO上实现了+7.7 AP50,在LVIS上实现了+2.9掩码AP的新最佳性能。代码和模型可在https://github.com/hchoi256/cotpl获取。
Summary / 总结
The paper introduces CoT-PL, a framework that integrates visual chain-of-thought reasoning into pseudo-labeling for open-vocabulary object detection. It decomposes object understanding into region perception, zero-shot category recognition, and background grounding. The background grounding step uses contrastive background learning (CBL) to improve feature disentanglement. Experiments show that CoT-PL significantly improves pseudo-label quality in crowded or occluded scenes, achieving state-of-the-art results with +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes.
CoT-PL 提出了一种新的开放词汇对象检测框架,该框架将结构化的视觉链式推理和对比背景学习集成到伪标签生成过程中。该方法将对象理解分解为区域感知、零样本类别识别和背景分离三个步骤,有助于在拥挤或遮挡场景中实现稳健的伪标签生成。该方法显著提高了新类别伪标签的质量,在这些设置中分别比之前的方法提高了103.4%和168.4%,并在开放词汇 COCO 和 LVIS 数据集上达到了新的最佳性能。
Improving Few-Shot Change Detection Visual Question Answering via Decision-Ambiguity-guided Reinforcement Fine-Tuning
Authors: Fuyu Dong, Ke Li, Di Wang, Nan Luo, Yiming Zhang, Kaiyu Li, Jianfei Yang, Quan Wang
First: 2025-12-31T03:28:17+00:00 · Latest: 2025-12-31T03:28:17+00:00
Abstract
Change detection visual question answering (CDVQA) requires answering text queries by reasoning about semantic changes in bi-temporal remote sensing images. A straightforward approach is to boost CDVQA performance with generic vision-language models via supervised fine-tuning (SFT). Despite recent progress, we observe that a significant portion of failures do not stem from clearly incorrect predictions, but from decision ambiguity, where the model assigns similar confidence to the correct answer and strong distractors. To formalize this challenge, we define Decision-Ambiguous Samples (DAS) as instances with a small probability margin between the ground-truth answer and the most competitive alternative. We argue that explicitly optimizing DAS is crucial for improving the discriminability and robustness of CDVQA models. To this end, we propose DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework that first mines DAS using an SFT-trained reference policy and then applies group-relative policy optimization on the mined subset. By leveraging multi-sample decoding and intra-group relative advantages, DARFT suppresses strong distractors and sharpens decision boundaries without additional supervision. Extensive experiments demonstrate consistent gains over SFT baselines, particularly under few-shot settings.
中文标题/摘要
标题:通过决策模糊性引导的强化微调改进少样本变化检测视觉问答
变化检测视觉问答(CDVQA)需要通过推理生物时相遥感图像中的语义变化来回答文本查询。一种直接的方法是通过监督微调(SFT)增强CDVQA性能。尽管取得了近期进展,我们观察到,大量失败并非源自明显错误的预测,而是决策模糊性,即模型对正确答案和强干扰项赋予了相似的置信度。为了正式化这一挑战,我们将决策模糊样本(DAS)定义为真实答案与最竞争替代方案之间概率差距较小的实例。我们认为,明确优化DAS对于提高CDVQA模型的可区分性和鲁棒性至关重要。为此,我们提出了DARFT框架,该框架首先使用SFT训练的参考策略挖掘DAS,然后在挖掘的子集上应用组相对策略优化。通过利用多样本解码和组内相对优势,DARFT抑制了强干扰项并细化了决策边界,而无需额外监督。大量实验表明,DARFT在少样本设置中相对于SFT基线具有一致的改进。
Summary / 总结
The paper addresses the challenge of decision ambiguity in change detection visual question answering (CDVQA) by defining Decision-Ambiguous Samples (DAS) and proposing DARFT, a Decision-Ambiguity-guided Reinforcement Fine-Tuning framework. This framework mines DAS using an SFT-trained reference policy and applies group-relative policy optimization to improve model discriminability and robustness. Experiments show consistent improvements over supervised fine-tuning baselines, especially in few-shot settings.
论文通过定义决策模糊样本(DAS)并提出决策模糊引导强化微调(DARFT)框架来解决变化检测视觉问答(CDVQA)中的决策模糊问题。该框架使用SFT训练的参考策略挖掘DAS,并应用组相对策略优化以提高模型的可区分性和鲁棒性。实验结果显示,在少量样本设置中,DARFT框架相对于监督微调基线有持续的改进。
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
Authors: Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang, Pragaash Ponnusamy, Harikaran Subbaraj, Jue Wang, Shuaiwen Leon Song, Ben Athiwaratkun
First: 2025-12-31T02:46:04+00:00 · Latest: 2025-12-31T02:46:04+00:00
Abstract
Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.
中文标题/摘要
标题:理解与引导测试时推理模型的认知行为
大型语言模型(LLMs)通常依赖长链推理(CoT)来解决复杂任务。虽然有效,但这些路径往往效率低下,导致因过度生成标记而产生高延迟,或者产生不稳定推理,交替出现浅层且不一致的推理和冗长且重复的推理。在本工作中,我们研究了推理路径的结构,并发现与验证和回溯等不同认知行为相关的专门注意头。通过在推理时轻柔地干预这些头,我们可以引导模型远离低效模式。基于这一见解,我们提出了CREST,一种无需训练的方法,用于测试时的认知推理引导。CREST有两个组成部分:(1)一个离线校准步骤,用于识别认知头并推导出特定于头的引导向量,以及(2)一个推理时的程序,用于旋转隐藏表示以抑制沿这些向量的成分。CREST自适应地抑制无生产力的推理行为,从而提高准确性和降低计算成本。在各种推理基准和模型上,CREST将准确率提高了最多17.5%,同时减少了37.6%的标记使用量,提供了一条简单而有效的快速、可靠LLM推理途径。
Summary / 总结
This work addresses the inefficiencies in long chain-of-thought reasoning used by large language models, which can lead to high latency and unstable reasoning. The authors identify specialized attention heads that correlate with cognitive behaviors like verification and backtracking. They propose CREST, a training-free method that calibrates these heads offline and suppresses unproductive reasoning at inference time, improving accuracy and reducing computational cost. Across various benchmarks, CREST enhances accuracy by up to 17.5% and reduces token usage by 37.6%.
这项工作解决了大型语言模型中长链推理的低效问题,可能导致高延迟和不稳定推理。作者识别了与认知行为如验证和回溯相关的特定注意力头。他们提出了一种名为CREST的无训练方法,在推理时通过旋转隐藏表示来引导模型远离低效推理。CREST在各种推理基准和模型上提高了高达17.5%的准确性,并减少了37.6%的令牌使用量,提供了一种简单而有效的增强LLM性能的方法。
PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
Authors: Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou
First: 2025-12-31T01:19:14+00:00 · Latest: 2025-12-31T01:19:14+00:00
Abstract
Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO
中文标题/摘要
标题:PhyGDPO:物理感知的组内直接偏好优化方法以实现物理一致的文本到视频生成
文本到视频(T2V)生成的最新进展在视觉质量方面取得了良好的成果,但合成严格遵循物理定律的视频仍然是一个开放的挑战。现有方法主要基于图形或提示扩展,难以泛化到复杂的模拟环境或学习隐式的物理推理。训练数据中缺乏丰富的物理交互和现象也是一个问题。在本文中,我们首先引入了一个物理增强的视频数据构建流水线PhyAugPipe,利用具有链式推理的视觉语言模型(VLM)收集大规模训练数据集PhyVidGen-135K。然后,我们提出了一个基于组内Plackett-Luce概率模型的物理感知的组内直接偏好优化PhyGDPO框架,以捕捉超越成对比较的整体偏好。在PhyGDPO中,我们设计了一种物理引导奖励(PGR)方案,将基于VLM的物理奖励嵌入其中,以引导优化向物理一致性方向发展。我们还提出了一种LoRA-Switch参考(LoRA-SR)方案,以消除内存密集型的参考重复,实现高效的训练。实验表明,我们的方法在PhyGenBench和VideoPhy2上显著优于最先进的开源方法。请访问我们的项目页面https://caiyuanhao1998.github.io/project/PhyGDPO获取更多视频结果。我们的代码、模型和数据将在https://github.com/caiyuanhao1998/Open-PhyGDPO发布
Summary / 总结
This paper addresses the challenge of generating physically consistent videos from text descriptions. It introduces PhyAugPipe, a pipeline that uses a vision-language model for data augmentation, and PhyGDPO, a framework that optimizes video generation based on groupwise preferences and physics rewards. The method significantly outperforms existing open-source approaches on PhyGenBench and VideoPhy2 benchmarks, demonstrating improved physical consistency in generated videos.
本文解决了从文本描述生成物理一致视频的挑战。它引入了PhyAugPipe,一种使用视觉语言模型的数据增强管道,以及PhyGDPO框架,该框架基于群体偏好和物理奖励优化视频生成。该方法在PhyGenBench和VideoPhy2基准测试中显著优于现有开源方法。
Training-Free Color-Aware Adversarial Diffusion Sanitization for Diffusion Stegomalware Defense at Security Gateways
Authors: Vladimir Frants, Sos Agaian
First: 2025-12-30T22:53:33+00:00 · Latest: 2025-12-30T22:53:33+00:00
Abstract
The rapid expansion of generative AI has normalized large-scale synthetic media creation, enabling new forms of covert communication. Recent generative steganography methods, particularly those based on diffusion models, can embed high-capacity payloads without fine-tuning or auxiliary decoders, creating significant challenges for detection and remediation. Coverless diffusion-based techniques are difficult to counter because they generate image carriers directly from secret data, enabling attackers to deliver stegomalware for command-and-control, payload staging, and data exfiltration while bypassing detectors that rely on cover-stego discrepancies. This work introduces Adversarial Diffusion Sanitization (ADS), a training-free defense for security gateways that neutralizes hidden payloads rather than detecting them. ADS employs an off-the-shelf pretrained denoiser as a differentiable proxy for diffusion-based decoders and incorporates a color-aware, quaternion-coupled update rule to reduce artifacts under strict distortion limits. Under a practical threat model and in evaluation against the state-of-the-art diffusion steganography method Pulsar, ADS drives decoder success rates to near zero with minimal perceptual impact. Results demonstrate that ADS provides a favorable security-utility trade-off compared to standard content transformations, offering an effective mitigation strategy against diffusion-driven steganography.
中文标题/摘要
标题:无需训练的色彩感知对抗扩散净化:安全网关中的扩散隐秘软件防御
生成式AI的迅速发展使大规模合成媒体的创建变得司空见惯,开启了新的隐蔽通信形式。最近基于扩散模型的生成式隐写术方法可以在不进行微调或辅助解码器的情况下嵌入高容量载荷,给检测和修复带来了巨大挑战。无伪装的基于扩散的技术难以对抗,因为它们直接从秘密数据生成图像载体,使攻击者能够绕过依赖伪装-隐写术差异的检测器,将隐秘软件用于命令与控制、载荷部署和数据泄露。本文提出了一种无需训练的对抗扩散净化(ADS),这是一种安全网关的防御方法,旨在中和隐藏的载荷而非检测它们。ADS 使用现成的预训练去噪器作为可微代理扩散解码器,并结合色彩感知的四元数耦合更新规则,在严格失真限制下减少伪影。在实际威胁模型下,与最先进的扩散隐写术方法Pulsar进行评估,ADS 将解码成功率驱动至接近零,同时对感知影响最小。结果表明,ADS 提供了与标准内容转换相比更有利的安全-效用权衡,为对抗驱动的隐写术提供了一种有效的缓解策略。
Summary / 总结
This work addresses the challenge of detecting and mitigating diffusion-based steganography methods that can embed high-capacity payloads covertly without fine-tuning. It introduces Adversarial Diffusion Sanitization (ADS), a training-free defense mechanism that uses an off-the-shelf pretrained denoiser to neutralize hidden payloads. ADS incorporates a color-aware, quaternion-coupled update rule to minimize artifacts under strict distortion limits. Evaluations show that ADS significantly reduces decoder success rates to near zero while maintaining minimal perceptual impact, providing a favorable security-utility trade-off compared to standard content transformations.
这项工作旨在检测和缓解基于扩散模型的隐写术,该隐写术直接从秘密数据中嵌入高容量载荷且无需微调。它引入了对抗扩散净化(ADS),这是一种无需训练的防御方法,使用现成的预训练去噪器作为扩散解码器的可微代理。ADS采用颜色感知的四元数耦合更新规则以减少伪影。评估结果显示,ADS可以将解码器的成功率驱动至接近零,同时保持最小的感知影响,相比标准内容变换提供了更有利的安全-实用性权衡。
Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models
Authors: Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavoned, Martin Steinert
First: 2025-12-30T21:20:41+00:00 · Latest: 2025-12-30T21:20:41+00:00
Comments: 17 pages without bibliography or appendix. The main paper has 16 figures
Abstract
The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained vision-language model (VLM) fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning.
中文标题/摘要
标题:桥梁上的基础模型:基于视觉语言模型的海上自主航行语义风险检测与安全操作
国际海事组织(IMO)的MASS代码草案要求自主和远程监督的海上船舶在偏离其操作设计域时能够检测到,并进入预定义的后备模式通知操作员,允许立即的人工干预,并且未经批准不得改变航程计划。在警报到接管的窗口期内满足这些义务需要一种短时间范围、可人工干预的后备操作。传统的海上自主系统在需要理解意义的情况下(例如,潜水员标志意味着有人在水中,火意味着危险)难以应对。我们认为(i)视觉语言模型(VLMs)为这些分布外情况提供了语义意识,(ii)快速-慢速异常检测流水线与短时间范围、可人工干预的后备操作可以在交接窗口内实现这一目标。我们引入了语义瞭望,这是一种仅使用摄像头、候选受限的视觉语言模型(VLM)后备操作选择器,它在持续的人类监督下从水下有效、世界锚定的轨迹中选择一个谨慎的操作(或保持位置)。在40个港口场景中,我们测量了每次呼叫的场景理解能力和延迟,与人类共识(模型三票多数投票)的对齐情况,以及在火灾危险场景中的短时间范围内的风险缓解,并进行了水上警报-后备操作-操作员交接。亚10秒的模型保留了大多数先进模型的大部分意识。后备操作选择器优于仅几何模型基准,并在火灾场景中增加了安全距离。现场运行验证了端到端操作。这些结果支持VLMs作为与IMO MASS代码草案兼容的语义后备操作选择器,符合实际的延迟预算,并激励未来的工作,即领域适应的混合自主,将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。
DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
Authors: Wen-wai Yim, Yujuan Fu, Asma Ben Abacha, Meliha Yetisgen, Noel Codella, Roberto Andres Novoa, Josep Malvehy
First: 2025-12-30T16:48:20+00:00 · Latest: 2025-12-30T16:48:20+00:00
Abstract
Recent advances in dermatological image analysis have been driven by large-scale annotated datasets; however, most existing benchmarks focus on dermatoscopic images and lack patient-authored queries and clinical context, limiting their applicability to patient-centered care. To address this gap, we introduce DermaVQA-DAS, an extension of the DermaVQA dataset that supports two complementary tasks: closed-ended question answering (QA) and dermatological lesion segmentation. Central to this work is the Dermatology Assessment Schema (DAS), a novel expert-developed framework that systematically captures clinically meaningful dermatological features in a structured and standardized form. DAS comprises 36 high-level and 27 fine-grained assessment questions, with multiple-choice options in English and Chinese. Leveraging DAS, we provide expert-annotated datasets for both closed QA and segmentation and benchmark state-of-the-art multimodal models. For segmentation, we evaluate multiple prompting strategies and show that prompt design impacts performance: the default prompt achieves the best results under Mean-of-Max and Mean-of-Mean evaluation aggregation schemes, while an augmented prompt incorporating both patient query title and content yields the highest performance under majority-vote-based microscore evaluation, achieving a Jaccard index of 0.395 and a Dice score of 0.566 with BiomedParse. For closed-ended QA, overall performance is strong across models, with average accuracies ranging from 0.729 to 0.798; o3 achieves the best overall accuracy (0.798), closely followed by GPT-4.1 (0.796), while Gemini-1.5-Pro shows competitive performance within the Gemini family (0.783). We publicly release DermaVQA-DAS, the DAS schema, and evaluation protocols to support and accelerate future research in patient-centered dermatological vision-language modeling (https://osf.io/72rp3).
中文标题/摘要
标题:DermaVQA-DAS:皮肤病评估方案(DAS)及数据集,用于患者生成的皮肤病图像的封闭式问题回答与分割
皮肤病图像分析的最新进展得益于大规模标注数据集;然而,现有大多数基准主要集中在皮肤镜图像上,缺乏患者撰写的查询和临床背景,限制了其在以患者为中心的护理中的应用。为解决这一问题,我们引入了DermaVQA-DAS,这是DermaVQA数据集的扩展,支持两种互补任务:封闭式问题回答(QA)和皮肤病病变分割。本研究的核心是皮肤病评估方案(DAS),这是一种新颖的专家开发框架,系统地以结构化和标准化形式捕捉临床有意义的皮肤病特征。DAS 包含36个高层次和27个细粒度评估问题,其中包含英文和中文的多项选择题。利用DAS,我们提供了专家标注的数据集,用于封闭式QA和分割,并对最先进的多模态模型进行了基准测试。对于分割,我们评估了多种提示策略,并展示了提示设计对性能的影响:默认提示在Mean-of-Max和Mean-of-Mean评估聚合方案下表现最佳,而结合患者查询标题和内容的增强提示在基于多数投票的微评分评估下表现最佳,使用BiomedParse时Jaccard指数为0.395,Dice得分为0.566。对于封闭式QA,模型的整体性能很强,平均准确率从0.729到0.798不等;o3在整体准确率上表现最佳(0.798),紧随其后的是GPT-4.1(0.796),而Gemini-1.5-Pro在Gemini家族中表现出竞争力(0.783)。我们公开发布了DermaVQA-DAS、DAS方案和评估协议,以支持和加速未来在以患者为中心的皮肤病视觉语言建模方面的研究(https://osf.io/72rp3)。
Spatial-aware Vision Language Model for Autonomous Driving
Authors: Weijie Wei, Zhipeng Luo, Ling Feng, Venice Erin Liong
First: 2025-12-30T16:35:00+00:00 · Latest: 2025-12-30T16:35:00+00:00
Abstract
While Vision-Language Models (VLMs) show significant promise for end-to-end autonomous driving by leveraging the common sense embedded in language models, their reliance on 2D image cues for complex scene understanding and decision-making presents a critical bottleneck for safety and reliability. Current image-based methods struggle with accurate metric spatial reasoning and geometric inference, leading to unreliable driving policies. To bridge this gap, we propose LVLDrive (LiDAR-Vision-Language), a novel framework specifically designed to upgrade existing VLMs with robust 3D metric spatial understanding for autonomous driving by incoperating LiDAR point cloud as an extra input modality. A key challenge lies in mitigating the catastrophic disturbance introduced by disparate 3D data to the pre-trained VLMs. To this end, we introduce a Gradual Fusion Q-Former that incrementally injects LiDAR features, ensuring the stability and preservation of the VLM's existing knowledge base. Furthermore, we develop a spatial-aware question-answering (SA-QA) dataset to explicitly teach the model advanced 3D perception and reasoning capabilities. Extensive experiments on driving benchmarks demonstrate that LVLDrive achieves superior performance compared to vision-only counterparts across scene understanding, metric spatial perception, and reliable driving decision-making. Our work highlights the necessity of explicit 3D metric data for building trustworthy VLM-based autonomous systems.
中文标题/摘要
标题:具备空间意识的视觉语言模型在自动驾驶中的应用
尽管视觉-语言模型(VLMs)通过嵌入在语言模型中的常识在端到端的自动驾驶中展现出显著的潜力,但它们依赖于二维图像线索进行复杂场景理解和决策,这成为确保安全性和可靠性的关键瓶颈。当前基于图像的方法在精确的度量空间推理和几何推断方面存在困难,导致不可靠的驾驶策略。为解决这一问题,我们提出了一种名为LVLDrive(LiDAR-视觉-语言)的新框架,该框架通过引入LiDAR点云作为额外输入模态,专门设计用于增强现有VLMs的稳健的三维度量空间理解能力,以适应自动驾驶的需求。一个关键挑战在于减轻来自不同三维数据对预训练VLMs的灾难性干扰。为此,我们引入了一种渐进融合Q-Former,逐步注入LiDAR特征,确保VLMs的稳定性和知识库的保留。此外,我们还开发了一个空间意识问答(SA-QA)数据集,以明确教授模型高级的三维感知和推理能力。在驾驶基准上的广泛实验表明,LVLDrive在场景理解、度量空间感知和可靠的驾驶决策方面优于仅基于视觉的模型。我们的工作强调了构建可信赖的基于VLM的自动驾驶系统时明确的三维度量数据的必要性。
Summary / 总结
The research aims to enhance Vision-Language Models (VLMs) for autonomous driving by integrating LiDAR data to improve 3D spatial understanding. The method introduces LVLDrive, which uses a Gradual Fusion Q-Former to incrementally incorporate LiDAR features into pre-trained VLMs, ensuring stability. Key findings show that LVLDrive outperforms vision-only models in scene understanding, metric spatial perception, and driving decision-making, emphasizing the importance of 3D data for reliable autonomous systems.
研究旨在通过集成LiDAR数据来增强Vision-Language Models (VLMs)的空间推理能力,以实现自主驾驶。提出的LVLDrive框架引入了Gradual Fusion Q-Former,逐步将LiDAR特征注入到预训练的VLMs中,确保稳定性。此外,开发了一个空间感知问答数据集,以训练模型进行3D感知和推理。实验结果表明,LVLDrive在场景理解、度量空间感知和驾驶决策方面优于仅基于视觉的模型,强调了可靠自主系统中3D度量数据的重要性。
SenseNova-MARS: Empowering Multimodal Agentic Reasoning and Search via Reinforcement Learning
Authors: Yong Xien Chng, Tao Hu, Wenwen Tong, Xueheng Li, Jiandong Chen, Haojia Yu, Jiefan Lu, Hewei Guo, Hanming Deng, Chengjun Xie, Gao Huang, Dahua Lin, Lewei Lu
First: 2025-12-30T16:31:45+00:00 · Latest: 2025-12-30T16:31:45+00:00
Abstract
While Vision-Language Models (VLMs) can solve complex tasks through agentic reasoning, their capabilities remain largely constrained to text-oriented chain-of-thought or isolated tool invocation. They fail to exhibit the human-like proficiency required to seamlessly interleave dynamic tool manipulation with continuous reasoning, particularly in knowledge-intensive and visually complex scenarios that demand coordinated external tools such as search and image cropping. In this work, we introduce SenseNova-MARS, a novel Multimodal Agentic Reasoning and Search framework that empowers VLMs with interleaved visual reasoning and tool-use capabilities via reinforcement learning (RL). Specifically, SenseNova-MARS dynamically integrates the image search, text search, and image crop tools to tackle fine-grained and knowledge-intensive visual understanding challenges. In the RL stage, we propose the Batch-Normalized Group Sequence Policy Optimization (BN-GSPO) algorithm to improve the training stability and advance the model's ability to invoke tools and reason effectively. To comprehensively evaluate the agentic VLMs on complex visual tasks, we introduce the HR-MMSearch benchmark, the first search-oriented benchmark composed of high-resolution images with knowledge-intensive and search-driven questions. Experiments demonstrate that SenseNova-MARS achieves state-of-the-art performance on open-source search and fine-grained image understanding benchmarks. Specifically, on search-oriented benchmarks, SenseNova-MARS-8B scores 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models such as Gemini-3-Flash and GPT-5. SenseNova-MARS represents a promising step toward agentic VLMs by providing effective and robust tool-use capabilities. To facilitate further research in this field, we will release all code, models, and datasets.
中文标题/摘要
标题:SenseNova-MARS:通过强化学习赋能多模态代理推理与搜索
尽管视觉语言模型(VLMs)可以通过代理推理解决复杂任务,但其能力主要局限于文本导向的链式思考或孤立工具调用。它们无法展现出人类所需的熟练度,以无缝地将动态工具操作与连续推理交织在一起,特别是在需要协调外部工具(如搜索和图像裁剪)的知识密集型和视觉复杂场景中。在本研究中,我们提出了SenseNova-MARS,这是一种新颖的多模态代理推理与搜索框架,通过强化学习(RL)赋予VLMs交织的视觉推理和工具使用能力。具体而言,SenseNova-MARS动态整合了图像搜索、文本搜索和图像裁剪工具,以应对精细和知识密集型的视觉理解挑战。在RL阶段,我们提出了批标准化组序列策略优化(BN-GSPO)算法,以提高训练稳定性并增强模型调用工具和有效推理的能力。为了全面评估代理VLMs在复杂视觉任务上的表现,我们引入了HR-MMSearch基准,这是第一个由高分辨率图像和知识密集型及搜索驱动的问题组成的搜索导向基准。实验表明,SenseNova-MARS在开源搜索和细粒度图像理解基准上达到了最先进的性能。具体而言,在搜索导向基准上,SenseNova-MARS-8B在MMSearch上的得分为67.84,在HR-MMSearch上的得分为41.64,超过了诸如Gemini-3-Flash和GPT-5等专有模型。SenseNova-MARS代表了向代理VLMs迈出的有希望的一步,提供了有效的和稳健的工具使用能力。为了促进该领域的进一步研究,我们将发布所有代码、模型和数据集。
Summary / 总结
SenseNova-MARS is a framework that enhances Vision-Language Models (VLMs) with the ability to perform interleaved visual reasoning and tool-use through reinforcement learning. It integrates image search, text search, and image crop tools to handle fine-grained and knowledge-intensive visual tasks. The BN-GSPO algorithm is proposed to improve training stability and tool invocation. Experiments show that SenseNova-MARS outperforms existing models on search-oriented benchmarks, achieving scores of 67.84 on MMSearch and 41.64 on HR-MMSearch, surpassing proprietary models like Gemini-3-Flash and GPT-5.
SenseNova-MARS 是一个框架,通过强化学习增强视觉语言模型(VLMs)的交互式视觉推理和工具使用能力。它整合了图像搜索、文本搜索和图像裁剪工具,以处理细粒度和知识密集型的视觉任务。BN-GSPO 算法被提出以提高训练稳定性和工具调用能力。实验表明,SenseNova-MARS 在搜索导向基准测试中表现出色,分别在 MMSearch 和 HR-MMSearch 上得分 67.84 和 41.64,超过了如 Gemini-3-Flash 和 GPT-5 等专有模型。
Bringing The Consistency Gap: Explicit Structured Memory for Interleaved Image-Text Generation
Authors: Zeteng Lin, Xingxing Li, Wen You, Xiaoyang Li, Zehan Lu, Yujun Cai, Jing Tang
First: 2025-10-13T03:19:45+00:00 · Latest: 2025-12-30T15:40:12+00:00
Abstract
Existing Vision Language Models (VLMs) often struggle to preserve logic, entity identity, and artistic style during extended, interleaved image-text interactions. We identify this limitation as "Multimodal Context Drift", which stems from the inherent tendency of implicit neural representations to decay or become entangled over long sequences. To bridge this gap, we propose IUT-Plug, a model-agnostic Neuro-Symbolic Structured State Tracking mechanism. Unlike purely neural approaches that rely on transient attention maps, IUT-Plug introduces the Image Understanding Tree (IUT) as an explicit, persistent memory module. The framework operates by (1) parsing visual scenes into hierarchical symbolic structures (entities, attributes, and relationships); (2) performing incremental state updates to logically lock invariant properties while modifying changing elements; and (3) guiding generation through topological constraints. We evaluate our approach on a novel benchmark comprising 3,000 human-annotated samples. Experimental results demonstrate that IUT-Plug effectively mitigates context drift, achieving significantly higher consistency scores compared to unstructured text-prompting baselines. This confirms that explicit symbolic grounding is essential for maintaining robust long-horizon consistency in multimodal generation.
中文标题/摘要
标题:弥补一致性差距:交错图像-文本生成的显式结构化记忆
现有的视觉语言模型(VLMs)在长时间的交错图像-文本交互中往往难以保持逻辑性、实体身份和艺术风格。我们将其局限性称为“多模态上下文漂移”,这源于隐式神经表示在长序列中固有的衰减或纠缠倾向。为了解决这一问题,我们提出了IUT-Plug,这是一种模型无关的神经-符号结构化状态跟踪机制。不同于依赖于瞬态注意力图的纯神经方法,IUT-Plug 引入了图像理解树(IUT)作为显式的持久性记忆模块。该框架通过以下步骤运作:(1) 将视觉场景解析为分层的符号结构(实体、属性和关系);(2) 进行增量状态更新,逻辑锁定不变的属性并修改变化的元素;(3) 通过拓扑约束指导生成。我们在一个包含3,000个人工标注样本的新基准上评估了我们的方法。实验结果表明,IUT-Plug 有效地缓解了上下文漂移,与无结构的文本提示基线相比,实现了显著更高的一致性得分。这证实了显式的符号定位对于在多模态生成中保持稳健的长期一致性至关重要。
Summary / 总结
The paper addresses the issue of 'Multimodal Context Drift' in Vision Language Models (VLMs), where the models struggle to maintain consistency in extended, interleaved image-text interactions. To tackle this, the authors propose IUT-Plug, a model-agnostic mechanism that uses an explicit, persistent Image Understanding Tree (IUT) to parse visual scenes into hierarchical symbolic structures and perform incremental state updates. Evaluations on a new benchmark show that IUT-Plug significantly improves consistency scores, indicating the importance of explicit symbolic grounding for long-horizon multimodal generation.
研究针对视觉语言模型(VLMs)在长时间交互中逻辑和视觉一致性下降的问题,提出了一种名为IUT-Plug的模型通用机制,通过引入显式的图像理解树(IUT)来保留逻辑和视觉信息。方法包括将视觉场景解析为符号结构、增量更新状态并使用拓扑约束引导生成。实验表明,IUT-Plug在保持一致性得分方面优于无结构的文本提示基线,这证明了在多模态生成中显式符号接地的重要性。
ARM: A Learnable, Plug-and-Play Module for CLIP-based Open-vocabulary Semantic Segmentation
Authors: Ziquan Liu, Zhewei Zhu, Xuyang Shi
First: 2025-12-30T13:38:30+00:00 · Latest: 2025-12-30T13:38:30+00:00
Comments: 10 pages, 4 figures
Abstract
Open-vocabulary semantic segmentation (OVSS) is fundamentally hampered by the coarse, image-level representations of CLIP, which lack precise pixel-level details. Existing training-free methods attempt to resolve this by either importing priors from costly external foundation models (e.g., SAM, DINO) or by applying static, hand-crafted heuristics to CLIP's internal features. These approaches are either computationally expensive or sub-optimal. We propose the Attention Refinement Module (ARM), a lightweight, learnable module that effectively unlocks and refines CLIP's internal potential. Unlike static-fusion methods, ARM learns to adaptively fuse hierarchical features. It employs a semantically-guided cross-attention block, using robust deep features (K, V) to select and refine detail-rich shallow features (Q), followed by a self-attention block. The key innovation lies in a ``train once, use anywhere" paradigm. Trained once on a general-purpose dataset (e.g., COCO-Stuff), ARM acts as a universal plug-and-play post-processor for diverse training-free frameworks. Extensive experiments show that ARM consistently boosts baseline performance on multiple benchmarks with negligible inference overhead, establishing an efficient and effective paradigm for training-free OVSS.
中文标题/摘要
标题:ARM:一种可学习的插件式模块,用于基于CLIP的开放式词汇语义分割
开放式词汇语义分割(OVSS)从根本上受到CLIP粗略的图像级表示的限制,缺乏精确的像素级细节。现有的无需训练的方法试图通过从昂贵的外部基础模型(例如SAM、DINO)导入先验知识或通过应用静态的手工制作启发式方法来解决这一问题,CLIP的内部特征。这些方法要么计算成本高,要么效果不佳。我们提出了注意力精炼模块(ARM),这是一种轻量级、可学习的模块,有效地解锁并精炼了CLIP的内部潜力。与静态融合方法不同,ARM学习自适应地融合层次特征。它采用语义引导的交叉注意力块,使用鲁棒的深层特征(K, V)来选择和精炼细节丰富的浅层特征(Q),然后通过一个自我注意力块。关键创新在于“一次训练,随处使用”的范式。ARM在通用数据集(例如COCO-Stuff)上训练一次后,作为通用插件式后处理器,适用于多种无需训练的框架。大量实验表明,ARM在多个基准测试上一致地提升了基线性能,且几乎无推理开销,建立了高效的开放式词汇语义分割范式。
Summary / 总结
The research addresses the challenge of open-vocabulary semantic segmentation (OVSS) by proposing the Attention Refinement Module (ARM), a lightweight, learnable module that enhances CLIP's coarse image-level representations. ARM uses a semantically-guided cross-attention mechanism to refine and fuse hierarchical features, providing a universal post-processor that can be applied to various training-free frameworks. Experiments demonstrate that ARM improves baseline performance on multiple benchmarks with minimal computational cost, establishing an efficient paradigm for training-free OVSS.
研究通过提出注意力精炼模块(ARM),解决了开放词汇语义分割(OVSS)中的挑战,该模块通过可学习的、自适应地融合层次特征来增强CLIP的粗略图像级表示。ARM 使用语义引导的交叉注意力机制,用稳健的深层特征来精炼浅层特征。ARM 可以一次训练并在各种无监督框架中通用应用。实验表明,ARM 在多个基准上提高了基线性能,并且具有极小的计算成本。
RANGER: A Monocular Zero-Shot Semantic Navigation Framework through Contextual Adaptation
Authors: Ming-Ming Yu, Yi Chen, Börje F. Karlsson, Wenjun Wu
First: 2025-12-30T13:25:22+00:00 · Latest: 2025-12-30T13:25:22+00:00
Abstract
Efficiently finding targets in complex environments is fundamental to real-world embodied applications. While recent advances in multimodal foundation models have enabled zero-shot object goal navigation, allowing robots to search for arbitrary objects without fine-tuning, existing methods face two key limitations: (1) heavy reliance on precise depth and pose information provided by simulators, which restricts applicability in real-world scenarios; and (2) lack of in-context learning (ICL) capability, making it difficult to quickly adapt to new environments, as in leveraging short videos. To address these challenges, we propose RANGER, a novel zero-shot, open-vocabulary semantic navigation framework that operates using only a monocular camera. Leveraging powerful 3D foundation models, RANGER eliminates the dependency on depth and pose while exhibiting strong ICL capability. By simply observing a short video of a new environment, the system can also significantly improve task efficiency without requiring architectural modifications or fine-tuning. The framework integrates several key components: keyframe-based 3D reconstruction, semantic point cloud generation, vision-language model (VLM)-driven exploration value estimation, high-level adaptive waypoint selection, and low-level action execution. Experiments on the HM3D benchmark and real-world environments demonstrate that RANGER achieves competitive performance in terms of navigation success rate and exploration efficiency, while showing superior ICL adaptability, with no previous 3D mapping of the environment required.
中文标题/摘要
标题:RANGER:通过上下文适应的单目零样本语义导航框架
在复杂环境中高效地找到目标是现实世界体态应用的基础。尽管最近多模态基础模型的发展使得零样本物体目标导航成为可能,允许机器人搜索任意物体而无需微调,但现有方法面临两个关键限制:(1)对模拟器提供的精确深度和姿态信息的高度依赖,这限制了其在现实世界场景中的应用;(2)缺乏上下文学习(ICL)能力,使得难以快速适应新环境,如利用短视频。为了解决这些挑战,我们提出了一种名为RANGER的新型零样本、开放式词汇语义导航框架,仅使用单目相机操作。利用强大的3D基础模型,RANGER消除了对深度和姿态的依赖,同时展示了强大的ICL能力。通过简单观察新环境的短视频,系统也可以显著提高任务效率,无需进行架构修改或微调。该框架整合了几个关键组件:基于关键帧的3D重建、语义点云生成、基于视觉语言模型(VLM)的探索价值估计、高层自适应航点选择和低层动作执行。在HM3D基准和真实世界环境中进行的实验表明,RANGER在导航成功率和探索效率方面表现出竞争力,同时展示了优越的ICL适应性,无需对环境进行先前的3D建图。
Summary / 总结
RANGER is a zero-shot semantic navigation framework that uses only a monocular camera to navigate complex environments efficiently. It addresses the limitations of existing methods by eliminating the need for precise depth and pose information and incorporating in-context learning capability. RANGER demonstrates strong performance in navigation success rate and exploration efficiency, and it can adapt quickly to new environments without requiring architectural changes or fine-tuning.
RANGER 是一种仅使用单目相机的零样本语义导航框架,无需依赖精确的深度和姿态信息。它利用 3D 基础模型增强在新环境中的即席学习(ICL)能力,通过短视频观察快速适应新环境。实验表明,RANGER 在导航成功率和探索效率方面表现出色,并且在无需先前 3D 映射环境的情况下展示了更优的 ICL 调适能力。
UniHetero: Could Generation Enhance Understanding for Vision-Language-Model at Large Data Scale?
Authors: Fengjiao Chen, Minhao Jing, Weitao Lu, Yan Feng, Xiaoyu Li, Xuezhi Cao
First: 2025-12-29T14:49:50+00:00 · Latest: 2025-12-30T13:23:48+00:00
Abstract
Vision-language large models are moving toward the unification of visual understanding and visual generation tasks. However, whether generation can enhance understanding is still under-explored on large data scale. In this work, we analysis the unified structure with a concise model, UniHetero, under large-scale pretraining (>200M samples). Our key observations are: (1) Generation can improve understanding, but Only if you generate Semantics, Not Pixels. A common assumption in unified vision-language models is that adding generation will naturally strengthen understanding. However, this is not always true at scale. At 200M+ pretraining samples, generation helps understanding only when it operates at the semantic level, i.e. when the model learns to autoregress high-level visual representations inside the LLM. Once pixel-level objectives (e.g., diffusion losses) directly interfere with the LLM, understanding performance often degrades. (2) Generation reveals a superior Data Scaling trend and higher Data Utilization. Unified generation-understanding demonstrates a superior scaling trend compared to understanding alone, revealing a more effective way to learn vision-only knowledge directive from vision modality rather than captioning to text. (3) Autoregression on Input Embedding is effective to capture visual details. Compared to the commonly-used vision encoder, make visual autoregression on input embedding shows less cumulative error and is modality independent, which can be extend to all modalities. The learned semantic representations capture visual information such as objects, locations, shapes, and colors; further enable pixel-level image generation.
中文标题/摘要
标题:UniHetero:生成能否在大规模数据下增强视觉-语言模型的理解?
视觉-语言大型模型正朝着统一视觉理解与生成任务的方向发展。然而,生成是否能增强理解在大规模数据下仍是一个未被充分探索的问题。在本工作中,我们通过一个简洁的统一结构模型UniHetero,在超过200M样本的大规模预训练下进行分析。我们的主要观察结果如下:(1) 生成可以提高理解,但只有在生成语义而非像素时才有效。统一的视觉-语言模型中普遍认为添加生成任务会自然增强理解,但在大规模数据下并非总是如此。在200M+预训练样本下,生成任务仅在操作语义级别时(即模型学会在LLM内部自回归高层次的视觉表示)才有助于理解。一旦像素级目标(如扩散损失)直接干扰LLM,理解性能往往会下降。(2) 生成揭示了更优的数据扩展趋势和更高的数据利用效率。统一的生成-理解模型相比单独的理解模型,展示了更优的扩展趋势,揭示了一种更有效的学习仅来自视觉模态的视觉知识的方法,而不是从描述到文本。(3) 在输入嵌入上进行自回归可以有效捕捉视觉细节。与常用的视觉编码器相比,在输入嵌入上进行视觉自回归显示出较少的累积误差,并且是跨模态的,可以扩展到所有模态。学习到的语义表示捕捉了视觉信息,如物体、位置、形状和颜色;进一步支持了像素级图像生成。
Summary / 总结
This study investigates whether generation can enhance understanding in large-scale vision-language models. Using UniHetero, a unified model trained on over 200 million samples, the research finds that generation improves understanding only when it focuses on semantics rather than pixels. Additionally, unified generation-understanding shows a better data scaling trend and higher data utilization compared to understanding alone. The study also highlights the effectiveness of autoregression on input embedding for capturing visual details, which can be applied across different modalities.
研究探讨了生成是否能在大规模视觉语言模型中增强理解。使用UniHetero模型,在超过2亿样本的预训练下,研究发现生成仅在关注语义而非像素时才能提升理解。此外,统一的生成-理解显示出比单独理解更好的数据扩展趋势和更高的数据利用率。研究还强调了在输入嵌入上进行自回归对于捕捉视觉细节的有效性,这种技术可以应用于不同模态。
CorGi: Contribution-Guided Block-Wise Interval Caching for Training-Free Acceleration of Diffusion Transformers
Authors: Yonglak Son, Suhyeok Kim, Seungryong Kim, Young Geun Kim
First: 2025-12-30T12:55:38+00:00 · Latest: 2025-12-30T12:55:38+00:00
Comments: 16 pages, 20 figures
Abstract
Diffusion transformer (DiT) achieves remarkable performance in visual generation, but its iterative denoising process combined with larger capacity leads to a high inference cost. Recent works have demonstrated that the iterative denoising process of DiT models involves substantial redundant computation across steps. To effectively reduce the redundant computation in DiT, we propose CorGi (Contribution-Guided Block-Wise Interval Caching), training-free DiT inference acceleration framework that selectively reuses the outputs of transformer blocks in DiT across denoising steps. CorGi caches low-contribution blocks and reuses them in later steps within each interval to reduce redundant computation while preserving generation quality. For text-to-image tasks, we further propose CorGi+, which leverages per-block cross-attention maps to identify salient tokens and applies partial attention updates to protect important object details. Evaluation on the state-of-the-art DiT models demonstrates that CorGi and CorGi+ achieve up to 2.0x speedup on average, while preserving high generation quality.
中文标题/摘要
标题:CorGi: 贡献指导的块级区间缓存加速扩散变换器的无训练推理
扩散变换器(DiT)在视觉生成方面取得了显著的性能,但其迭代去噪过程与较大的容量相结合导致了高昂的推理成本。最近的研究表明,DiT模型的迭代去噪过程在各个步骤中包含了大量的冗余计算。为了有效减少DiT中的冗余计算,我们提出了CorGi(贡献指导的块级区间缓存),这是一种无训练的DiT推理加速框架,它选择性地在去噪步骤中重用DiT中的变压器块的输出。CorGi缓存低贡献的块,并在每个区间内的后续步骤中重用它们,以减少冗余计算并保持生成质量。对于文本到图像任务,我们进一步提出了CorGi+,它利用每个块的交叉注意力图来识别重要的标记,并应用部分注意更新以保护重要的对象细节。在最先进的DiT模型上的评估表明,CorGi和CorGi+在平均上实现了2.0倍的加速,同时保持了高质量的生成。
Summary / 总结
CorGi is a training-free acceleration framework for diffusion transformers that reduces redundant computation by caching low-contribution blocks and reusing them in later steps. This method improves the efficiency of text-to-image generation tasks, achieving up to 2.0x speedup while maintaining high generation quality. Additionally, CorGi+ enhances this approach by using per-block cross-attention maps to protect important object details, further improving the quality of generated images.
CorGi 是一个无需训练的框架,通过在去噪步骤间选择性地重用变压器块输出来加速扩散变压器(DiT)的推理,减少冗余计算并保持生成质量。对于文本到图像任务,CorGi+ 使用每块的交叉注意力图来识别并保护重要标记,使最先进的 DiT 模型的平均加速比达到 2.0 倍,同时保持高质量。
VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree
Authors: Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng
Venue: NeurIPS 2025 poster
First: 2025-10-26T14:36:15+00:00 · Latest: 2025-12-30T12:31:56+00:00
Comments: NeurIPS 2025 poster
Abstract
Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.
中文标题/摘要
标题:VADTree:基于层次粒度感知树的无训练视频异常检测
视频异常检测(VAD)专注于识别视频中的异常。监督方法需要大量领域内训练数据,并且无法为异常提供清晰的解释。相比之下,无训练方法利用大型预训练模型的知识储备和语言互动能力来检测异常。然而,当前固定长度的时间窗口采样方法难以准确捕捉具有不同时间跨度的异常。因此,我们提出了VADTree,利用层次粒度感知树(HGTree)结构进行灵活的VAD采样。VADTree利用预训练的通用事件边界检测(GEBD)模型嵌入的知识来表征潜在的异常事件边界。具体来说,VADTree基于边界置信度将视频分解为通用事件节点,并进行自适应粗细层次结构构建和冗余去除以构建HGTree。然后,将多维先验注入视觉语言模型(VLMs)以增强节点级别的异常感知,并通过大型语言模型(LLMs)实现通用事件节点的异常推理。最后,使用跨簇节点相关方法整合多粒度异常评分。在三个具有挑战性的数据集上的广泛实验表明,VADTree在无训练设置中实现了最先进的性能,同时大幅减少了采样的视频片段数量。代码将在https://github.com/wenlongli10/VADTree上提供。
Summary / 总结
VADTree proposes a Hierarchical Granularity-aware Tree (HGTree) structure for flexible sampling in video anomaly detection (VAD), leveraging a pre-trained Generic Event Boundary Detection (GEBD) model to identify potential anomaly event boundaries. VADTree decomposes videos into generic event nodes and constructs an HGTree for adaptive coarse-fine hierarchical structuring. It enhances node-wise anomaly perception using visual language models and achieves anomaly reasoning via large language models. Experiments show VADTree outperforms existing methods in training-free settings with fewer sampled video segments.
VADTree 提出了一种层次粒度感知树(HGTree)结构,用于灵活的视频异常检测(VAD)采样,利用预训练的通用事件边界检测(GEBD)模型来识别潜在的异常事件边界。VADTree 将视频分解为通用事件节点,并构建 HGTree 进行自适应粗细层次结构化和冗余去除。它将多维度先验注入视觉语言模型以增强异常感知,并通过大型语言模型实现通用事件节点的异常推理。实验表明,VADTree 在更少的视频片段采样下优于现有方法。
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
Authors: TsaiChing Ni, ZhenQi Chen, YuanFu Yang
First: 2025-12-30T11:45:22+00:00 · Latest: 2025-12-30T11:45:22+00:00
Abstract
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.
中文标题/摘要
标题:面向大规模多模态数据的工业缺陷开放词汇理解
我们提出了IMDD-1M,这是首个包含1,000,000个图像-文本对的大型工业多模态缺陷数据集,旨在推动制造和质量检测中的多模态学习。IMDD-1M 包含了60多种材料类别和400多种缺陷类型的高分辨率真实世界缺陷,每种缺陷都附有专家验证的注释和详细的文本描述,详细说明了缺陷的位置、严重程度和上下文属性。该数据集支持包括分类、分割、检索、描述生成和生成模型在内的广泛应用。基于IMDD-1M,我们从零开始训练了一个基于扩散的视觉-语言基础模型,特别适用于工业场景。该模型作为可泛化的基础模型,可以通过轻量级微调高效适应特定领域。与专门的专家模型相比,它只需要不到5%的任务特定数据即可达到相当的性能,突显了数据高效基础模型适应在工业检测和生成中的潜力,为可扩展、领域适应和知识导向的制造智能铺平了道路。
Summary / 总结
The paper introduces IMDD-1M, a large-scale multimodal dataset with 1,000,000 image-text pairs for industrial defect understanding, covering 60 material categories and 400 defect types. The dataset is used to train a diffusion-based vision-language model that can be fine-tuned with minimal data for industrial inspection tasks, achieving performance comparable to expert models with significantly less data. This demonstrates the potential for data-efficient foundation models in industrial applications.
研究旨在通过引入包含1,000,000个图像-文本对的IMDD-1M大规模多模态数据集,推进工业缺陷理解中的多模态学习。该数据集包括来自超过60种材料类别和400种缺陷类型的高分辨率缺陷,每种缺陷都有专家注释和详细的描述。基于该数据集训练了一个扩散型视觉-语言模型,该模型可以通过少量的专业数据进行微调以达到与专家模型相当的性能,展示了在工业检测和生成任务中数据高效适应的潜力。